Due to changes in the policies of funding agencies and attitudes towards data sharing, the number of databases (publicly) available to study the genetics of complex diseases has grown exponentially. These databases provide a freely available independent source of information with a considerable potential to increase the likelihood of identifying genes. However, the integration of large scale heterogeneous databases into novel data collections presents a major methodological challenge. We introduce a rigorous data integration framework and implement our method in a freely-available, user-friendly R package called MIND. MIND estimates the (posterior) probability that a marker has an effect after taking all information into account. Specific features include:
1) Flexibility: Integrate data of almost any kind or information generated by any kind of activity
2) Optimal weights: Use of optimal weights so that databases that do not contain disease relevant information will not affect the results
3) Solid mathematical foundation: Relies on a solid mathematical foundation that provide us with the exact posterior probability
4) Excellent estimation properties: Excellent estimation properties because of very efficient parameterization
5) Tested empirically: We tested MIND empirically through an independent replication study of 6,544 SNPs in 6,298 samples from nuclear families. Results show that it identified effects that would otherwise require sample sizes that are 2.5 times larger or replication studies with up to 10 times as many markers.
The mathematical foundation of MIND is described in the article:
– Bukszár J and Van den Oord EJCG. Mathematically-based integration of heterogeneous data, RUTCOR Research Reports, 16, 1-20, 2011 (download pdf)
An extensive SQLite data base containing 136 features that can be used for data integration can be downloaded here. The database includes general genomic features such as transcription factor binding sites, transcript annotation, coding sequence information mainly downloaded from the UCSC genome browser and ENCODE. In addition, we added data from the NIH Roadmap Epigenomics Mapping Consortium that has generated high-quality, genome-wide maps of several key histone modifications, chromatin accessibility. Finally, a wide variety of disease studies for schizophrenia, bipolar disorder, major depression disorder, and autism such as the PGC2 GWAS meta-analyses, top regions from genome-wide linkage scans, and meta-analysis of gene expression studies.
In methylome-wide association studies (MWAS) there are many possible differences between cases and controls (e.g. related to life style, diet, and medication use) that may affect the methylome and produce false positive findings. An effective approach to control for these confounders is to first capture the major sources of variation in the methylation data and then regress out these components in the association analyses. This approach is, however, computationally very challenging because the human genome comprises over 30 million possible methylation sites. We introduce methylPCA that is specifically designed to handle this problem. Specifically, MethylPCA can:
1) Create blocks. Reducing the total number of sites has computational and statistical advantages (e.g., decreased risk of false discoveries, avoid redundancy in the PCA) and the sum of substantially inter-correlated measurements is a more reliable indicator of the underlying signal than the individual measurements separately. Rather than using a sliding window of a pre-determined fixed length, MethylPCA combines adjacent sites adaptively based on the observed inter-correlations.
2) Perform PCA. The PCA is based on input methylation data and the output is PC scores, eigenvalues and loadings. The PCA is performed through eigen-decomposition of a much smaller inner product matrix calculated from the methylation data.
3) Perform association tests. It performs association tests with supplied covariates. Typical covariates are the PC scores calculated from the PCA procedure. It outputs the test statistics and p-values, as well as a QQ plot.
To speed up calculations, data from different chromosomes can be processed simultaneously and the PCA input matrix can be computed in parallel. Statistics that are used repeatedly (e.g. means in the entire sample) are calculated only once and stored to further increase efficiency. MethylPCA consists of separate components that can be run individually or as a pipeline. A user-friendly interface is provided where a parameter file controls which and how procedures are performed. The software is described in the paper:
– Wenan Chen , Guimin Gao, Karolina A Aberg, Srilaxmi Nerella, Swedish Schizophrenia Consortium, Christina M Hultman, Patrik KE Magnusson, Patrick F Sullivan, Edwin JCG van den Oord (2013). MethylPCA: A toolkit to control for confounders in methylome-wide association studies. BMC Bioinformatics, In press.
The computational and I/O intensive part of MethylPCA is implemented in C++ and the R package serves as the user interface. The Documentation/source code/executables/example can be downloaded for Windows (WinZip format), Mac OS X (Zip format), and Linux (tar.gz format).
Because of the assays costs and large sample sizes that are required to discover effects while controlling false discoveries, large scale genetic association studies can be very expensive. Two-stage designs can be used to design these studies in the most cost-effective way. In two stage designs all the markers are assayed and tested in a first stage. Only the promising markers are subsequently assayed in the second stage using additional samples. Compared to single-stage studies, optimized multistage designs can achieve the same goals in terms of true and false discoveries with a 50-70% saving in the amount of genotyping. Furthermore, rather than using arbitrary rules (e.g. P-values smaller than 0.05 suggest a replication), use of multistage designs can provide statistically motivated decision rules for declaring significance.
lga972 is a cross-platform application with a graphical interface that uses a genetic algorithm for determining the design features of 2-stage genetic association studies that minimize the genotyping burden. The user can choose among a variety of case-control and family based tests where outcome may be scored as present versus absent or is a continuous variable. The text-based output can easily be exported to other programs such as word-processors and spreadsheets.
Lga972 is described in:
– Robles, J & Van den Oord, EJCG (2004). lga972: A cross-platform application for optimizing LD studies via the genetic algorithm. Bioinformatics, 20, 3244-3245.
– Van den Oord, EJCG & Sullivan, PF (2003). False discoveries and models for gene discovery. Trend in Genetics, 19, 537-542.
– Van den Oord, EJCG & Sullivan, PF (2003). A framework for controlling false discovery rates and minimizing the amount of genotyping in the search for disease mutations. Human Heredity, 188-199.
– Van den Oord, EJCG (2005). Controlling false discoveries in candidate gene studies. Molecular Psychiatry, 10, 230-231.
– Bukszár, J & Van den Oord, EJCG (2006). Optimization of two-stage genetic designs where data are combined using an accurate and efficient approximation for Pearson's statistic, Biometrics 62, 1132-1137.
Download Iga972 in tar.gz format
lga972 is distributed as Freeware. To install, download the lga972 distribution and expand the file into your system. The lga972 distribution includes the program (java jar file), the program manual (PDF) and User-License (text). You need the Java Runtime Environment (or Java Development Kit) Standard Edition, version 1.3.1 or better. Check your system for an existing installation or download the JRE and follow the installation instructions at the Sun Microsystems Java site.
Updated August 2015