Fast and accurate imputation of summary statistics enhances evidence of functional enrichment

Imputation using external reference panels is a widely used approach for increasing power in GWAS and meta-analysis. Existing HMM-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation fr…

Authors: Bogdan Pasaniuc, Noah Zaitlen, Huwenbo Shi

Fast and accurate imputation of summary statistics enhances evidence of   functional enrichment
F ast and accurate imputation of summary statistics enhances evidence of functional enric hmen t July 16, 2018 Bogdan Pasaniuc 1 , 2 , Noah Zaitlen 3 , Huw enbo Shi 2 , Gaurav Bhatia 4 , 5 , 6 , Alexander Gusev 4 , 5 , 6 , Joseph Pic krell 6 , 7 , Jo el Hirsc hhorn 6 , Da vid P Strachan 8 , Nic k Patterson 6 , Alk es L. Price 4 , 5 , 6 1. Departmen t of Pathology and Lab oratory Medicine, Geffen Sc hool of Medicine, Univ ersit y of Cali- fornia Los Angeles 2. Bioinformatics In terdepartmen tal Program, Universit y of California Los Angeles 3. Departmen t of Medicine, Lung Biology Cen ter, Universit y of California San F rancisco 4. Program in Molecular and Genetic Epidemiology , Harv ard School of Public Health, 5. Departmen ts of Epidemiology and Biostatistics, Harv ard School of Public Health, 6. Broad Institute of Harv ard and MIT, Cam bridge, MA, USA. 7. Harv ard Medical Sc ho ol, MA, USA. 8. Division of P opulation Health Sciences and Education, St George’s, Univ ersity of London, UK. Corresp ondence: bpasaniuc@mednet.ucla.edu and aprice@hsph.harv ard.edu 1 Abstract Imputation using external reference panels (e.g. 1000 Genomes) is a widely used approac h for increasing p o wer in GW AS and meta-analysis. Existing HMM-based imputation approaches require individual-lev el genot yp es. Here, we dev elop a new metho d for Gaussian imputation from summary association statistics, a type of data that is b ecoming widely av ailable. In simulations using 1000 Genomes (1000G) data, this metho d recov ers 84% (54%) of the effectiv e sample size for common ( > 5%) and low-frequency (1-5%) v ariants (increasing to 87% (60%) when summary LD information is a v ailable from target samples) v ersus the gold standard of 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approac h accounts for the limited sample size of the reference panel, a crucial step to eliminate false-p ositive asso ciations, and is computationally very fast. As an empirical demonstration, w e apply our metho d to 7 case-control phenotypes from the WTCCC data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recov ers 95% (105%) of the effective sample size (as quan tified b y the ratio of χ 2 asso ciation statistics) compared to HMM-based imputation from individual-lev el genotypes at the 227 (176) published SNPs in the WTCCC (1958BC height) data. In addition, for publicly av ailable summary statistics from large meta-analyses of 4 lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not hav e b een obtained using previously published metho ds, and demonstrate their accuracy b y masking subsets of the data. W e show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichmen t at genic vs. non-genic lo ci for these traits, as compared to an analysis without 1000G imputation. Th us, imputation of summary statistics will b e a v aluable to ol in future functional enric hment analyses. Author Summary Public resources of haplotypic div ersity such as the 1000 Genomes are routinely leveraged to increase the n umber of v ariants tested for asso ciation in genome-wide studies through genot yp e imputation. Existing approac hes require individual lev el data and are computationally demanding. W e presen t here approac hes for imputation that do not require individual level data but work directly on summary asso ciation statis- tics, a type of data widely a v ailable. Through extensive simulations w e show that summary asso ciation 2 imputation captures the same signal as standard individual imputation for common and to a lesser extend lo w-frequency v ariants. In addition to asso ciation studies, we show the utility of imputation of summary statistics in enrichmen t analyses by enhancing the signal of enric hment for functionally relev ant classes of v ariants as compared to analyses ov er non-imputed data. In tro duction Genome-wide asso ciation studies (GW AS) are the prev ailing approach for finding disease risk lo ci, hav- ing successfully iden tified thousands of v ariants asso ciated to complex phenot yp es [1, 2]. An imp ortant comp onen t of the GW AS analysis to olkit is genot yp e imputation, an approach that lev erages publicly a v ailable data (e.g. 1000 Genomes [3]) to estimate genotypes at markers un typed in the study to increase p o wer for finding new risk lo ci [4 – 7]. In addition to GW AS, genot yp e imputation is a key comp onent of meta-analysis of studies that use different genotyping platforms, where SNPs that were genotyped in one study can b e imputed in the other studies th us increasing the asso ciation p ow er [8 – 11]. Man y approaches for genotype imputation ha ve b een prop osed, with metho ds based on Hidden Marko v Mo dels (HMM) sho wing the highest accuracy in sim ulations and empirical data [4 – 7]. How ever, priv acy and logistic constraints often prohibit access to individual level genot yp e data thus precluding HMM- based imputation, whereas summary asso ciation statistics are b ecoming widely a v ailable. F or example, summary statistics are required to b e publicly released for an y GW AS published in Nature Genetics, and ha ve b een publicly released for many traits [12, 13]. In this work w e prop ose metho ds for testing for asso ciation at SNPs unt yp ed in the study when only summary asso ciation statistics are a v ailable at the typed SNPs. Unlike HMM-based imputation from individual-lev el genotypes, our proposed approach requires as input only the association statistics at t yp ed v ariants. T o accomplish this w e approximate the distribution of asso ciation statistics at a given lo cus using a multiv ariate Gaussian. Previous studies hav e sho wn that a Gaussian approximation of link age disequilibrium (LD) leads to accurate inference across a wide range of problems [14–18]. In particular, ref. [17] highlighted the p otential utility of Gaussian imputation metho ds for individual level and po oled data, but that study did not provide metho ds or soft ware for imputation from summary asso ciation statistics (see Discussion). 3 Through extensive simulations based on 1000 Genomes data, we show that our approach is almost as p o werful as the gold standard of HMM-based imputation from individual-level genotypes, and is able to a void an increase in false-p ositive asso ciations by accounting for the limited size of the reference panel. Our approac h recov ers 84% (54%) of the effective sample size for common ( > 5%) and low-frequency (1- 5%) v ariants, versus 89% (67%) for HMM-based imputation, with a reduction in running time of sev eral orders of magnitude. When s ummary information on the pairwise LD structure among typed v ariants in GW AS samples is made av ailable (as is also recommended in other contexts [19]), our metho d recov ers 87%(60%) of the effectiv e sample size, again with no increase in false-p ositive asso ciations. W e v alidate our approach using real GW AS data from WTCCC ac ross 7 phenotypes as well as a height GW AS from the 1958 Birth Cohort (1958BC), where we show that Gaussian imputation from summary statistics recov ers the same signal as HMM-based imputation from individual-level genotypes, with no increase in false-p ositive rate. F or example, we attain an av erage χ 2 asso ciation statistic of 18.28 as compared to 19.17 for HMM-based imputation at the 227 published SNPs in the WTCCC data and 4.76 (vs. 4.55 for HMM-based imputation) at the 176 published SNPs in the 1958BC heigh t data. F or publicly av ailable summary statistics from large meta-analyses of 4 lipid traits (triglycerides (TG), total c holesterol (TC), high densit y lipoprotein (HDL), low density lipoprotein (LDL)), we publicly release imputed summary statistics at 1000 Genomes SNPs, which could not hav e been obtained using previously published metho ds. W e v alidate the accuracy of the imputed statistics across the 4 studies using a masking approac h and show that we attain a correlation of 0.98 (0.95) to masked summary statistics for common (low-frequency) v ariants, consistent with simulations. Finally , w e explore the utilit y of imputed asso ciation statistics to functional enrichmen t analysis [13]. F or the 4 lipid traits, we find that imputed data increases the magnitude and statistical evidence of enrichmen t at genic vs. non-genic lo ci, as compared to an analysis without 1000 Genomes imputation [13]. 4 Metho ds Ov erview of Gaussian imputation of summary statistics W e assume that summary asso ciation statistics consist of z -scores known to b e normally distributed with mean 0 and v ariance 1 under the null mo del of no asso ciation. LD b etw een SNPs i and j induces a cov ariance b etw een their observed z- scores, according to the correlation r ij b et ween the tw o SNPs. Th us, under null data the v ector Z of z-scores at all SNPs in a lo cus follows a Gaussian distribution, Z ∼ N (0 , Σ), with Σ b eing the correlation matrix among all pairs of SNPs induced b y LD (Σ ij = r ij ). In a given study we only observe z-scores at the typed SNPs ( Z t ), with no information ab out unt yp ed SNPs . W e estimate Σ using reference panels of haplotypes (e.g. 1000 Genomes) and analytically deriv e the p osterior mean of z -scores at unobserved SNPs ( Z i ) given Z t and Σ (ImpG-Summary). W e use the conditional v ariance to estimate the imputation accuracy ( r 2 pr ed ), in a manner similar to the r 2 hat estimator in HMM-based imputation [7]. The finite sample size of the reference panel adds statistical noise to the estimate of Σ. W e accoun t for noisy estimates at distant SNPs by emplo ying a windowing strategy that mo dels distant SNPs as uncorrelated. This strategy also leads to efficient computational runtime (a smaller matrix needs to be in verted for each window in the genome). In particular, w e partition the genome in to non-ov erlapping windo ws (e.g. 1Mb) and for eac h window independently , w e estimate the LD matrix Σ using the reference panel of haplotypes. T o account for SNPs at the b oundaries of these windows w e include SNPs within a buffer around the window in the computation (e.g. 250Kb on either side). W e also account for statistical noise in estimates at pro ximal SNPs by adding λI to the LD matrix estimated from the data Σ. This pro cedure is similar to ridge regression [20] and can also b e interpreted in a Bay esian context as adding a prior of N (0 , λI ) to the t yp ed SNP co efficients [21]. Accounting for this statistical noise is necessary to eliminate false-p ositiv e asso ciations (see Results). The imputed z -scores (from the conditional Gaussian distribution) can b e view ed as a linear combination of typed z -scores w ith weigh ts pre-computed from the reference panel. Therefore the v ariance of the imputed z -score can b e estimated on the basis of the w eights and the LD among typed SNPs. W e estimate the LD structure among t yp ed SNPs using the reference panel as ab ov e (accounting for statistical noise in the reference panel) and normalize the imputed z -scores suc h that their theoretical v ariance under the 5 n ull is 1 (in a real scan the observed v ariance ma y b e greater than 1 due to p olygenic effects [22]). Since w e use fixed window sizes the computation time of our pro cedure scales linearly with the total num b er of SNPs as the computations at each windo w can b e p erformed in constant time (prop ortional to the square of the fixed n umber of SNPs in the windo w). The computation time can be further reduced by precomputing the in verse of the co v ariance matrix for each genot yping arra y platform at eac h windo w, although sp ecial requirement is required for typed SNPs that are remov ed by QC. In particular, matrix in version should b e repeated at windows where t yp ed SNPs used in the imputation are remo ved b y QC. If summary LD statistics (pairwise LD among typed SNPs within a window in GW AS samples) are also a v ailable, they can be directly used to estimate the v ariance of eac h imputed z -score (which can b e view ed as a linear combination of t yp ed z -scores). This pro duces an accurate estimate of the exp ected v ariance under the null for the imputed z -scores with no need of adjustment for the statistical noise in the reference panel (ImpG-SummaryLD). This leads to well calibrated asso ciation statistics under the n ull with increased p ow er relative to ImpG-Summary . Asso ciation statistics in GW AS A standard test for asso ciation in GW AS is the normalized difference in frequencies b etw een cases and con trols (z-score z ) defined as: z = √ N f + − f − p 2 f (1 − f ) where f + ( f − ) denotes the frequency in cases (controls), f is the o verall frequency and N the num b er of samples. This statistic extends to contin uous phenotypes by considering √ N times the correlation b et ween the vector of genotypes (0,1,2) and phenotype. In the case of imputed data, this statistic extends b y using genotype dosages in the computation of the correlation of dosages to phenot yp e. Link age Disequilibrium (LD) b etw een pairs of SNPs s and s 0 induces a correlation among the observ ed z-scores at these SNPs whic h can b e expressed through the standard correlation co efficient r ( s, s 0 ). 6 Multiv ariate Gaussian appro ximation Similar to other works [14 – 17], w e appro ximate the full distribution of association statistics Z at n SNPs in LD using a m ultiv ariate Gaussian distribution with probabilit y density function dep ending on the mean µ and v ariance cov ariance Σ. Let the v ector Z b e partitioned in to tw o comp onen ts Z t and Z i corresp onding with the typed and imputed SNPs, where Z t is a vector of size m (assuming m SNPs hav e b een typed) and Z i has n − m elements. Similarly , we will partition the mean vector and v ariance-cov ariance matrix in to ( µ t , µ i ) T corresp onding to the means at typed and imputed SNPs, co v ariances among imputed (Σ i,i , [ n − m × n − m ]), cov ariances among t yp ed and imputed (Σ i,t , [ n − m × m ])) and co v ariance among t yp ed data (Σ t,t , [ m × m ])). Then the conditional random v ariable Z i | Z t follo ws an Gaussian distribution with mean µ Z i | Z t = µ i + Σ i,t × Σ − 1 t,t × ( Z t − µ t ) and co v ariance: Σ i | t = Σ i,i − Σ i,t × Σ − 1 t,t × Σ 0 i,t . Gaussian imputation of asso ciation statistics (ImpG-Summary) When mo deling the v ariance cov ariance matrix Σ, w e adopt a windo wing strategy aimed at decreasing run time (a smaller matrix needs to be inv erted for eac h windo w in the genome) and at reducing statistical noise that can show distant SNPs to b e correlated when the true is of no LD. In particular, we partition the genome into non-o verlapping windows of 1Mb (with a buffer of 250Kb on either side to accoun t for LD at b oundaries). Let Σ denote an estimate of the v ariance cov ariance matrix in the GW AS sample across b oth typed and imputed SNPs. F or each window indep endently , we estimate Σ from the reference panel of haplotypes, with an adjustment for sampling noise (see b elo w). Let z t b e the set of observed z-scores restricted to current window. W e impute z i as z i | t = Σ i,t × Σ − 1 t,t × z t . T o sp eed up computation, we can precompute Σ − 1 t,t for all genot yping array platforms suc h that runtime is quadratic in the n umber of SNPs in the window. F or windo ws where QC has remo ved part of the t yp ed SNPs used in imputation Σ − 1 t,t needs to b e re-estimated. Since the windo w length is fixed across the genome, the ov erall computational runtime can b e though t of linear in the num b er of SNPs (when Σ − 1 t,t has b een precomputed already). The imputed z-scores at imputed SNP i z i | t can be view ed as a linear combination of t yp ed z-scores z t with w eights W = Σ i,t × Σ − 1 t,t pre-computed from the reference panel. Let A denote the v ariance cov ariance matrix among typed SNPs in the p opulation. Since z t follo ws a Gaussian distribution N (0 , A ), it follows that z i | t has v ariance W × A × W 0 . Therefore, we use z i | t √ W × A × W 0 as the imputation z-score at imputed 7 SNP i. T o accoun t for the statistical noise while also making sure that Σ is in vertible we adopt a pro cedure similar to ridge regression [20] and use Σ = Σ unad j + λI in b oth Σ i,t and Σ t,t in the estimation of W (we use λ = 0 . 1 as default) (see T ables S1,S2,S3 for results across other v alues of λ ). W e appro ximate A with Σ t,t using LD information from reference panel (ImpG-Summary). An alternative is to use the true A , i.e. the summary LD statistics from the GW AS sample, if they are a v ailable; in this case, a more substantial adjustment for statistical noise in Σ is not needed b ecause A is deriv ed from the GW AS sample, and we set λ = 0 . 001 to mak e sure that Σ is inv ertible in the estimation of W (ImpG-SummaryLD). W e do not use the summary LD statistics across t yp ed SNPs in the sample for estimation of W in ImpG-SummaryLD, to main tain consistency among pairwise LD statistics betw een t yp ed and imputed SNPs. Softw are implementing the ImpG-Summary and ImpG-SummaryLD metho ds has b een made publicly a v ailable (see W eb Resources). W e prop ose a metric for imputation accuracy based on the v ariance of the conditional random v ariable Z i | Z t : we define r 2 pr ed = 1 − Σ i | t . Figure S1 shows that r 2 pr ed b ehav es very similarly to the standard imputation accuracy metric r 2 hat [7] (correlation of 0.90 to the true r 2 accuracy as compared to 0.92 for r 2 hat ). Sim ulation framew ork W e simulated data starting from the 381 diploid Europ ean individuals from the phase 2 release of the 1000 Genome s Pro ject (June 2011) [3]. The 381 individuals include 87 CEPH individuals of North Eu- rop ean ancestry (CEU), 93 Finnish individuals from Finland (FIN), 89 British individuals from England and Scotland (GBR), 98 T uscan individuals (TSI), and 14 individuals from the Ib erian p eninsula (IBS). Genot yp e calls and haplotypic phase had b een previously inferred from low-co verage sequencing (4x) using an imputation strategy that b orrow ed information across samples and lo ci [3]. The set haplotypes w ere split at random b etw een a set of 178 (num b er chosen to match the 89 samples of British ancestry) haplot yp es used to build simulated data, and the other was used as an imputation reference panel. Start- ing from the simulation panel of haplotypes, we used hapgen [23] to simulate 10,000 diploid individuals. All sim ulation results w ere generated ov er 50 distinct 1Mb regions (total of 50Mb) randomly chosen across Chromosome 1 totaling 321,226 SNPs. F or eac h of the SNPs with MAF greater than 1% in the reference panel (133,025 in total), we simulated case-control data sets by randomly choosing a subset 8 of 1,000 con trols, and then c hosing 1,000 cases from the remaining samples so that samples with 0:1:2 reference alleles hav e relative probabilities 1: R : R 2 of b eing chosen (for a giv en o dds ratio R ). F or n ull sim ulations we randomly selected 1,000 samples as cases and 1,000 samples as controls. WTCCC data set W e examined data from the W ellcome T rust Case Control Consortium (WTCCC) phase I comprising GW AS studies of 7 diseases: Bip olar disorder (BD), Coronary heart disease (CAD) , Crohn’s disease (CD), Hyp ertension (HT), Rheumatoid arthritis (RA), Type 1 diab etes (T1D), T yp e 2 diab etes (T2D) (see T able S4 for detailed sample sizes ) [24]. W e remo ved all SNPs that had o verall deviation from Hardy-W einberg equilibrium at a p-v alue b elow 0.01. Then, we remov ed any SNP that had differential missigness (p-v alue < 0.01) in an y of the case-con trol cohort, ov erall missingness ov er 0.001, or minor allele frequency b elo w 0.01. This yielded a total of 325,553 SNPs. W e p erformed HMM-based imputation using the pre-phasing approach of [5]; we used HAPI-UR [25] to infer haplotypes from genotypes and then ran IMPUTE2 [5] using default parameters on the inferred haplotypes (see W eb Resources). Unless otherwise noted w e filtered out imputed SNPs using an imputation accuracy cutoff of 0.6, as w ell as SNPs that had more than 5% of the individual imputed calls missing at a p osterior probability lev el of 0.9. This pro cedure yielded appro ximatively 4.7M SNPs for the considered phenotypes (T able S4). 1958 Birth Cohort data The British 1958 birth cohort is an ongoing follow-up of all p ersons b orn in England, Scotland and W ales during one w eek in 1958. A t the age of 44-45 y ears, the cohort were follo wed up with a biomedical examination and blo o d sampling [26], from whic h a DNA collection was established as a nationally represen tative reference panel (h ttp://www.b58cgene.sgul.ac.uk/). Non-ov erlapping subsets of the DNA collection were genot yp ed b y the W ellcome T rust Case-Con trol Consortium (WTCCC) [24], the Type 1 Diab etes Genetics Consortium (T1DGC) [27] and the GABRIEL consortium [28]. Genot yping by the WTCCC used both the Affymetrix 500K array and the Illumina 550K (v ersion 1) array . Since the T1DGC used the Illumina 550K (version 3) arra y and GABRIEL used the Illumina 610K array , a com bined dataset was created of SNPs in common across these three panels. SNPs were excluded from subsequen t 9 imputation if they had MAF < 1%, call-rate < 95%, HWE p-v alue < 0.0001, or differences in allele frequency across the three dep osits (p < 0.0001 on pairwise comparisons). Pre-imputation phasing was p erformed using MACH [7]. Imputations against the March 2012 release of 1000-genomes all-ethnicities reference haplot yp es w ere p erformed using Minimac [5]. Asso ciations of imputed allele dosages with standing heigh t, as measured at the 44-45-year follow-up, were analysed using ProbAb el [29]. Publicly a v ailable summary statistics f or 4 lipid traits Publicly av ailable GW AS summary data across four blo o d lipids phenot yp es (triglycerides (TG), total c holesterol (TC), high density lip oprotein (HDL), low density lip oprotein (LDL)) w as downloaded from public access w ebsites [30]. This data has b een recently used in an study of ov erlap of GW AS findings and functional data [13]; all QC steps are describ ed elsewhere [30]. The data com prised roughly 2.7M summary statistics based on roughly 100,000 samples for each of the four phenot yp es. T o remov e strand am biguity we remo ved all A/T and C/G SNPs (roughly 15.4% of all SNPs); we also remov ed all SNPs with meta-analysis sample sizes under 80,000, leaving approximately 2.0M SNPs for each of the phenotypes. W e imputed to 1000Genomes using ImpG-Summary under three scenarios. In the first scenario, we remo ved 10% of the SNPs at random. In the second scenario, w e remov ed all SNPs not presen t on the Illumina 610 genotyping platform (approximately 600k in total). In b oth of these scenarios, w e imputed from the remaining SNPs and assessed accuracy using the previously masked SNPs. As a metric of accuracy we computed the correlation b etw een imputed and previously masked asso ciation statistics . In the third scenario, we imputed from all 2.0M SNPs to obtain the summary statistics at 7.3M SNPs that w e publicly release. Enric hmen t analysis for 4 lipid traits W e used an analysis similar to [13] to quan tify enric hment p er classes of SNPs. W e categorized each SNP according to its distance to genes using the all SNPs track (snp137) from the UCSC genome browser (h ttp://genome.ucsc.edu/cgi-bin/hgT ables). All SNPs within an exon, up to 5Kb up-stream, up to 5Kb do wn-stream, lo cated in the 3’ UTR or in the 5’ UTR were lab eled as genic. SNPs with no annotation in the data were considered as b eing Intergenic. F or each data set, we normalized the asso ciation statistics 10 using genomic control attained only o ver the Intergenic SNPs [13], follo wed by computation of av erage v ariance across SNPs within eac h functional class. W e estimate the v ariance as the a verage of the squared asso ciation z-scores minus 1 [13]. W e compared the magnitude of enric hment in asso ciation statistics across different functional classes within the same data set (either the public data or the imputed one) using the median of the Kolmogorov-Smirno v (KS) test statistic at 100 random draws each of 10,000 random SNPs across the genome. This conserv ative computation a voids correlations due to LD and do es not accoun t for the larger num b er of SNPs in the 1000G imputed data, which would further increase statistical significance. Gaussian imputation of individual genotypes Although we fo cus primarily on imputation of summary statistics, for completeness w e also discuss Gaussian imputation when individual level data is av ailable. W e compare tw o differen t approaches. The first approach is to apply ImpG-SummaryLD as describ ed ab o ve, relying only on summary asso ciation statistics and summary LD statistics. The second approach (which attains sligh tly worse results) is v ery similar to the approac h prop osed by [17]. As describ ed in that study , we can impute allele frequencies and treat eac h genot yp e as a sample of size 2. F ollowing [17], we set µ to b e the observed allele frequency in the reference panel and Σ[ i, j ] to b e the co v ariance b etw een SNP i and j . Next we apply the same windo wing approach ab o ve to eac h sample indep enden tly to impute individual level genot yp es. Although rare, in practice Gaussian imputation can output v alues less than 0 or greater than 2; w e adjust these v alues to 0 and 2, resp ectively . As asso ciation statistic we use the χ 2 1 dof statistic N ρ 2 ( G 0 , φ ), where N is the num b er of samples and ρ 2 ( G 0 , φ ) is the squared correlation b etw een the vectors of imputed genot yp es and the phenotype. Results Sim ulations T o explore the effectiveness of Gaussian imputation using summary statistics (ImpG-Summary and ImpG- SummaryLD), we sim ulated case-control data sets at v arious effect sizes across a wide range of SNPs (see 11 Metho ds). W e used the 762 Europ ean haplotypes from the 1000 Genomes Pro ject (phase 1, June 2011 release), and restricted the analysis to 50 distinct 1Mb regions (total of 50 Mb, con taining 133,025 SNPs with MAF > 1%) randomly chosen across Chromosome 1. W e randomly selected 178 haplot yp es for the sim ulation panel data and the remaining haplotypes for the imputation reference panel. Starting with the 178 haplotypes of the sim ulation panel, we used hapgen [23] to simulate GW AS data sets. F or eac h of the SNPs with MAF greater than 1% in the reference panel we simulated GW AS data o ver 1,000 cases and 1,000 controls at different effect sizes. T o assess the p erformance of imputation at recov ering the true asso ciation signal when present, we used the relative effective sample size, defined as the ratio of a verage imputed χ 2 statistics at unt yp ed SNPs vs. χ 2 statistics computed from true genotypes. Here χ 2 statistics refer to the squared z-score, which has a χ 2 with 1 degree of freedom distribution under the n ull hypothesis. W e envision that real scans will restrict their analyses to v ariants with high estimated imputation accuracy ( r 2 pr ed > 0 . 6, analogous to the r 2 hat estimator in HMM-based imputation [7], Figure S1), but we computed the relative effective sample size with all v alues of r 2 pr ed included in order to prov ide an appropriate assessment of p o wer. Ho wev er, we restricted most of our analyses of false- p ositiv es to accurately imputed v ariants ( r 2 pr ed > 0 . 6), as these are the v arian ts that would b e analyzed in a real scan. W e first explored the robustness of imputation from summary statistics. ImpG-Summary attains genomic con trol λ GC of 0 . 94 (see Figure S2) with no increase in false p ositive rate at the tail of the distribution (see T ables S1,S2,S3). Although ImpG-Summary attains a sligh t deflation (due to the adjustmen t pro- cedure that has the effect of shrinking the predictor weigh ts), this is necessary to av oid false-p ositives. As exp ected from the conditional distribution, Gaussian imputation with no v ariance normalization is deflated ( λ GC = 0 . 86) while the naiv e normalization that do es not account for the statistical noise in the LD matrix is also susceptible to false p ositives (w e observ e a near 4-fold increase in p-v alues smaller than 10 − 4 as compared to a w ell-calibrated statistic in n ull data sim ulations (see T ables S1, S2 and Figure S2)). Recen t work in parallel to ours has also inv estigated the use of Gaussian mo dels for summary asso ciation imputation but do not prop ose an adjustmen t for the statistical noise in the reference panel [31]. W e caution that adjustment for the statistical noise in the reference panel is required to av oid false p ositives when using our method (see Figure S2). Likewise, our sim ulations indicate that the method of [31], which do es not adjust for statistical noise in the reference panel, is susceptible to false positives at these reference panel sizes (see Figure S3). How ever, when pairwise correlations among typed SNPs are a v ailable from 12 the GW AS data, the exp ected v ariance under null of the imputed statistics can b e accurately estimated and used for normalization (ImpG-SummaryLD). This remov es the need for adjusting the LD matrix estimated from the reference panel leading to distributed asso ciation statistics with no susceptibility to false-p ositiv es ( λ GC = 1 . 00 , Figure S2). W e next assessed the ability of ImpG-Summary to iden tify true p ositive asso ciations b y measuring the decrease in effectiv e sample size. T able 1 shows the relativ e effective sample size in 1000 Genomes sim ula- tions with target and reference haplot yp es randomly sampled from 762 Europ ean haplotypes (i.e roughly matc hed for ancestry). As a gold standard for imputation accuracy , we used Beagle, an HMM-based metho d that requires individual-lev el data [6, 32]. Beagle has previously b een shown to ac hieve similar accuracy as other HMM-based metho ds, with far sup erior accuracy compared to tagging-based imputa- tion [32, 33]. At an o dds ratio of 1.5, ImpG-Summary reco vers 84% (54%) of the effective sample size for common ( > 5%) and low-frequency (1-5%) v ariants, versus 89% (67%) for Beagle imputation. In terest- ingly , when LD information among the t yp ed v ariants from the GW AS is av ailable, ImpG-SummaryLD reco vers 87% (60%) of the effective sample size, nearly as high as Beagle. T able 1 also shows the decrease in effective sample size across a wide array of o dds ratios sho wing that the results are robust to different effect sizes. Thus, imputation from summary statistics can recov er most of the association p o wer av ailable from GW AS with individual-level data. W e also tested the effect of a mismatch in ancestry b etw een the reference haplotype panel and the target p opulation. W e sim ulated case-con trol GW AS using the GBR haplotypes for target samples and the remaining 1000 Genomes Europ ean haplot yp es as reference haplot yp es. T able 2 shows only marginal decreases in p erformance for each of HMM, ImpG-Summary and ImpG-SummaryLD as compared to previous results, with no excess of false p ositiv es ( T able S2). W e note that b oth ImpG-Summary and ImpG-SummaryLD are computationally very fast, with running times several orders of magnitude low er than HMM-based metho ds for imputation from individual-level genot yp es. T able 3 sho ws a reduction in running time of several orders of magnitude for ImpG-Summary as compared to HMM-based approac hes. F or example, for a GW AS with 10,000 samples, IMPUTE2 with pre-phasing [5] takes > 200 CPU days ( > 40 CPU da ys if pre-phasing time is not included), whereas ImpG-Summary takes less than one CPU day (this can b e further sped up using weigh ts precomputed from the 1000 Genomes reference panel), as do es ImpG-SummaryLD. The magnitude of the difference in 13 running time will only increase with larger studies (suc h as the N=100,000 studies analyzed b elo w), as the running time of ImpG-Summary is indep enden t of the num b er of target samples while the running time of HMM-based imputation is linear in this quantit y . Ho wev er, w e note that all of the metho ds listed in T able 3 can b e parallelized across regions of the genome for faster wall-clock running time. Although our work fo cuses on imputation from summary statistics, for completeness we also inv estigate Gaussian imputation when individual-lev el data is a v ailable. This has been prop osed in [17] and sho wn to ac hieve similar accuracy as HMM-based imputation in the con text of HapMap 3 data. In sim ulations from 1000 Genomes we find that our implementation of the [17] approach (see Metho ds) ac hieves slightly but significan tly lo wer accuracy than ImpG-SummaryLD (see T ables S7,S8) across a wide range of effect sizes. The sligh t impro vemen t of ImpG-SummaryLD o ver Gaussian imputation using individual-lev el genotypes suggests that there is an adv antage to phenot yp e-aw are imputation. When individual-lev el genot yp es are a v ailable, HMM-based imputation remains the approach of choice due to its slightly higher accuracy , but w e recommend the use of ImpG-SummaryLD in preference to previous metho ds for p erforming Gaussian imputation when rapidly prioritizing regions for HMM-based analysis. Application to WTCCC and height data sets W e explored whether similar results could b e attained in real empirical GW AS data. W e v alidated our approac h using a WTCCC study spanning 7 diseases [24] (roughly 2,000 cases for each disease and 3,000 shared controls genotyped on Affymetrix 500K arra y (see T able S4)) and a study of height inv olving 6,500 individuals from the British 1958 birth cohort (1958BC) genotyped on the Affymetrix 6.0 array (see Metho ds). Starting from the real genotype data, we used as reference all 758 Europ ean haplotypes of the 1000 Genomes phase 2 data to accurately impute appro ximately 4 . 3 million SNPs with minor allele frequency 1% either using an HMM-based metho d (IMPUTE2 with pre-phasing [5]) or ImpG-Summary (see T able S4). W e compared asso ciation statistics at accurately imputed SNPs with either the HMM-based metho d or using ImpG-Summary (with the latter assuming no access to individual level data). W e observ ed an a verage correlation of 0.94 b etw een the tw o set of asso ciation statistics for b oth WTCCC and 1958BC phenot yp es (Figure 1), sho wing high similarit y b etw een the tw o approaches (see Figure S4 for each WTCCC phenotype). In general w e observe that the QQ and Manhattan plots show similar b ehavior 14 for HMM-based asso ciation as compared to ImpG-Summary imputation emphasizing no excess of false p ositiv es when only summary data is used in imputation (Figures S5-S13). In some instances we observe differences that we hypothesize represent false p ositives for the HMM-based imputation, likely due to insufficien t QC filtering for the HMM-based approach (see Figures S10-S12). Imp ortantly , statistics at known asso ciated SNPs from the NHGRI GW AS catalog for eac h of the considered phenot yp es [1] (T able 4) show similar asso ciation p ow er across the tw o compared metho ds (e.g. an av erage χ 2 of 19.17 for HMM-based imputation v ersus 18.28 across the WTCCC data and 4.55 versus 4.76 for the heigh t phenot yp e) (Figure 2, T able S5) . Application to publicly av ailable summary statistics for 4 lipid traits W e in vestigated the p erformance of ImpG-Summary on publicly av ailable summary asso ciation statistic data sets of 4 blo o d lipid traits [30]. This data has b een imputed using HMM-based imputation to an av erage of 2.0M markers (see Metho ds). W e imputed this data to 7.3M 1000 Genomes mark ers. W e randomly masked 10% of the data, re-imputed using ImpG-Summary and assessed accuracy b y comparing ImpG-Summary with the mask ed data. As expected we observe a high correlation b etw een the tw o sets of summary statistics (correlation r=0.98 (0.95) at common (low-frequency) v ariants; see Figure 3, Figure S14). T o quantify the expected accuracy when imputation is performed from arra y-based asso ciation statistics, we also mask ed all asso ciation statistics not present on a standard genot yping arra y and re-imputed using ImpG-Summary . W e again observ e a high correlation (r=0.97 (0.91) at common (lo w-frequency) v ariants; see Figure 3, Figure S14) thus sho wing that our approach recov ers as sociation statistics similar to those obtained b y HMM-based imputation requiring individual-level genotypes. W e ha ve publicly released imputed summary asso ciation statistics obtained using the full set of 2.0M mark ers, without masking (see W eb Resources). As exp ected, w e observed low er λ GC for ImpG-Summary data as compared to original data (e.g. 0.92 versus 0.98 for HDL phenotype, see T able S9). Enric hmen t analysis for 4 lipid traits W e categorized each SNP according to functional classes (see Metho ds). W e p erformed genomic control correction using λ GC estimated from only the Intergenic SNPs, as in [13]. After normalization, we 15 estimated the a verage excess v ariance for each functional class as the av erage square of the asso ciation z-score minus 1 [13]. W e observe that 1000 Genomes imputation using ImpG-Summary increases the a verage v ariance for eac h functional class, with Genic SNPs (and in some cases Intronic SNPs) showing larger increases than Intergenic SNPs (see Figure 4, Figure S15) . The increase in λ mean for eac h functional class, even after normalization by λ GC , indicates that 1000 Genomes imputation increases the ratio λ mean / λ GC , i.e. causes true signals to b e more concentrated at the tail of the distribution. 1000 Genomes imputation using ImpG-Summary increases statistical evidence of enric hment at Genic vs. In tergenic SNPs, b oth b ecause the magnitude of the enrichmen t is larger and b ecause of the increased n umber of SNPs. W e fo cus here on just the former effect by computing KS test statistics at random subsets of 10,000 SNPs, a conserv ative computation that av oids correlations due to LD (see Metho ds). Across all 4 phenotypes, median KS test statistics were more significant in the 1000G imputed data vs. the original data set (e.g. 4.75E-08 versus 7.63E-05 for HDL; see T able S10 for all phenot yp es). This highligh ts the increased utility of the 1000G imputed summary statistics that we hav e publicly released for analyses of functional enric hment. Discussion W e hav e in tro duced an approac h for imputation of asso ciation statistics at un typed v ariants directly from summary asso ciation statistics using publicly av ailable reference panels of haplotypes such as 1000 Genomes [3], in contrast to widely used HMM-based approaches that require individual-level genotypes [5]. Through extensiv e simulations and real data analyses w e sho w that our approac h is almost as pow erful as imputation from individual-level genotypes (for b oth common and lo w-frequency v ariants) with no excess of false-p ositives. W e ha ve describ ed a metho d that uses summary asso ciation statistics (ImpG- Summary), as well as a metho d that uses summary association statistics and summary LD statistics (ImpG-SummaryLD). Because summary LD statistics are not currently widely shared, we exp ect that ImpG-Summary will b e of greatest practical v alue in the immediate future. How ever, the slightly higher p o wer attained by ImpG-SummaryLD pro vides a motiv ation for sharing of summary LD statistics to b ecome a widely accepted practice. This is likely to also prov e v aluable in other s ettings, such as conditional analysis or rare v ariant testing [19, 34]. 16 It is often the case that priv acy and logistic constraints prohibit the sharing of individual-lev el data. On the other hand, summary asso ciation statistics from large scale asso ciation studies are often readily a v ailable [13, 30], despite the fact that priv acy concerns ma y extend to summary data [35,36]. F or example, a recent study used publicly av ailable summary asso ciation statistics ov er a wide range of phenotypes to dra w inferences ab out the enrichmen t of disease-asso ciated v ariants in several functional categories [13]. Using the metho ds in tro duced here, such analyses can be expanded to the set of all 1000 Genomes v ariants. In particular, w e hav e publicly released imputed asso ciation statistics at 1000 Genomes v ariants for 4 lipid traits. W e show that for these 4 lipid traits, 1000 Genomes imputed summary statistics sho w a consisten tly larger and more statistically significant signal of enric hment in genic vs. non-genic regions as compared to the original publicly av ailable data. Thus, 1000 Genomes imputed summary statistics can b e used to increase p o wer in studies of functional enrichmen t. The Gaussian appro ximation for LD among SNPs has previously b een employ ed in a wide range of problems [14 – 18]. W e show ed that an adjustment similar to ridge regression remov ed the false-p ositive asso ciations in imputed summary statistics that o ccurred when unadjusted estimates of LD were used. As reference panels b ecome larger, we exp ect a smaller adjustmen t factor to b e needed thus increas- ing accuracy . Large reference panels of typed SNPs could p otentially also b e emplo yed to reduce the adjustmen t factor needed for av oiding false-p ositiv es [34]. Other recent works hav e prop osed to reduce the computational burden of imputation using a technique similar to matrix completion, how ev er that approac h do es not extend to imputation from summary statistics [37]. The work of [17] presented methods for Gaussian imputation from allele frequencies in cases and controls or from individual-lev el genotypes. There are many key differences b etw een that w ork and the current study . First, w e impute asso ciation statistics (i.e. z -scores) rather than allele frequencies. F or case-con trol traits, it is unclear how to use imputed allele frequencies in cases and controls [17] to obtain asso ciation statistics that are robust to false-p ositiv es; for quantitativ e traits, imputation of allele frequencies do es not apply . Thus, the metho ds and softw are of [17] cannot b e used to impute asso ciation statistics, as w e hav e done here. Second, we ev aluate our approach in simulations based on 1000 Genomes data [3], assessing both p ow er and false-p ositive asso ciations. Third, we v alidate our approac h using real empirical data across several GW AS inv olving b oth discrete and con tinuous phenot yp es, including the 4 lipid traits for whic h we hav e publicly released imputed asso ciation statistics at 1000 Genomes v ariants. W e note 17 that recent parallel w ork has also prop osed to use summary statistics with reference panels of haplotypes for imputation ( [31]; a related approac h is prop osed in [38]), but that w ork do es not pro vide a strategy to address false-positive asso ciations arising from the limited size of the reference panel, as we do here. W e conclude with sev eral limitations for the approaches we presented here. First, when summary LD statistics from the study are not av ailable, our adjustmen t pro cedure leads to a slight deflation of as- so ciation statistics under null data. This could hamper efforts to assess confounding due to p opulation stratification or cryptic relatedness via genomic control [39]. How ever, it is now widely recognized that genomic control is not an effective approach for assessing confounding in large studies, due to the ex- p ected inflation from p olygenic effects [22]. Second, application of our approach to summary statistics with inappropriate levels of QC is a p otential concern due to the possibility of introducing false pos- itiv es. Therefore, we caution that appropriate QC should b e p erformed on typed v ariants prior to estimation of summary asso ciation statistics, as is standard pro cedure in any GW AS. Third, recent work has shown that low-co verage sequencing is a more p ow erful alternative to genotyping arra ys p er unit of cost inv ested [40 – 43]. The extension of Gaussian imputation to lo w-cov erage sequencing data remains a direction for future work. Finally , as with all imputation approaches, the metho ds presen ted here are more accurate for common v ariants than for low-frequency v ariants. Accuracy will b e ev en low er at very rare v arian ts, although GW AS inv olving single-v ariant asso ciations are generally fo cused on common and lo w-frequency v ariants. Supplemen tal Data Supplemen tal data includes 10 T ables and 15 Figures. Ac kno wledgmen ts This researc h was supp orted b y NIH grant R01 HG006399 (B.P ., N.Z., N.P ., A.L.P .), R03 CA162200 (B.P .) and R01 GM053275 (B.P .). W e are grateful to J. Y ang and D. Reich for helpful discussions, and to P . de Bakker and the International HIV Controllers Study for sharing data from ref. [40]. W e ac knowledge use of phenotype and genotype data from the British 1958 Birth Cohort DNA collection, 18 funded by the Medical Research Council grant G0000934 and the W ellcome T rust grant 068545/Z/02. (h ttp://www.b58cgene.sgul.ac.uk/). Genotyping for the B58C-WTCCC subset was funded by the W ell- come T rust gran t 076113/B/04/Z. The B58C-T1DGC genot yping utilized resources pro vided b y the T yp e 1 Diab etes Genetics Consortium, a collab orative clinical study sp onsored by the National Institute of Di- ab etes and Digestive and Kidney Diseases (NIDDK), National Institute of Allergy and Infectious Diseases (NIAID), National Human Genome Researc h Institute (NHGRI), National Institute of Child Health and Human Developmen t (NICHD), and Juvenile Diab etes Research F oundation International (JDRF) and supp orted by U01 DK062418. B58C-T1DGC GW AS data were dep osited by the Diab etes and Inflam- mation Lab oratory , Cam bridge Institute for Medical Research (CIMR), Univ ersity of Cam bridge, whic h is funded by Juvenile Diab etes Research F oundation In ternational, the W ellcome T rust and the National Institute for Health Research Cambridge Biomedical Research Centre; the CIMR is in receipt of a W ell- come T rust Strategic Aw ard (079895). The B58C-GABRIEL genotyping was supp orted b y a contract from the Europ ean Commission F ramew ork Programme 6 (018996) and gran ts from the F rench Ministry of Researc h. W eb Resources ImpG-Summary , ImpG-SummaryLD, precomputed w eights from the 1000 Genomes reference panel, and imputed summary asso ciation statistics at 1000G SNPs for 4 lipid traits: http://bogdanlab.pathology .ucla.edu Beagle imputation: h ttp://faculty .w ashington.edu/browning/beagle/b eagle.html HAPI-UR: h ttps://co de.go ogle.com/p/hapi-ur/ References 1. Hindorff LA, Sethupath y P , Junkins HA, Ramos EM, Mehta JP , et al. (2009) Poten tial etiologic and functional implications of genome-wide asso ciation lo ci for human diseases and traits. Pro ceedings of the National Academ y of Sciences 106: 9362-9367. 2. Visscher PM, Bro wn MA, McCarth y MI, Y ang J (2012) Five years of gwas discov ery . American Journal of Human Genetics . 19 3. The 1000 Genomes Pro ject Consortium (2012) An integrated map of genetic v ariation from 1,092 h uman genomes. Nature . 4. Marchini J, Howie B (2010) Genotype imputation for genome-wide asso ciation studies. Nature reviews Genetics 11: 499–511. 5. Howie B, F uchsberger C, Stephens M, Marchini J, Ab ecasis GR (2012) F ast and accurate genotype imputation in genome-wide asso ciation studies through pre-phasing. Nat Genet 44: 955–959. 6. Browning S, BL B (2007) Rapid and accurate haplotype phasing and missing data inference for whole genome asso ciation studies using localized haplot yp e clustering. American Journal of Human Genetics 81: 1084-1097. 7. Li Y, Willer CJ, Ding J, Sc heet P , Ab ecasis GR (2010) Mach: using sequence and genotype data to estimate haplot yp es and unobserved genotypes. Genetic Epidemiology 34: 816–834. 8. Estrada K, Styrk arsdottir U, Ev angelou E, Hsu Y, Duncan E, et al. (2012) Genome-wide meta- analysis identifies 56 b one mineral density lo ci and reveals 14 lo ci asso ciated with risk of fracture. Nat Genet 44: 491-501. 9. Lango Allen H, Estrada K, Lettre G, Berndt SI, W eedon MN, et al. (2010) Hundreds of v ariants clustered in genomic lo ci and biological path wa ys affect human height. Nature 467: 832-8. 10. Morris A, V oight B, T eslo vich T, F erreira T, Segre A, et al. (2012) Large-scale asso ciation analysis pro vides insights into the genetic architecture and pathophysiology of type 2 diab etes. Nat Genet 44: 981-990. 11. Liu X, Inv ernizzi P , Lu Y, Kosoy R, Lu Y, et al. (2010) Genome-wide meta-analyses iden tify three lo ci asso ciated with primary biliary cirrhosis. Nat Genet 42: 658-60. 12. Nature Genetics (2012) Asking for more. Nature Genetics 44: 733–733. 13. Schork AJ, Thompson WK, Pham P , T ork amani A, Roddey JC, et al. (2013) All snps are not created equal: Genome-wide asso ciation studies rev eal a consisten t pattern of enrichmen t among functionally annotated snps. PLoS Genet 9: e1003449. 20 14. Zaitlen N, Pasaniuc B, Gur T, Ziv E, Halp erin E (2010) Lev eraging genetic v ariabilit y across p opulations for the identification of causal v ariants. American journal of human genetics 86: 23– 33. 15. Han B, Kang HM, Eskin E (2009) Rapid and accurate m ultiple testing correction and pow er estimation for millions of correlated mark ers. PLoS genetics 5: e1000456. 16. Conneely KN, Bo ehnke M (2007) So many correlated tests, so little time! Rapid adjustmen t of P v alues for multiple correlated tests. American journal of human genetics 81: 1158–1168. 17. W en X, Stephens M (2010) Using linear predictors to impute allele frequencies from summary or p o oled genot yp e data. Ann Appl Stat 4: 1158–1182. 18. McPeek MS (2012) Blup genotype imputation for case-control asso ciation testing with related individuals and missing data. Journal of Computational Biology 19: 756–765. 19. Lee S, T eslovic h TM, Boehnke M, Lin X (2013) General framew ork for meta-analysis of rare v arian ts in sequencing asso ciation studies. The American Journal of Human Genetics . 20. Ho erl AE, Kennard R W (1970) Ridge regression: Biased estimation for nonorthogonal problems. T echnometrics 12: pp. 55-67. 21. Hastie T, Tibshirani R, F riedman JJH (2001) The elements of statistical learning, volume 1. Springer New Y ork. 22. Y ang J, W eedon MN, Purcell S, Lettre G, Estrada K, et al. (2011) Genomic inflation factors under p olygenic inheritance. Europ ean Journal of Human Genetics 19: 807–812. 23. Su Z, Marc hini J, Donnelly P (2011) Hapgen2: simulation of multiple disease snps. Bioinformatics . 24. W ellcome T rust Case Con trol Consortium (2007) Genome-wide asso ciation study of 14,000 cases of sev en common diseases and 3,000 shared controls. Nature 447: 661-678. 25. Williams AL, Patterson N, Glessner J, Hakonarson H, Reich D (2012) Phasing of man y thousands of genot yp ed samples. The American Journal of Human Genetics 91: 238–251. 21 26. Strachan DP , Rudnic k a AR, Po wer C, Shepherd P , F uller E, et al. (2007) Lifecourse influences on health among british adults: Effects of region of residence in childhoo d and adultho o d. In terna- tional Journal of Epidemiology 36: 522-531. 27. Barrett JC, Clayton DG, Concannon P , Akolk ar B, Co op er JD, et al. (2009) Genome-wide asso ci- ation study and meta-analysis find that ov er 40 lo ci affect risk of type 1 diab etes. Nat Genet 41: 703-7. 28. Moffatt MF, Gut IG, Demenais F, Strachan DP , Bouzigon E, et al. (2010) A large-scale, consortium- based genomewide association study of asthma. New England Journal of Medicine 363: 1211-1221. 29. Aulchenk o Y, Struc halin MV, v an Duijn Cornelia M (2010) Probab el pack age for genome-wide asso ciation analysis of imputed data. BMC Bioinformatics 11. 30. T eslo vich et al TM (2010) Biological, clinical and p opulation relev ance of 95 lo ci for blo o d lipids. Nature 466: 707–713. 31. Lee D, Bigdeli TB, Riley B, F anous A, Bacanu SA (2013) Dist: Direct imputation of summary statistics for unmeasured snps. Bioinformatics : btt500. 32. Browning BL, Bro wning SR (2009) A unified approach to genotype imputation and haplot yp e- phase inference for large data sets of trios and unrelated individuals. The American Journal of Human Genetics 84: 210–223. 33. Marchini J, Howie B (2010) Genotype imputation for genome-wide asso ciation studies. Nature Reviews Genetics 11: 499–511. 34. Y ang J, F erreira T, Morris AP , Medland SE, Madden P A, et al. (2012) Conditional and join t m ultiple-snp analysis of gwas summary statistics identifies additional v ariants influencing complex traits. Nature genetics 44: 369–375. 35. Sank araraman S, Ob ozinski G, Jordan MI, Halp erin E (2009) Genomic priv acy and limits of indi- vidual detection in a p o ol. Nature genetics 41: 965–967. 36. Homer N, Szelinger S, Redman M, Duggan D, T embe W, et al. (2008) Resolving individuals con tributing trace amoun ts of dna to highly complex mixtures using high-density snp genotyping microarra ys. PLoS genetics 4: e1000167. 22 37. Chi EC, Zhou H, Chen GK, Del V ecch yo DO, Lange K (2013) Genotype imputation via matrix completion. Genome research 23: 509–518. 38. Hu YJ, Berndt SI, Gustafsson S, Ganna A, Hirschhorn J, et al. (2013) Meta-analysis of gene-level asso ciations for rare v arian ts based on single-v ariant statistics. The American Journal of Human Genetics . 39. Devlin B, Ro eder K (1999) Genomic control for asso ciation studies. Biometrics 55: 997–1004. 40. Pasaniuc B, Rohland N, McLaren PJ, Garimella K, Zaitlen N, et al. (2012) Extremely lo w-cov erage sequencing and imputation increases p ow er for genome-wide asso ciation studies. Nat Genet 44: 631–635. 41. Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR (2011) Low-co verage sequencing: implications for design of complex trait asso ciation studies. Genome Research 21: 940–951. 42. Flannick J, Korn JM, F ontanillas P , Grant GB, Banks E, et al. (2012) Efficiency and p ow er as a function of sequence co verage, snp arra y densit y , and imputation. PLoS Comput Biol 8: e1002604. 43. Nielsen R, Paul JS, Albrec h tsen A, Song YS (2011) Genotype and snp calling from next-generation sequencing data. Nature Reviews Gene tics 12: 443–451. 23 Figure Titles and Legends Figure 1. HMM-imputed (x-axis) versus ImpG-Summary (y-axis) asso ciation statistics (z-scores) for the BD phenot yp e in WTCCC Data (left) and ov er height phenotype in 1958 Birth Cohort Data (righ t). Results for all other WTCCC phenotypes can b e found in Figure S4. Figure 2. HMM-imputed (x-axis) versus ImpG-Summary (y-axis) asso ciation statistics (z-scores) at kno wn asso ciated SNPs from NHGRI GW AS Catalog in WTCCC (left) and height in 1958 Birth Cohort Data (righ t). Figure 3. HMM-imputed (x-axis) versus ImpG-Summary (y-axis) asso ciation statistics (z-scores) for the TG phenot yp e in the blo o d lipids data. Left denotes imputation of 10% of the z-scores using the remaining 90% while righ t shows imputation results starting from all v ariants present on the Illumina 610 arra y . Results for all blo o d lipids phenot yp es can b e found in Figure S13. ImpG-Summary to ok 4 CPU da ys for the 10% data and under 10 CPU hours for the array-based imputation. Figure 4. Average v ariance p er SNP (av erage asso ciation z 2 -1) binned b y different functional classes for all 4 blo o d phenotypes. T op displa ys the absolute num b ers attained across the original data and the ImpG-Summary imputation to 1000 Genomes (r2pred > 0.8). Bottom figure shows the absolute difference b et ween original data and 1000 Genomes imputed asso ciation statistics. 24 T ables Metho d Odds Ratio 1.0 1.2 1.5 1.7 2.0 All SNPs Beagle 0.999 0.892 0.872 0.870 0.868 ImpG-Summary 0.937 0.835 0.823 0.827 0.836 ImpG-SummaryLD 0.999 0.872 0.851 0.852 0.855 Common SNPs ( o ver 5%) Beagle 0.999 0.900 0.885 0.883 0.881 ImpG-Summary 0.956 0.850 0.841 0.845 0.855 ImpG-SummaryLD 0.999 0.882 0.867 0.868 0.872 Lo w frequency SNPs (1 to 5%) Beagle 0.997 0.808 0.667 0.640 0.620 ImpG-Summary 0.881 0.685 0.539 0.512 0.491 ImpG-SummaryLD 0.997 0.768 0.597 0.565 0.542 T able 1. Relative effective sample size at imputed SNPs (ratio of the av erage χ 2 asso ciation statistics attained at imputed versus t yp ed SNPs) in sim ulated case-control studies at different effect sizes (R > 1). The column corresp onding to R=1 shows the a verage χ 2 asso ciation statistic under the null mo del of no asso ciation. 25 Metho d Rand Great Britain All SNPs Beagle 0.872 0.868 ImpG-Summary 0.823 0.818 ImpG-SummaryLD 0.851 0.845 Common SNPs ( o ver 5%) Beagle 0.885 0.880 ImpG-Summary 0.841 0.835 ImpG-SummaryLD 0.867 0.860 Lo w frequency SNPs (1 to 5%) Beagle 0.667 0.671 ImpG-Summary 0.539 0.549 ImpG-SummaryLD 0.597 0.603 T able 2. Relative effective sample size at imputed SNPs (ratio of the av erage χ 2 asso ciation statistics attained at imputed v ersus typed SNPs) when imputation is p erformed in a random subsample of the 1000 Genomes Europ ean data or o ver only Great Britain haplotypes (o dds ratio is set to 1.5). 26 Metho d N=1,000 N=10,000 N=50,000 IMPUTE1 893.8 8,937.5 44,687.5 IMPUTE2 (sampling) 100 1,000 5,000 IMPUTE2 (pre-phasing) 4.2 41.7 208.3 IMPUTE2 (pre-phasing)* 21.5 215.3 1,076.4 Beagle 250 2,500 12,500 ImpG-Summary 0.4 0.4 0.4 ImpG-SummaryLD 0.4 0.4 0.4 T able 3. Estimated runtimes for 1000 Genomes imputation for v arious num b er of individuals (N) in imputation. Runtimes given in cen tral pro cessing unit (CPU) days needed to impute across the whole genome (11.6 million SNPs p olymorphic in Europ eans). Run times for all versions of IMPUTE extrap olated from Ho wie et al 2012 [5]. IMPUTE2 (pre-phasing)* includes GW AS phasing time of 25 min utes p er individual [5]. Beagle runtime extrap olated from an av erage of 3 CPU hours runtime for N=300 samples across a 5Mb windo w in the genome. ImpG-Summary takes under 10 hours for imputation starting from 600k t yp ed v ariants and under 4 CPU days for imputation from 2M typed v ariants with no pre-computation. 27 Phenot yp e Num b er of SNPs HMM χ 2 ImpG-Summary χ 2 Ratio Bip olar disorder(BD) 9 7.02 6.66 0.95 Coronary heart disease (CAD) 32 13.39 13.11 0.98 Crohn’s disease (CD) 70 20.78 19.74 0.95 Hyp ertension(HT) 7 4.47 3.95 0.88 Rheumatoid arthritis (RA) 22 20.36 19.00 0.93 T yp e 1 diab etes (T1D) 36 36.39 34.98 0.96 T yp e 2 diab etes (T2D) 51 12.08 11.45 0.95 All WTCCC 227 19.17 18.28 0.95 Heigh t 1958 Birth Cohort 176 4.55 4.76 1.05 T able 4. Average asso ciation statistics ( χ 2 ) o ver known asso ciated SNPs from NHGRI GW AS Catalog for the 8 studied phenot yp es. The av erage across all SNPs except HLA region (chr6:20-35Mb) in WTCCC data consisting of 216 SNPs in total of 16.02 for HMM versus 15.29 for ImpG-Summary . 28 29 30 31 32

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment