partitioning heritability using gwas summary statistics with ld score regression
DESCRIPTION
Hilary Finucane ASHG 2014 talkTRANSCRIPT
Partitioning heritability by functional annotation using summary statistics
Hilary FinucaneMIT Department of Mathematics
HSPH Department of EpidemiologyOctober 21, 2014
Acknowledgements
• Brendan Bulik-Sullivan• Alkes Price• Ben Neale• Alexander Gusev• Nick Patterson• Po-Ru Loh• Gosia Trynka• Han Xu• Verneri Anttila• Yakir Reshef
• Chongzhi Zang• Stephan Ripke• Schizophrenia
Working Group of the PGC
• Shaun Purcell• Mark Daly• Eli Stahl• Soumya Raychaudhuri• Sara Lindstrom
Partitioning heritability by functional annotation is an important goal
• Learn about genetic architecture of disease– Where does the heritability lie?
• Learn about disease biology– What are the relevant cell types?
• Learn about the functional annotations– Which functional annotations show the highest
enrichments?• Downstream applications– Fine mapping– Risk prediction– GWAS priors
Maurano et al. 2012 ScienceTrynka et al. 2013 Nat GenetPickrell 2014 AJHG
What is partitioned heritability?
• Our model is
Where • Y is an individual’s phenotype, • Xj is an individual’s genotype at the j-th SNP
(normalized to mean 0 and variance 1),• βj is the effect of SNP j, and • ε is noise and random environmental effects.
What is partitioned heritability?
• Our model is
• We define heritability as
What is partitioned heritability?
• Our model is
• We define heritability as
and the heritability of a category as
Partitioning heritabilityusing variance components has yielded many insights
• 31% of schizophrenia SNP-heritability lies in CNS+ gene regions spanning 20% of the genome1.
• 28% of Tourette syndrome SNP-heritability and 29% of OCD SNP-heritability lies in parietal lobe eQTLs spanning 5% of the genome2.
• 79% of SNP-heritability, averaged across WTCCC and WTCCC2 traits, lies in DHS regions spanning 16% of the genome3.
2 Davis et al. 2013 PLoS Genet1 Lee et al. 2012 Nat Genet
3 Gusev et al. in press AJHG
A method for partitioning heritability from summary statistics is needed
• Variance components methods are intractable at very large sample sizes.
• There is lots of information in large meta-analyses.
• Lots of publicly available summary statistics allow us to compare many phenotypes and many annotations to get a big picture.
Our method partitions heritability from summary statistics
• Input: – Sample size and p-value for every SNP tested in a
large GWAS of a quantitative or case-control trait– LD information from a reference panel like 1000G– Genome annotation of interest– Other genome annotations to include in the
model.
Our method partitions heritability from summary statistics
• Input: – Sample size and p-value for every SNP tested in a
large GWAS of a quantitative or case-control trait– LD information from a reference panel like 1000G– Genome annotation of interest– Other genome annotations to include in the model.
• Output:– Estimated proportion of heritability that falls within
the annotation of interest.– Enrichment = (% of heritability) / (% of SNPs)
Outline
• Description of method
• Validation on simulated data
• Results on real data
Outline
• Description of method
• Validation on simulated data
• Results on real data
LD is important for summary statistics-based methods
• Some SNPs have a lot of LD to other SNPs in the same category.
• Some SNPs have a lot of LD to SNPs in other categories.
• Some SNPs do not have a lot of LD to other SNPs.
LD is important for summary statistics-based methods
• Some SNPs have a lot of LD to other SNPs in the same category.
• Some SNPs have a lot of LD to SNPs in other categories.
• Some SNPs do not have a lot of LD to other SNPs.
Our solution: LD Score Regression.See Bulik-Sullivan et al. biorxiv (under revision, Nat Genet) and ASHG 2014 poster 1787T Bulik-Sullivan
LD Score Regression: basic intuition
High LD region Low LD region
Chi-s
quar
e
• Polygenicity causes more chi-square statistic inflation in high LD regions than in low LD regions
Mean chi-square for high LD region: high Mean chi-square for low LD region: low
Multivariate LD Score Regression: basic intuition
Enriched category BIG difference between lots of LD vs little LD to the category
Depleted category SMALL difference between lots of LD vs little LD to the category
High chi-square Low chi-square
Low chi-square Low chi-square
Multivariate LD Score regression allows us to partition SNP heritability
• Multivariate LD Score: the sum over all SNPs in a category of r^2 with that SNP.
Multivariate LD Score regression allows us to partition SNP heritability
• Multivariate LD Score: the sum over all SNPs in a category of r^2 with that SNP.
• Derivations based on a polygenic model give:
Multivariate LD Score regression allows us to partition SNP heritability
• Multivariate LD Score: the sum over all SNPs in a category of r^2 with that SNP.
• Derivations based on a polygenic model give:
• Easily extends to overlapping categories.
Multivariate LD Score regression allows us to partition SNP heritability
To estimate partitioned heritability: • Estimate LD Scores from a reference panel.• Regress chi-square statistics on LD Scores.• The slopes give the partitioned heritability.• For best results, use many categories!
Outline
• Description of method
• Validation on simulated data
• Results on real data
Multivariate LD Score regression works in simulations
• Standard errors are over 100 simulations.• Simulated quantitative phenotype with h2 = 0.5. • M = 110,444, N = 2,713
True h2(DHS) 0.092REML (2 cat) 0.089 (0.006)LD Score (27 cat) 0.086 (0.012)
Null simulations DHS 3x enrichedTrue h2(DHS) 0.276REML (2 cat) 0.281 (0.006)LD Score (27 cat) 0.278 (0.013)
Multivariate LD Score regression works in simulations
• Standard errors are over 100 simulations.• Simulated quantitative phenotype with h2 = 0.5. • M = 110,444, N = 2,713
True h2(DHS) 0.092REML (2 cat) 0.089 (0.006)LD Score (27 cat) 0.086 (0.012)
Null simulations DHS 3x enrichedTrue h2(DHS) 0.276REML (2 cat) 0.281 (0.006)LD Score (27 cat) 0.278 (0.013)
FANTOM5 Enhancer* causalTrue h2(DHS) 0.379REML (2 cat) 0.531 (0.007)LD Score (27 cat) 0.361 (0.015)
* Andersson et al. 2014 Nature
Multivariate LD Score regression works in simulations
• Standard errors are over 100 simulations.• Simulated quantitative phenotype with h2 = 0.5. • M = 110,444, N = 2,713
True h2(DHS) 0.092REML (2 cat) 0.089 (0.006)LD Score (27 cat) 0.086 (0.012)
Null simulations DHS 3x enrichedTrue h2(DHS) 0.276REML (2 cat) 0.281 (0.006)LD Score (27 cat) 0.278 (0.013)
FANTOM5 Enhancer* causalTrue h2(DHS) 0.379REML (2 cat) 0.531 (0.007)LD Score (27 cat) 0.361 (0.015)
* Andersson et al. 2014 Nature
FANTOM5 Enhancer* causal, Excluded from the model
True h2(DHS) 0.379REML (2 cat) 0.531 (0.007)LD Score (26 cat) 0.318 (0.014)
Outline
• Description of method
• Validation on simulated data
• Results on real data
Phenotype Citation Sample size
Schizophrenia SCZ working grp of the PGC, 2014 Nature 70,100
Bipolar Disorder Bip working grp of the PGC, 2011 Nat Genet 16,731
Rheumatoid Arthritis* Okada et al., 2014 Nature 38,242
Crohn’s Disease* Jostins et al., 2012 Nature 20,883
Ulcerative Colitis* Jostins et al., 2012 Nature 27,432
Height Wood et al., 2014 Nature Genetics 253,280
BMI Speliotes et al., 2010 Nature Genetics 123,865
Coronary Artery Disease Schunkert et al., 2011 Nature Genetics 86,995
College (yes/no) Rietveld et al., Science 2013 126,559
Type 2 Diabetes Morris et al., 2012 Nature Genetics 69,033
*HLA locus excluded from all analyses for autoimmune traits
Datasets analyzed
Annotations used
Mark Source/reference
Coding, 3’ UTR, 5’ UTR, Promoter, Intron UCSC; Gusev et al., in press AJHG
Digital Genomic Footprint, TFBS ENCODE; Gusev et al., in press AJHG
CTCF binding site, Promoter Flanking, Repressed, Transcribed, TSS, Enhancer, Weak Enhancer
ENCODE; Hoffman et al., 2012 Nucleic Acids Research
DHS, fetal DHS, H3K4me1, H3K4me3, H3K9ac
Trynka et al., 2013 Nature Genetics.*
Conserved Lindblad-Toh et al., 2011 Nature
FANTOM5 Enhancer Andersson et al., 2014 Nature
lincRNAs Cabili et al., 2011 Genes Dev
DHS and DHS promoter Maurano et al., 2012 Science
H3K27ac Roadmap; PGC2 2014 Nature
*Post-processed from ENCODE and Roadmap data by S. Raychaudhuri and X. Liu labs
Coding, Intergenic, Enhancer, H3K4me3, and DHS enrichments in six phenotypes
(Bars indicate 95% confidence intervals)
Coding, Intergenic, Enhancer, H3K4me3, DHS, and Conserved enrichments in six phenotypes
*Lindblad-Toh et al., 2011 Nature
(Bars indicate 95% confidence intervals)
Coding, Intergenic, Enhancer, H3K4me3, DHS, and FANTOM5 Enhancer enrichments in six phenotypes
(Bars indicate 95% confidence intervals)
*Andersson et al., 2014 Nature
Cell-type specific H3K27ac enrichments inform trait biology
• We group 56 cell types into 7 basic categories.• For each trait (10 traits) – For each category (7 categories)• We asses the significance of improvement to
the model from adding that category.
Conclusions
• Many annotations are enriched in many phenotypes.
• Conserved regions, 2.6% of SNPs, are estimated to explain 30% of heritability on average.
• FANTOM5 Enhancers are extremely enriched in auto-immune traits.
• H3K27ac cell-type enrichment matches and extends our understanding of disease biology.
Acknowledgements
• Brendan Bulik-Sullivan• Alkes Price• Ben Neale• Alexander Gusev• Nick Patterson• Po-Ru Loh• Gosia Trynka• Han Xu• Verneri Anttila• Yakir Reshef
• Chongzhi Zang• Stephan Ripke• Schizophrenia
Working Group of the PGC
• Shaun Purcell• Mark Daly• Eli Stahl• Soumya Raychaudhuri• Sara Lindstrom