1
EstimationofgeneticvariationandSNP-heritabilityfromGWASdata
• DenseSNPpanelsallowtheestimationoftheexpectedgeneticcovariancebetweendistantrelatives
• AmodelbaseduponestimatedrelationshipsfromSNPsisequivalenttoamodelfittingallSNPssimultaneously
• ThetotalgeneticvarianceduetoLDbetweencommonSNPsand(unknown)causalvariantscanbeestimated
• GeneticvariancecapturedbycommonSNPscanbepartitionedacrossthegenome
• DifferentmethodstoestimaterelatednessfromSNPsassumedifferentgenetictraitarchitectures
2
Keyconcepts
EstimationofSNP-heritabilityfromGWASdata
Background– 2008:GWASwasperceivedbymanytohavefailedasanexperimentaldesign
– Missingheritability:discrepancybetweenpedigreeheritabilityandvariancecapturedbyassociatedSNPs
3
Disease Number of loci
Percent of Heritability Measure Explained
Heritability Measure
Age-related macular degeneration
5 50% Sibling recurrence risk
Crohn’s disease 32 20% Genetic risk (liability)
Systemic lupus erythematosus
6 15% Sibling recurrence risk
Type 2 diabetes 18 6% Sibling recurrence risk
HDL cholesterol 7 5.2% Phenotypic variance
Height 40 5% Phenotypic variance
Early onset myocardial infarction
9 2.8% Phenotypic variance
Fasting glucose 4 1.5% Phenotypic variance
WhereistheDarkMatter?
4
Hypothesistestingvs.Estimation
GWAS=hypothesistesting– Stringentp-valuethreshold– Estimatesofeffectsbiased(“Winner’sCurse”)
Canweestimate thetotalproportionofvariationaccountedforbyallSNPs?
5
Amodelforasinglecausalvariant
AA AB BBfrequency (1-p)2 2p(1-p) p2
x 0 1 2effect 0 b 2bw =[x-E(x)]/sx -2p/√{2p(1-p)} (1-2p)/√{2p(1-p)} 2(1-p)/√{2p(1-p)}
yj = µ’ +xijbi +ej x=0,1,2{standardassociationmodel}
yj = µ +wijuj +ej u=bsx;µ =µ’+bsx
6
yj = µ +Swijuj +ej
= µ +gj +ej
y = µ1 +g +e
= µ1 +Wu +e
7
Weighting scheme 1
Multiple(M)causalvariants
Letubearandomvariable,u~N(0,su2)
Thensg2 =Msu
2
var(y)=WW’ su2 +Ise
2
=WW’(sg2/M)+Ise
2
=Gsg2 +Ise
2
8
Model with individual genome-wide additive values using relationships (G) at the causal variants is equivalent to a model
fitting all causal variants
We can estimate genetic variance just as if we would do using pedigree relationships
Equivalence
IfweestimateG fromSNPs:– loseinformationduetoimperfectLDbetweenSNPsandcausalvariants
– howmuchwelosedependson• densityofSNPs• allelefrequencyspectrumofSNPsvs.causalvariants
– estimateofvarianceàmissingheritability
9
G fromMSNPs:
Gjk =(1/M)S {xij – 2pi)(xik – 2pi)/{2pi(1-pi)}
=(1/M)S wijwik
Butwedon’thavethecausalvariants
• EstimaterealisedrelationshipmatrixfromSNPs• Estimateadditivegeneticvariance
y =Xb +e =Wu +e,var(y)=Gsg2 +Ise
2
Gjk =(1/M)S {xij – 2pi)(xik – 2pi)/{2pi(1-pi)}=(1/M)S wijwik
• Basepopulation=currentpopulation• Weightingscheme1 10
Methods(Yangetal.2010)
var(y) =V =Gσ g2 + Iσ e
2
y standardised~N(0,1)
Nofixedeffectsotherthanmean
G estimatedfromSNPs
Residualmaximumlikelihood(REML)
11
Statisticalanalysis
h2 ~ 0.5 (SE 0.1)
12
Results
[Yang et al. 2010, Nature Genetics]
13[Visscher et al. 2010, Twin Research and Human Genetics]
Checkingforpopulationstructure
GeneticvarianceassociatedwithallSNPscanbeestimatedfromGWASdata
– useSNPstoestimateG– usephenotypeson“unrelated”individualsandGtoestimategeneticvariance
Empiricalresults:mostadditivegeneticvariationforheightiscapturedbycommonSNPs
– little‘missing’heritability– GWASworksfine
14
ConclusionsYangetal.2010
y = mean + g1 + g2 + g3 + g4 + g5 + evar(gi) = (WiWi’/Mi)σi
2 for SNPs in group i
Examples of groupings:• chromosome• genome annotation• MAF• LD
15
Partitioning of genetic variation
IfwecanestimatethevariancecapturedbySNPsgenome-wide,weshouldbeabletopartitionitandattributevariancetoregionsofthegenome
“Populationbasedlinkageanalysis”
Application(2):partitioningvariation
16
17
Exampleonquantitativetraits
[Yang et al. 2011, Nature Genetics]
1
2
3
4
5
6
789
101112
13
14
15
16
17
1819
20
21
22
0
0.01
0.02
0.03
0.04
0.05
0 50 100 150 200 250
Varia
nceexplainedbyeachchromosom
e
Chromosomelength(Mb)
Slope=1.6×10-4
P =1.4×10-6
R2 =0.695
Height(combined)
12
34
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
2122
0
0.005
0.01
0.015
0.02
0.025
0 50 100 150 200 250
Varia
nceexplainedbyeachchromosom
e
Chromosomelength(Mb)
Slope=2.3×10-5
P =0.214R2 =0.076
BMI(combined)
18
longerchromosomesexplainmorevariation
Partitioning onchromosomes
[Yang et al. 2011, Nature Genetics]
1
234
5
6
7
8
9
10
11
12
13 14
15
16
17
1819
20
21 22
R²=0.511
0.000
0.002
0.004
0.006
0.008
0.010
0.00 0.01 0.02 0.03 0.04 0.05
Varia
nceexplainedbyGIANTheightSNPson
eachch
romosom
e
Varianceexplainedbyeachchromosome
Height (11,586 unrelated)
1
2
34
5
6
7
8
9
10
11
12
13
14
1516
17
18
19
20
2122
0.000
0.004
0.008
0.012
0.016
0.020
0.000 0.004 0.008 0.012 0.016 0.020
Varia
nceexplainedbychrom
osom
e(adjustedfortheFTO
SNP)
Varianceexplainedbychromosome(noadjustment)
BMI(11,586unrelated)
FTO
19
ResultsareconsistentwithreportedGWAS
[Yang et al. 2011, Nature Genetics]
1
234
5
6
78
9
1011
12
13
14
151617
18
192021220.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0 50 100 150 200 250
Varia
nceexplainedbyeachchromosom
e
Chromosomelength(Mb)
Slope=6.9×10-5
P =0.524R2 =0.021
vWF(ARIC)
20
Inferencerobustwithrespecttogeneticarchitecture
1
2
3
4
5
6
7
8
9
1011
12
13
14
151617
18
192021
22
0.00
0.03
0.06
0.09
0.12
0.15
0.00 0.03 0.06 0.09 0.12 0.15
Varia
nceexplainedbychrom
osom
e(adjustedfortheABO
SNP)
Varianceexplainedbychromosome(noadjustment)
vWF(6,662unrelated)
ABO
[Yang et al. 2011, Nature Genetics]
0
0.01
0.02
0.03
0.04
0.05
0.06
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Varia
nceexplained
Chromosome
intergenic(± 20Kb)
genic(± 20Kb)
Height(combined)17,277proteincoding geneshGg
2 =0.328(s.e.=0.024)hGi
2 =0.126(s.e.=0.022)Coverageofgenicregions=49.4%P(observedvs.expected)=2.1x10-10
0
0.005
0.01
0.015
0.02
0.025
0.03
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Varia
nceexplained
Chromosome
intergenic(± 20Kb)
genic(± 20Kb)
BMI(combined)17,277proteincoding geneshGg
2 =0.117(s.e.=0.023)hGi
2 =0.047(s.e.=0.022)Coverageofgenicregions=49.4%P(observedvs.expected)=0.022
21
Genic regionsexplainvariationdisproportionately
[Yang et al. 2011, Nature Genetics]
Partitioning ongenomeannotation
Application(3):Usingimputedsequencedata
HowmuchinformationisgainedbyusingSNParraydataimputedtoafullysequencedreference?
Howmuchislostrelativetowholegenomesequencing?
PartitionvariationaccordingtoMAFandLD
22[Yang et al. 2015 Nature Genetics]
AccountingforLDandMAFspectrumallowsunbiasedestimationofgeneticvariance
23
0.0#
0.2#
0.4#
0.6#
0.8#
1.0#
1.2#
Random# More#common# Rarer# Rarer#&#DHS#
Herita
bility*es,m
ate*
GREML:SC#GREML:MS#LDAK#LDAK:MS#LDres#LDres:MS#GREML:LDMS#
[Yang et al. 2015 Nature Genetics]
0.0#
0.2#
0.4#
0.6#
0.8#
1.0#
Random# More#common# Rarer# Rarer#&#DHS#
Heritab
ility*es
,mate*
7MAF_4LD#
7MAF_3LD#
7MAF_2LD#
2MAF_2LD#
Verylittledifferencein“taggability”betweenSNPchips
24
Genetic variation captured after imputation:96% due to common variants73% due to rare variants
0.0#
0.2#
0.4#
0.6#
0.8#
1.0#
0# 0.1# 0.2# 0.3# 0.4# 0.5# 0.6# 0.7# 0.8# 0.9#
Prop
or%o
n'of'varia%o
n'captured
'
Imputa%on'R2'threshold'
Common#1#Affymetrix#6#
Common#1#Affymetrix#Axiom#
Common#1#Illumina#OmniExpress#
Common#1#Illumina#Omni2.5#
Common#1#Illumina#CoreExome#
Rare#1#Affymetrix#6#
Rare#1#Affymetrix#Axiom#
Rare#1#Illumina#OmniExpress#
Rare#1#Illumina#Omni2.5#
Rare#1#Illumina#CoreExome#
[Yang et al. 2015 Nature Genetics]
n=45kdataonheightandBMI
25
Totals~60% for height~30% for BMI
0.00#
0.05#
0.10#
0.15#
0.20#
0.25#
<#0.1# 0.1#~#0.2# 0.2#~#0.3# 0.3#~#0.4# 0.4#~#0.5#
Varia
nce(explaine
d(
MAF(stra2fied(variant(group(
Height# BMI#
[Yang et al. 2015 Nature Genetics]
0.00#
0.02#
0.04#
0.06#
0.08#
0.10#
0.12#
0.14#
2.5e+5#~#0.001#
0.001#~#0.01#
0.01#~#0.1#
0.1#~#0.2#
0.2#~#0.3#
0.3#~#0.4#
0.4#~#0.5#
Varia
nce(explaine
d(
MAF(
1st#quar4le#(low#LD)#
2nd#quar4le#
3rd#quar4le#
4th#quar4le#(high#LD)#
0.00#
0.01#
0.02#
0.03#
0.04#
0.05#
0.06#
0.07#
2.5e,5#~#0.001#
0.001#~#0.01#
0.01#~#0.1#
0.1#~#0.2#
0.2#~#0.3#
0.3#~#0.4#
0.4#~#0.5#
Varia
nce(explaine
d(
MAF(
100%
80%
45%
16%
SlidebyRobertMaier 26
h2 overestimation?untaggedrarevariants?
better tagging of ungenotyped variants
samplesize/power
Partitioningvarianceofheight
TotalvarianceHeritability (based on Twin or family studies)SNP heritability from imputation to sequenced referenceSNP-heritability (variance explained by all genotyped SNPs ontheChip)VarianceexplainedbygenomewidesignificantSNPs
missingheritability60%
Estimated relatedness and trait architecture
𝐺"#$ = (1
∑ 2𝑝+ 1 − 𝑝+ -./0+1-
)3 𝑧+#𝑧+$ 2𝑝+ 1 − 𝑝+ /0
+1-
If G describes the genetic covariance between individuals [var(g) = Gsg
2], then what is the equivalent linear model in terms of SNP effects?
27
Equivalent models
𝐲 = 𝟏𝜇 + ∑ 𝐗#𝛽#;# + e
𝛽#~𝑁(0, 2𝑝#(1 − 𝑝#)/ 𝜎AB)
ℎ#B = 2𝑝# 1 − 𝑝# 𝐸 𝛽B = 2𝑝#(1 − 𝑝#)-./𝜎AB
28
S = -1
𝛽#~𝑁(0, 2𝑝#(1 − 𝑝#F- 𝜎AB)
ℎ#B = 2𝑝#(1 − 𝑝#)G𝜎AB = 𝜎AB
• Weighting scheme 1• All SNPs contribute equally to heritability• Rare variants have bigger effects• “Purifying selection model”
29
S = 0
𝛽#~𝑁 0, 2𝑝# 1 − 𝑝#G 𝜎AB ~𝑁(0, 𝜎AB)
ℎ#B = 2𝑝#(1 − 𝑝#)-𝜎AB
• Weighting scheme 2• Common SNPs contribute more to heritability• Rare and common variants have same effects• “Neutral model”
30
Weighting scheme and genetic architecture
• Weighting schemes 1 and 2 can be justified in two ways:– IBD vs IBS– A priori assumption about the relationship
between allele frequency of effect size (natural selection)
• Can we estimate genetic architecture from the data?
31
Bayesian mixture model (BayesS)
𝒚 = 𝟏𝜇 +3 𝑿#𝛽#�
#+ 𝒆
𝛽#~𝑁 0, 2𝑝#𝑞#/𝜎AB 𝜋 + 𝜙 1 − 𝜋
• S measures the relationship between effect size and MAF 𝑝#– S = 0: independence– S < 0: negatively related (rare variant tends to have large effect)– S > 0: positively related (common variant tends to have large effect)– GCTAdefault:S=-1
• 𝜋 is the polygenicity (proportion of SNPs with non-zero effects)
• ℎ/_`B = 𝑉𝑎𝑟 𝑔 /𝜎fB where g = ∑ 𝑿#𝛽#�#
• Simultaneously estimate SNP effects and genetic architecture parameters using MCMC
• Account for LD between SNPs
Zeng …. Yang 2017 (BioRxiv)
Direction of S distinguishes stabilising selection from directional and disruptive selection
S = 0 S = 0.14 S = 0.13 S = −0.19 S = 0.09
S = 3.70
Neutral Directional (+) Directional (−) Stabilizing Disruptive
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.5
1.0
1.5
2.0
2.5
MAF Bin
Varia
nce
of C
oded
Alle
le E
ffect
s
𝛽#~𝑁 0, 1 + 2𝑝#𝑞#/𝜎AB
Zeng …. Yang 2017 (BioRxiv)
Height vs. BMI
Polygenicity
Heritability
S
0.04 0.06 0.08 0.10
0.25 0.30 0.35 0.40 0.45 0.50 0.55
−0.5 −0.4 −0.3 −0.20
5
10
15
20
0
30
60
90
0
100
200
300
Value
Density
Height
BMI
• Both height and BMI have
been under selection
• Selection has been stronger for
height than BMI-associated
SNPs
• Height is more heritable than
BMI.
• BMI is more polygenic than
height.
Posterior distribution of genetic architecture parameters
Zeng …. Yang 2017 (BioRxiv)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
S Heritability Polygenicity
−0.75 −0.50 −0.25 0.00 0.25 0.1 0.2 0.3 0.4 0.5 0.00 0.05 0.10 0.15Fluid intelligence score
Neuroticism scoreMDD
Birth weightBody fat percentage
Diastolic blood pressureBMI
WeightT2D
BaldnessSystolic blood pressure
Peak expiratory flowEducational attainment
Forced expiratory volumeBasal metabolic rate
Hand grip strength leftAge menarche
Age at first live birthHeel BMD T score
Heel QUIHand grip strength right
Forced vital capacityHeight
HCadjBMIMean time to correctly identify matches
WHRadjBMIWCadjBMIPulse rate
Age at menopause
Highest Probability Density
40000
60000
80000
100000
120000N
29 traits in UKB
Zeng …. Yang 2017 (BioRxiv)
24 traits with significant 𝑆h
0.00
0.25
0.50
0.75
1.00
0.0 0.1 0.2 0.3 0.4 0.5MAF
Cum
ulat
ive g
enet
ic v
aria
nce
expl
aine
d
Age at menopause
Pulse rate
Mean time to correctly identify matches
WCadjBMI
Age at first live birth
WHRadjBMI
Hand grip strength right
HCadjBMI
Hand grip strength left
Forced vital capacity
Age menarche
Educational attainment
Forced expiratory volume
Basal metabolic rate
Systolic blood pressure
Heel QUI
Peak expiratory flow
Heel BMD T score
Height
Weight
Diastolic blood pressure
BMI
Body fat percentage
Baldness
●
●
●
●
●
●
●
●
●●
●●
●
●
●● ●
●
●
●
●
●
●
●
0.52
0.54
0.56
0.58
0.60
0.3 0.4 0.5 0.6
Absolute value of S
AUC
Zeng …. Yang 2017 (BioRxiv)
0
5
10
15
−0.6 −0.3 0.0 0.3 0.6Posterior Mode
Cou
nt
S
Heritability
Polygenicity
●
●
●●● ● ●●● ●●
●●● ●●●●●
●● ●
●●
●
●
●●
●
r = 0.048
0.0
0.2
0.4
0.6
0.1 0.2 0.3 0.4 0.5Heritability
Abso
lute
val
ue o
f S
●
●
●● ●● ●●●●●
●● ●●●●●●
● ● ●
●●
●
●
●●
●
r = −0.359
0.0
0.2
0.4
0.6
0.00 0.05 0.10Polygenicity
Abso
lute
val
ue o
f S
● ●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
r = −0.021
0.1
0.2
0.3
0.4
0.5
0.00 0.05 0.10Polygenicity
Her
itabi
lity
Mean Median
S -0.350 -0.367
h2SNP 0.223 0.222
𝜋 5.8% 5.1%
Summarize over 29 traits
Zeng …. Yang 2017 (BioRxiv)
Multiplemethodstoestimateadditivegeneticvariance
Individual-leveldata- GREML- Haseman-Elston regression
(yjyk)=mean +bGjk +e
Summarydata- LDscore regression
Consideration:- dataavailability- modelassumptions- computation
38
• Dense SNP panels allow the estimation of the expected genetic covariance between distant relatives
• A model based upon estimated relationships from SNPs is equivalent to a model fitting all SNPs simultaneously
• The total genetic variance due to LD between common SNPs and (unknown) causal variants can be estimated
• Genetic variance captured by common SNPs can be partitioned across the genome
• Different methods to estimate relatedness from SNPs assume different genetic trait architectures
39
Key concepts