population stratification qunyuan zhang division of statistical genomics gems course m21-621...
TRANSCRIPT
![Page 1: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/1.jpg)
Population Stratification
Qunyuan ZhangDivision of Statistical Genomics
GEMS Course M21-621 Computational Statistical Genetics
Mar. 24, 2011
https://dsgweb.wustl.edu/qunyuan/presentations/PopStrat2011.pptx
1
![Page 2: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/2.jpg)
What is Population Stratification (PS) ?
In narrow sense PS is the presence of a
systematic difference in allele frequencies between subpopulations in a population, possibly due to different ancestry or origins, especially in the context of genetic association studies. Population stratification is also referred to as population structure.
In broad sense PS can be regarded as the
presence of a difference in relatedness between individuals in a population, due to different subpopulations, family/pedigree structure and/or cryptic relation.
2
![Page 3: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/3.jpg)
False Positives (inflation)
Association could be due to the underlying structure of the population, even there is no disease-locus association.
PS & False Positives
3
![Page 4: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/4.jpg)
An Example of PS-caused False Positive
Sub-population 1case control total risk
A 72 8 80 9/1a 18 2 20 9/1total 90 10 100 9/1Sub-population 2
case control total riskA 3 27 30 1/9a 7 63 70 1/9
10 90 100 1/9Mixed population
case control total riskA 75 35 110 2.14a 25 65 90 0.38
100 100 200 1.00
• No disease-locus association.
• Risk difference between sub-populations.
• Allele Frequency difference between sub-populations.
• False disease-locus association in mixed population. (any allele with higher frequency in higher-risk sub-population seems to be risk allele)
4
![Page 5: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/5.jpg)
Mantel-Haenszel Test for Stratification
Adjusted RR
Standard error
Chi-square test
An Example
(1)
(2)
(3)
5
![Page 6: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/6.jpg)
Linear Model
Marker data
Population structure variableGenetic background variableMembership variableSubgroup/sub-population variableAncestry/admixture proportion variable
Usually Q is unknown, needs to be estimated
6
![Page 7: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/7.jpg)
-0.28 -0.95 0.11-0.75 0.29 0.59-0.60 0.08 -0.80
Estimating Q by Eigen-analysis
References: Patterson et al. 2006, Price et al. 2006 (software EIGENSTRAT)
X = U S VT
Q1 Q2 Q3Eigenvector of COV(X)
T
idv1 idv2 idv3snp1 0 2 1snp2 1 2 2snp3 0 0 1snp4 0 1 0snp5 2 0 0
-0.55 0.33 0.34-0.78 -0.10 -0.27-0.16 0.04 -0.71-0.20 0.14 0.52-0.15 -0.93 0.20
3.81 0.00 0.000.00 2.05 0.000.00 0.00 1.13
singular values
eigenvaluesS2
14.51 0.00 0.00
0.00 4.21 0.00
0.00 0.00 1.28
Or SAS Proc PRINCOM; R svd() and eigen() 7
![Page 8: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/8.jpg)
Eigen-analysis of HapMap Populations
Q1
Q2
8
![Page 9: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/9.jpg)
Estimating Q by MLE(for admixed population)
G: Observed genotypes of admixed [and parental populations]Q: Allelic frequencies in parental populationsP : Individual membership to be estimated
Goal: obtain P that maximizes Pr(G|P,Q)
1. Assign prior values for Q (randomly or estimated from parental population genotype data) & P (randomly)
2. Compute P(i) by solving
3. Compute Q(i) by solving
4. Iterate Steps 1 and 2 until convergence.
Tang et al. Genetic Epidemiology, 2005(28): 289–301
0)(
),|(
P
PQG
0)(
),|(
Q
PQG
9
![Page 10: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/10.jpg)
Observed G : genotypes of admixed [and parental populations]
Unknown Z : admixed individuals’ membership from ancestral populations
Problem: How to estimate Z ?
Bayesian and Markov Chain Monte Carlo (MCMC) methods1. Assume ancestral population number K (see next slide)
2. Define prior distribution Pr(Z) under K3. Use MCMC to sample from posterior distribution Pr(Z|G) = Pr(Z) Pr(∙ G|
Z)
4. Average over large number of MCMC samples to obtain estimate of Z
Falush et al. Genetics, 2003(164):1567–1587 Software : STRUCTURE
Estimating Q by MCMC(for admixed population)
10
![Page 11: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/11.jpg)
Infer Population Number (K)
11
![Page 12: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/12.jpg)
Linear Model (an example including m Q-variables)
eQbQbQbbxay mm ...2211
eQbbxaym
iii
1
SAS Proc REG, Proc GENMOD; R lm(), glm()
Generalized, can fit binary/categorical y 12
![Page 13: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/13.jpg)
Unified Mixed Model(more general)
SNP(s)
Inferred population membership
ID matrixCovariate(s)
V = Z G Z ' + R
Modeling the resemblance among individuals
13
![Page 14: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/14.jpg)
Multi-Variate Normal Distribution (MVN) & Likelihood of Mixed Model
Based on MVN, the likelihood of trait (y) in a matrix form is:
no. of individuals (in a pedigree) nn variance-
covariance matrix
phenotype vector
mean phenotype
vector
V = Z G Z ' + R
IV ea222
Kinship (IBD) matrix (nn )
14
![Page 15: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/15.jpg)
Kinship
Inbreeding CoefficientThe inbreeding coefficient of an individual is the probability that the pair of alleles carried by the gametes that produced it are Identical By Descent (IBD).
Identical By Descent (IBD)Two alleles come from the same ancestry.
Kinship/Coancestry
The inbreeding coefficient of an individual is equal to the coancestry between its parents. For example if parents X and Y have a child Z, theninbreeding coefficient of Z = coancestry between X and Y
Software: SAS (PROC INBREED), MERLIN, SPAGedi , R(kinship, emma) et al. (need pedigree and/or marker data)
15
![Page 16: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/16.jpg)
Kinship Matrix (expected probability of allele sharing among
relatives)
16
![Page 17: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/17.jpg)
Resources for Mixed Model with Kinship Matrix
Software Kinship Mixed Model Data
SAS Proc INBREED Proc MIXED Quantitative traitPedigree data
SAS Proc INBREED Proc GLIMMIX Quantitative/qualitative trait, Pedigree data
R : kinship makekinship() lmekin() Quantitative traitPedigree data
R: emma emma.kinship() emma.REML.t() Quantitative traitUsing maker data to calculate kinship
EMMAX emmax-kin emmax
17
![Page 18: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/18.jpg)
Diagnosis of Inflation of False Positives
• Inflation: more false positives than expected under the null
• In GWAS, usually due to PS
• Can be caused by inappropriate statistical methods even with no PS
• May (not necessarily) indicate PS
18
![Page 19: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/19.jpg)
Theoretical Basis of Diagnosis Uniform distribution [0,1] of p-values under the null
Histogram
-log10(p)Q-Q plot
inflationno inflation
19
![Page 20: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/20.jpg)
Inflation Rate (IR)
For Binary Trait
For Continuous Trait
Amin , Duijn, Aulchenko, 2007
Devlin et al. 2004
20
![Page 21: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/21.jpg)
Genomic Control (by IR)
For Binary Trait
For Continuous Trait
22iiY 22 )( ii tY
Or based on p-value 2)1,1(
2 dfpi i
Y
21
22 ~
ˆ~
dfi
i
YY
)~
(Pr~ 221 idfi Yobp
21
![Page 22: Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011](https://reader036.vdocument.in/reader036/viewer/2022062517/56649eab5503460f94bb0cc9/html5/thumbnails/22.jpg)
Practice• Download and unzip the data from dsgweb.wustl.edu/qunyuan/data/ popstra2011hw.zip• Ignore pedigree.csv, test each SNP in snp.csv for association (with trait in
trait.csv);• Investigate p-values to see if there is any inflation;• Try to explain why;• List some possible methods to reduce or control the inflation;• Choose one method, apply it to the data;• Does it work? • Try to explain why. • Clearly document each step of you analysis.
The is no standard answer, feel free to try anything you like !
Report back to [email protected] and [email protected] in one week. Thanks !
22