population stratification qunyuan zhang division of statistical genomics gems course m21-621...

Population Stratification

Qunyuan ZhangDivision of Statistical Genomics

GEMS Course M21-621 Computational Statistical Genetics

Mar. 24, 2011

https://dsgweb.wustl.edu/qunyuan/presentations/PopStrat2011.pptx

1

What is Population Stratification (PS) ?

In narrow sense PS is the presence of a

systematic difference in allele frequencies between subpopulations in a population, possibly due to different ancestry or origins, especially in the context of genetic association studies. Population stratification is also referred to as population structure.

In broad sense PS can be regarded as the

presence of a difference in relatedness between individuals in a population, due to different subpopulations, family/pedigree structure and/or cryptic relation.

2

False Positives (inflation)

Association could be due to the underlying structure of the population, even there is no disease-locus association.

PS & False Positives

3

An Example of PS-caused False Positive

Sub-population 1case control total risk

A 72 8 80 9/1a 18 2 20 9/1total 90 10 100 9/1Sub-population 2

case control total riskA 3 27 30 1/9a 7 63 70 1/9

10 90 100 1/9Mixed population

case control total riskA 75 35 110 2.14a 25 65 90 0.38

100 100 200 1.00

• No disease-locus association.

• Risk difference between sub-populations.

• Allele Frequency difference between sub-populations.

• False disease-locus association in mixed population. (any allele with higher frequency in higher-risk sub-population seems to be risk allele)

4

Mantel-Haenszel Test for Stratification

Adjusted RR

Standard error

Chi-square test

An Example

(1)

(2)

(3)

5

Linear Model

Marker data

Population structure variableGenetic background variableMembership variableSubgroup/sub-population variableAncestry/admixture proportion variable

Usually Q is unknown, needs to be estimated

6

-0.28 -0.95 0.11-0.75 0.29 0.59-0.60 0.08 -0.80

Estimating Q by Eigen-analysis

References: Patterson et al. 2006, Price et al. 2006 (software EIGENSTRAT)

X = U S VT

Q1 Q2 Q3Eigenvector of COV(X)

T

idv1 idv2 idv3snp1 0 2 1snp2 1 2 2snp3 0 0 1snp4 0 1 0snp5 2 0 0

-0.55 0.33 0.34-0.78 -0.10 -0.27-0.16 0.04 -0.71-0.20 0.14 0.52-0.15 -0.93 0.20

3.81 0.00 0.000.00 2.05 0.000.00 0.00 1.13

singular values

eigenvaluesS2

14.51 0.00 0.00

0.00 4.21 0.00

0.00 0.00 1.28

Or SAS Proc PRINCOM; R svd() and eigen() 7

Eigen-analysis of HapMap Populations

Q1

Q2

8

Estimating Q by MLE(for admixed population)

G: Observed genotypes of admixed [and parental populations]Q: Allelic frequencies in parental populationsP : Individual membership to be estimated

Goal: obtain P that maximizes Pr(G|P,Q)

1. Assign prior values for Q (randomly or estimated from parental population genotype data) & P (randomly)

2. Compute P(i) by solving

3. Compute Q(i) by solving

4. Iterate Steps 1 and 2 until convergence.

Tang et al. Genetic Epidemiology, 2005(28): 289–301

0)(

),|(

P

PQG

0)(

),|(

Q

PQG

9

Observed G : genotypes of admixed [and parental populations]

Unknown Z : admixed individuals’ membership from ancestral populations

Problem: How to estimate Z ?

Bayesian and Markov Chain Monte Carlo (MCMC) methods1. Assume ancestral population number K (see next slide)

2. Define prior distribution Pr(Z) under K3. Use MCMC to sample from posterior distribution Pr(Z|G) = Pr(Z) Pr(∙ G|

Z)

4. Average over large number of MCMC samples to obtain estimate of Z

Falush et al. Genetics, 2003(164):1567–1587 Software : STRUCTURE

Estimating Q by MCMC(for admixed population)

10

Infer Population Number (K)

11

Linear Model (an example including m Q-variables)

eQbQbQbbxay mm ...2211

eQbbxaym

iii

1

SAS Proc REG, Proc GENMOD; R lm(), glm()

Generalized, can fit binary/categorical y 12

Unified Mixed Model(more general)

SNP(s)

Inferred population membership

ID matrixCovariate(s)

V = Z G Z ' + R

Modeling the resemblance among individuals

13

Multi-Variate Normal Distribution (MVN) & Likelihood of Mixed Model

Based on MVN, the likelihood of trait (y) in a matrix form is:

no. of individuals (in a pedigree) nn variance-

covariance matrix

phenotype vector

mean phenotype

vector

V = Z G Z ' + R

IV ea222

Kinship (IBD) matrix (nn )

14

Kinship

Inbreeding CoefficientThe inbreeding coefficient of an individual is the probability that the pair of alleles carried by the gametes that produced it are Identical By Descent (IBD).

Identical By Descent (IBD)Two alleles come from the same ancestry.

Kinship/Coancestry

The inbreeding coefficient of an individual is equal to the coancestry between its parents. For example if parents X and Y have a child Z, theninbreeding coefficient of Z = coancestry between X and Y

Software: SAS (PROC INBREED), MERLIN, SPAGedi , R(kinship, emma) et al. (need pedigree and/or marker data)

15

Kinship Matrix (expected probability of allele sharing among

relatives)

16

Resources for Mixed Model with Kinship Matrix

Software Kinship Mixed Model Data

SAS Proc INBREED Proc MIXED Quantitative traitPedigree data

SAS Proc INBREED Proc GLIMMIX Quantitative/qualitative trait, Pedigree data

R : kinship makekinship() lmekin() Quantitative traitPedigree data

R: emma emma.kinship() emma.REML.t() Quantitative traitUsing maker data to calculate kinship

EMMAX emmax-kin emmax

17

Diagnosis of Inflation of False Positives

• Inflation: more false positives than expected under the null

• In GWAS, usually due to PS

• Can be caused by inappropriate statistical methods even with no PS

• May (not necessarily) indicate PS

18

Theoretical Basis of Diagnosis Uniform distribution [0,1] of p-values under the null

Histogram

-log10(p)Q-Q plot

inflationno inflation

19

Inflation Rate (IR)

For Binary Trait

For Continuous Trait

Amin , Duijn, Aulchenko, 2007

Devlin et al. 2004

20

Genomic Control (by IR)

For Binary Trait

For Continuous Trait

22iiY 22 )( ii tY

Or based on p-value 2)1,1(

2 dfpi i

Y

21

22 ~

ˆ~

dfi

i

YY

)~

(Pr~ 221 idfi Yobp

21

Practice• Download and unzip the data from dsgweb.wustl.edu/qunyuan/data/ popstra2011hw.zip• Ignore pedigree.csv, test each SNP in snp.csv for association (with trait in

trait.csv);• Investigate p-values to see if there is any inflation;• Try to explain why;• List some possible methods to reduce or control the inflation;• Choose one method, apply it to the data;• Does it work? • Try to explain why. • Clearly document each step of you analysis.

The is no standard answer, feel free to try anything you like !

Report back to [email protected] and [email protected] in one week. Thanks !

22

mailto:[email protected]

mailto:[email protected]

population stratification qunyuan zhang division of statistical genomics gems course m21-621...

Documents

population stratification

false positive subpopulation

higherrisk subpopulation

structure estimating

genotypes of admixed

risk difference

admixed individuals

allele frequency difference