lecture 8: advanced topics of gwas

33
Lecture 8: Advanced Topics of GWAS

Upload: others

Post on 03-Oct-2021

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 8: Advanced Topics of GWAS

Lecture 8: Advanced Topics of GWAS

Page 2: Lecture 8: Advanced Topics of GWAS

Advanced Topics of GWAS — 1/32 —

• Account for population stratification

– Genomic control factor

– Structured association study

– PCA

• Multiple testing

• Conditional analysis

• Meta-analysis

• Rare variant association study

Page 3: Lecture 8: Advanced Topics of GWAS

Population Stratification — 2/32 —

Population stratification is a confounding factor that must be accounted for inGWAS.

If not accounted for⇒ the number of false-positive findings can be notably higher than would beexpected (i.e., inflated type I error).

QQ plot of -log10(p-value)

Page 4: Lecture 8: Advanced Topics of GWAS

Adjust for Population Stratification — 3/32 —

• Family-based association analysis: TDT, FBAT (previous lecture)

• Genomic control factor, Devlin and Roeder, 1999, Biometrics

• Principle component analysis (PCA), Price et al. 2006, Nature Genetics

Page 5: Lecture 8: Advanced Topics of GWAS

Genomic Control Factor — 4/32 —

Genomic Control Factor is used to control for systematic inflation of type I error.

The idea is that the statistic T is inflated by an inflation factor λ (i.e., genomiccontrol factor) so that

T ∼ λχ21

where λ can be estimated by

λ = median(T1,T2, . . . ,TM)/0.456

• M is the number of independent tests, though in practice all tests are included.

• The denominator is the median of χ21 distribution.

• λ should be 1 under H0.

Page 6: Lecture 8: Advanced Topics of GWAS

Genomic Control Factor — 5/32 —

λ = 1.08

Divide all chi-square test statistics T by the estimated inflation (GC) factor to getcorrected test statistics

T/λ ∼ χ21

under H0 of no association.

Page 7: Lecture 8: Advanced Topics of GWAS

Genomic Control Factor — 6/32 —

Limitation:

• Genomic control corrects for stratification by adjusting association statistics ateach marker by a uniform overall inflation factor.

• However, some markers differ in their allele frequencies across ancestralpopulations more than others.

• Thus, the uniform adjustment applied by genomic control may be insufficient atmarkers having unusually strong differentiation across ancestral populationsand may be superfluous at markers devoid of such differentiation, leading to aloss in power

Page 8: Lecture 8: Advanced Topics of GWAS

Structured Association Study — 7/32 —

• Software: STRUCTURE

• Idea: stratified analysis of association by ethnicity to account for populationstratification

• Method: Use genotype markers to infer the latent population structure, assignthe samples to discrete subpopulation clusters, and then aggregates evidenceof association within each cluster

• Limitation:

– Discrete subpopulation clusters are not clear in admixed population

– Not work well for fractional membership

– Assignments of individuals to clusters are highly sensitive to the number ofclusters, which is not well defined

Page 9: Lecture 8: Advanced Topics of GWAS

Principal Component Analysis (PCA) — 8/32 —

The principal component analysis (PCA) has become one of the standard ways toadjust for population stratification in population-based GWAS.

• Apply PCA to genotype data to obtain top principal components (PCs) thatexplain most genotype data variation

• PLINK can be used to generate top PCs

• Top PCs will reflect sample ancestry

• Include top PCs as covariates in GWAS

Page 10: Lecture 8: Advanced Topics of GWAS

PCA — 9/32 —

Top PCs often reflect geographic distribution (e.g, PC1 - PC4 as follows)

Page 11: Lecture 8: Advanced Topics of GWAS

PCA — 10/32 —

PC1 vs. PC2 among European samples

Heath et al. 2008.

Page 12: Lecture 8: Advanced Topics of GWAS

PCA Algorithm — 11/32 —

Notation:– M : Number of SNPs– N : Number of subjects– Z = (zi j) : an M × N matrix of standardized genotyped coded for the additive model for

the ith SNP in the jthe subject, i.e.,

zi j = (Xi j − Xi.)/√

2 pi(1 − pi)

where pi denotes the MAF of the ith SNP.

Algorithm:– Compute the N × N variance-covariance matrix as Σ = ZTZ/(N − 1).– Compute the eigenvalue decomposition of Σ: e.g., using R function eigen– Select the top K eigenvalues that are significantly large (K = 5 or 10) by a scree plot.– Include the K eigenvectors (PCs) as additional covariates in the generalized linear

regression models that are used for GWAS.

Page 13: Lecture 8: Advanced Topics of GWAS

Multiple Testing — 12/32 —

For GWAS with millions of single variant tests, the problem of multiple testing mustbe accounted for.

How does the problem of multiple testing arise?

– M: the number of markers for testing– H(m)

0 : no association between the mth SNP and disease, m = 1, . . . ,M

In testing single marker, we set our significance level, α′:

α′ = P(reject null hypothesis H(m)0 | H(m)

0 is true)

But in testing multiple SNPs, our interest is in the family-wise error rate (FWER):

α = P(reject at least one H(m)0 | H(m)

0 is true for all m)

Assume 1) M = 500, 000, 2) all markers independent, 3) α′ = 0.05. Consequenctly,

α = 1 − P(not reject any H(m)0 | H(m)

0 is true for all m)

= 1 −M∏

m=1

P(not reject H(m)0 | H(m)

0 is true for all m)

= 1 − (1 − α′)M

= 1

Page 14: Lecture 8: Advanced Topics of GWAS

Multiple Testing: Controlling for FWER — 13/32 —

Bonferroni correction: An easy and popular approach to adjust thesignificance-level of each test so as to preserve the overall error rate

It follows from Boole’s inequality:

α = P(reject at least one H(m)0 | H(m)

0 is true for all m)

= P

⋃m

{reject H(m)0 | H(m)

0 is true}

∑m

P(reject H(m)0 | H(m)

0 is true) = Mα′

FWER can be kept less than α if each individual test has significance level α/M.

• If α = 0.01 and M = 500, 000, then α′ = 2 × 10−8.

• Bonferroni correction is conservative when tests are not independent.

• Use α = 5 × 10−8 to account for independent tests in GWAS.

Page 15: Lecture 8: Advanced Topics of GWAS

Recall the Example Manhattan Plot of GWAS Result — 14/32 —

GWAS  Results  

18  known  AMD  loci    and  16  novel  AMD  loci  

Page 16: Lecture 8: Advanced Topics of GWAS

Conditional Analysis — 15/32 —

From Fritsche L and Pasaniuc B & Price AL, Nat. Rev. 2017

Page 17: Lecture 8: Advanced Topics of GWAS

Conditional Analysis — 16/32 —

Page 18: Lecture 8: Advanced Topics of GWAS

Meta-Analysis — 17/32 —

• Combine summary statistics (e.g., p-values, odds ratios, effect-sizes) acrossmultiple studies for the same phenotype

• Improve power for increasing total sample size

• Address between study variances (due to population stratification, study design)

• Avoid the hassle of sharing individual-level genotype/phenotype/covariate data(e.g., privacy protocols)

• Lin and Zeng (2010) showed that meta-analysis with summary results isstatistically as efficient as using individual-level data, under an idea situation:

– Phenotype distributions are the same across all studies

– Same covariates were included across all studies

– Equal case/control ratios across all studies

Page 19: Lecture 8: Advanced Topics of GWAS

Why Meta-Analysis? — 18/32 —

Additive model, N cases, N controls, MAF = .3, α = 5 × 10−8

Page 20: Lecture 8: Advanced Topics of GWAS

Why Meta-Analysis? — 19/32 —

Page 21: Lecture 8: Advanced Topics of GWAS

Meta-Analysis Methods — 20/32 —

• Fisher’s Method: combining p-values

• Stouffer’s Z-score method

• Fixed Effect Model: combining standardized effect-sizes

• Software: METALhttps://genome.sph.umich.edu/wiki/METAL_Documentation

Page 22: Lecture 8: Advanced Topics of GWAS

Fisher’s method — 21/32 —

Given summary statistics from individual studies of the same genetic variant

• pk: p-value from the kth study, k = 1, . . . ,K

The test statistic−2

∑k

log(pk) ∼ χ2(2K)

Derivation:

• Under the null, each pk follows U[0, 1]

• The − log of a uniformly distributed value follows an exponential distribution

• Scaling a value that follows an exponential distribution by a factor of two yields aquantity that follows a χ2 distribution with 2 df

• The sum of K independent χ2 values follows a χ2 distribution with 2K df

Page 23: Lecture 8: Advanced Topics of GWAS

Stouffer’s Z-score method — 22/32 —

Given summary statistics from individual studies

• nk: sample size of the kth study

• pk: p-value from the kth study

• βk: effect-size for the kth study

Then, we obtain

• Zk = sign(βk)Φ−1(1 − pk/2), where Φ is standard normal CDF.

• wk =√

nk: weight

Stouffer’s Z statistic is given by ∑k wkZk√∑

k w2k

∼ N(0, 1)

Page 24: Lecture 8: Advanced Topics of GWAS

Fixed Effect Model — 23/32 —

Inverse-variance estimator

Given summary statistics from individual studies

• βk: genetic effect-size from the kth study

• vk: variance of βk from the kth study

Then, consider

• βmeta =∑

k wkβk∑k wk

, wk = 1/vk

• Vbeta = 1∑k wk

• Inverse-variance weighting

The Wald test statistic is given by

βmeta√

Vmeta∼ N(0, 1)

Page 25: Lecture 8: Advanced Topics of GWAS

Rare Variants — 24/32 —

Genetic variants identified in the TOPMed project (2015 - Present)

Page 26: Lecture 8: Advanced Topics of GWAS

Why Study Rare Variants? — 25/32 —

• Most genetic variants are rare

• Functional variants tend to be rare

• Number of samples N needed to observe a rare variant with at least 99.9%probability :

MAF 0.1 0.01 0.001 0.001N 33 344 3453 34537

• Number of samples needed to achieve 80% power by single variant test(underpowered for rare variants):

Page 27: Lecture 8: Advanced Topics of GWAS

Region (or Gene) Based Test — 26/32 —

Test the joint effect of rare/common variants within a defined genome region (e.g.,gene, regulatory region)

• Consider a total of p variants within a test genome region

• Genotype data: Xi = (xi1, xi2, · · · , xip)′, xi j = 0, 1, 2, for individual i

• Genetic effect-size vector β1×p

• Consider covariate vector Zi for individual i, with extended intercept term

• General linear regression model

E[ f (µi)] = Z′iα + X′iβ ,

where f (·) is a link function, e.g., logistic function for dichotomous traits, identityfunction for quantitative traits.

• Test region-based association

H0 : β = (β1, · · · , βp) = 0

Page 28: Lecture 8: Advanced Topics of GWAS

Region (or Gene) Based Test — 27/32 —

Burden (collapsing test)

• Assume there is a shared genetic effect across all variants, i.e., β j = w jβburden

• One commonly used variant weight is based on MAF,w j = 1/

√MAF j(1 − MAF j)

• Equivalent to consider a Burden genotype score

Ci =∑

j

w jxi j

• Model is equivalent toE[ f (µi)] = Z′iα + Ciβburden

• Region-based test is equivalent to test

H0 : βburden = 0

Page 29: Lecture 8: Advanced Topics of GWAS

Region (or Gene) Based Test — 28/32 —

Burden (collapsing test)

• Score test statistic can be used

Tscore =

n∑i=1

Ci(Yi − µ0i), var(Tscore) = C′(P − PZ(Z′PZ)−1Z′P

)C;

Tscore√

var(Tscore)∼ N(0, 1);

– Let α denote the covariate coefficients estimated under the NULL modelE[ f (µi)] = Ziα

– µ0i = Z′i α and P = σ2ε I, with error variance estimate σ2

ε under NULL model,for standard linear regression model

– µ0i = logit−1(Ziα) and P = diag(µ01(1 − µ01), · · · , µ0n(1 − µ0n)

)for logistic

regression model

• Underpowered when genetic variants have opposite effect-sizes within a testedregion

Page 30: Lecture 8: Advanced Topics of GWAS

Region (or Gene) Based Test — 29/32 —

Variance component test

• Consider the general linear regression model

E[ f (µi)] = Z′iα + X′iβ ,

• Assume β j ∼ N(0,w2jτ), sharing a common variance component τ

• Then region-based test is equivalent to test

H0 : τ = 0

Page 31: Lecture 8: Advanced Topics of GWAS

Region (or Gene) Based Test — 30/32 —

SKAT: Sequence Kernel Association Test, Wu et al. (2011) AJHG

• Test statistic (i.e., weighted sum of single variant score test statistics) :

QS KAT = (Y − µ0)XWWX′(Y − µ0) =

p∑j=1

w2j[X′j(Y − µ0)]

• Let K = XWWX′ denote the kernel matrix, W = diag(w1, · · · ,wp)

• µ0 is estimated under the NULL hypothesis

• QS KAT asymptotically follows a mixture of χ2(1) distribution under the NULL

hypothesis

QS KAT ≈

p∑j=1

λ jχ2(1)

• λ j are eigenvalues of P1/20 KP1/2

0 , with projection matrix P0 = P − PZ(Z′PZ)−1Z′P.

• The mixture of χ2(1) distribution can be approximated with the computationally

efficient Davies method, which will be used to calculate p-value.

Page 32: Lecture 8: Advanced Topics of GWAS

Post-GWAS Analysis — 31/32 —

• Replication study with independent datasets

• Fine-mapping GWAS loci while accounting for functional annotation

• Biological interpretation

• Biological replication (e.g., CRISPER-CAS9)

Page 33: Lecture 8: Advanced Topics of GWAS

References — 32/32 —

• Price A.L. Principal components analysis corrects for stratification ingenome-wide association studies. Nature Genetics volume 38, pages 904-909(2006).

• Purcell, S. et. al. PLINK: A tool set for whole-genome association andpopulation-based linkage analyses. Am. J. Hum. Genet.81, 559-575. 2007.

• Cristen J. Willer, Yun Li, Goncalo R. Abecasis; METAL: fast and efficientmeta-analysis of genome-wide association scans, Bioinformatics, Volume 26,Issue 17, 1 September 2010, Pages 2190-2191.

• Li B, Leal SM. Methods for detecting associations with rare variants for commondiseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311-21.

• Wu M.C. et. al. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variantassociation testing for sequencing data with the sequence kernel associationtest. Am J Hum Genet. 2011; 89(1):82-93.