population structure and association analysissssykim/teaching/s13/slides/lecture_sa.pdf ·...

Population Structure and Association Analysis

02-‐715 Advanced Topics in Computa8onal Genomics

Population Structure and Association Analysis

• Popula8on structure in data causes false posi8ves – Samples in the case popula8on are usually more related

– Any SNPs more prevalent in the case popula8on will be found significantly associated with the trait.

Accounting for Population Structure in Association Analysis

• Needs to account for popula8on structure in associa8on mapping.

• Careful study design with each popula8on represented in case/control groups in a balanced way. – Can be hard to control – The effect of cryp8c popula8on structure

Family-based Design vs. Population-based Design

• Family-‐based studies – The effect of popula8on structure can be controlled by the use of

parents’ genotypes.

– In prac8ce, collec8ng genotypes from mul8ple individuals in a family can be hard. (e.g., late-‐onset diseases)

• Popula8on-‐based design – Data collec8on is easier for a large number of unrelated individuals

than a large number of families.

– The control samples can be reused in different studies.

Accounting for Population Structure in Association Analysis

• Family-‐based method – Transmission disequilibrium test (TDT)

• Popula8on-‐based method – Genomic control (Devlin & Roeder, Biometrics 1999)

– Structured associa8on (Pritchard et al., AJHG 2000)

– EigenStrat: principal component analysis (Price et al., Nature Gene8cs 2006)

Transmission Disequilibrium Test (TDT)

Non-‐transmi+ed alleles

Transmi+ed alleles M m total

M a b a+b

m c d c+d

Total a+c b+d 2N

• Genotype affected individuals and their parents (trio)

• Null hypothesis: (b/(b+c), c/(b+c)) is compa8ble with (0.5, 0.5) • Test sta8s8c is given as (b-‐c)2/(b+c)

• The non-‐transmi[ed alleles play the role of controls

Genomic Control (GC)

• Idea: Use the SNPs that are not associated with the trait to remove the effect of popula8on stra8fica8on

• Genotype data consist of – Candidate genes to be tested – L supplementary loci (null loci) for es8ma8ng the infla8on factor λ

• GC uses the infla8on factor λ to correct the associa8on sta8s8c of the SNP in the candidate gene

• Limita8on: the infla8on factor λ is assumed to be the same across the genome, ignoring popula8on admixture

Devlin & Roeder, Biometrics 1999

STRAT: Structured Association (Pritchard et al., AJHG 2000)

• Idea: Within each subpopula8on, an associa8on between a gene8c marker and the trait is a true associa8on.

• Two-‐stage method – Step 1: Using Structure (Pritchard et al., Gene8cs 2000) and unlinked

gene8c markers, • es8mate the popula8on structure • assign sampled individuals to puta8ve subpopula8ons

– Step 2: • Test for associa8on within the subpopula8ons inferred in Step 1

• Limita8on – Running Structure is computa8onally demanding

Pritchard et al., AJHG 2000

STRAT: Step 2

• Given ancestry propor8ons qk(i) for popula8on k, individual i es8mated by STRUCTURE

• H0: The probability model for genotypes c’s under the null hypothesis of no associa8on

• H1: The probability model for genotypes c’s the alterna8ve hypothesis of associa8on

STRAT: Step 2

• Likelihood ra8o test:

– Large values indicate that the alterna8ve hypothesis explains the data be[er.

Simulation Studies: No Admixture

• Assume two discrete popula8ons

• Simulate genotypes of 150 affected and 150 control individuals at 100 unlinked loci – With sample size N, we have 2N chromosomes

– Assume two popula8ons have split 0.05N genera8ons ago without migra8on

– Controls: half of the controls came from each of the two subpopula8ons

– Affected group: 100 from popula8on 1, 50 from popula8on 2

STRAT: Simulation Results

• Rejec8on rates under the null hypothesis of no associa8on

• p1,p2: allele frequencies for popula8ons 1 and 2 at the given locus

Simulation Studies: With Admixture

• Assume two discrete popula8ons

• Simulate genotypes of 500 affected and 500 control individuals at 150 unlinked microsatellite loci – With sample size N, we have 2N chromosomes

– Assume two popula8ons have split 0.15N genera8ons ago, followed by two genera8ons of admixing

– Controls: random draws from the whole popula8on

– Affected group: random draws from the whole popula8on assuming a disease risk mode for grand parents

Structure: Simulation Results

• Learning popula8on structure using genotypes from two recently admixed popula8ons – Dashed line – case group

STRAT: Simulation Results

• Rejec8on rates under the null hypothesis

• p1,p2: allele frequencies for popula8ons 1 and 2 at the given locus

TDT vs. STRAT

• TDT – Requires genotyping parents of the affected offspring

• STRAT – Requires genotypes for addi8onal loci to infer popula8on structure

with STRUCTURE

EigenStrat

• Structured associa8on approach

• Step 1: Run PCA on genotype data to infer the popula8on structure

• Step 2: Perform associa8on analysis afer correc8ng for the popula8on effects in genotype/phenotype data

• Advantages: low computa8onal cost compared to STRAT

EigenStrat: Structured Association with PCA

• Step 1: (Inferring Ancestry) PCA is applied to genotype data to infer con8nuous axes of gene8c varia8on

Price et al., Nature Gene8cs 2006

What are the new axes?

Original Variable A

PC 1 PC 2

• Orthogonal direc8ons of greatest variance in data • Projec8ons along PC1 discriminate the data most along any one axis

Original Variable B

EigenStrat: Structured Association with PCA

• Step 2: (Removing Ancestry Effects) Genotype at a candidate SNP and phenotype are con8nuously adjusted by amounts a[ributable to ancestry along each axis

• Step 3: (Associa8on test)

Simulation Procedure

• Given FST, For each SNP – Draw an ancestral popula8on allele frequency p from uniform

distribu8on [0.1 0.9]

– Allele frequencies for popula8ons 1 and 2, p1 and p2, are drawn from Beta(p(1-‐FST)/FST, (1-‐p)(1-‐FST)/FST)

– Draw SNPs using popula8on allele frequencies p1 and p2

Simulation Study

• Discrete popula8ons vs. admixed popula8ons

• Moderate vs. extreme ancestry differences in the ancestry between cases/controls – Moderate: control (40% popula8on 1, 60% popula8on 2), case (60%

popula8on 1, 40% popula8on 2) – Extreme: control (0% popula8on 1, 100% popula8on 2), case (50%

popula8on 1, 50% popula8on 2)

• Datasets with candidate loci selected as follows – Random SNPs (no associa8ons) – Differen8ated SNPs (a large difference in allele frequencies between

popula8ons, but no associa8ons) • Allele frequence 0.8 for popula8on 1, 0.2 for popula8on 2

– Causal SNPs

Simulation Results

Simulation Results

# SNPs required FST

20,000 0.005

50,000 0.002

100,000 0.001

• To correct for popula8on stra8fica8on, a greater number of SNPs are required for less differen8ated popula8ons

PCA for Population Structure Discovery

Gene8c varia8on between northwest and southeast Europe

Gen

e8c varia8

on between tw

o southe

ast

Europe

an pop

ula8

ons

European American Dataset

• 488 European Americans genotyped at 116,204 SNPs

• A muta8on in LCT gene is 100% associated with lactase persistence phenotype – This muta8on was not included in this dataset – Look for an indirect associa8on between a nearby SNP rs3769005,

which is in 90% LD with the LCT muta8on based on HapMap data, and the phenotype

• The region in chromosome 2 surrounding LCT gene is highly associated with the phenotype due to the the strong selec8ve sweep in that region.

Association Results for SNPs Outside of Chromosome 2 (LCT gene)

Summary

• Genomic Control – Cannot handle the effect of admixed popula8ons

• STRAT: structured associa8on with STRUCTURE – Uses a genera8ve model that explicitly models admixture

– Computa8onally demanding

• EigenStrat – Does not provide intui8on behind the admixing process

– Significantly low computa8onal cost than STRAT

population structure and association analysissssykim/teaching/s13/slides/lecture_sa.pdf ·...

Documents