population structure and association analysissssykim/teaching/s13/slides/lecture_sa.pdf ·...
TRANSCRIPT
Population Structure and Association Analysis
02-‐715 Advanced Topics in Computa8onal Genomics
Population Structure and Association Analysis
• Popula8on structure in data causes false posi8ves – Samples in the case popula8on are usually more related
– Any SNPs more prevalent in the case popula8on will be found significantly associated with the trait.
Accounting for Population Structure in Association Analysis
• Needs to account for popula8on structure in associa8on mapping.
• Careful study design with each popula8on represented in case/control groups in a balanced way. – Can be hard to control – The effect of cryp8c popula8on structure
Family-based Design vs. Population-based Design
• Family-‐based studies – The effect of popula8on structure can be controlled by the use of
parents’ genotypes.
– In prac8ce, collec8ng genotypes from mul8ple individuals in a family can be hard. (e.g., late-‐onset diseases)
• Popula8on-‐based design – Data collec8on is easier for a large number of unrelated individuals
than a large number of families.
– The control samples can be reused in different studies.
Accounting for Population Structure in Association Analysis
• Family-‐based method – Transmission disequilibrium test (TDT)
• Popula8on-‐based method – Genomic control (Devlin & Roeder, Biometrics 1999)
– Structured associa8on (Pritchard et al., AJHG 2000)
– EigenStrat: principal component analysis (Price et al., Nature Gene8cs 2006)
Transmission Disequilibrium Test (TDT)
Non-‐transmi+ed alleles
Transmi+ed alleles M m total
M a b a+b
m c d c+d
Total a+c b+d 2N
• Genotype affected individuals and their parents (trio)
• Null hypothesis: (b/(b+c), c/(b+c)) is compa8ble with (0.5, 0.5) • Test sta8s8c is given as (b-‐c)2/(b+c)
• The non-‐transmi[ed alleles play the role of controls
Genomic Control (GC)
• Idea: Use the SNPs that are not associated with the trait to remove the effect of popula8on stra8fica8on
• Genotype data consist of – Candidate genes to be tested – L supplementary loci (null loci) for es8ma8ng the infla8on factor λ
• GC uses the infla8on factor λ to correct the associa8on sta8s8c of the SNP in the candidate gene
• Limita8on: the infla8on factor λ is assumed to be the same across the genome, ignoring popula8on admixture
Devlin & Roeder, Biometrics 1999
STRAT: Structured Association (Pritchard et al., AJHG 2000)
• Idea: Within each subpopula8on, an associa8on between a gene8c marker and the trait is a true associa8on.
• Two-‐stage method – Step 1: Using Structure (Pritchard et al., Gene8cs 2000) and unlinked
gene8c markers, • es8mate the popula8on structure • assign sampled individuals to puta8ve subpopula8ons
– Step 2: • Test for associa8on within the subpopula8ons inferred in Step 1
• Limita8on – Running Structure is computa8onally demanding
Pritchard et al., AJHG 2000
STRAT: Step 2
• Given ancestry propor8ons qk(i) for popula8on k, individual i es8mated by STRUCTURE
• H0: The probability model for genotypes c’s under the null hypothesis of no associa8on
• H1: The probability model for genotypes c’s the alterna8ve hypothesis of associa8on
STRAT: Step 2
• Likelihood ra8o test:
– Large values indicate that the alterna8ve hypothesis explains the data be[er.
Simulation Studies: No Admixture
• Assume two discrete popula8ons
• Simulate genotypes of 150 affected and 150 control individuals at 100 unlinked loci – With sample size N, we have 2N chromosomes
– Assume two popula8ons have split 0.05N genera8ons ago without migra8on
– Controls: half of the controls came from each of the two subpopula8ons
– Affected group: 100 from popula8on 1, 50 from popula8on 2
STRAT: Simulation Results
• Rejec8on rates under the null hypothesis of no associa8on
• p1,p2: allele frequencies for popula8ons 1 and 2 at the given locus
Simulation Studies: With Admixture
• Assume two discrete popula8ons
• Simulate genotypes of 500 affected and 500 control individuals at 150 unlinked microsatellite loci – With sample size N, we have 2N chromosomes
– Assume two popula8ons have split 0.15N genera8ons ago, followed by two genera8ons of admixing
– Controls: random draws from the whole popula8on
– Affected group: random draws from the whole popula8on assuming a disease risk mode for grand parents
Structure: Simulation Results
• Learning popula8on structure using genotypes from two recently admixed popula8ons – Dashed line – case group
STRAT: Simulation Results
• Rejec8on rates under the null hypothesis
• p1,p2: allele frequencies for popula8ons 1 and 2 at the given locus
TDT vs. STRAT
• TDT – Requires genotyping parents of the affected offspring
• STRAT – Requires genotypes for addi8onal loci to infer popula8on structure
with STRUCTURE
EigenStrat
• Structured associa8on approach
• Step 1: Run PCA on genotype data to infer the popula8on structure
• Step 2: Perform associa8on analysis afer correc8ng for the popula8on effects in genotype/phenotype data
• Advantages: low computa8onal cost compared to STRAT
EigenStrat: Structured Association with PCA
• Step 1: (Inferring Ancestry) PCA is applied to genotype data to infer con8nuous axes of gene8c varia8on
Price et al., Nature Gene8cs 2006
What are the new axes?
Original Variable A
PC 1 PC 2
• Orthogonal direc8ons of greatest variance in data • Projec8ons along PC1 discriminate the data most along any one axis
Original Variable B
EigenStrat: Structured Association with PCA
• Step 2: (Removing Ancestry Effects) Genotype at a candidate SNP and phenotype are con8nuously adjusted by amounts a[ributable to ancestry along each axis
• Step 3: (Associa8on test)
Simulation Procedure
• Given FST, For each SNP – Draw an ancestral popula8on allele frequency p from uniform
distribu8on [0.1 0.9]
– Allele frequencies for popula8ons 1 and 2, p1 and p2, are drawn from Beta(p(1-‐FST)/FST, (1-‐p)(1-‐FST)/FST)
– Draw SNPs using popula8on allele frequencies p1 and p2
Simulation Study
• Discrete popula8ons vs. admixed popula8ons
• Moderate vs. extreme ancestry differences in the ancestry between cases/controls – Moderate: control (40% popula8on 1, 60% popula8on 2), case (60%
popula8on 1, 40% popula8on 2) – Extreme: control (0% popula8on 1, 100% popula8on 2), case (50%
popula8on 1, 50% popula8on 2)
• Datasets with candidate loci selected as follows – Random SNPs (no associa8ons) – Differen8ated SNPs (a large difference in allele frequencies between
popula8ons, but no associa8ons) • Allele frequence 0.8 for popula8on 1, 0.2 for popula8on 2
– Causal SNPs
Simulation Results
Simulation Results
# SNPs required FST
20,000 0.005
50,000 0.002
100,000 0.001
• To correct for popula8on stra8fica8on, a greater number of SNPs are required for less differen8ated popula8ons
PCA for Population Structure Discovery
Gene8c varia8on between northwest and southeast Europe
Gen
e8c varia8
on between tw
o southe
ast
Europe
an pop
ula8
ons
European American Dataset
• 488 European Americans genotyped at 116,204 SNPs
• A muta8on in LCT gene is 100% associated with lactase persistence phenotype – This muta8on was not included in this dataset – Look for an indirect associa8on between a nearby SNP rs3769005,
which is in 90% LD with the LCT muta8on based on HapMap data, and the phenotype
• The region in chromosome 2 surrounding LCT gene is highly associated with the phenotype due to the the strong selec8ve sweep in that region.
Association Results for SNPs Outside of Chromosome 2 (LCT gene)
Summary
• Genomic Control – Cannot handle the effect of admixed popula8ons
• STRAT: structured associa8on with STRUCTURE – Uses a genera8ve model that explicitly models admixture
– Computa8onally demanding
• EigenStrat – Does not provide intui8on behind the admixing process
– Significantly low computa8onal cost than STRAT