population structure and association analysissssykim/teaching/s13/slides/lecture_sa.pdf ·...

28
Population Structure and Association Analysis 02715 Advanced Topics in Computa8onal Genomics

Upload: ngominh

Post on 05-Jul-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

Population Structure and Association Analysis

02-­‐715  Advanced  Topics  in  Computa8onal  Genomics  

Page 2: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

Population Structure and Association Analysis

•  Popula8on  structure  in  data  causes  false  posi8ves  –  Samples  in  the  case  popula8on  are  usually  more  related  

–  Any  SNPs  more  prevalent  in  the  case  popula8on  will  be  found  significantly  associated  with  the  trait.  

Page 3: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

Accounting for Population Structure in Association Analysis

•  Needs  to  account  for  popula8on  structure  in  associa8on  mapping.  

•  Careful  study  design  with  each  popula8on  represented  in  case/control  groups  in  a  balanced  way.  –  Can  be  hard  to  control  –  The  effect  of  cryp8c  popula8on  structure  

Page 4: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

Family-based Design vs. Population-based Design

•  Family-­‐based  studies  –  The  effect  of  popula8on  structure  can  be  controlled  by  the  use  of  

parents’  genotypes.  

–  In  prac8ce,  collec8ng  genotypes  from  mul8ple  individuals  in  a  family  can  be  hard.  (e.g.,  late-­‐onset  diseases)  

•  Popula8on-­‐based  design  –  Data  collec8on  is  easier  for  a  large  number  of  unrelated  individuals  

than  a  large  number  of  families.  

–  The  control  samples  can  be  reused  in  different  studies.  

Page 5: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

Accounting for Population Structure in Association Analysis

•  Family-­‐based  method  –  Transmission  disequilibrium  test  (TDT)  

•  Popula8on-­‐based  method  –  Genomic  control  (Devlin  &  Roeder,  Biometrics  1999)  

–  Structured  associa8on  (Pritchard  et  al.,  AJHG  2000)  

–  EigenStrat:  principal  component  analysis  (Price  et  al.,  Nature  Gene8cs  2006)  

Page 6: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

Transmission Disequilibrium Test (TDT)

Non-­‐transmi+ed  alleles  

Transmi+ed  alleles   M   m   total  

M   a   b   a+b  

m   c   d   c+d  

Total   a+c   b+d   2N  

•  Genotype  affected  individuals  and  their  parents  (trio)  

•  Null  hypothesis:  (b/(b+c),  c/(b+c))  is  compa8ble  with  (0.5,  0.5)  •  Test  sta8s8c  is  given  as  (b-­‐c)2/(b+c)  

•  The  non-­‐transmi[ed  alleles  play  the  role  of  controls  

Page 7: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

Genomic Control (GC)

•  Idea:  Use  the  SNPs  that  are  not  associated  with  the  trait  to  remove  the  effect  of  popula8on  stra8fica8on  

•  Genotype  data  consist  of  –  Candidate  genes  to  be  tested  –  L  supplementary  loci  (null  loci)  for  es8ma8ng  the  infla8on  factor  λ  

•  GC  uses  the  infla8on  factor  λ  to  correct  the  associa8on  sta8s8c  of  the  SNP  in  the  candidate  gene  

•  Limita8on:  the  infla8on  factor  λ  is  assumed  to  be  the  same  across  the  genome,  ignoring  popula8on  admixture  

Devlin  &  Roeder,  Biometrics  1999  

Page 8: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

STRAT: Structured Association (Pritchard et al., AJHG 2000)

•  Idea:  Within  each  subpopula8on,  an  associa8on  between  a  gene8c  marker  and  the  trait  is  a  true  associa8on.  

•  Two-­‐stage  method  –  Step  1:  Using  Structure  (Pritchard  et  al.,  Gene8cs  2000)  and  unlinked  

gene8c  markers,    •  es8mate  the  popula8on  structure  •  assign  sampled  individuals  to  puta8ve  subpopula8ons  

–  Step  2:    •  Test  for  associa8on  within  the  subpopula8ons  inferred  in  Step  1  

•  Limita8on    –  Running  Structure  is  computa8onally  demanding  

Pritchard  et  al.,  AJHG  2000  

Page 9: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

STRAT: Step 2

•  Given  ancestry  propor8ons  qk(i)  for  popula8on  k,  individual  i  es8mated  by  STRUCTURE  

•  H0:  The  probability  model  for  genotypes  c’s  under  the  null  hypothesis  of  no  associa8on  

•  H1:  The  probability  model  for  genotypes  c’s  the  alterna8ve  hypothesis  of  associa8on  

Page 10: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

STRAT: Step 2

•  Likelihood  ra8o  test:  

–  Large  values  indicate  that  the  alterna8ve  hypothesis  explains  the  data  be[er.  

Page 11: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

Simulation Studies: No Admixture

•  Assume  two  discrete  popula8ons  

•  Simulate  genotypes  of  150  affected  and  150  control  individuals  at  100  unlinked  loci  –  With  sample  size  N,  we  have  2N  chromosomes  

–  Assume  two  popula8ons  have  split  0.05N  genera8ons  ago  without  migra8on  

–  Controls:  half  of  the  controls  came  from  each  of  the  two  subpopula8ons  

–  Affected  group:  100  from  popula8on  1,  50  from  popula8on  2    

Page 12: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

STRAT: Simulation Results

•  Rejec8on  rates  under  the  null  hypothesis  of  no  associa8on  

•  p1,p2:  allele  frequencies  for  popula8ons  1  and  2  at  the  given  locus  

Page 13: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

Simulation Studies: With Admixture

•  Assume  two  discrete  popula8ons  

•  Simulate  genotypes  of  500  affected  and  500  control  individuals  at  150  unlinked  microsatellite  loci  –  With  sample  size  N,  we  have  2N  chromosomes  

–  Assume  two  popula8ons  have  split  0.15N  genera8ons  ago,  followed  by  two  genera8ons  of  admixing  

–  Controls:  random  draws  from  the  whole  popula8on  

–  Affected  group:  random  draws  from  the  whole  popula8on  assuming  a  disease  risk  mode  for  grand  parents  

Page 14: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

Structure: Simulation Results

•  Learning  popula8on  structure  using  genotypes  from  two  recently  admixed  popula8ons  –  Dashed  line  –  case  group  

Page 15: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

STRAT: Simulation Results

•  Rejec8on  rates  under  the  null  hypothesis  

•  p1,p2:  allele  frequencies  for  popula8ons  1  and  2  at  the  given  locus  

Page 16: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

TDT vs. STRAT

•  TDT  –  Requires  genotyping  parents  of  the  affected  offspring  

•  STRAT  –  Requires  genotypes  for  addi8onal  loci  to  infer  popula8on  structure  

with  STRUCTURE  

Page 17: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

EigenStrat

•  Structured  associa8on  approach  

•  Step  1:  Run  PCA  on  genotype  data  to  infer  the  popula8on  structure  

•  Step  2:  Perform  associa8on  analysis  afer  correc8ng  for  the  popula8on  effects  in  genotype/phenotype  data  

•  Advantages:  low  computa8onal  cost  compared  to  STRAT  

Page 18: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

EigenStrat: Structured Association with PCA

•  Step  1:  (Inferring  Ancestry)  PCA  is  applied  to  genotype  data  to  infer  con8nuous  axes  of  gene8c  varia8on    

Price  et  al.,  Nature  Gene8cs  2006  

Page 19: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

What are the new axes?

Original  Variable  A  

PC  1  PC  2  

•   Orthogonal  direc8ons  of  greatest  variance  in  data  •   Projec8ons  along  PC1  discriminate  the  data  most  along  any  one  axis  

Original  Variable  B  

Page 20: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

EigenStrat: Structured Association with PCA

•  Step  2:  (Removing  Ancestry  Effects)  Genotype  at  a  candidate  SNP  and  phenotype  are  con8nuously  adjusted  by  amounts  a[ributable  to  ancestry  along  each  axis  

•  Step  3:  (Associa8on  test)  

Page 21: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

Simulation Procedure

•  Given  FST,  For  each  SNP  –  Draw  an  ancestral  popula8on  allele  frequency  p  from  uniform  

distribu8on  [0.1  0.9]  

–  Allele  frequencies  for  popula8ons  1  and  2,  p1  and  p2,  are  drawn  from  Beta(p(1-­‐FST)/FST,  (1-­‐p)(1-­‐FST)/FST)  

–  Draw  SNPs  using  popula8on  allele  frequencies  p1  and  p2  

Page 22: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

Simulation Study

•  Discrete  popula8ons  vs.  admixed  popula8ons  

•  Moderate  vs.  extreme  ancestry  differences  in  the  ancestry  between  cases/controls  –  Moderate:  control  (40%  popula8on  1,  60%  popula8on  2),  case  (60%  

popula8on  1,  40%  popula8on  2)  –  Extreme:  control  (0%  popula8on  1,  100%  popula8on  2),  case  (50%  

popula8on  1,  50%  popula8on  2)  

•  Datasets  with  candidate  loci  selected  as  follows  –  Random  SNPs  (no  associa8ons)  –  Differen8ated  SNPs  (a  large  difference  in  allele  frequencies  between  

popula8ons,  but  no  associa8ons)  •  Allele  frequence  0.8  for  popula8on  1,  0.2  for  popula8on  2  

–  Causal  SNPs  

Page 23: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

Simulation Results

Page 24: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

Simulation Results

#  SNPs  required   FST  

20,000   0.005  

50,000   0.002  

100,000   0.001  

•  To  correct  for  popula8on  stra8fica8on,  a  greater  number  of  SNPs  are  required  for  less  differen8ated  popula8ons  

Page 25: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

PCA for Population Structure Discovery

Gene8c  varia8on  between  northwest  and  southeast  Europe  

Gen

e8c  varia8

on  between  tw

o  southe

ast  

Europe

an  pop

ula8

ons  

Page 26: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

European American Dataset

•  488  European  Americans  genotyped  at  116,204  SNPs  

•  A  muta8on  in  LCT  gene  is  100%  associated  with  lactase  persistence  phenotype  –  This  muta8on  was  not  included  in  this  dataset  –  Look  for  an  indirect  associa8on  between  a  nearby  SNP  rs3769005,  

which  is  in  90%  LD  with  the  LCT  muta8on  based  on  HapMap  data,  and  the  phenotype  

•  The  region  in  chromosome  2  surrounding  LCT  gene  is  highly  associated  with  the  phenotype  due  to  the  the  strong  selec8ve  sweep  in  that  region.  

Page 27: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

Association Results for SNPs Outside of Chromosome 2 (LCT gene)

Page 28: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’

Summary

•  Genomic  Control  –  Cannot  handle  the  effect  of  admixed  popula8ons  

•  STRAT:  structured  associa8on  with  STRUCTURE  –  Uses  a  genera8ve  model  that  explicitly  models  admixture  

–  Computa8onally  demanding  

•  EigenStrat  –  Does  not  provide  intui8on  behind  the  admixing  process  

–  Significantly  low  computa8onal  cost  than  STRAT