pooled sequence haplotype estimator

1
Evalua’on of Es’mate using D Sta’s’c: D= Accuracy Op*miza*on Es*ma*ng haplotype frequencies of Drosophila melanogaster from pooled sequence data Devin Petersohn*, Aniqa Rahman* and Elizabeth King * cofirst authors Abstract Goals and Significance Selec’on and Popula’on Studies Genotype/Phenotype Mapping Big data processing Cost effec’ve data collec’on Acknowledgments Results Results Overview Methods Increasing pool size to 15 founders does not decrease accuracy of algorithm Increased marker density improves accuracy of algorithm Window sizes based on gene’c loca’on are most accurate Increased window size increases accuracy to a breaking point, where it begins to rise again References 1. Burke MK et al. 2013. Genomewide associa’on study of extreme longevity in Drosophila melanogaster. Genome Biology and Evolu’on 6(1):1–11. 2. King EG, Macdonald SJ, Long AD. 2012. Proper’es and power of the Drosophila Synthe’c Popula’on Resource for the rou’ne dissec’on of complex traits. Gene’cs 191:935–949. DS PR Conclusions This project was funded by the NSF, the NIH (F32GM099382), and the University of Missouri Office of Undergraduate Research. Figure 1. Expected and es’mated haplotype frequencies of A1 (above) and AB8 (below) founders for pools 1 and 4 across the genome. Chromosome arms are displayed in varying colors while HMM inferred frequencies appear in a darker shade and es’mated values appear lighter. Fly Prep Pool min %D chromosome max %D chromosome mean %D ave coverage 1 0.24 X 24.51 X 4.24 59.90 2 0.55 2L 27.31 X 3.97 51.68 3 0.93 2L 20.69 X 5.68 28.75 4 0.47 2R 10.65 2L 2.54 70.12 Figure 2. Percent difference between es’mated and HMM inferred haplotype frequencies in Pool 1 (blue) and Pool 4 (green) across the genome. Pool 4 displayed consistently lower D values than pools 13. Figure 3. Average percent difference observed in haplotype es’mates as a result of varying marker density in chromosome arm 2R, Pool 1. SNP density was down sampled by randomly selec’ng SNPs from the pooled genomic data from 1K140K SNPs in increments of 1K. Accuracy of the es’mator suffers below 1K SNPs/Mb but reaches a stable low %D aier this point. Algorithm The founder ancestry at any given posi’on in each RIL is determined with a high degree of certainty using the genome sequences of the founders and genotype data for the RILs in a hidden Markov model 2 (HMM). In this study, HMM inferences are used as expected haplotype frequencies in the different pools. Table 1. Summary sta’s’cs for pools 14. Lowest mean D values are observed in pool 4, likely due to greater average coverage. Ques’on SeOng precedents for op*mal configura*ons for haplotype es*ma*on from pooled samples to minimize cost and maximize quan*ty and accuracy of results. What are the op*mal algorithm seOngs for es*ma*ng haplotype frequencies from pooled sequence data? 0 1000 2000 3000 4000 5000 4 6 8 10 12 SNP Density (SNPs per Mb) Average %D | | | As the cost of genome sequencing decreases, studies that were previously impossible are becoming more feasible. For popula’on gene’cists, however, sequencing every individual in a popula’on is oien cost prohibi’ve. Pooled sequencing is a commonly used, cheaper alterna’ve to individuallevel sequencing. However, accurately es’ma’ng the haplotype frequencies of a popula’on from pooled sequence data remains a challenge. In order to address this problem, we have developed and refined an algorithm to es’mate haplotype frequencies from pooled data. To experimentally validate our method, we used genomic data collected from pooled sets of recombinant inbred lines with a completely known haplotype structure. These lines were derived from a 50 genera’on controlled cross of 15 homozygous founder lines of Drosophila melanogaster. We validated the predic’ve accuracy of our haplotype es’mator by comparing the haplotype frequency es’mates obtained by our method with the known haplotype composi’on of the pool. We present a study in which the accuracy of the haplotype es’mator is tested against variability in raw sequence coverage, SNP density, and the procedure of the algorithm. This algorithm, which can accurately es’mate the haplotype frequency of a popula’on from pooled sequence data, has the poten’al to significantly progress the field of genotypephenotype mapping, a major goal of modern biology and bioinforma’cs. Position (Mb) %D 0 5 10 15 0 10.0 0 12.4 25.3 37.4 0 10.5 24.3 40.6 52.0 X 2L 2R 3L 3R Applica’on These plots demonstrate varying haplotype frequencies between young and old popula’ons of Drosophila melanogaster in a longevity study 1. For this region on chromosome 2R there is a significant difference between haplotype frequencies in the two popula’ons. Different colors represent the 8 different haplotypes. (RILs) Algorithm intakes flavors of SNPs at each posi’on (eg. 0=A, 1=T) and refines a haplotype frequency guess to minimize the difference between the observed allele counts and es’mated allele counts weighted by haplotype frequency. ●●●● ●●●● ●● ●● ●●● ●●● 0 1 2 3 4 5 6 3.2 3.6 4.0 4.4 Window Size (cM) Average %D Figure 4. The effect of window size on accuracy using (a) SNPs, (b) chromosomal posi’on (Kb), and (c) gene’c posi’on (cM). The op’mal window size is marked on each plot. Gene’c posi’on has the lowest %D, and is therefore the op’mal window metric when window size is between 0.8 and 3.5 cM (%D: 3.053.13). ●● ●● ●● 0 5000 10000 15000 3.5 4.5 5.5 6.5 Window Size (SNP) Average %D (a) (c) Op’mum = 3.38 %D v at 2500 bp 200 SNP window ●● ●●●●●●●●●●●●● 0 500 1000 1500 2000 3.5 4.5 5.5 6.5 Window Size (Kb) Average %D Op’mum = 3.37 %D v at 500Kb Op’mum = 3.05 %D v 2 cM (h o ) (h Y ) Pool 1 Position (Mb) Frequency 0.00 0.10 0.20 0.30 0 10.0 0 12.4 25.3 37.4 0 10.5 24.3 40.6 52.0 X 2L 2R 3L 3R Pool 4 Position (Mb) Frequency 0.00 0.10 0.20 0 10.0 0 12.4 25.3 37.4 0 10.5 24.3 40.6 52.0 X 2L 2R 3L 3R Pool 1 Position (Mb) Frequency 0.0 0.1 0.2 0.3 0.4 0 10.0 0 12.4 25.3 37.4 0 10.5 24.3 40.6 52.0 X 2L 2R 3L 3R Pool 4 Position (Mb) Frequency 0.00 0.10 0.20 0.30 0 10.0 0 12.4 25.3 37.4 0 10.5 24.3 40.6 52.0 X 2L 2R 3L 3R

Upload: devin-petersohn

Post on 14-Apr-2017

100 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Pooled Sequence Haplotype Estimator

                                               

 

   

•  Evalua'on  of  Es'mate  using  D  Sta's'c:  

 

               D=  

Accuracy  Op*miza*on        

   

                       

   

                               

Es*ma*ng  haplotype  frequencies  of  Drosophila  melanogaster  from  pooled  sequence  data  

Devin  Petersohn*,  Aniqa  Rahman*  and  Elizabeth  King  *  co-­‐first  authors  

Abstract  

Goals  and  Significance    

•  Selec'on  and  Popula'on  Studies  •  Genotype/Phenotype  Mapping  •  Big  data  processing  •  Cost  effec've  data  collec'on  

       

Acknowledgments  

Results  

Results  Overview          

   

                       

   

                             

Methods  

•  Increasing   pool   size   to   15   founders   does   not  decrease  accuracy  of  algorithm  

•  Increased   marker   density   improves   accuracy   of  algorithm  

•  Window  sizes  based  on  gene'c  loca'on  are  most  accurate  

•  Increased   window   size   increases   accuracy   to   a  breaking  point,  where  it  begins  to  rise  again  

References  1.  Burke    MK  et  al.  2013.  Genome-­‐wide  associa'on  study  of  extreme  longevity  in  

Drosophila  melanogaster.  Genome  Biology  and  Evolu'on  6(1):1–11.    2.   King  EG,  Macdonald  SJ,  Long  AD.  2012.  Proper'es  and  power  of  the  Drosophila  

Synthe'c  Popula'on  Resource  for  the  rou'ne  dissec'on  of  complex  traits.  Gene'cs  191:935–949.  

D  S   P  R  

Conclusions  

       

This  project  was  funded  by  the  NSF,  the  NIH  (F32GM099382),  and  the  University  of  Missouri  Office  of  

Undergraduate  Research.  

Figure  1.  Expected  and  es'mated  haplotype  frequencies  of  A1  (above)  and  AB8  (below)  founders  for  pools  1  and  4  across  the  genome.  Chromosome  arms  are  displayed  in  varying  colors  while  HMM  inferred  frequencies  appear  in  a  darker  shade  and  es'mated  values  appear  lighter.  

Fly  Prep  

Pool   min  %D   chromosome   max  %D   chromosome   mean  %D   ave  coverage  

1   0.24   X   24.51   X   4.24   59.90  

2   0.55   2L   27.31   X   3.97   51.68  

3   0.93   2L   20.69   X   5.68   28.75  

4   0.47   2R   10.65   2L   2.54   70.12  

Figure  2.  Percent  difference  between  es'mated  and  HMM  inferred  haplotype  frequencies  in  Pool  1  (blue)  and  Pool  4  (green)  across  the  genome.  Pool  4  displayed  consistently  lower  D  values  than  pools  1-­‐3.  

Figure  3.  Average  percent  difference  observed  in  haplotype  es'mates  as  a  result  of  varying  marker  density  in  chromosome  arm  2R,  Pool  1.    SNP  density  was  down-­‐sampled  by  randomly  selec'ng  SNPs  from  the  pooled  genomic  data  from  1K-­‐140K  SNPs  in  increments  of  1K.  Accuracy  of  the  es'mator  suffers  below  1K  SNPs/Mb  but  reaches  a  stable  low  %D  aier  this  point.  

Algorithm  

The  founder  ancestry  at  any  given  posi'on  in  each  RIL  is  determined  with  a  high  degree  of  certainty  using  the  genome  sequences  of  the  founders  and  genotype  data  for  the  RILs  in  a  hidden  Markov  model2  (HMM).  In  this  study,  HMM  inferences  are  used  as  expected  haplotype  frequencies  in  the  different  pools.  

Table  1.  Summary  sta's'cs  for  pools  1-­‐4.  Lowest  mean  D  values  are  observed  in  pool  4,  likely  due  to  greater  average  coverage.  

Ques'on    

SeOng  precedents  for  op*mal  configura*ons  for  haplotype  es*ma*on  from  pooled  samples  to  minimize  cost  and  maximize  quan*ty  and  accuracy  of  results.  

What  are  the  op*mal  algorithm  seOngs  for  es*ma*ng  haplotype  frequencies  from  pooled  sequence  data?  

 

0 1000 2000 3000 4000 5000

46

810

12

SNP Density (SNPs per Mb)

Aver

age

%D

|||

As  the  cost  of  genome  sequencing  decreases,  studies  that  were  previously  impossible  are  becoming  more  feasible.    For  popula'on  gene'cists,  however,  sequencing  every  individual  in  a  popula'on  is  oien  cost  prohibi've.    Pooled  sequencing  is  a  commonly  used,  cheaper  alterna've  to  individual-­‐level  sequencing.  However,  accurately  es'ma'ng  the  haplotype  frequencies  of  a  popula'on  from  pooled  sequence  data  remains  a  challenge.  In  order  to  address  this  problem,  we  have  developed  and  refined  an  algorithm  to  es'mate  haplotype  frequencies  from  pooled  data.  To  experimentally  validate  our  method,  we  used  genomic  data  collected  from    pooled  sets  of  recombinant  inbred  lines  with  a  completely  known  haplotype  structure.  These  lines  were  derived  from  a  50  genera'on  controlled  cross  of  15  homozygous  founder  lines  of  Drosophila  melanogaster.    We  validated  the  predic've  accuracy  of  our  haplotype  es'mator  by  comparing  the  haplotype  frequency  es'mates  obtained  by  our  method  with  the  known  haplotype  composi'on  of  the  pool.    We  present  a  study  in  which  the  accuracy  of  the  haplotype  es'mator  is  tested  against  variability  in  raw  sequence  coverage,  SNP  density,  and  the  procedure  of  the  algorithm.  This  algorithm,  which  can  accurately  es'mate  the  haplotype  frequency  of  a  popula'on  from  pooled  sequence  data,  has  the  poten'al  to  significantly  progress  the  field  of  genotype-­‐phenotype  mapping,  a  major  goal  of  modern  biology  and  bioinforma'cs.        

Position (Mb)

%D

05

1015

0 10.0 0 12.4 25.3 37.4 0 10.5 24.3 40.6 52.0X 2L 2R 3L 3R

Applica'on  

These  plots  demonstrate  varying  haplotype  frequencies  between  young  and  old  popula'ons  of  Drosophila  melanogaster  in  a  longevity  study1.  For  this  region  on  chromosome  2R  there  is  a  significant  difference  between  haplotype  frequencies  in  the  two  popula'ons.  Different  colors  represent  the  8  different  haplotypes.  

(RILs)   Algorithm  intakes  flavors  of  SNPs  at  each  posi'on  (eg.  0=A,  1=T)  and  refines  a  haplotype  frequency  guess  to  minimize  the  difference  between  the  observed  allele  counts  and  es'mated  allele  counts  weighted  by  haplotype  frequency.  

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

● ●

0 1 2 3 4 5 6

3.2

3.6

4.0

4.4

Window Size (cM)

Aver

age

%D

Figure  4.  The  effect  of  window  size  on  accuracy  using  (a)  SNPs,  (b)  chromosomal  posi'on  (Kb),  and  (c)  gene'c  posi'on  (cM).  The  op'mal  window  size  is  marked  on  each  plot.    Gene'c  posi'on  has  the  lowest  %D,  and  is  therefore  the  op'mal  window  metric  when  window  size  is  between  0.8  and  3.5  cM  (%D:  3.05-­‐3.13).  

●●●●●●●●●●●●●●●●● ●

● ●●

●●

0 5000 10000 15000

3.5

4.5

5.5

6.5

Window Size (SNP)

Aver

age

%D

(a)  

 (c)  

Op'mum  =  3.38  %D                          v    at  2500  bp    

ß  200    SNP  window    

●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ●

0 500 1000 1500 2000

3.5

4.5

5.5

6.5

Window Size (Kb)

Aver

age

%D

Op'mum  =    3.37  %D                                  v      at  500Kb        

Op'mum  =  3.05  %D                                  v                  2  cM      

(ho)  

(hY)  

Pool 1

Position (Mb)

Freq

uenc

y

0.00

0.10

0.20

0.30

0 10.0 0 12.4 25.3 37.4 0 10.5 24.3 40.6 52.0X 2L 2R 3L 3R

Pool 4

Position (Mb)

Freq

uenc

y

0.00

0.10

0.20

0 10.0 0 12.4 25.3 37.4 0 10.5 24.3 40.6 52.0X 2L 2R 3L 3R

Pool 1

Position (Mb)

Freq

uenc

y

0.0

0.1

0.2

0.3

0.4

0 10.0 0 12.4 25.3 37.4 0 10.5 24.3 40.6 52.0X 2L 2R 3L 3R

Pool 4

Position (Mb)

Freq

uenc

y

0.00

0.10

0.20

0.30

0 10.0 0 12.4 25.3 37.4 0 10.5 24.3 40.6 52.0X 2L 2R 3L 3R