pooled sequence haplotype estimator
TRANSCRIPT
• Evalua'on of Es'mate using D Sta's'c:
D=
Accuracy Op*miza*on
Es*ma*ng haplotype frequencies of Drosophila melanogaster from pooled sequence data
Devin Petersohn*, Aniqa Rahman* and Elizabeth King * co-‐first authors
Abstract
Goals and Significance
• Selec'on and Popula'on Studies • Genotype/Phenotype Mapping • Big data processing • Cost effec've data collec'on
Acknowledgments
Results
Results Overview
Methods
• Increasing pool size to 15 founders does not decrease accuracy of algorithm
• Increased marker density improves accuracy of algorithm
• Window sizes based on gene'c loca'on are most accurate
• Increased window size increases accuracy to a breaking point, where it begins to rise again
References 1. Burke MK et al. 2013. Genome-‐wide associa'on study of extreme longevity in
Drosophila melanogaster. Genome Biology and Evolu'on 6(1):1–11. 2. King EG, Macdonald SJ, Long AD. 2012. Proper'es and power of the Drosophila
Synthe'c Popula'on Resource for the rou'ne dissec'on of complex traits. Gene'cs 191:935–949.
D S P R
Conclusions
This project was funded by the NSF, the NIH (F32GM099382), and the University of Missouri Office of
Undergraduate Research.
Figure 1. Expected and es'mated haplotype frequencies of A1 (above) and AB8 (below) founders for pools 1 and 4 across the genome. Chromosome arms are displayed in varying colors while HMM inferred frequencies appear in a darker shade and es'mated values appear lighter.
Fly Prep
Pool min %D chromosome max %D chromosome mean %D ave coverage
1 0.24 X 24.51 X 4.24 59.90
2 0.55 2L 27.31 X 3.97 51.68
3 0.93 2L 20.69 X 5.68 28.75
4 0.47 2R 10.65 2L 2.54 70.12
Figure 2. Percent difference between es'mated and HMM inferred haplotype frequencies in Pool 1 (blue) and Pool 4 (green) across the genome. Pool 4 displayed consistently lower D values than pools 1-‐3.
Figure 3. Average percent difference observed in haplotype es'mates as a result of varying marker density in chromosome arm 2R, Pool 1. SNP density was down-‐sampled by randomly selec'ng SNPs from the pooled genomic data from 1K-‐140K SNPs in increments of 1K. Accuracy of the es'mator suffers below 1K SNPs/Mb but reaches a stable low %D aier this point.
Algorithm
The founder ancestry at any given posi'on in each RIL is determined with a high degree of certainty using the genome sequences of the founders and genotype data for the RILs in a hidden Markov model2 (HMM). In this study, HMM inferences are used as expected haplotype frequencies in the different pools.
Table 1. Summary sta's'cs for pools 1-‐4. Lowest mean D values are observed in pool 4, likely due to greater average coverage.
Ques'on
SeOng precedents for op*mal configura*ons for haplotype es*ma*on from pooled samples to minimize cost and maximize quan*ty and accuracy of results.
What are the op*mal algorithm seOngs for es*ma*ng haplotype frequencies from pooled sequence data?
0 1000 2000 3000 4000 5000
46
810
12
SNP Density (SNPs per Mb)
Aver
age
%D
|||
As the cost of genome sequencing decreases, studies that were previously impossible are becoming more feasible. For popula'on gene'cists, however, sequencing every individual in a popula'on is oien cost prohibi've. Pooled sequencing is a commonly used, cheaper alterna've to individual-‐level sequencing. However, accurately es'ma'ng the haplotype frequencies of a popula'on from pooled sequence data remains a challenge. In order to address this problem, we have developed and refined an algorithm to es'mate haplotype frequencies from pooled data. To experimentally validate our method, we used genomic data collected from pooled sets of recombinant inbred lines with a completely known haplotype structure. These lines were derived from a 50 genera'on controlled cross of 15 homozygous founder lines of Drosophila melanogaster. We validated the predic've accuracy of our haplotype es'mator by comparing the haplotype frequency es'mates obtained by our method with the known haplotype composi'on of the pool. We present a study in which the accuracy of the haplotype es'mator is tested against variability in raw sequence coverage, SNP density, and the procedure of the algorithm. This algorithm, which can accurately es'mate the haplotype frequency of a popula'on from pooled sequence data, has the poten'al to significantly progress the field of genotype-‐phenotype mapping, a major goal of modern biology and bioinforma'cs.
Position (Mb)
%D
05
1015
0 10.0 0 12.4 25.3 37.4 0 10.5 24.3 40.6 52.0X 2L 2R 3L 3R
Applica'on
These plots demonstrate varying haplotype frequencies between young and old popula'ons of Drosophila melanogaster in a longevity study1. For this region on chromosome 2R there is a significant difference between haplotype frequencies in the two popula'ons. Different colors represent the 8 different haplotypes.
(RILs) Algorithm intakes flavors of SNPs at each posi'on (eg. 0=A, 1=T) and refines a haplotype frequency guess to minimize the difference between the observed allele counts and es'mated allele counts weighted by haplotype frequency.
●
●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●
● ●
●
●
●
●
0 1 2 3 4 5 6
3.2
3.6
4.0
4.4
Window Size (cM)
Aver
age
%D
Figure 4. The effect of window size on accuracy using (a) SNPs, (b) chromosomal posi'on (Kb), and (c) gene'c posi'on (cM). The op'mal window size is marked on each plot. Gene'c posi'on has the lowest %D, and is therefore the op'mal window metric when window size is between 0.8 and 3.5 cM (%D: 3.05-‐3.13).
●
●
●
●
●●●●●●●●●●●●●●●●● ●
●
● ●●
●
●●
0 5000 10000 15000
3.5
4.5
5.5
6.5
Window Size (SNP)
Aver
age
%D
(a)
(c)
Op'mum = 3.38 %D v at 2500 bp
ß 200 SNP window
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ●
●
0 500 1000 1500 2000
3.5
4.5
5.5
6.5
Window Size (Kb)
Aver
age
%D
Op'mum = 3.37 %D v at 500Kb
Op'mum = 3.05 %D v 2 cM
(ho)
(hY)
Pool 1
Position (Mb)
Freq
uenc
y
0.00
0.10
0.20
0.30
0 10.0 0 12.4 25.3 37.4 0 10.5 24.3 40.6 52.0X 2L 2R 3L 3R
Pool 4
Position (Mb)
Freq
uenc
y
0.00
0.10
0.20
0 10.0 0 12.4 25.3 37.4 0 10.5 24.3 40.6 52.0X 2L 2R 3L 3R
Pool 1
Position (Mb)
Freq
uenc
y
0.0
0.1
0.2
0.3
0.4
0 10.0 0 12.4 25.3 37.4 0 10.5 24.3 40.6 52.0X 2L 2R 3L 3R
Pool 4
Position (Mb)
Freq
uenc
y
0.00
0.10
0.20
0.30
0 10.0 0 12.4 25.3 37.4 0 10.5 24.3 40.6 52.0X 2L 2R 3L 3R