Efficient Algorithms for SNP Genotype Data Analysis using
HiddenMarkov Models of Haplotype
Diversity
Justin KennedyDissertation Defense for the Degree of Doctorate in Philosophy
Computer Science & Engineering DepartmentUniversity of Connecticut
1
Outline
Introduction
Hidden Markov Models of Haplotype Diversity
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity
Imputation-based Local Ancestry Inference in Admixed Populations
Single Individual Genotyping from Low-Coverage Sequencing Data
Conclusion
2
Introduction-Single Nucleotide Polymorphisms
Main form of variation between individual genomes: Single Nucleotide Polymorphisms (SNPs)
High density in the human genome: 1.3x107 out of 3109 base pairs
Vast majority bi-allelic 0/1 encoding (major/minor resp.)
Haplotype: description of SNP alleles on a chromosome0/1 vector: 0 for major allele, 1 for minor
… ataggtccCtatttcgcgcCgtatacacgggActata …
… ataggtccGtatttcgcgcCgtatacacgggTctata …
… ataggtccCtatttcgcgcGgtatacacgggTctata …
3
Genotype Error Detection-SNP Genotypes
Diploid: two haplotypes for each chromosome One inherited from mother and one from father
Multilocus Genotype: description of alleles on both chromosomes
0/1/2 vector: 0 (2) - both chromosomes contain the major (minor) allele; 1 - the chromosomes contain different alleles
SNP Genotypes are critical to Disease-Gene Mapping
011000110001100010012100120
+two haplotypes per individual
genotype
4
Introduction- Why SNP Genotypes?
SNPs are the genetic marker of choice for genome wide association studies (GWASs)
GWAS: Method for discovering disease associated genes by typing a dense set of markers in large numbers of cases and controls followed by a statistical test of association.
Ongoing GWASs generate a deluge of genotype data Genetic Association Information Network (GAIN): 6 studies totaling
19,000 individuals typed at 500,000 to 940,000 SNP loci Wellcome Trust Case-Control Consortium (WTCCC): 7 studies
totaling 17,000 individuals typed at 500,000 SNPs WTCCC2: hundreds of thousands of individuals covering over a
million SNPs!5
Introduction-Computational Challenges to Disease Gene Mapping
Genotype error detection: Genotyping errors can decrease statistical power and invalidate statistical tests for disease association based on haplotypes.
Local ancestry Inference: Accurate estimates of local ancestry surrounding disease-associated loci are a critical step in admixture mapping.
Accurate SNP Genotyping from new sequencing
technologies: Accurate determination of both alleles at variable loci is essential, and is limited by coverage depth due to random nature of shotgun sequencing.
6
Outline
Introduction
Hidden Markov Models of Haplotype Diversity
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity
Imputation-based Local Ancestry Inference in Admixed Populations
Single Individual Genotyping from Low-Coverage Sequencing Data
Conclusion 7
Haplotype structure in panmictic populations
8
HMM of haplotype diversity
Similar models proposed in [Schwartz 04, Rastas et al. 08, Kennedy et al. 07, Kimmel&Shamir 05, Scheet&Stephens 06,…]
Captures Linkage Disequilibrium (LD)
k = 4(# founders)
n = 5(# SNPs)
9
Graphical model representation
Random variables for each locus i (i=1..n) Fi = founder haplotype at locus i; values between 1 and k Hi = observed allele at locus i; values: 0 (major) or 1 (minor)
Model training, based on Baum-Welch algorithm, using: Reference haplotypes from population panel (e.g. Hapmap), or Haplotypes from phased genotype using ENT software
Given haplotype h, P(H=h|M) can be computed in O(nk2) using a forward algorithm.
F2 Fn…
H1 H2 Hn
10
F1
F1 F2 Fn…
H1 H2 Hn
F'1 F'2 F'n…
H'1 H'2 H'n
G1 G2 Gn
Factorial HMM for genotype data
Random variable for each locus i (i=1..n) Gi = genotype at locus i; values: 0/1/2 (major hom./het./minor hom.) Given multilocus genotype g, P(g|M) can be computed in O(nk4) using
a forward algorithm. 11
HMM Based Genotype Imputation
Probability of observing genotype at locus i given the known multilocus genotype with missing data at i:
Gi is imputed as:
x
12
)|],...,,,,...,([)],,...,,,...,[|( 111111 MGGxGGPMGGGGxGP niiniii
)|],...,,,,...,([maxarg 111}2,1,0{ MGGxGGP niix
Fi …
Hi
Gi
F’i …
H’i
…
…
Forward-backward computation
)()|( '' ''1 ,1 ,, i
iFF
k
Fi
FFi
FF
k
FGMgP
iii iiiii
13
Fi …
Hi
Gi
F’i …
H’i
…
…
Forward-backward computation
14
)()|( '' ''1 ,1 ,, i
iFF
k
Fi
FFi
FF
k
FGMgP
iii iiiii
Fi …
Hi
Gi
F’i …
H’i
…
…
Forward-backward computation
15
)()|( '' ''1 ,1 ,, i
iFF
k
Fi
FFi
FF
k
FGMgP
iii iiiii
Fi …
Hi
Gi
F’i …
H’i
…
…
Forward-backward computation
16
)()|( '' ''1 ,1 ,, i
iFF
k
Fi
FFi
FF
k
FGMgP
iii iiiii
Runtime Direct recurrences for computing forward probabilities
O(nk4) :
Runtime reduced to O(nk3) by reusing common terms:
where
17
)'()(1', iiFF FPFP
ii
k
Fi
iFFiiii
iFF
k
F
iFF
i
iiii
i
iiGFFPFFP
1'1
1',11
1',
1',
1
1111
1
)()'|'()|(
k
F
iFFii
iFF
i
iiiixFFP
1'',1',
1
11)'|'(
k
Fi
iFFii
iFF
iFF
i
iiiiiiGFFPx
1'1
1',1
1',',
1
1111)()'|'(
Speed-up: PopTree Trie
18
Outline
Introduction
Hidden Markov Models of Haplotype Diversity
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Motivation Likelihood Sensitivity Approach to Error Detection
Hidden Markov Model of Haplotype Diversity
Efficiently Computable Likelihood functions
Experimental Results
Imputation-based Local Ancestry Inference in Admixed Populations
Single Individual Genotyping from Low-Coverage Sequencing Data
Conclusion19
Genotype Errors- Motivation
A real problem despite advances in genotyping technology [Zaitlen et al. 2005] found 1.1% inconsistencies among the 20 million dbSNP
genotypes typed multiple times 1% errors decrease power by 10-50% for linkage, and by 5-20% for association
Error types Easily Detectable errors
Systematic errors (e.g., assay failure) detected by HWE test [Hosking et al. 2004] For pedigree data some errors detected as Mendelian Inconsistencies (MIs)
E.g. Only ~30% detectable as MIs for trios [Gordon et al. 1999] Undetected errors
Methods for handling undetected errors: Improved genotype calling algorithms Improved modeling in analysis methods Separate error detection step
Detected errors can be retyped, imputed, or ignored in downstream analysis20
Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]
0 1 2 1 0 2
0 2 2 1 0 2
0 2 2 1 0 2
Mother Father
Child
Likelihood of best phasing for original trio T
0 1 1 1 0 0 H10 0 0 1 0 1 H3
0 1 1 1 0 0 H1
0 1 0 1 0 1 H2
0 0 0 1 0 1 H3
0 1 1 1 0 0 H4
)()()()( MAX)( 4321 HpHpHpHpTL
21
Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]
0 1 2 1 0 2
0 2 2 1 0 2
0 2 2 1 0 2
Mother Father
Child
Likelihood of best phasing for original trio T
? 0 1 0 1 0 1 H’ 1 0 0 0 1 0 0 H’ 3
0 1 0 1 0 1 H’1
0 1 1 1 0 0 H’2
0 0 0 1 0 0 H’ 3
0 1 1 1 0 1 H’ 4
Likelihood of best phasing for modified trio T’ 22
)()()()( MAX)( 4321 HpHpHpHpTL
)'()'()'()'( MAX)(' 4321 HpHpHpHpTL
0 1 2 1 0 2
0 2 2 1 0 2
0 2 2 1 0 2
Mother Father
Child
0 1 2 1 0 2
0 2 2 1 0 2
0 2 2 1 0 2
Mother Father
Child
Large change in likelihood suggests likely error Flag genotype as an error if L(T’)/L(T) > R, where R is the
detection threshold (e.g., R=104)
?
Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]
23
Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection
[Becker et al. 06] Implementation in FAMHAP Software
Window-based algorithm For each window including the SNP
under test, generate list of H most frequent haplotypes (default H=50)
Find most likely trio phasings by pruned search over the H4 quadruples of frequent haplotypes
Flag genotype as an error if L(T’)/L(T) > R for at least one window
Mother …201012 1 02210...Father …201202 2 10211...Child …000120 2 21021...
24
Genotype Error Detection- Limitations of FAMHAP
Unbounded list of haplotypes (H=4n) is hard to compute
Truncating H may lead to sub-optimal phasings and inaccurate L(T) values
False positives caused by nearby errors (due to the use of multiple short windows)
Our approach: HMM of haplotype frequencies all haplotypes
represented + no need for short windows Alternate likelihood functions scalable runtime
25
Trio-Based HMM of haplotype diversity
26
F2 Fn…
H1 H2 Hn
F‘1
H'1 H'2 H'n
GM1 GM2 GMn
F1 F2 Fn…
H1 H2 Hn
F‘2 F'n… F’1 F’2 F’n…
H’1 H’2 H’n
GF1 GF2 GFn
GC1 GC2 GCn
F1
Genotype Error Detection- Alternate Likelihood Functions
Viterbi probability (ViterbiProb): Maximum prob. of a set of 4 HMM paths that emit 4 haplotypes compatible with the trio.
Probability of Viterbi Haplotypes (ViterbiHaps): Obtain the path of the 4 Viterbi haplotypes, then then take product of these individual haplotype probabilities using forward (again).
Total Trio Probability (TotalProb): Total probability P(T) that the HMM emits four haplotypes that explain trio T along all possible 4-tuples of paths.
27
Genotype Error Detection- Speed Ups from reuse of common terms
Straight-forward approach run time: For a fixed trio, ViterbiProb/TotalProb paths can be found using a 4-path version of
Viterbi’s/Forward algorithm in time For ViterbiHaps, additional traceback to compute probabilities:
K3 speed-up by reuse of common terms: per trio
Likelihoods of all 3n modified trios computed using forward-backward algorithm,
ViterbiProb/TotalProb for m trios: ViterbiHaps:
)( 5nkO
))(( 25 knnkmO
)( 5mnkO
28
)( 8nkO)( 28 knnkO
Genotype Error Detection- Comparison of Likelihood
Functions
-0.005 0.005 0.0150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
VitHaps-PVitProb-PTotalProb-PVitHaps-CVitProb-CTotalProb-C
FP rate
Sens
itivi
ty
Sensitivity=TP/(TP+TN)
False Positive rate = 1 - TN/(FP+TN)
35 SNPs551 trios[Becker 06]1% err. rate
29
Genotype Error Detection-“Combined” Detection Method
Compute 4 likelihood ratios
Trio Mother-child duo Father-child duo Child (unrelated)
Flag as error if all ratios are above detection threshold
30
Genotype Error Detection- Comparison with FAMHAP
0.5
0.6
0.7
0.8
0.9
1
0 0.005 0.01 0.015
Sensitivity
FP rate
#FP=#FN line
TotalProb-TRIO
TotalProb-COMBINEDFAMHAP
Children
0.5
0.6
0.7
0.8
0.9
1
0 0.005 0.01 0.015
SensitivityParents
FP Rate
35 SNPs551 trios[Becker 06]1% err. rate
31
Outline
Introduction
Hidden Markov Models of Haplotype Diversity
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity
Imputation-based Local Ancestry Inference in Admixed Populations Motivation Factorial HMM of genotype data Algorithms for genotype imputation and ancestry inference Experimental results
Single Individual Genotyping from Low-Coverage Sequencing Data
Conclusion 32
Introduction- Population admixture
http://www.garlandscience.co.uk/textbooks/0815341857.asp?type=resources 33
Introduction- Motivation: Admixture mapping
Patterson et al, AJHG 74:979-1000, 2004 34
Introduction- Local ancestry inference problem
rs11095710 T T rs11117179 C T rs11800791 G G rs11578310 G Grs1187611 G Grs11804808 C C rs17471518 A G...
Given: Reference haplotypes for all ancestral populations to be
studied Whole-genome SNP genotype data for extant individual
Find: Allele ancestries at each SNP locus
Reference haplotypes
SNP genotypes
rs11095710 P1 P1rs11117179 P1 P1rs11800791 P1 P1rs11578310 P1 P2rs1187611 P1 P2rs11804808 P1 P2rs17471518 P1 P2...
1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000
Inferred local ancestry
1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000
1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000
35
Introduction- Previous work Two main classes of methods for SNP
Ancestry Inference HMM-based (exploit LD): SABER [Tang et al 06],
SWITCH [Sankararaman et al 08a], HAPAA [Sundquist et al. 08], …
Window-based (unlinked SNP Data): LAMP [Sankararaman et al 08b], WINPOP [Pasaniuc et al. 09]
Limitations Poor accuracy when ancestral populations are
closely related (e.g. Japanese and Chinese) Methods based on unlinked SNPs outperform
methods that model LD!36
F2 Fn…
H1 H2 Hn
F'1 F'2 F'n…
H'1 H'2 H'n
G1 G2 Gn
Factorial HMM for genotype data in a window with known local ancestry
klM
37
F1
Fixed-window version: pick ancestry that maximizes the average posterior probability of the SNP genotypes within a fixed-size window centered at the locus
Observations: The local ancestry of a SNP locus is typically shared with
neighboring loci. Small Window sizes may not provide enough information Large Window sizes may violate local ancestry property for
neighboring loci
Imputation-based ancestry inference
38
11M 12M 22M
N=2,000g=7
=0.2n=38,864
r=10-8
Window size effect
39
Multi-window version: Weighted voting over window sizes
between 200-3000, with window weights proportional to average posterior probabilities
Imputation-based ancestry inference
40
N=2,000g=7
=0.2n=38,864
r=10-8
Comparison with other methods% of correctly recovered SNP ancestries
41
N=2,000g=7
=0.5n=38,864
r=10-8
Untyped SNP imputation error rate in admixed individuals
42
Genotype Imputation- Accuracy with number
founders/runtime5835 SNPs2502 unrelated (CEU)[IMAGE]9% imputed (535 SNPs)
43
3 5 7 9 11 13 1585%
87%
89%
91%
93%
95%
97%
99%
-
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
Impu
tatio
n Er
ror R
ate
CPU
seco
nds
# Founders
Number of founders effect on Ancestry inference
CEU-JPTN=2,000
g=7=0.2
n=38,864 r=10-8
44
Outline
Introduction
Hidden Markov Models of Haplotype Diversity
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity
Imputation-based Local Ancestry Inference in Admixed Populations
Single Individual Genotyping from Low-Coverage Sequencing Data Motivation Single SNP Calling Algorithms HF-HMM Overview Multilocus HMM Calling Algorithm Experimental Results
Conclusion 45
Low Coverage Genotyping-Next Generation Sequencing (NGS)
By several orders of magnitude, NGS delivers higher throughput of sequencing reads compared to older technologies (e.g. Sanger sequencing)
46
Roche/454 FLX Titanium~1M reads400bp avg. 400-600Mb / run (10h)
ABI SOLiD 3 plus~500M reads/pairs35-50bp25-60Gb / run (3.5-14 days)
Illumina Genome Analyzer IIx~100-300M reads/pairs35-100bp4.5-33 Gb / run (2-10 days)
Low Coverage Genotyping-NGS Applications and Challenges
NGS is enabling many applications, including personal genomics ~$100 million for the Sanger-sequenced Venter genome [Levy et al 07] ~$1 million for sequencing James Watson genome [Wheeler et al 08]
using 454 technology. ~$50K human sequencing now available Thousands more individual genomes to be sequenced as part of
1000 Genomes Project Challenges:
Sequencing requires accurate determination of genetic variation (e.g. SNPs)
Accuracy is limited by coverage depth due to random nature of shotgun sequencing
For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), only 75-80% accuracy achieved for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08].
[Wheeler et al 08] use hypothesis testing based on binomial distribution
47
Low Coverage Genotyping-Do Heuristic Inputs Help?
[Wendl&Wilson 08] predict that 21x coverage is required for sequencing of samples based on the assumption that “neglects any heuristic inputs”
We propose methods incorporating two additional sources of information:
Quality scores reflecting uncertainty in sequencing data
Linkage disequilibrium (LD) information and allele frequencies extracted from reference panels such as Hapmap
48
Low Coverage Genotyping-Pipeline for Single Genotype Calling
49
Single SNP Genotyping- Basic Notations
Biallelic SNPs: 0 = major allele, 1 = minor allele (reads with non-reference alleles are discarded)
SNP genotypes: 0/2 = homozygous major/minor, 1=heterozygous
Read set ri describes the mapped reads for each SNP i
Inferred genotypesMapped reads with allele 0
Mapped reads with allele 1012100120
Sequencing errors50
Applying Bayes’ formula:
Where: is the conditional probability for the read set at
locus i
are genotype frequencies at inferred from a representative panel
Single SNP Genotype Calling
}2,1,0{
)|r()()|r()()r|(
x iii
iiiii xGPxGP
GPGPGP
)( iGP
51
)|r( ii GP
Low Coverage Genotyping-Pipeline for Multilocus Genotyping
Reference haplotypes
52
Multilocus Genotyping-HF-HMM
F1 F2 Fn…
H1 H2 Hn
G1 G2 Gn
…R1,1 R2,1
F'1 F'2 F'n…
H'1 H'2 H'n
R1,c … R2,c …Rn,1 Rn,c1 2 n
HMMs representing LD in populations of origin for mother/father; similar to models used in [Scheet & Stephens 06, Rastas et al 08, Kennedy et al 08]
53
Multilocus Genotyping-HF-HMM Training
Training HMM based on Baum-Welch algorithm from haplotypes inferred from populations of origin for mother/father
Use haplotype reference panel (e.g. HAPMAP) for training Haplotypes
Conditional probabilities for read sets are given by the formulas derived for the single SNP case:
54
1)(r
)(
0)(r
)( )1()0|r(ir
rir
irr
iriiii
GP
0)(r
)(
1)(r
)( )1()2|r(ir
rir
irr
iriiii
GP
ic
ii GP
21)1|r(
Remark: maxgP((G*1,G*2,…,G*n)| r) is hard to approximate within unless ZPP=NP, and thus the multilocus genotyping problem is NP-hard
Multilocus Genotyping Problem
GIVEN: • Shotgun read sets r=(r1, r2, … , rn)• Quality scores• Trained HMMs representing LD in populations of origin for
mother/fatherFIND:
• Multilocus genotype (G*1,G*2,…,G*n) with maximum posterior probability, i.e., (G*1,G*2,…,G*n) argmaxG1..Gn
P(G1…Gn | r)
)( 1 nO
55
Multilocus Genotyping-HMM-Posterior Decoding Algorithm
1. For each i = 1..n, compute
2. Return *)*,...,( 1 nGG)Mr,|(maxarg* iGi GPG
i
56
Forward-Backward Computation of Posterior Probabilities
Fi …
Hi
Gi
…R1,1Ri,1
F’i …
H’i
R1,c …Ri,c …1
i n
…
…
57Rn,cRn,1
)()|r()Mr,|( '' ''1 ,1 ,, i
iFF
k
Fi
FFi
FF
k
Fiii GGPGPiii iiiii
Forward-Backward Computation of Posterior Probabilities
Fi …
Hi
…R1,1Ri,1
F’i …
H’i
R1,c …Ri,c …1
i n
…
…
58
)()|r()Mr,|( '' ''1 ,1 ,, i
iFF
k
Fi
FFi
FF
k
Fiii GGPGPiii iiiii
Rn,cRn,1
Gi
Fi …
Hi
Gi
…R1,1Ri,1
F’i …
H’i
R1,c …Ri,c …Rn,11
i n
…
…
Forward-Backward Computation of Posterior Probabilities
59Rn,c
)()|r()Mr,|( '' ''1 ,1 ,, i
iFF
k
Fi
FFi
FF
k
Fiii GGPGPiii iiiii
Fi …
Hi
Gi
…Ri,1
F’i …
H’i
R1,c …Ri,c … Rn,c1
i
…
…
Forward-Backward Computation of Posterior Probabilities
60Rn,1R1,1
)()|r()Mr,|( '' ''1 ,1 ,, i
iFF
k
Fi
FFi
FF
k
Fiii GGPGPiii iiiii
Fi …
Hi
Gi
…R1,1Ri,1
F’i …
H’i
R1,c …Ri,c …1
i n
…
…
Forward-Backward Computation of Posterior Probabilities
61Rn,cRn,1
)()|r()Mr,|( '' ''1 ,1 ,, i
iFF
k
Fi
FFi
FF
k
Fiii GGPGPiii iiiii
Multilocus Genotyping-Runtime
Direct implementation gives O(m+nk4) time: m = number of reads n = number of SNPs k = number of founder haplotypes in HMMs
Runtime reduced to O(m+nk3) by reusing common terms:
where
k
F
iFFii
iFF
iFF
iiiiiii
FFP1
1,
'1
'1,,
'1
'11
'11
'1
)|(
k
F
iFFii
iFF
iiiii
FFP1
,1,'1
'1
' )|(
}1,0{
'',
'' )|()|()|()(
iiii
HHiiiiiiiii
iFF
HHGRPFHPFHPG62
Low Coverage Genotyping-Experimental Results- Setup
Subset of James Watson’s 454 reads 74.4M of 106.5M reads
265 bp/read avg coverage: 5.64X
Quality scores included Reads mapped on human genome build 36.3
using the nucmer tool of the MUMmer package [Kurtz et al 04]
Estimated mapping error rates: FP rate: 0.37% FN rate: 21.16%
Haplotype reference panel used to train HMM generated from Hapmap CEU genotypes (release 23a)
63
Accuracy Comparison (Heterozygous Genotypes)
80
82
84
86
88
90
92
94
96
98
100
0 10 20 30 40 50 60 70 80 90 100
% c
orre
ctly
cal
led
% uncalled
1SNP-Posterior Binomial HMM-Posterior
64
Accuracy Comparison (All Genotypes)
93
94
95
96
97
98
99
100
0 20 40 60 80 100
% c
orre
ctly
cal
led
% uncalled
1SNP-Posterior Binomial HMM-Posterior
65
Accuracy at Varying Coverages (All Genotypes)
70
75
80
85
90
95
100
0 10 20 30 40 50 60 70 80 90 100
% uncalled
% c
orre
ctly
cal
led
Posterior 1/16 Posterior 1/8 Posterior 1/4 Posterior 1/2 Posterior 1/1Binomial 1/16 Binomial 1/8 Binomial 1/4 Binomial 1/2 Binomial 1/1
66
Outline
Introduction
Hidden Markov Models of Haplotype Diversity
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity
Imputation-based Local Ancestry Inference in Admixed Populations
Single Individual Genotyping from Low-Coverage Sequencing Data
Conclusion 67
Conclusion
68
Genotype Error Detection Contributions:
Proposed efficient methods for error detection in trio genotype data based on an HMM of haplotype diversity
Can exploit available pedigree info Yield improved detection accuracy compared to FAMHAP Runtime grows linearly in #SNPs and #individuals
Papers: J. Kennedy, I.I. Mandoiu, and B. Pasaniuc. Genotype Error Detection
using Hidden Markov Models of Haplotype diversity. Journal of Computational Biology (to Appear).
J. Kennedy, I.I. Mandoiu and B. Pasaniuc. Genotype Error Detection using Hidden Markov Models of Haplotype Diversity. Proc. WABI 2007, R. Giancarlo and S. Hannenhalli (eds.), LNBI 4645:73-84, 2007
Software: GEDI (Genotype Error Detection and Imputation): J. Kennedy, I.I. Mandoiu and B. Pasaniuc. GEDI: Scalable Algorithms for
Genotype Error Detection and Imputation. ARXIV Report, 2009 Best poster award: J. Kennedy, I.I. Mandoiu and B. Pasaniuc. Genotype Error
Detection and Imputation using Hidden Markov Models of Haplotype Diversity. ISBRA 2008.
Conclusion Imputation-based local ancestry inference in admixed
populations Contributions:
Imputation-based local ancestry inference achieves significant improvement over previous methods for admixtures between close ancestral populations
Future work: Evaluating accuracy under more realistic admixture scenarios (multiple
ancestral populations/gene flow/drift in ancestral populations) Extension to pedigree data Extensions to sequencing data Exploiting inferred local ancestry for phasing of admixed individuals Inference of ancestral haplotypes from extant admixed populations
Papers: B. Pasaniuc and J. Kennedy and I.I. Mandoiu, Imputation-based local
ancestry inference in admixed populations, Proc. 5th International Symposium on Bioinformatics Research and Applications/2nd Workshop on Computational Issues in Genetic Epidemiology, pp. 221-233, 2009
Software: GEDI-ADMX Unix and Windows version (geneAdmixViewer on Windows) 69
Conclusion Genotyping from low coverage sequencing reads
Contributions: Exploiting “heuristic inputs” such as quality scores and population allele
frequency and LD information yields significant improvements in genotyping calling accuracy from low-coverage sequencing data
LD information extracted from a reference panel gives highest benefit Relatively small gain from incorporating quality scores may be due in part to the
poor calibration of 454 quality scores [Brockman et al 08, Quinlan et al 08] Although our evaluation is on 454 reads, the methods are well-suited for short
read technologies Future Work: Population-based Genotyping from Low-Coverage
Sequencing Data Extending the single individual genotyping methods to population sequencing
data (removing the need for reference panels) Use Same HF-HMM as before, only training model based off of EM algorithm
on population-level data provided. Papers:
J. Duitama, S. Dinakar, Y. Hernandez, J. Kennedy, I. Mandoiu, and Y. Wu. Single individual genotyping from low-coverage sequencing data (In Preparation)
Presentation: J.Kennedy. Linkage Disequilibrium Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads. DIMACS workshop on Computational Issues in Genetic Epidemiology.
Software: Gene-Seq70
Questions?
71
Acknowledgments
This work was supported in part by NSF (awards IIS-0546457, DBI-0543365, and CCF-0755373) and by the University of Connecticut Research Foundation
72
Click again for helper slides
73
HAPMAP: The International HAPMAP Project is an organization
whose goal is to develop a haplotype map of the human genome (the HapMap), which will describe the common patterns of human genetic variation.
HAPMAP is expected to be a key resource for researchers to find genetic variants affecting health, disease and responses to drugs and environmental factors. The information produced by the project is made freely available to researchers around the world.
The International HapMap Project is a collaboration among researchers at academic centers, non-profit biomedical research groups and private companies in Canada, China, Japan, Nigeria, the United Kingdom, and the United States.
74
Baum-Welch overview The algorithm has two steps:
Calculating the forward probability for each HMM state;
Determining the frequency of the transition-emission pair values and dividing it by the probability of the entire sequence. This amounts to calculating the expected count of the particular transition-emission pair. Each time a particular transition is found, the value of the quotient of the transition divided by the probability of the entire sequence goes up, and this value can then be made the new value of the transition.
75
GEDI Speed-up: PopTree Trie Due to limited genotype variation across individuals of the
same population, additional re-use of forward and backward probability computations corresponding to genotype prefixes (respectively suffixes) shared by multiple genotypes is possible.
GEDI builds PopTree, which is a trie (prefix tree), from the given multilocus genotypes and then computes probabilities by performing a preorder traversal of the trie.
Specifically, the PopTree data structure for unrelated individuals in a population consists of:
Up to n levels, Each node has up to 3 child edges one for each possible
genotype value (0, 1, 2).
76
GEDI Speed-up: PopTree Speed-up
3 4 5 6 8 10 15 20 25 50 100 -
1,000
2,000
3,000
4,000
5,000
6,000
7,000
7 Pop-Tree13 Pop-Tree7 Slow13 Slow
Time (Seconds)
# Flank-ing
77
Genotype Imputation- Accuracy with varying parameters
5835 SNPs2502 unrelated[IMAGE]9% imputed (535 SNPs)
3060
90120
220320
420520
5.00%7.00%9.00%11.00%13.00%15.00%17.00%
3
5
7
9
11
13
15
3 5 7 9
11 13 15
#Founders
# Training Haplo-
types
% Error Rate
78
Genotype Error Detection- Experimental Results (Setup)
Real dataset [Becker et al. 2006] 35 SNP loci covering a region of 91kb 551 trios
Synthetic datasets 35 SNPs, 551 trios Preserved missing data pattern of real
dataset Haplotypes assigned to trios based on
frequencies inferred from real dataset 1% error rate using random allele insertion
model79
Error Detection Accuracy on Unrelated Genotype Data
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.005 0.01 0.015
Sens
itivi
ty
FP rate
#FP=#FN line
Len=10Kb (U)
Len=100Kb (U)
Len=1Mb (U)
Len=10Mb (U)
551 unrelated individuals Recombination & mutation rates of 10-8 per generation
per bp 35 SNPs within a region of 10kb-10Mb
Genotype Error Detection- Effect of Distance between SNPs
80
Genotype Error Detection- TrioProb-Combined Results on Real Dataset
[Becker et al. 06] resequenced all trio members at 41 loci flagged by FAMHAP-3
26 SNP genotypes in 23 trios were identified as true errors 41*3-26=97 resequenced SNP genotypes agree with
original calls (or are unknown)
Total Signals True Positives False Positives Unknown
FP Rate 1%.5% .1% 1%.5% .1% 1%.5% .1% 1%.5% .1%
Parents 218 127 69 9 9 8 1 0 0 208 118 91
Children 104 74 24 11 11 11 3 3 2 90 60 11
Total 322 201 93 20 20 19 4 3 2 298 178 72
81
Genotype Error Detection- Error Model Comparison
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.005 0.01 0.015
FP rate
Sens
itivi
ty
random allele (P)random geno (P)hetero-to-homo (P)homo-to-hetero (P)random allele (C)random geno (C)hetero-to-homo (C)homo-to-hetero (C)
82
Genotype Error Detection- Distribution of LLR for Total Trio Prob.
Parents-TRIOS
1
10
100
1000
10000
100000
1000000
0
0.27
0.54
0.81
1.08
1.35
1.62
1.89
2.16
2.43 2.
7
2.97
3.24
3.51
3.78
4.05
4.32
4.59
4.86
5.13 5.
4
5.67
5.94
NO_ERR ERR
Children-TRIOS
1
10
100
1000
10000
100000
1000000
0
0.27
0.54
0.81
1.08
1.35
1.62
1.89
2.16
2.43 2.
7
2.97
3.24
3.51
3.78
4.05
4.32
4.59
4.86
5.13 5.
4
5.67
5.94
NO_ERR ERR
Same-locus errors in parents
35 SNPs551 trios[Becker 06]1% err. rate
83
Genotype Error Detection-LLR for Combined Method
Parents-COMBINED
1
10
100
1000
10000
100000
1000000
0
0.27
0.54
0.81
1.08
1.35
1.62
1.89
2.16
2.43 2.7
2.97
3.24
3.51
3.78
4.05
4.32
4.59
4.86
5.13 5.4
5.67
5.94
NO_ERR ERR
Children-COMBINED
1
10
100
1000
10000
100000
1000000
0
0.28
0.56
0.84
1.12 1.4
1.68
1.96
2.24
2.52 2.8
3.08
3.36
3.64
3.92 4.2
4.48
4.76
5.04
5.32 5.6
5.88
NO_ERR ERR
35 SNPs551 trios[Becker 06]1% err. rate
84
Genotype Error Detection- Effect of Population Size
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.005 0.01 0.015
FP rate
Sens
itivi
ty
n=551 (P)n=129 (P)n=30 (P)n=551 (C)n=129 (C)n=30 (C)
85
Genotype Imputation-Effect of flanking size
3 4 5 6 8 10 15 20 25 50 1000.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
-
10,000
20,000
30,000
40,000
50,000
7 Founders- Error Rate13 Founders- Error Rate7 Founders- Time13 Founders- Time
Error Rate
Time (Seconds)
# Flanking
86
Genotype Imputation-Effect of pedigree data/haplotypes
30 60 90 120 220 320 420 5205.00%
6.00%
7.00%
8.00%
9.00%
10.00%
11.00%
12.00%
13.00%
14.00%
Trios 7-founders
Unrelated 7-founders
Unrelated 13-founders
Trios 13-founders
Error Rate
#Training Haplotypes
87
Introduction- Single Nucleotide Polymorphisms
Main form of variation between individual genomes: Single Nucleotide Polymorphisms (SNPs)
Human genome density: 1.2x107 out of 3109
base pairs Vast majority bi-allelic 0/1 encoding (major/minor
resp.)
SNP Genotypes are critical to Disease-Gene Mapping One Method: Admixture Mapping
… ataggtccCtatttcgcgcCgtatacacgggActata …
… ataggtccGtatttcgcgcCgtatacacgggTctata …
… ataggtccCtatttcgcgcGgtatacacgggTctata …
012100120
011000110
001100010
+two haplotypes per individual
genotype
88
Helper Slide-Software Overview Only Require information about ancestral allele
frequencies: LAMP (No recombination assumption) WINPOP (a more refined model of recombination
events coupled with an adaptive window size computation to achieve increased accuracy)
SWITCH (HMM-Based) Only require ancestral allele frequencies &
genotypes SABER (HMM-Based)
Additionally use ancestral haplotype information:
HAPAA (HMM-Based) GEDI-ADMX (HMM-Based) 89
Helper Slide- Probabilities
)[(~)|( g]gg i gPgGPklkl MiiM
:,...,,,...,()[ n1i1-i1i ggg,ggg]g gg
:),...,,,...,( n1i1-i1 gggg ig:}2,1,0{g:}2,1,0{iG
:klM
Random variable genotype at SNP i
Multilocus genotype with I set to
Multilocus genotype without i
Genotype variable taken at SNP i
HMM with ancestral pair k,l
Helper Slide- Emission Details
ig
igli
ki
li
ki
li
ki
hhhh
li
li
ki
kiFF
i FhPFhP}1,0{,
)|()|()(E
91
Input For every Window half-size w
Output (Single Window method) For every i=1..n:
Where:
)|( iiM gGPkl g {0,1,2}g,, klMi
i
klWj
iiMi
Akli gGPW
a )|(||
1maxargˆ 1ig
}1|{ nlkklA
}},min{},...,1{max{ winwiWi
Window-based local ancestry inference
Aai ˆ
92
6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….
…….…………………………..……
k
l
YORP :1 CEUP :2
???????????????????????????????????????????????????????????????????????????????????????
???????????????????????????????????????????????????????????????????????????????????????
i
klWj
iiMi
Akli gGPW
a )|(||
1maxargˆ 1ig
Window-based local ancestry inference
93
??????????????????????????????????????????????????????????????????????????????????????
6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….
…….…………………………..……
k
l
YORP :1 CEUP :2
??????????????????????????????????????????????????????????????????????????????????????
}11..1{
11 )(111maxargˆ
jMAkl GPa
kl 1g 10w }11,..,1{1 W
}11..1{}11..1{}11..1{
221211 111
111
111
jM
jM
jM PPP
1
1
Window-based local ancestry inference
94
6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….
…….…………………………..……
k
l
YORP :1 CEUP :2
10w }12,..,1{2 W
1
1
}12..1{
2222 )|(121maxargˆ
jMAkl gGPa
kl 2g
1
1
}12..1{}12..1{}12..1{
221211 121
121
121
jM
jM
jM PPP
????????????????????????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????????????????????????
Window-based local ancestry inference
95
??????????????????????????????????????????????????????????????????????????????????
??????????????????????????????????????????????????????????????????????????????????
6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….
…….…………………………..……
k
l
YORP :1 CEUP :2
10w }13,..,1{3 W
1
1
}13..1{
3333 )|(131maxargˆ
jMAkl gGPa
kl 3g
1
1
}13..1{}13..1{}13..1{
221211 131
131
131
jM
jM
jM PPP
1
1
Window-based local ancestry inference
96
????????????????????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????????????????????
6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….
…….…………………………..……
k
l
YORP :1 CEUP :2
10w }14,..,1{2 W
1
1
}14..1{
4444 )|(141maxargˆ
jMAkl gGPa
kl 4g
1
1
}14..1{}14..1{}14..1{
221112 141
141
141
jM
jM
jM PPP
1
1
1
2
Window-based local ancestry inference
97
1112222222222222221111111111111122222222222222222222222221111111111111222222
1111111112222222211111111111111112222 ???????????????????????????????????????????
1112222222222222221111111111111122222 ???????????????????????????????????????????
1111111112222222211111111111111112222222222222222111111111111111111111111111
6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….
…….…………………………..……
k
l
YORP :1 CEUP :2
10w }10,..,10{ iiWi
iii Wj
MWj
MWj
M PPP221112 21
1211
211
1
2
i
klWj
iiMAkli gGPa )|(211maxargˆ 1ig
Window-based local ancestry inference
98
Experimental Results- GEDI 1-pop Imputation
1,444 individuals trained on HAPMAP
CEU haplotype reference panel
Imputed (after masking) 1% of SNPs on chromosome 22
99
Binomial Distribution
n: # of successive trials (reads)p: the probability of a success (correct map)1-p: =q , theprobability of a failure (incorrect map)
To call a heterozygous genotype, each allele must be covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01
100
To determine quality scores, Phred first calculates several parameters related to peak shape and peak resolution at each base. Phred then uses these parameters to look up a corresponding quality score in huge lookup tables. These lookup tables were generated from sequence traces where the correct sequence was known, and are hard coded in Phred; different lookup tables are used for different sequencing chemistries and machines
Phred Score
101
Conditional Probability for Heterozygous Genotypes
i
i
c
riririi GP
21
21)1(
21)1|r(
r)()(
102
Baum-Welch overview The algorithm has two steps:
Calculating the forward probability for each HMM state;
Determining the frequency of the transition-emission pair values and dividing it by the probability of the entire sequence. This amounts to calculating the expected count of the particular transition-emission pair. Each time a particular transition is found, the value of the quotient of the transition divided by the probability of the entire sequence goes up, and this value can then be made the new value of the transition.
103
Model Training- Details
Initial founder probabilities P(f1), P(f’1), transition probabilities P(fi+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father
P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise
This implies that conditional probabilities for sets of reads are given by the formulas derived for the single SNP case:
1)(r
)(
0)(r
)( )1()0|r(ir
rir
irr
iriiii
GP
0)(r
)(
1)(r
)( )1()2|r(ir
rir
irr
iriiii
GP
ic
ii GP
21)1|r(
)(1)(
)()(
)()(
)(1)(, 1
221
2)|( ir
irir
iriir
irir
iri
iijigggGrRP
104
Implementation Details
iii
iiii
ghhhh
iiiiii
fffhPfhPg
'
''
}1,0{,
'',
)|()|()(
Forward recurrences:
)()( '11
1, ' fPfP
ii ff
K
fi
iffii
K
fii
iff
iff
iii
iiiii
gffPffP1
11
,11
11
,,1
'11'
1
'11
' )()|()|(
Backward recurrences are similar 105
Allele coverage for heterozygous SNPs (Watson 454 @ 5.85x avg. coverage)
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
Reference allele coverage
Varia
nt a
llele
cov
erag
e
Allele coverage for heterozygous SNPs (Watson 454 @ 2.93x avg. coverage)
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
Reference allele coverage
Varia
nt a
llele
cov
erag
e
Allele coverage for heterozygous SNPs (Watson 454 @ 1.46x avg. coverage)
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
Reference allele coverage
Varia
nt a
llele
cov
erag
e
Single SNP Genotyping- Incorporating Base Call Uncertainty
Let ri denote the set of mapped reads covering SNP locus i and ci=| ri |
For a read r in ri , r(i) denotes the allele observed at locus i If qr(i) is the phred quality score of r(i), the probability that r(i)
is incorrect is given by The probability of observing read set ri conditional on
having genotype gi is then given by:
1)(r
)(
0)(r
)( )1()0|r(ir
rir
irr
iriiii
GP
10/)(
)(10 irqir
0)(r
)(
1)(r
)( )1()2|r(ir
rir
irr
iriiii
GP
ic
ii GP
21)1|r(
111
Experimental Results-Read Data
Subset of James Watson’s 454 reads 74.4 million reads with quality scores (of 106.5 million
reads used in [Wheeler et al 08]) downloaded from ftp://ftp.ncbi.nih.gov/pub/TraceDB/Personal_Genomics/Watson/
Average read length ~265 bpRead length distribution
0
500000
1000000
1500000
2000000
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 112
Experimental Results-Read Data
Reads mapped on human genome build 36.3 using the nucmer tool of the MUMmer package [Kurtz et al 04]
Default nucmer parameters (MUM size 20, min cluster size 65, max gap between adjacent matches 90)
Additional filtering: at least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels)
Reads meeting above conditions at multiple genome positions (likely coming from genomic repeats) were discarded
Simulated 454 reads generated using ReadSim [Schmid et al 07] were used to estimate mapping error rates:
FP rate: 0.37% FN rate: 21.16%
113
Experimental Results-Read Data
Average coverage by mapped reads of Hapmap SNPs was 5.64x
Lower than [Wheeler et al 08] since we start with a subset of the reads and use more stringent mapping constraints
SNP coverage by mapped reads
0
100000
200000
300000
400000
500000
600000
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
114
Experimental Results-Read Data
CEU genotypes from latest Hapmap release (23a) were dowloaded from http://ftp.hapmap.org/genotypes/latest_ncbi_build36/forward/non-redundant/
Genotypes were phased using the ENT algorithm [Gusev et al 08] and inferred haplotypes were used to train the parent HMMs using Baum-Welch
Duplicate Affymetrix 500k SNP genotypes were downloaded from ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE10668/GSE10668_family.soft.gz
We removed genotypes that were discordant in the two replicates and genotypes for which Hapmap and Affymetrix annotations had more than 5% in CEU same-strand allele frequency
115
Accuracy Comparison (Homozygous Genotypes)
97
97.5
98
98.5
99
99.5
100
0 10 20 30 40 50 60 70 80 90 100
% uncalled
% c
orre
ctly
cal
led
1SNP-Posterior Binomial0.01 HMM-Posterior
116
Gene-Seq Algorithms Posterior Decoding (see presentation) Greedy:
Markov Approximation:
Composite 2-SNP Viterbi Posterior Decoding, version 2
117
118
Introduction Helper Linkage Analysis: Study aimed at establishing linkage between
genes. Today linkage analysis serves as a way of gene-hunting and genetic testing.
Genetic linkage is the tendency for genes and other genetic markers to be inherited together because of their location near one another on the same chromosome.
Relative Risk is the risk of an event (or of developing a disease) relative to exposure.
Relative risk is a ratio of the probability of the event occurring in the exposed group versus a non-exposed group.
For example, if the probability of developing lung cancer among smokers was 20% and among non-smokers 1%, then the relative risk of cancer associated with smoking would be 20. Smokers would be twenty times as likely as non-smokers to develop lung cancer.
Association Analysis: Test for association between a genetic variation (e.g. SNP) and one or more quantitative traits
119
Association analysis
Genome-wide scans made possible by recent progress in Single Nucleotide Polymorphism (SNP) genotyping technologies
Introduction-Disease Gene Mapping
Linkage analysis
Very successful for Mendelian diseases (cystic fibrosis, Huntington’s,…)
Low power to detect genes with small relative risk in complex diseases [RischMerikangas’96]
Cases Controls
120
Genotype Error Detection-Motivation
Even low error levels can have large effects for some study designs (e.g. rare alleles, haplotype-based)
Errors as low as .1% can increase Type I error rates in haplotype sharing transmission disequilibrium test (HS-TDT) [Knapp&Becker04]
1% errors decrease power by 10-50% for linkage, and by 5-20% for association [Douglas et al. 00, Abecasis et al. 01]
121
Improved genotype calling algorithms [Marchini et al. 07, Nicolae et al. 06, Rabbee&Speed 05, Xiao
et al. 07] Explicit modeling in analysis methods
[Cheng 07, Hao & Wang 04, Liu et al. 07] Computationally complex
Separate error detection step Detected errors can be retyped, imputed, or ignored in
downstream analyses Common approach in pedigree genotype data analysis
[Abecasis et al. 02, Douglas et al. 00, Sobel et al. 02]
Genotype Error Detection-Motivation
122
Complexity of Computing Maximum Phasing Probability
• For unrelated genotypes, computing maximum phasing probability is hard to approximate within a factor of O(f½-) unless ZPP=NP, where f is the number of founders• For trios, hard to approx. within O(f1/4 -)• Reductions from the clique problem
123
NGS Applications
Besides reducing costs of de novo genome sequencing, NGS has found many more apps:
Resequencing, transcriptomics (RNA-Seq), gene regulation (non-coding RNAs, transcription factor binding sites using ChIP-Seq), epigenetics (methylation, nucleosome modifications), metagenomics, paleogenomics, …
NGS is enabling personal genomics James Watson genome [Wheeler et al 08] sequenced
using 454 technology for ~$1 million compared to ~$100 million for the Sanger-sequenced Venter genome [Levy et al 07]
Thousands more individual genomes to be sequenced as part of 1000 Genomes Project 124
Challenges in Medical Applications of Sequencing
Medical sequencing focuses on genetic variation (SNPs, CNVs, genome rearrangements)
Requires accurate determination of both alleles at variable loci
Accuracy is limited by coverage depth due to random nature of shotgun sequencing
For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), only 75-80% accuracy achieved for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08].
[Wheeler et al 08] use hypothesis testing based on binomial distribution
To call a heterozygous genotype each allele must be covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01
[Wendl&Wilson 08] predict that 21x coverage will be required for sequencing of normal tissue samples based on idealized theory that “neglects any heuristic inputs”
125
Prior Methods for Calling SNP Genotypes from Read Data
Prior methods are all based on allele coverage [Levy et al 07] require that each allele be covered by at
least 2 reads in order to be called [Wheeler et al 08] use hypothesis testing based on the
binomial distribution To call a heterozygous genotype must have each allele
covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01
[Wendl&Wilson 08] generalize these methods by allowing an arbitrary minimum allele coverage k
126
Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads S. Dinakar1, Y. Hernandez2, J. Kennedy1, I. Mandoiu1, and Y. Wu11CSE Department, University of Connecticut 2Department of Computer Science, Hunter College
Motivation• Medical sequencing focuses on genetic variation (SNPs, CNVs,…)
– Requires determination of both alleles at variable loci; however, this is limited by coverage depth due to random nature of shotgun sequencing
– For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), comparison with SNP genotyping chips shows ~75% accuracy for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08]
• [Wendl&Wilson 08] estimate that 21x coverage will be required for medical sequencing of normal tissue samples – Based on idealized theory that “neglects heuristic inputs”
• We propose genotype calling methods that exploit following heuristic inputs– Quality scores reflecting uncertainty in sequencing data– Allele frequency and linkage disequilibrium info extracted from reference
panels such as Hapmap• Do heuristic inputs help?
– Experiments on a subset of James Watson 454 reads show that accuracy achieved by the binomial test of [Wheeler et al. 08] for 5.6x mapped read coverage is achieved by our methods using less than 1/4 of the reads
• Applying Bayes’ formula:
– Where are allele frequencies inferred from a representative panel
Single SNP Genotype Calling• Let ri denote the arbitrarily ordered set of mapped reads
covering SNP locus i and ci=| ri |– For a read r in ri , r(i) denotes the allele observed at locus i– If qr(i) is the phred quality score of r(i), the probability that r(i) is incorrect
is given by
• The probability of observing read set ri conditional on having genotype Gi:
1)(r
)(
0)(r
)( )1()0|r(ir
rir
irr
iriiii
GP
10/)(
)(10 irqir
0)(r
)(
1)(r
)( )1()2|r(ir
rir
irr
iriiii
GP ic
ii GP
21)1|r(
}2,1,0{)|r()(
)|r()()r|(g iiii
iiiiiii gGPgGP
gGPgPgGP
)( ii gGP
F1 F2 Fn…
H1 H2 Hn
G1 G2 Gn
…R1,1 R2,1
F'1 F'2 F'n…
H'1 H'2 H'n
R1,c … R2,c …Rn,1 Rn,c1 2 n
HMMs representing LD in populations of origin for mother/father; similar to models used in [Scheet & Stephens 06, Rastas et al. 08, Kennedy et al. 08]
Multilocus Model
• Initial founder probabilities P(f1), P(f’1), transition probabilities P(f i+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father
• P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise
•
– This implies that conditional probabilities for sets of reads are given by the formulas derived for the single SNP case:
Model Training
)(1)(
)()(
)()(
)(1)(, 1
221
2)|( ir
irir
iriir
irir
iri
iijigggGrRP
1)(r
)(
0)(r
)( )1()0|r(ir
rir
irr
iriiii
GP
0)(r
)(
1)(r
)( )1()2|r(ir
rir
irr
iriiii
GP ic
ii GP
21)1|r(
• Joint probabilities can be computed using a forward-backward algorithm:
• Direct implementation gives O(m+nK4) time, where – m = number of reads– n = number of SNPs– K = number of founder haplotypes in HMMs
• Runtime reduced to O(m+nK3) using speed-up idea similar to [Rastas et al. 08, Kennedy et al. 08]
)()|r()r,( '' ''1 ,1 ,, i
iff
K
fi
ffi
ff
K
fiii ggPgPiii iiiii
Posterior Decoding Algorithm1. For each i = 1..n, compute2. Return *)*,...,(* 1 nggg
)r,(maxarg)r|(maxarg* igigi gPgPgii
• Read data– 74.4 million reads with quality scores (of 106.5 million reads used in [Wheeler et
al 08]) downloaded from ftp://ftp.ncbi.nih.gov/pub/TraceDB/Personal_Genomics/Watson/
– Average read length ~265 bp
• Reference population data– CEU genotypes from latest Hapmap release (23a) from
http://ftp.hapmap.org/genotypes/latest_ncbi_build36/forward/non-redundant/
– Genotypes were phased using the ENT algorithm [Gusev et al. 08] and inferred haplotypes were used to train the parent HMM models using the Baum-Welch algorithm
• Genotype data– Duplicate Affymetrix 500k SNP genotypes downloaded from
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE10668/GSE10668_family.soft.gz
– We removed genotypes that were discordant in the two replicates and genotypes for which Hapmap and Affymetrix annotations had more than 5% in CEU same-strand allele frequency
Experimental Setup
• Reads mapped on human genome build 36.3 using the nucmertool of the MUMmer package [Kurtz et al 04]– Default nucmer parameters:
• MUM size 20, min cluster size 65, max gap between adjacent matches 90– Additional filtering
• At least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels)
– Reads meeting above conditions at multiple genome positions were discarded
• These reads are coming from genomic repeats, and is difficult to accurately map them
• Simulated 454 reads generated using ReadSim [Schmid et al 07] were used to estimate mapping error rates:– Estimated mapping FP rate is 0.37%
Read Mapping Procedure
Hapmap SNP coverage by mapped reads
0
100000
200000
300000
400000
500000
600000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
• Average coverage by mapped reads of Hapmap SNPs was 5.64x– Lower than [Wheeler et al 08] since we start with a subset of the reads
and use more stringent mapping constraints
Read Mapping Results
Homozygous Genotypes
97
97.5
98
98.5
99
99.5
100
0 10 20 30 40 50
% uncalled
% c
orre
ctly
cal
led
1SNP-Posterior Binomial0.01 HMM-Posterior
Genotyping Accuracy at 5.64x Read Coverage
Heterozygous Genotypes
80
82
84
86
88
90
92
94
96
98
100
0 10 20 30 40 50
% uncalled
% c
orre
ctly
cal
led
1SNP-Posterior Binomial0.01 HMM-Posterior
Conclusions & Ongoing Work• Exploiting “heuristic inputs” such as quality scores, population allele
frequency, and LD information yields significant improvements in genotyping calling accuracy from low-coverage sequencing data– Improvement depends on the coverage depth (higher at lower coverage)
• Accuracy achieved by binomial test of [Wheeler et al. 08] for 5.6x mapped read coverage is achieved by HMM-based posterior decoding using less than 1/4 of the reads; accuracy achieved by the binomial test for 2.8x coverage is achieved using less than 1/8 of the reads
– Small gain from incorporating quality scores may be due to poor calibration of 454 quality scores [Brockman et al. 08, Quinlan et al. 08]
– Although the evaluation is on 454 reads, our methods are well-suited for sequencing technologies with shorter reads
• Ongoing work includes modeling ambiguities in read mapping and extending the methods to population sequencing data (removing the need for reference panels)
ACKNOWLEDGEMENTSThis work was supported in part by NSF under awards IIS-0546457 and DBI-0543365 to IM and IIS-0803440 to YW. SD and YH performed this research as part of the Summer REU program “Bio-Grid Initiatives for Interdisciplinary Research and Education" funded by NSF under award CCF-0755373.
Genotyping Accuracy at Lower Read Coverage
70
75
80
85
90
95
100
0 10 20 30 40 50 60 70 80 90 100
% uncalled
% c
orre
ctly
cal
led
HMM-Posterior 1/16 HMM-Posterior 1/8 HMM-Posterior 1/4 HMM-Posterior 1/2 HMM-Posterior 1/1Binomial0.01 1/16 Binomial0.01 1/8 Binomial0.01 1/4 Binomial0.01 1/2 Binomial0.01 1/1
>gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference assemblyGAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAGCTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTGGATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGTAATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCCCTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATATTTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGAAATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAGTCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGAATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC
>gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference assemblyGAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAGCTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTGGATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGTAATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCCCTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATATTTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGAAATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAGTCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGAATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC
>gnl|ti|1779718824 name:EI1W3PE02ILQXT28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 727 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28 43 36 22 928 43 36 22 9 28 44 36 24 14 4 28 28 28 27 28 26 26 35 2640 34 18 3 28 28 28 27 33 24 26 28 28 28 40 33 14 28 36 2726 26 37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 28 3728 28 28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 2728 33 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 27 36 2728 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 2828 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15
>gnl|ti|1779718824 name:EI1W3PE02ILQXT28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 727 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28 43 36 22 928 43 36 22 9 28 44 36 24 14 4 28 28 28 27 28 26 26 35 2640 34 18 3 28 28 28 27 33 24 26 28 28 28 40 33 14 28 36 2726 26 37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 28 3728 28 28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 2728 33 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 27 36 2728 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 2828 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15
>gnl|ti|1779718824 name:EI1W3PE02ILQXTTCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTTGTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATTCTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGTTAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC>gnl|ti|1779718825 name:EI1W3PE02GTXK0TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTTTAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGCAGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCAGGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA
>gnl|ti|1779718824 name:EI1W3PE02ILQXTTCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTTGTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATTCTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGTTAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC>gnl|ti|1779718825 name:EI1W3PE02GTXK0TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTTTAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGCAGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCAGGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA
Proposed Pipeline for Genotype Calling from Short Reads
Mapped reads
Hapmap genotypes90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000
90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000
90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000
Reference genome sequence
>gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference assemblyGAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAGCTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTGGATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGTAATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCCCTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATATTTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGAAATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAGTCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGAATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC
…
…
…
… …
…
…
>gnl|ti|1779718824 name:EI1W3PE02ILQXTTCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTTGTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATTCTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGTTAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC>gnl|ti|1779718825 name:EI1W3PE02GTXK0TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTTTAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGCAGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCAGGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA
>gnl|ti|1779718824 name:EI1W3PE02ILQXT28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 727 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28 43 36 22 928 43 36 22 9 28 44 36 24 14 4 28 28 28 27 28 26 26 35 2640 34 18 3 28 28 28 27 33 24 26 28 28 28 40 33 14 28 36 2726 26 37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 28 3728 28 28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 2728 33 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 27 36 2728 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 2828 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15
Read sequences
Quality scores
SNP genotype callsrs12095710 T T 9.988139e-01rs12127179 C T 9.986735e-01rs11800791 G G 9.977713e-01rs11578310 G G 9.980062e-01rs1287622 G G 8.644588e-01 rs11804808 C C 9.977779e-01rs17471528 A G 5.236099e-01rs11804835 C C 9.977759e-01rs11804836 C C 9.977925e-01rs1287623 G G 9.646510e-01 rs13374307 G G 9.989084e-01rs12122008 G G 5.121655e-01rs17431341 A C 5.290652e-01rs881635 G G 9.978737e-01 rs9700130 A A 9.989940e-01 rs11121600 A A 6.160199e-01rs12121542 A A 5.555713e-01rs11121605 T T 8.387705e-01rs12563779 G G 9.982776e-01rs11121607 C G 5.639239e-01rs11121608 G T 5.452936e-01rs12029742 G G 9.973527e-01rs562118 C C 9.738776e-01 rs12133533 A C 9.956655e-01rs11121648 G G 9.077355e-01rs9662691 C C 9.988648e-01 rs11805141 C C 9.928786e-01rs1287635 C C 6.113270e-01
ORIGINAL PROPOSAL PRESENTATION NEXT
Efficient Algorithms for SNP Genotype Data Analysis using
HiddenMarkov Models of Haplotype
Diversity
Justin KennedyDissertation Proposal for the Degree of Doctorate in Philosophy
Computer Science & Engineering DepartmentUniversity of Connecticut
129
Outline
Introduction
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity
Single Individual Genotyping from Low-Coverage Sequencing Data
Conclusion130
Introduction-Single Nucleotide Polymorphisms
Main form of variation between individual genomes: Single Nucleotide Polymorphisms (SNPs)
High density in the human genome: 1.2x107 out of 3109 base pairs
Vast majority bi-allelic 0/1 encoding (major/minor resp.)
SNP Genotypes are critical to Disease-Gene Mapping
… ataggtccCtatttcgcgcCgtatacacgggActata …
… ataggtccGtatttcgcgcCgtatacacgggTctata …
… ataggtccCtatttcgcgcGgtatacacgggTctata …
131
Introduction- Why SNP Genotypes?
Single Nucleotide Polymorphisms (SNPs) have become the genetic marker of choice for genome wide association studies (GWASs)
GWAS: Method for mapping disease associated genes by typing a dense set of markers in large numbers of cases and controls followed by a statistical test of association. Provides Higher statistical power compared to other gene mapping methods such as linkage for uncovering genetic basis of complex diseases
Ongoing GWASs generate a deluge of genotype data Genetic Association Information Network (GAIN): 6 studies totaling 18,000
individuals typed at 500,000 to 940,000 SNP loci Wellcome Trust Case-Control Consortium (WTCCC): 7 studies totaling
17,000 individuals typed at 500,000 SNP
Major concern: quality of genotype data132
Introduction-Computational Challenges to Disease Gene Mapping
Genotype error detection: Low levels of genotyping errors can decrease statistical power and invalidate statistical tests for disease association based on haplotypes
Handling structural variation data
provided by new sequencing technologies: Accurate determination of both alleles at variable loci is essential, and is limited by coverage depth due to random nature of shotgun sequencing
133
Outline
Introduction Genotype Error Detection using Hidden Markov Models of
Haplotype Diversity Motivation Likelihood Sensitivity Approach to Error Detection
Hidden Markov Model of Haplotype Diversity
Efficiently Computable Likelihood functions
Experimental Results Single Individual Genotyping from Low-Coverage
Sequencing Data Conclusion
134
Genotype Error Detection-Motivation
A real problem despite advances in genotyping technology [Zaitlen et al. 2005] found 1.1% inconsistencies among the 20 million dbSNP
genotypes typed multiple times 1% errors decrease power by 10-50% for linkage, and by 5-20% for
association [Douglas et al. 00, Abecasis et al. 01] Error types
Systematic errors (e.g., assay failure) detected by HWE test [Hosking et al. 2004]
For pedigree data some errors detected as Mendelian Inconsistencies (MIs) E.g. Only ~30% detectable as MIs for trios [Gordon et al. 1999]
Undetected errors Methods for Handling Undetected errors:
Improved genotype calling algorithms [Marchini et al. 07,, Xiao et al. 07] Explicit modeling in analysis methods [Cheng 07, Hao & Wang 04, Liu et al.
07] Separate error detection step
Detected errors can be retyped, imputed, or ignored in downstream analyses
135
Genotype Error Detection-Haplotypes and Genotypes
Haplotype: description of SNP alleles on a chromosome 0/1 vector: 0 for major allele, 1 for minor
Diploids: two homologous copies of each autosomal chromosome
One inherited from mother and one from father Genotype: description of alleles on both chromosomes
0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles
011000110001100010021200210
+two haplotypes per individual
genotype136
Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]
0 1 2 1 0 2
0 2 2 1 0 2
0 2 2 1 0 2
Mother Father
Child
Likelihood of best phasing for original trio T
0 1 1 1 0 0 h10 0 0 1 0 1 h3
0 1 1 1 0 0 h1
0 1 0 1 0 1 h2
0 0 0 1 0 1 h3
0 1 1 1 0 0 h4
)()()()( MAX)( 4321 hphphphpTL
137
Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]
0 1 2 1 0 2
0 2 2 1 0 2
0 2 2 1 0 2
Mother Father
Child
Likelihood of best phasing for original trio T
)()()()( MAX)( 4321 hphphphpTL
? 0 1 0 1 0 1 h’ 1 0 0 0 1 0 0 h’ 3
0 1 0 1 0 1 h’1
0 1 1 1 0 0 h’2
0 0 0 1 0 0 h’ 3
0 1 1 1 0 1 h’ 4
Likelihood of best phasing for modified trio T’
)'()'()'()'( MAX)'( 4321 hphphphpTL 138
0 1 2 1 0 2
0 2 2 1 0 2
0 2 2 1 0 2
Mother Father
Child
0 1 2 1 0 2
0 2 2 1 0 2
0 2 2 1 0 2
Mother Father
Child
?
Large change in likelihood suggests likely error Flag genotype as an error if L(T’)/L(T) > R, where R is the
detection threshold (e.g., R=104)
Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]
139
Genotype Error Detection- Likelihood Sensitivity Approach to Error Detection
[Becker et al. 06] Implementation in FAMHAP Software
Window-based algorithm For each window including the SNP
under test, generate list of H most frequent haplotypes (default H=50)
Find most likely trio phasings by pruned search over the H4 quadruples of frequent haplotypes
Flag genotype as an error if L(T’)/L(T) > R for at least one window
Mother …201012 1 02210...Father …201202 2 10211...Child …000120 2 21021...
140
Genotype Error Detection- Limitations of FAMHAP
Truncating the list of haplotypes to size H may lead to sub-optimal phasings and inaccurate L(T) values
False positives caused by nearby errors (due to the use of multiple short windows)
Our approach: HMM of haplotype frequencies all haplotypes
represented + no need for short windows Alternate likelihood functions scalable runtime
141
Genotype Error Detection-Hidden Markov Model of Haplotype Diversity
Similar HMMs proposed by [Kimmel &Shamir 05, Rastas et al. 05, Schwartz 04]
Paths with high transition probability correspond to “founder” haplotypes
Haplotype sequence/paths computed using Viterbi and forward algorithms
K= #Founders(E.g. K=4)
Transition Prob
Emission Prob
N= #SNPs(E.g. N=5)
:),'( qqj:);( qjE
142
Genotype Error Detection-Hidden Markov Model of Haplotype Diversity
Training: 2- step algorithm that exploits pedigree info
Step 1: Obtain haplotypes from using either: ENT: A pedigree-aware haplotype phasing algorithm based on
entropy-minimization Haplotype reference panel (e.g. HAPMAP)
Step 2: train HMM based on inferred haplotypes, using Baum-Welch
:),'( qqj
K= #Founders(E.g. K=4)
Transition Prob
Emission Prob
N= #SNPs(E.g. N=5)
:);( qjE
143
Genotype Error Detection- Alternate Likelihood Functions
• Viterbi probability (ViterbiProb): Maximum probability of a set of 4 HMM paths that emit 4 haplotypes compatible with the trio
• Probability of Viterbi Haplotypes (ViterbiHaps): Product of total probabilities of the 4 Viterbi haplotypes
• Total Trio Probability (TotalProb): Total probability P(T) that the HMM emits four haplotypes that explain trio T along all possible 4-tuples of paths
),'()';1({max);();( 41'
qqqqqq jQ
jVjEjVj
4
1'
)},'()';1({);();(jQ
jjTjEjTq
qqqqq 144
Genotype Error Detection- Speed Up of Viterbi Probability
For a fixed trio, Viterbi paths can be found using a 4-path version of Viterbi’s algorithm in time
K3 speed-up by reuse of common terms (similar to [Rastas et al. 05]):
)( 8NKO
)},'()',,,;1({max),,,;(),,,;1( 4443213'43214321 14qqqqqqjSqqqqjEqqqqjV jQq j
)},'()',',,;1({max)',,,;( 3343212'43213 13qqqqqqjSqqqqjS jQq j
)},'()',',',;1({max)',',,;( 2243211'43212 12qqqqqqjSqqqqjS jQq j
)},'()',',',';1({max)',',',;( 114321'43211 11qqqqqqjVqqqqjS jQq j
145
Genotype Error Detection- Overall Function Runtimes
Viterbi probability Likelihoods of all 3N modified trios can be computed
within time using forward-backward algorithm Overall runtime for M trios
Probability of Viterbi haplotypes Obtain haplotypes from standard traceback, then
compute haplotype probabilities using forward algorithms
Overall runtime Total trio probability
Similar pre-computation speed-up & forward-backward algorithm
Overall runtime
))(( 25 KNNKMO
)( 5MNKO
)( 5MNKO)( 5NKO
146
Genotype Error Detection- Experimental Results (Setup)
Real dataset [Becker et al. 2006] 35 SNP loci covering a region of 91kb 551 trios
Synthetic datasets 35 SNPs, 551 trios Preserved missing data pattern of real
dataset Haplotypes assigned to trios based on
frequencies inferred from real dataset 1% error rate using random allele insertion
model147
Genotype Error Detection- Comparison of Likelihood
Functions
-0.005 0.005 0.0150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
VitHaps-PVitProb-PTotalProb-PVitHaps-CVitProb-CTotalProb-C
FP rate
Sens
itivi
ty
Sensitivity=TP/(TP+TN)
False Positive rate = 1 - TN/(FP+TN) 148
Distribution of Log-Likelihood Ratios for TotalTrioProb
Parents-TRIOS
1
10
100
1000
10000
100000
1000000
0
0.27
0.54
0.81
1.08
1.35
1.62
1.89
2.16
2.43 2.
7
2.97
3.24
3.51
3.78
4.05
4.32
4.59
4.86
5.13 5.
4
5.67
5.94
NO_ERR ERR
Children-TRIOS
1
10
100
1000
10000
100000
1000000
0
0.27
0.54
0.81
1.08
1.35
1.62
1.89
2.16
2.43 2.
7
2.97
3.24
3.51
3.78
4.05
4.32
4.59
4.86
5.13 5.
4
5.67
5.94
NO_ERR ERR
Same-locus errors in parents
149
Genotype Error Detection-“Combined” Detection Method
Compute 4 likelihood ratios
Trio Mother-child duo Father-child duo Child (unrelated)
Flag as error if all ratios are above detection threshold
150
Distribution of Log-Likelihood Ratios for Combined Method
Parents-COMBINED
1
10
100
1000
10000
100000
1000000
0
0.27
0.54
0.81
1.08
1.35
1.62
1.89
2.16
2.43 2.7
2.97
3.24
3.51
3.78
4.05
4.32
4.59
4.86
5.13 5.4
5.67
5.94
NO_ERR ERR
Children-COMBINED
1
10
100
1000
10000
100000
1000000
0
0.28
0.56
0.84
1.12 1.4
1.68
1.96
2.24
2.52 2.8
3.08
3.36
3.64
3.92 4.2
4.48
4.76
5.04
5.32 5.6
5.88
NO_ERR ERR
151
Comparison with FAMHAP (Children)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.005 0.01 0.015
Sens
itivi
ty
FP rate
#FP=#FN line
TotalProb-TRIO
TotalProb-COMBINED
FAMHAP
152
Comparison with FAMHAP (Parents)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.005 0.01 0.015
Sens
itivi
ty
FP rate
#FP=#FN line
TotalProb-TRIO
TotalProb-COMBINED
FAMHAP
153
Outline
Introduction Genotype Error Detection using Hidden Markov
Models of Haplotype Diversity Single Individual Genotyping from Low-Coverage
Sequencing Data Motivation Single SNP Calling Algorithm Multilocus HMM Calling Algorithm Experimental Results
Conclusion
154
Low Coverage Genotyping-Next Generation Sequencing (NGS)
Illumina / Solexa Genetic Analyzer 1G1000 Mb/run, 35bp reads
Roche / 454 Genome Sequencer FLX100 Mb/run, 400bp reads
Applied BiosystemsSOLiD3000 Mb/run, 25-35bp reads
By several orders of magnitude, NGS delivers higher throughput of sequencing reads compared to older technologies (e.g. Sanger sequencing)
More improvements expected in quest for $1,000 genome
155
Low Coverage Genotyping-NGS Applications and Challenges
NGS is enabling many applications, including personal genomics ~$1 million for sequencing James Watson genome [Wheeler et al 08]
using 454 technology. ~$100 million for the Sanger-sequenced Venter genome [Levy et al 07] Thousands more individual genomes to be sequenced as part of
1000 Genomes Project Challenges:
Sequencing requires accurate determination of genetic variation (e.g. SNPs)
Accuracy is limited by coverage depth due to random nature of shotgun sequencing
For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), only 75-80% accuracy achieved for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08].
[Wheeler et al 08] use hypothesis testing based on binomial distribution [Wendl&Wilson 08] predict that 21x coverage will be required for
sequencing of normal tissue samples based on idealized theory that “neglects any heuristic inputs”
156
Low Coverage Genotyping-Do Heuristic Inputs Help?
We propose methods incorporating two additional sources of information:
Quality scores reflecting uncertainty in sequencing data
Linkage disequilibrium (LD) information and allele frequencies extracted from reference panels such as Hapmap
Experiments on a subset of the James Watson 454 reads show that our methods yield improved genotyping accuracy
Improvement depends on the coverage depth (higher at lower coverage), e.g., accuracy achieved by the binomial test of [Wheeler et al. 08] for 5.6-fold mapped read coverage is achieved by our methods using less than 1/4 of the reads 157
Low Coverage Genotyping-Pipeline for Single Genotype Calling
158
Single SNP Genotyping- Basic Notations
Biallelic SNPs: 0 = major allele, 1 = minor allele (reads with non-reference alleles are discarded)
SNP genotypes: 0/2 = homozygous major/minor, 1=heterozygous
Inferred genotypesMapped reads with allele 0
Mapped reads with allele 1012100120
Sequencing errors159
Single SNP Genotyping- Incorporating Base Call Uncertainty
Let ri denote the set of mapped reads covering SNP locus i and ci=| ri |
For a read r in ri , r(i) denotes the allele observed at locus i If qr(i) is the phred quality score of r(i), the probability that r(i)
is incorrect is given by The probability of observing read set ri conditional on
having genotype Gi is then given by:
1)(r
)(
0)(r
)( )1()0|r(ir
rir
irr
iriiii
GP
10/)(
)(10 irqir
0)(r
)(
1)(r
)( )1()2|r(ir
rir
irr
iriiii
GP
ic
ii GP
21)1|r(
160
Single SNP Genotype Calling
Applying Bayes’ formula:
Where are allele frequencies inferred from a representative panel
}2,1,0{)|r()(
)|r()()r|(g iiii
iiiiiii gGPgGP
gGPgPgGP
)( ii gGP
161
Low Coverage Genotyping-Pipeline for Multilocus Genotyping
162
Multilocus Genotyping-HF-HMM
F1 F2 Fn…
H1 H2 Hn
G1 G2 Gn
…R1,1 R2,1
F'1 F'2 F'n…
H'1 H'2 H'n
R1,c … R2,c …Rn,1 Rn,c1 2 n
HMMs representing LD in populations of origin for mother/father; similar to models used in [Scheet & Stephens 06, Rastas et al 08, Kennedy et al 08]
163
Multilocus Genotyping-HF-HMM Training
Training HMM based on Baum-Welch algorithm from haplotypes inferred from populations of origin for mother/father
Use haplotype reference panel (e.g. HAPMAP) for training Haplotypes
Conditional probabilities for read sets are given by the formulas derived for the single SNP case:
1)(r
)(
0)(r
)( )1()0|r(ir
rir
irr
iriiii
GP
0)(r
)(
1)(r
)( )1()2|r(ir
rir
irr
iriiii
GP
ic
ii GP
21)1|r( 164
Remark: maxgP(g | r) is hard to approximate within unless ZPP=NP, and thus the multilocus genotyping problem is NP-hard
Multilocus Genotyping Problem
GIVEN: • Shotgun read sets r=(r1, r2, … , rn)• Quality scores• Trained HMMs representing LD in populations of origin for
mother/fatherFIND:
• Multilocus genotype g*=(g*1,g*2,…,g*n) with maximum posterior probability, i.e., g*=argmaxg P(g | r)
)( 1 nO
165
Multilocus Genotyping-HMM-Posterior Decoding Algorithm
1. For each i = 1..n, compute
2. Return *)*,...,(* 1 nggg
)r,(maxarg)r|(maxarg* igigi gPgPgii
166
Forward-Backward Computation of Posterior Probabilities
)()|r()r,( '' ''1 ,1 ,, i
iff
K
fi
ffi
ff
K
fiii ggPgPiii iiiii
fi …
hi
gi
…r1,1ri,1
f’i …
h’i
r1,c …ri,c …Rn,1 Rn,c
1i n
…
…
167
Forward-Backward Computation of Posterior Probabilities
)()|r()r,( '' ''1 ,1 ,, i
iff
K
fi
ffi
ff
K
fiii ggPgPiii iiiii
fi …
hi
gi
…r1,1ri,1
f’i …
h’i
r1,c …ri,c …Rn,1 Rn,c
1i n
…
…
168
)()|r()r,( '' ''1 ,1 ,, i
iff
K
fi
ffi
ff
K
fiii ggPgPiii iiiii
fi …
hi
gi
…r1,1ri,1
f’i …
h’i
r1,c …ri,c …Rn,1 Rn,c
1i n
…
…
Forward-Backward Computation of Posterior Probabilities
169
)()|r()r,( '' ''1 ,1 ,, i
iff
K
fi
ffi
ff
K
fiii ggPgPiii iiiii
fi …
hi
gi
…r1,1ri,1
f’i …
h’i
r1,c …ri,c …Rn,1 Rn,c
1i n
…
…
Forward-Backward Computation of Posterior Probabilities
170
)()|r()r,( '' ''1 ,1 ,, i
iff
K
fi
ffi
ff
K
fiii ggPgPiii iiiii
fi …
hi
gi
…r1,1ri,1
f’i …
h’i
r1,c …ri,c …Rn,1 Rn,c
1i n
…
…
Forward-Backward Computation of Posterior Probabilities
171
Multilocus Genotyping-Runtime
Direct implementation gives O(m+nK4) time: m = number of reads n = number of SNPs K = number of founder haplotypes in HMMs
Runtime reduced to O(m+nK3) by reusing common terms:
where
K
f
iffii
iff
iff
iiiiiii
ffP1
1,
'1
'1,,
'1
'11
'11
'1
)|(
K
f
iffii
iff
iiiii
ffP1
,1,'1
'1
' )|(
}1,0{,
'',
'' )|()|()|()(
iiii
hhiiiiiiiii
iff
hhGrPfhPfhPg172
Low Coverage Genotyping-Experimental Results- Setup
Subset of James Watson’s 454 reads 74.4M of 106.5M reads
265 bp/read avg coverage: 5.64X
Quality scores included Reads mapped on human genome build 36.3
using the nucmer tool of the MUMmer package [Kurtz et al 04]
Estimated mapping error rates: FP rate: 0.37% FN rate: 21.16%
Haplotype reference panel used to train HMM generated from Hapmap CEU genotypes (release 23a)
173
Accuracy Comparison (Heterozygous Genotypes)
80
82
84
86
88
90
92
94
96
98
100
0 10 20 30 40 50 60 70 80 90 100
% uncalled
% c
orre
ctly
cal
led
1SNP-Posterior Binomial0.01 HMM-Posterior
174
Accuracy Comparison (All Genotypes)
93
94
95
96
97
98
99
100
0 20 40 60 80 100
% uncalled
% c
orre
ctly
cal
led
1SNP-Posterior Binomial0.01 HMM-Posterior
175
Accuracy at Varying Coverages (All Genotypes)
70
75
80
85
90
95
100
0 10 20 30 40 50 60 70 80 90 100
% uncalled
% c
orre
ctly
cal
led
Posterior 1/16 Posterior 1/8 Posterior 1/4 Posterior 1/2 Posterior 1/1Binomial 1/16 Binomial 1/8 Binomial 1/4 Binomial 1/2 Binomial 1/1
176
Outline
Introduction
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity
Single Individual Genotyping from Low-Coverage Sequencing Data
Conclusion177
Conclusion Genotype Error Detection
Contributions: Proposed efficient methods for error detection in trio genotype data
based on an HMM of haplotype diversity Can exploit available pedigree info Yield improved detection accuracy compared to FAMHAP Runtime grows linearly in #SNPs and #individuals
Papers/Software: J. Kennedy, I.I. Mandoiu, and B. Pasaniuc. Genotype Error Detection
using Hidden Markov Models of Haplotype diversity. Journal of Computational Biology (to Appear).
J. Kennedy, I.I. Mandoiu and B. Pasaniuc. Genotype Error Detection using Hidden Markov Models of Haplotype Diversity. Proc. WABI 2007, R. Giancarlo and S. Hannenhalli (eds.), LNBI 4645:73-84, 2007
J. Kennedy, I.I. Mandoiu and B. Pasaniuc. Genotype Error Detection using Hidden Markov Models of haplotype diversity. In 3rd RECOMB Satellie Workshop on: Computational Methods for SNPs and Haplotypes, 2007
Software: GEDI (Genotype Error Detection and Imputation): Best poster award: J. Kennedy, I.I. Mandoiu and B. Pasaniuc. Genotype
Error Detection and Imputation using Hidden Markov Models of Haplotype Diversity. ISBRA 2008.
178
Conclusion Genotyping from low coverage sequencing reads
Contributions: Exploiting “heuristic inputs” such as quality scores and population
allele frequency and LD information yields significant improvements in genotyping calling accuracy from low-coverage sequencing data
LD information extracted from a reference panel gives highest benefit Relatively small gain from incorporating quality scores may be due in
part to the poor calibration of 454 quality scores [Brockman et al 08, Quinlan et al 08]
Although our evaluation is on 454 reads, the methods are well-suited for short read technologies
Papers/Software: S. Dinakar, Y. Hernandez, J. Kennedy, I. Mandoiu, and Y. Wu. Single
individual genotyping from low-coverage sequencing data (In Preparation)
Presentation: J.Kennedy. Linkage Disequilibrium Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads. DIMACS workshop on Computational Issues in Genetic Epidemiology.
Software: Gene-Seq179
Conclusion Future Work: Population-based
Genotyping from Low-Coverage Sequencing Data Extending the single individual
genotyping methods to population sequencing data (removing the need for reference panels)
Use Same HF-HMM as before, only training model based off of EM algorithm on population-level data provided. 180
Questions?
181
Acknowledgments
This work was supported in part by NSF (awards IIS-0546457, DBI-0543365, and CCF-0755373) and by the University of Connecticut Research Foundation
182
Click again for helper slides
183
HAPMAP: The International HAPMAP Project is an organization
whose goal is to develop a haplotype map of the human genome (the HapMap), which will describe the common patterns of human genetic variation.
HAPMAP is expected to be a key resource for researchers to find genetic variants affecting health, disease and responses to drugs and environmental factors. The information produced by the project is made freely available to researchers around the world.
The International HapMap Project is a collaboration among researchers at academic centers, non-profit biomedical research groups and private companies in Canada, China, Japan, Nigeria, the United Kingdom, and the United States.
184
Baum-Welch overview The algorithm has two steps:
Calculating the forward probability and the backward probability for each HMM state;
Determining the frequency of the transition-emission pair values and dividing it by the probability of the entire sequence. This amounts to calculating the expected count of the particular transition-emission pair. Each time a particular transition is found, the value of the quotient of the transition divided by the probability of the entire sequence goes up, and this value can then be made the new value of the transition.
185
Error Detection Accuracy on Unrelated Genotype Data
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.005 0.01 0.015
Sens
itivi
ty
FP rate
#FP=#FN line
Len=10Kb (U)
Len=100Kb (U)
Len=1Mb (U)
Len=10Mb (U)
551 unrelated individuals Recombination & mutation rates of 10-8 per generation
per bp 35 SNPs within a region of 10kb-10Mb
186
TrioProb-Combined Results on Real Dataset
[Becker et al. 06] resequenced all trio members at 41 loci flagged by FAMHAP-3
26 SNP genotypes in 23 trios were identified as true errors 41*3-26=97 resequenced SNP genotypes agree with
original calls (or are unknown)
Total Signals True Positives False Positives Unknown
FP Rate 1%.5% .1% 1%.5% .1% 1%.5% .1% 1%.5% .1%
Parents 218 127 69 9 9 8 1 0 0 208 118 91
Children 104 74 24 11 11 11 3 3 2 90 60 11
Total 322 201 93 20 20 19 4 3 2 298 178 72
187
Error Model Comparison
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.005 0.01 0.015
FP rate
Sens
itivi
ty
random allele (P)random geno (P)hetero-to-homo (P)homo-to-hetero (P)random allele (C)random geno (C)hetero-to-homo (C)homo-to-hetero (C)
188
Effect of Population Size
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.005 0.01 0.015
FP rate
Sens
itivi
ty
n=551 (P)n=129 (P)n=30 (P)n=551 (C)n=129 (C)n=30 (C)
189
Binomial Distribution
n: # of successive trials (reads)p: the probability of a success (correct map)1-p: =q , theprobability of a failure (incorrect map)
To call a heterozygous genotype, each allele must be covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01
190
Conditional Probability for Heterozygous Genotypes
i
i
c
riririi GP
21
21)1(
21)1|r(
r)()(
191
Model Training- Details
Initial founder probabilities P(f1), P(f’1), transition probabilities P(fi+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father
P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise
This implies that conditional probabilities for sets of reads are given by the formulas derived for the single SNP case:
1)(r
)(
0)(r
)( )1()0|r(ir
rir
irr
iriiii
GP
0)(r
)(
1)(r
)( )1()2|r(ir
rir
irr
iriiii
GP
ic
ii GP
21)1|r(
)(1)(
)()(
)()(
)(1)(, 1
221
2)|( ir
irir
iriir
irir
iri
iijigggGrRP
192
Implementation Details
iii
iiii
ghhhh
iiiiii
fffhPfhPg
'
''
}1,0{,
'',
)|()|()(
Forward recurrences:
)()( '11
1, ' fPfP
ii ff
K
fi
iffii
K
fii
iff
iff
i
iii
iiiigffPffP
11
1,1
11
1,,
1
'11'
1
'11
' )()|()|(
Backward recurrences are similar 193
Experimental Results-Read Data
Subset of James Watson’s 454 reads 74.4 million reads with quality scores (of 106.5 million
reads used in [Wheeler et al 08]) downloaded from ftp://ftp.ncbi.nih.gov/pub/TraceDB/Personal_Genomics/Watson/
Average read length ~265 bpRead length distribution
0
500000
1000000
1500000
2000000
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 194
Experimental Results-Read Data
Reads mapped on human genome build 36.3 using the nucmer tool of the MUMmer package [Kurtz et al 04]
Default nucmer parameters (MUM size 20, min cluster size 65, max gap between adjacent matches 90)
Additional filtering: at least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels)
Reads meeting above conditions at multiple genome positions (likely coming from genomic repeats) were discarded
Simulated 454 reads generated using ReadSim [Schmid et al 07] were used to estimate mapping error rates:
FP rate: 0.37% FN rate: 21.16%
195
Experimental Results-Read Data
Average coverage by mapped reads of Hapmap SNPs was 5.64x
Lower than [Wheeler et al 08] since we start with a subset of the reads and use more stringent mapping constraints
SNP coverage by mapped reads
0
100000
200000
300000
400000
500000
600000
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
196
Experimental Results-Read Data
CEU genotypes from latest Hapmap release (23a) were dowloaded from http://ftp.hapmap.org/genotypes/latest_ncbi_build36/forward/non-redundant/
Genotypes were phased using the ENT algorithm [Gusev et al 08] and inferred haplotypes were used to train the parent HMMs using Baum-Welch
Duplicate Affymetrix 500k SNP genotypes were downloaded from ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE10668/GSE10668_family.soft.gz
We removed genotypes that were discordant in the two replicates and genotypes for which Hapmap and Affymetrix annotations had more than 5% in CEU same-strand allele frequency
197
Accuracy Comparison (Homozygous Genotypes)
97
97.5
98
98.5
99
99.5
100
0 10 20 30 40 50 60 70 80 90 100
% uncalled
% c
orre
ctly
cal
led
1SNP-Posterior Binomial0.01 HMM-Posterior
198
Gene-Seq Algorithms Posterior Decoding (see presentation) Greedy:
Markov Approximation:
Composite 2-SNP Viterbi Posterior Decoding, version 2
199
200
Introduction Helper Linkage Analysis: Study aimed at establishing linkage between
genes. Today linkage analysis serves as a way of gene-hunting and genetic testing.
Genetic linkage is the tendency for genes and other genetic markers to be inherited together because of their location near one another on the same chromosome.
Relative Risk is the risk of an event (or of developing a disease) relative to exposure.
Relative risk is a ratio of the probability of the event occurring in the exposed group versus a non-exposed group.
For example, if the probability of developing lung cancer among smokers was 20% and among non-smokers 1%, then the relative risk of cancer associated with smoking would be 20. Smokers would be twenty times as likely as non-smokers to develop lung cancer.
Association Analysis: Test for association between a genetic variation (e.g. SNP) and one or more quantitative traits
201
Association analysis
Genome-wide scans made possible by recent progress in Single Nucleotide Polymorphism (SNP) genotyping technologies
Introduction-Disease Gene Mapping
Linkage analysis
Very successful for Mendelian diseases (cystic fibrosis, Huntington’s,…)
Low power to detect genes with small relative risk in complex diseases [RischMerikangas’96]
Cases Controls
202
Genotype Error Detection-Motivation
Even low error levels can have large effects for some study designs (e.g. rare alleles, haplotype-based)
Errors as low as .1% can increase Type I error rates in haplotype sharing transmission disequilibrium test (HS-TDT) [Knapp&Becker04]
1% errors decrease power by 10-50% for linkage, and by 5-20% for association [Douglas et al. 00, Abecasis et al. 01]
203
Improved genotype calling algorithms [Marchini et al. 07, Nicolae et al. 06, Rabbee&Speed 05, Xiao
et al. 07] Explicit modeling in analysis methods
[Cheng 07, Hao & Wang 04, Liu et al. 07] Computationally complex
Separate error detection step Detected errors can be retyped, imputed, or ignored in
downstream analyses Common approach in pedigree genotype data analysis
[Abecasis et al. 02, Douglas et al. 00, Sobel et al. 02]
Genotype Error Detection-Motivation
204
Complexity of Computing Maximum Phasing Probability
• For unrelated genotypes, computing maximum phasing probability is hard to approximate within a factor of O(f½-) unless ZPP=NP, where f is the number of founders• For trios, hard to approx. within O(f1/4 -)• Reductions from the clique problem
205
NGS Applications
Besides reducing costs of de novo genome sequencing, NGS has found many more apps:
Resequencing, transcriptomics (RNA-Seq), gene regulation (non-coding RNAs, transcription factor binding sites using ChIP-Seq), epigenetics (methylation, nucleosome modifications), metagenomics, paleogenomics, …
NGS is enabling personal genomics James Watson genome [Wheeler et al 08] sequenced
using 454 technology for ~$1 million compared to ~$100 million for the Sanger-sequenced Venter genome [Levy et al 07]
Thousands more individual genomes to be sequenced as part of 1000 Genomes Project 206
Challenges in Medical Applications of Sequencing
Medical sequencing focuses on genetic variation (SNPs, CNVs, genome rearrangements)
Requires accurate determination of both alleles at variable loci
Accuracy is limited by coverage depth due to random nature of shotgun sequencing
For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), only 75-80% accuracy achieved for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08].
[Wheeler et al 08] use hypothesis testing based on binomial distribution
To call a heterozygous genotype each allele must be covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01
[Wendl&Wilson 08] predict that 21x coverage will be required for sequencing of normal tissue samples based on idealized theory that “neglects any heuristic inputs”
207
Prior Methods for Calling SNP Genotypes from Read Data
Prior methods are all based on allele coverage [Levy et al 07] require that each allele be covered by at
least 2 reads in order to be called [Wheeler et al 08] use hypothesis testing based on the
binomial distribution To call a heterozygous genotype must have each allele
covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01
[Wendl&Wilson 08] generalize these methods by allowing an arbitrary minimum allele coverage k
208
Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads S. Dinakar1, Y. Hernandez2, J. Kennedy1, I. Mandoiu1, and Y. Wu11CSE Department, University of Connecticut 2Department of Computer Science, Hunter College
Motivation• Medical sequencing focuses on genetic variation (SNPs, CNVs,…)
– Requires determination of both alleles at variable loci; however, this is limited by coverage depth due to random nature of shotgun sequencing
– For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), comparison with SNP genotyping chips shows ~75% accuracy for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08]
• [Wendl&Wilson 08] estimate that 21x coverage will be required for medical sequencing of normal tissue samples – Based on idealized theory that “neglects heuristic inputs”
• We propose genotype calling methods that exploit following heuristic inputs– Quality scores reflecting uncertainty in sequencing data– Allele frequency and linkage disequilibrium info extracted from reference
panels such as Hapmap• Do heuristic inputs help?
– Experiments on a subset of James Watson 454 reads show that accuracy achieved by the binomial test of [Wheeler et al. 08] for 5.6x mapped read coverage is achieved by our methods using less than 1/4 of the reads
• Applying Bayes’ formula:
– Where are allele frequencies inferred from a representative panel
Single SNP Genotype Calling• Let ri denote the arbitrarily ordered set of mapped reads
covering SNP locus i and ci=| ri |– For a read r in ri , r(i) denotes the allele observed at locus i– If qr(i) is the phred quality score of r(i), the probability that r(i) is incorrect
is given by
• The probability of observing read set ri conditional on having genotype Gi:
1)(r
)(
0)(r
)( )1()0|r(ir
rir
irr
iriiii
GP
10/)(
)(10 irqir
0)(r
)(
1)(r
)( )1()2|r(ir
rir
irr
iriiii
GP ic
ii GP
21)1|r(
}2,1,0{)|r()(
)|r()()r|(g iiii
iiiiiii gGPgGP
gGPgPgGP
)( ii gGP
F1 F2 Fn…
H1 H2 Hn
G1 G2 Gn
…R1,1 R2,1
F'1 F'2 F'n…
H'1 H'2 H'n
R1,c … R2,c …Rn,1 Rn,c1 2 n
HMMs representing LD in populations of origin for mother/father; similar to models used in [Scheet & Stephens 06, Rastas et al. 08, Kennedy et al. 08]
Multilocus Model
• Initial founder probabilities P(f1), P(f’1), transition probabilities P(f i+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father
• P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise
•
– This implies that conditional probabilities for sets of reads are given by the formulas derived for the single SNP case:
Model Training
)(1)(
)()(
)()(
)(1)(, 1
221
2)|( ir
irir
iriir
irir
iri
iijigggGrRP
1)(r
)(
0)(r
)( )1()0|r(ir
rir
irr
iriiii
GP
0)(r
)(
1)(r
)( )1()2|r(ir
rir
irr
iriiii
GP ic
ii GP
21)1|r(
• Joint probabilities can be computed using a forward-backward algorithm:
• Direct implementation gives O(m+nK4) time, where – m = number of reads– n = number of SNPs– K = number of founder haplotypes in HMMs
• Runtime reduced to O(m+nK3) using speed-up idea similar to [Rastas et al. 08, Kennedy et al. 08]
)()|r()r,( '' ''1 ,1 ,, i
iff
K
fi
ffi
ff
K
fiii ggPgPiii iiiii
Posterior Decoding Algorithm1. For each i = 1..n, compute2. Return *)*,...,(* 1 nggg
)r,(maxarg)r|(maxarg* igigi gPgPgii
• Read data– 74.4 million reads with quality scores (of 106.5 million reads used in [Wheeler et
al 08]) downloaded from ftp://ftp.ncbi.nih.gov/pub/TraceDB/Personal_Genomics/Watson/
– Average read length ~265 bp
• Reference population data– CEU genotypes from latest Hapmap release (23a) from
http://ftp.hapmap.org/genotypes/latest_ncbi_build36/forward/non-redundant/
– Genotypes were phased using the ENT algorithm [Gusev et al. 08] and inferred haplotypes were used to train the parent HMM models using the Baum-Welch algorithm
• Genotype data– Duplicate Affymetrix 500k SNP genotypes downloaded from
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE10668/GSE10668_family.soft.gz
– We removed genotypes that were discordant in the two replicates and genotypes for which Hapmap and Affymetrix annotations had more than 5% in CEU same-strand allele frequency
Experimental Setup
• Reads mapped on human genome build 36.3 using the nucmertool of the MUMmer package [Kurtz et al 04]– Default nucmer parameters:
• MUM size 20, min cluster size 65, max gap between adjacent matches 90– Additional filtering
• At least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels)
– Reads meeting above conditions at multiple genome positions were discarded
• These reads are coming from genomic repeats, and is difficult to accurately map them
• Simulated 454 reads generated using ReadSim [Schmid et al 07] were used to estimate mapping error rates:– Estimated mapping FP rate is 0.37%
Read Mapping Procedure
Hapmap SNP coverage by mapped reads
0
100000
200000
300000
400000
500000
600000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
• Average coverage by mapped reads of Hapmap SNPs was 5.64x– Lower than [Wheeler et al 08] since we start with a subset of the reads
and use more stringent mapping constraints
Read Mapping Results
Homozygous Genotypes
97
97.5
98
98.5
99
99.5
100
0 10 20 30 40 50
% uncalled
% c
orre
ctly
cal
led
1SNP-Posterior Binomial0.01 HMM-Posterior
Genotyping Accuracy at 5.64x Read Coverage
Heterozygous Genotypes
80
82
84
86
88
90
92
94
96
98
100
0 10 20 30 40 50
% uncalled
% c
orre
ctly
cal
led
1SNP-Posterior Binomial0.01 HMM-Posterior
Conclusions & Ongoing Work• Exploiting “heuristic inputs” such as quality scores, population allele
frequency, and LD information yields significant improvements in genotyping calling accuracy from low-coverage sequencing data– Improvement depends on the coverage depth (higher at lower coverage)
• Accuracy achieved by binomial test of [Wheeler et al. 08] for 5.6x mapped read coverage is achieved by HMM-based posterior decoding using less than 1/4 of the reads; accuracy achieved by the binomial test for 2.8x coverage is achieved using less than 1/8 of the reads
– Small gain from incorporating quality scores may be due to poor calibration of 454 quality scores [Brockman et al. 08, Quinlan et al. 08]
– Although the evaluation is on 454 reads, our methods are well-suited for sequencing technologies with shorter reads
• Ongoing work includes modeling ambiguities in read mapping and extending the methods to population sequencing data (removing the need for reference panels)
ACKNOWLEDGEMENTSThis work was supported in part by NSF under awards IIS-0546457 and DBI-0543365 to IM and IIS-0803440 to YW. SD and YH performed this research as part of the Summer REU program “Bio-Grid Initiatives for Interdisciplinary Research and Education" funded by NSF under award CCF-0755373.
Genotyping Accuracy at Lower Read Coverage
70
75
80
85
90
95
100
0 10 20 30 40 50 60 70 80 90 100
% uncalled
% c
orre
ctly
cal
led
HMM-Posterior 1/16 HMM-Posterior 1/8 HMM-Posterior 1/4 HMM-Posterior 1/2 HMM-Posterior 1/1Binomial0.01 1/16 Binomial0.01 1/8 Binomial0.01 1/4 Binomial0.01 1/2 Binomial0.01 1/1
>gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference assemblyGAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAGCTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTGGATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGTAATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCCCTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATATTTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGAAATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAGTCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGAATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC
>gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference assemblyGAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAGCTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTGGATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGTAATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCCCTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATATTTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGAAATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAGTCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGAATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC
>gnl|ti|1779718824 name:EI1W3PE02ILQXT28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 727 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28 43 36 22 928 43 36 22 9 28 44 36 24 14 4 28 28 28 27 28 26 26 35 2640 34 18 3 28 28 28 27 33 24 26 28 28 28 40 33 14 28 36 2726 26 37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 28 3728 28 28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 2728 33 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 27 36 2728 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 2828 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15
>gnl|ti|1779718824 name:EI1W3PE02ILQXT28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 727 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28 43 36 22 928 43 36 22 9 28 44 36 24 14 4 28 28 28 27 28 26 26 35 2640 34 18 3 28 28 28 27 33 24 26 28 28 28 40 33 14 28 36 2726 26 37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 28 3728 28 28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 2728 33 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 27 36 2728 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 2828 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15
>gnl|ti|1779718824 name:EI1W3PE02ILQXTTCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTTGTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATTCTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGTTAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC>gnl|ti|1779718825 name:EI1W3PE02GTXK0TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTTTAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGCAGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCAGGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA
>gnl|ti|1779718824 name:EI1W3PE02ILQXTTCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTTGTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATTCTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGTTAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC>gnl|ti|1779718825 name:EI1W3PE02GTXK0TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTTTAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGCAGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCAGGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA
Proposed Pipeline for Genotype Calling from Short Reads
Mapped reads
Hapmap genotypes90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000
90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000
90 20934216 F 0 02110001?0100210010011002122201210211?122122021200018 F 15 1621100012010021001001100?100201?10111110111?021200015 M 0 0211200100120012010011200101101010111110111102120007 M 0 02110001001000200122110001111011100111?1212102220008 F 0 0011202100120022012211200101101210211122111?012000012 F 9 10211000100100020012211000101101110011121212102200009 M 0 0011?001?012002201221120010?1012102111221111012000011 M 7 821100210010002001221100012110111001112121210222000
Reference genome sequence
>gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference assemblyGAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAGCTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTGGATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGTAATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCCCTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATATTTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGAAATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAGTCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGAATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC
…
…
…
… …
…
…
>gnl|ti|1779718824 name:EI1W3PE02ILQXTTCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTTGTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATTCTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGTTAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC>gnl|ti|1779718825 name:EI1W3PE02GTXK0TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTTTAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGCAGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCAGGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA
>gnl|ti|1779718824 name:EI1W3PE02ILQXT28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 727 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28 43 36 22 928 43 36 22 9 28 44 36 24 14 4 28 28 28 27 28 26 26 35 2640 34 18 3 28 28 28 27 33 24 26 28 28 28 40 33 14 28 36 2726 26 37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 28 3728 28 28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 2728 33 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 27 36 2728 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 2828 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15
Read sequences
Quality scores
SNP genotype callsrs12095710 T T 9.988139e-01rs12127179 C T 9.986735e-01rs11800791 G G 9.977713e-01rs11578310 G G 9.980062e-01rs1287622 G G 8.644588e-01 rs11804808 C C 9.977779e-01rs17471528 A G 5.236099e-01rs11804835 C C 9.977759e-01rs11804836 C C 9.977925e-01rs1287623 G G 9.646510e-01 rs13374307 G G 9.989084e-01rs12122008 G G 5.121655e-01rs17431341 A C 5.290652e-01rs881635 G G 9.978737e-01 rs9700130 A A 9.989940e-01 rs11121600 A A 6.160199e-01rs12121542 A A 5.555713e-01rs11121605 T T 8.387705e-01rs12563779 G G 9.982776e-01rs11121607 C G 5.639239e-01rs11121608 G T 5.452936e-01rs12029742 G G 9.973527e-01rs562118 C C 9.738776e-01 rs12133533 A C 9.956655e-01rs11121648 G G 9.077355e-01rs9662691 C C 9.988648e-01 rs11805141 C C 9.928786e-01rs1287635 C C 6.113270e-01
GEDI-ADMX presentation from earlier this year (Ft lauderdale)
Imputation-based local ancestry inference in admixed
populationsJustin Kennedy
Computer Science and Engineering DepartmentUniversity of Connecticut
Joint work with I. Mandoiu and B. Pasaniuc
211
Outline Introduction
Factorial HMM of genotype data
Algorithms for genotype imputation and ancestry inference
Preliminary experimental results
Conclusion
212
Introduction- Motivation: Admixture mapping
Patterson et al, AJHG 74:979-1000, 2004 213
Introduction- Local ancestry inference problem
rs11095710 T T rs11117179 C T rs11800791 G G rs11578310 G Grs1187611 G Grs11804808 C C rs17471518 A G...
Given: Reference haplotypes for ancestral populations P1,…,PN Whole-genome SNP genotype data for extant individual
Find: Allele ancestries at each SNP locus
Reference haplotypes
SNP genotypes
rs11095710 P1 P1rs11117179 P1 P1rs11800791 P1 P1rs11578310 P1 P2rs1187611 P1 P2rs11804808 P1 P2rs17471518 P1 P2...
1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000
Inferred local ancestry
1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000
1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000111100100110011010011100101101010111110111101110001110001001000100111110001111011100111?111110111000011101100110011011111100101101110111111111?011000011100010010001001111100010110111001111111110110000011?001?011001101111110010?1011101111111111011000011100110010001001111100011110111001111111110111000
214
Introduction- Previous work MANY methods
Ancestry inference at different granularities, assuming different kinds/amounts of info about genetic makeup of ancestral populations
Two main classes of methods HMM-based (exploit LD): SABER [Tang et al 06], SWITCH
[Sankararaman et al 08a], HAPAA [Sundquist et al. 08], …
Window-based (unlinked SNP Data): LAMP [Sankararaman et al 08b], WINPOP [Pasaniuc et al. 09]
Poor accuracy when ancestral populations are closely related (e.g. Japanese and Chinese)
Methods based on unlinked SNPs outperform methods that model LD!
215
Outline Introduction
Factorial HMM of genotype data
Algorithms for genotype imputation and ancestry inference
Preliminary experimental results
Conclusion
216
Haplotype structure in panmictic populations
217
HMM of haplotype frequencies
Similar models proposed in [Schwartz 04, Rastas et al. 05, Kennedy et al. 07, Kimmel&Shamir 05, Scheet&Stephens 06,…]
K = 4(# founders)
n = 5(# SNPs)
218
Graphical model representation
Random variables for each locus i (i=1..n) Fi = founder haplotype at locus i; values between 1 and K Hi = observed allele at locus i; values: 0 (major) or 1
(minor) Model training
Based on reference haplotypes using Baum-Welch alg, or Based on unphased genotypes using EM [Rastas et al. 05]
Given haplotype h, P(H=h|M) can be computed in O(nK2) using a forward algorithm, where n=#SNPs, K=#founders
F1 F2 Fn…
H1 H2 Hn
219
F1 F2 Fn…
H1 H2 Hn
F'1 F'2 F'n…
H'1 H'2 H'n
G1 G2 Gn
Factorial HMM for genotype data in a window with known local ancestry
klM
Random variable for each locus i (i=1..n) Gi = genotype at locus i; values: 0/1/2 (major hom./het./minor
hom.) 220
Outline Introduction
Factorial HMM of genotype data
Algorithms for genotype imputation and ancestry inference
Preliminary experimental results
Conclusion
221
HMM Based Genotype Imputation
Probability of observing genotype at locus i given the known multilocus genotype with missing data at i:
gi is imputed as )|][(argmax }2,1,0{ MxggP ix
)|][(),|( MxggPMgxgP iii
x
222
fi …
hi
gi
f’i …
h’i
…
…
Forward-backward computation
)()|( '' ''1 ,1 ,, i
iff
K
fi
ffi
ff
K
fgMgP
iii iiiii
223
fi …
hi
gi
f’i …
h’i
…
…
Forward-backward computation
)()|( '' ''1 ,1 ,, i
iff
K
fi
ffi
ff
K
fgMgP
iii iiiii
224
fi …
hi
gi
f’i …
h’i
…
…
Forward-backward computation
)()|( '' ''1 ,1 ,, i
iff
K
fi
ffi
ff
K
fgMgP
iii iiiii
225
fi …
hi
gi
f’i …
h’i
…
…
Forward-backward computation
)()|( '' ''1 ,1 ,, i
iff
K
fi
ffi
ff
K
fgMgP
iii iiiii
226
)()( '11
1, ' fPfP
ii ff
K
fi
iffii
K
fii
iff
iff
iii
iiiii
gffPffP1
11
,'1
'
11
1,,
1
'11'
1
'11
' )()|()|(
Runtime Direct recurrences for computing forward probabilities
O(nK4) :
Runtime reduced to O(nK3) by reusing common terms:
where )()|( 1
1
1,
'1
'1,,
'1
'11
'11
'1
i
K
f
iffii
iff
iff
gffPi
iiiiii
K
f
iffii
iff
iiiii
ffP1
,1,'1
'1
' )|(
227
Imputation-based ancestry inference
View local ancestry inference as a model selection problem Each possible local ancestry defines a factorial
HMM compute for all possible k,l,i,x values Pick model that re-imputes SNPs most
accurately around the locus i. Fixed-window version: pick ancestry that
maximizes the average posterior probability of the SNP genotypes within a fixed-size window centered at the locus
Multi-window version: weighted voting over window sizes between 200-3000, with window weights proportional to average posterior probabilities
klM),|( ,lkii MgxgP
11M 12M 22M
228
Local Ancestry at a locus is an unordered pair of (not necessarily distinct) ancestral populations.
Observations: The local ancestry of a SNP locus is typically shared with
neighboring loci. Small Window sizes may not provide enough
information Large Window sizes may violate local ancestry property
for neighboring loci When using the true values of in ,the accuracy
of SNP genotype imputation within such a neighborhood is typically higher than when using a mis-specified model.
klMlk,
Imputation-based ancestry inference
229
Outline Introduction
Factorial HMM of genotype data
Algorithms for genotype imputation and ancestry inference
Preliminary experimental results
Conclusion
230
HMM imputation accuracy Missing data rate and accuracy for imputed
genotypes at different thresholds (WTCCC 58BC/Hapmap CEU)
231
N=2,000g=7
=0.2n=38,864
r=10-8
Window size effect
232
Number of founders effect
CEU-JPTN=2,000
g=7=0.2
n=38,864 r=10-8
233
N=2,000g=7
=0.2n=38,864
r=10-8
Comparison with other methods% of correctly recovered SNP ancestries
234
N=2,000g=7
=0.5n=38,864
r=10-8
Untyped SNP imputation error rate in admixed individuals
235
Outline Introduction
Factorial HMM of genotype data
Algorithms for genotype imputation and ancestry inference
Preliminary experimental results
Conclusion
236
Conclusion-Summary and ongoing work
Imputation-based local ancestry inference achieves significant improvement over previous methods for admixtures between close ancestral populations
Code at http://dna.engr.uconn.edu/software/ Ongoing work
Evaluating accuracy under more realistic admixture scenarios (multiple ancestral populations/gene flow/drift in ancestral populations)
Extension to pedigree data Exploiting inferred local ancestry for more accurate
untyped SNP imputation and phasing of admixed individuals
Extensions to sequencing data Inference of ancestral haplotypes from extant admixed
populations 237
Questions?
238
1. L.E. Baum, T. Petrie, G. Soules, and N.Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist., 41:164{171, 1970.
2. The Wellcome Trust Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447:661{678, 2007.3. Z. Ghahramani and M.I. Jordan. Factorial hidden Markov models. Mach. Learn., 29(2-3):245{273, 1997.
4. J. Kennedy, I.I. Mandoiu, and B. Pasaniuc. Genotype error detection using hidden markov models of haplotype diversity. Journal of Computational Biology, 15(9):1155{1171, 2008.
5. J. Kennedy, B. Pasaniuc, and I.I. Mandoiu. GEDI: Genotype error detection and imputation using hidden markov models of haplotype diversity, manuscript in preparation. software available at at http://dna.engr.uconn.edu/software/gedi/ .
6. G. Kimmel and R. Shamir. A block-free hidden Markov model for genotypes and its application to disease association. Journal of Computational Biology, 12:1243{1260, 2005.
7. Y. Li and G. R. Abecasis. Mach 1.0: Rapid haplotype reconstruction and missing genotype inference. American Journal of Human Genetics, 79:2290, 2006.
8. J. Marchini, C. Spencer, Y.Y. Teo, and P. Donnelly. A bayesian hierarchical mix ture model for genotype calling in a multi-cohort study. in preparation, 2007.
9. B. Pasaniuc, S. Sankararaman, G. Kimmel, and E. Halperin. Inference of locus-specic ancestry in closely related populations (under review).
10. E. J. Parra, A. Marcini, J. Akey, J. Martinson, M. A. Batzer, R. Cooper, T. For-rester, D. B. Allison, R. Deka, R. E. Ferrell, et al. Estimating african american admixture proportions by use of population-specic alleles. Am J Hum Genet, 63(6):1839{1851, December 1998.
11. P. Rastas, M. Koivisto, H. Mannila, and E. Ukkonen. Phasing genotypes using a hidden Markov model. In I.I. Mandoiu and A. Zelikovsky, editors, Bioinformatics Algorithms: Techniques and Applications, pages 355{372. Wiley, 2008.
12. D. Reich and Patterson N. Will admixture mapping work to nd disease genes? Philos Trans R Soc Lond B Biol Sci, 360:1605{1607, 2005.
13. S. Sankararaman, G. Kimmel, E. Halperin, and M.I. Jordan. On the inference of ancestries in admixed populations. Genome Research, (18):668{675, 2008.
14. S. Sankararaman, S. Sridhar, G. Kimmel, and E. Halperin. Estimating local ancestry in admixed populations. American Journal of Human Genetics, 8(2):290{303,2008.
15. P. Scheet and M. Stephens. A fast and exible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypicphase. American Journal of Human Genetics, 78:629{644, 2006.16. R. Schwartz. Algorithms for association study design using a generalized model of haplotype conservation. In Proc. CSB, pages 90{97, 2004.
17. M. W. Smith, N. Patterson, J. A. Lautenberger, A. L. Truelove, G. J. McDonald, A.Waliszewska, B. D. Kessing, M. J. Malasky, C. Scafe, E. Le, et al. A high-density admixture map for disease gene discovery in african americans. Am J Hum Genet,74(5):1001{1013, May 2004.
18. A. Sundquist, E. Fratkin, C.B. Do, and S. Batzoglou. Eect of genetic divergence in identifying ancestral origin using HAPAA. Genome Research, 18(4):676{682,2008.
19. H. Tang, M. Coram, P. Wang, X. Zhu, and N. Risch. Reconstructing genetic ancestry blocks in admixed individuals. Am J Hum Genet, 79:1{12, 2006.
20. H. Tang, Peng J., and Pei Wang P.and Risch N.J. Estimation of individual admixture: Analytical and study design considerations. Genetic Epidemiology, 28:289{
301, 2005. 21. C. Tian, D. A. Hinds, R. Shigeta, R. Kittles, D. G. Ballinger, and M. F. Seldin. A genomewide single-nucleotide-polymorphism panel with high ancestry information for African American admixture mapping. Am J Hum Genet, 79:640{649, 2006. 22. http://www.hapmap.org/.
239
Acknowledgments Work supported in part by NSF awards IIS-0546457
and DBI-0543365.
240
HELPER SLIDES-STARTS NEXT PAGE
241
Introduction- Single Nucleotide Polymorphisms
Main form of variation between individual genomes: Single Nucleotide Polymorphisms (SNPs)
Human genome density: 1.2x107 out of 3109
base pairs Vast majority bi-allelic 0/1 encoding (major/minor
resp.)
SNP Genotypes are critical to Disease-Gene Mapping One Method: Admixture Mapping
… ataggtccCtatttcgcgcCgtatacacgggActata …
… ataggtccGtatttcgcgcCgtatacacgggTctata …
… ataggtccCtatttcgcgcGgtatacacgggTctata …
012100120
011000110
001100010
+two haplotypes per individual
genotype
242
Helper Slide- Other Software1) HMM-based methods:
a) SABERb) SWITCHc) HAPAA
2) Window based Majority vote:a) LAMP (no recombination assumption)b) WINPOP (a more refined model of recombination events coupled with an
adaptive window size computation to achieve increased accuracy.
243
Helper Slide- Probabilities
)[(~)|( g]gg i gPgGPklkl MiiM
:,...,,,...,()[ n1i1-i1i ggg,ggg]g gg
:),...,,,...,( n1i1-i1 gggg ig:}2,1,0{g:}2,1,0{iG
:klM
Random variable genotype at SNP i
Multilocus genotype with I set to
Multilocus genotype without i
Genotype variable taken at SNP i
HMM with ancestral pair k,l
Helper Slide- Emission Details
ig
igli
ki
li
ki
li
ki
hhhh
li
li
ki
kiFF
i FhPFhP}1,0{,
)|()|()(E
245
Input For every Window half-size w
Output (Single Window method) For every i=1..n:
Where:
)|( iiM gGPkl g {0,1,2}g,, klMi
i
klWj
iiMi
Akli gGPW
a )|(||
1maxargˆ 1ig
}1|{ nlkklA
}},min{},...,1{max{ winwiWi
Window-based local ancestry inference
Aai ˆ
246
6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….
…….…………………………..……
k
l
YORP :1 CEUP :2
???????????????????????????????????????????????????????????????????????????????????????
???????????????????????????????????????????????????????????????????????????????????????
i
klWj
iiMi
Akli gGPW
a )|(||
1maxargˆ 1ig
Window-based local ancestry inference
247
??????????????????????????????????????????????????????????????????????????????????????
6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….
…….…………………………..……
k
l
YORP :1 CEUP :2
??????????????????????????????????????????????????????????????????????????????????????
}11..1{
11 )(111maxargˆ
jMAkl GPa
kl 1g 10w }11,..,1{1 W
}11..1{}11..1{}11..1{
221211 111
111
111
jM
jM
jM PPP
1
1
Window-based local ancestry inference
248
6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….
…….…………………………..……
k
l
YORP :1 CEUP :2
10w }12,..,1{2 W
1
1
}12..1{
2222 )|(121maxargˆ
jMAkl gGPa
kl 2g
1
1
}12..1{}12..1{}12..1{
221211 121
121
121
jM
jM
jM PPP
????????????????????????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????????????????????????
Window-based local ancestry inference
249
??????????????????????????????????????????????????????????????????????????????????
??????????????????????????????????????????????????????????????????????????????????
6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….
…….…………………………..……
k
l
YORP :1 CEUP :2
10w }13,..,1{3 W
1
1
}13..1{
3333 )|(131maxargˆ
jMAkl gGPa
kl 3g
1
1
}13..1{}13..1{}13..1{
221211 131
131
131
jM
jM
jM PPP
1
1
Window-based local ancestry inference
250
????????????????????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????????????????????
6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….
…….…………………………..……
k
l
YORP :1 CEUP :2
10w }14,..,1{2 W
1
1
}14..1{
4444 )|(141maxargˆ
jMAkl gGPa
kl 4g
1
1
}14..1{}14..1{}14..1{
221112 141
141
141
jM
jM
jM PPP
1
1
1
2
Window-based local ancestry inference
251
1112222222222222221111111111111122222222222222222222222221111111111111222222
1111111112222222211111111111111112222 ???????????????????????????????????????????
1112222222222222221111111111111122222 ???????????????????????????????????????????
1111111112222222211111111111111112222222222222222111111111111111111111111111
6a4a1a 2a 3a 5a 7a 8a ia na…………………………………………………..….
…….…………………………..……
k
l
YORP :1 CEUP :2
10w }10,..,10{ iiWi
iii Wj
MWj
MWj
M PPP221112 21
1211
211
1
2
i
klWj
iiMAkli gGPa )|(211maxargˆ 1ig
Window-based local ancestry inference
252
Experimental Results- GEDI 1-pop Imputation
1,444 individuals trained on HAPMAP
CEU haplotype reference panel
Imputed (after masking) 1% of SNPs on chromosome 22
253
Helper Slide-Software Overview Only Require information about ancestral allele
frequencies: LAMP WINPOP SWITCH (HMM-Based)
Only require ancestral allele frequencies & genotypes
SABER (HMM-Based) Additionally use ancestral haplotype
information: HAPAA (HMM-Based) GEDI-ADMX (HMM-Based)
254
Genotype Error Detection-Hidden Markov Model of Haplotype Diversity
Similar HMMs proposed by [Kimmel &Shamir 05, Rastas et al. 05, Schwartz 04]
Paths with high transition probability correspond to “founder” haplotypes
Haplotype sequence/paths computed using Viterbi and forward algorithms
K= #Founders(E.g. K=4)
Transition Prob
Emission Prob
n= #SNPs(E.g. n=5)
:),'( qqj:);( qjE
n= #SNPs
F1
H1
F2
H2
Fi
Hi
Fn
Hn(Graphical model representation)
255
Genotype Error Detection-Hidden Markov Model of Haplotype Diversity
n= #SNPs
F1
H1
F2
H2
Fi
Hi
Fn
Hn(Graphical model representation)
Random variables for each locus i (i=1..n) Fi = founder haplotype at locus i; values between 1 and K Hi = observed allele at locus i; values: 0 (major) or 1
(minor) Given haplotype h, P(H=h|M) can be computed in
O(nK2) using a forward algorithm, where n=#SNPs, K=#founders 256
Genotype Error Detection-Hidden Markov Model of Haplotype Diversity
n= #SNPs
F1
H1
F2
H2
Fi
Hi
Fn
Hn(Graphical model representation)
Training: 2- step algorithm that exploits pedigree info
Step 1: Obtain haplotypes from using either: ENT: A pedigree-aware haplotype phasing algorithm based on
entropy-minimization Haplotype reference panel (e.g. HAPMAP)
Step 2: train HMM based on inferred haplotypes, using Baum-Welch 257
Genotype Error Detection- Factorial HMM for multilocus genotype data
n= #SNPs
F1
H1
F2
H2
Fi
Hi
Fn
Hn
F1
H1
F2
H2
Fi
Hi
Fn
Hn
G1 G2 GnGi
258
Genotype Error Detection- Factorial HMM for multilocus trio data
n= #SNPsF1
H1
Fi
Hi
Fn
Hn
F1
H1
Fi
Hi
Fn
Hn
M1 Mi Mn
F1
H1
Fi
Hi
Fn
Hn
F1
H1
Fi
Hi
Fn
Hn
F1 Fi Fn
G1 Gi Gn
259
(Graphical model representation)
Random variables for each locus i (i=1..n) Fi = founder haplotype at locus i; values between 1 and K Hi = observed allele at locus i; values: 0 (major) or 1
(minor) Model training
Based on reference haplotypes using Baum-Welch alg, or Based on unphased genotypes using EM [Rastas et al. 05]
Given haplotype h, P(H=h|M) can be computed in O(nK2) using a forward algorithm, where n=#SNPs, K=#founders
F1
H1
F2
H2
F3
H3
F4
H4
F5
H5
260
F1 F2 Fn…
H1 H2 Hn
F'1 F'2 F'n…
H'1 H'2 H'n
G1 G2 Gn
Factorial HMM for genotype data in a window with known local ancestry
klM
Random variable for each locus i (i=1..n) Gi = genotype at locus i; values: 0/1/2 (major hom./het./minor
hom.) 261