statistical machine learning from hiv genomic data using hmmsfiles.meetup.com/2894492/sml...
TRANSCRIPT
![Page 1: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine](https://reader036.vdocument.in/reader036/viewer/2022062603/5f0296647e708231d40501c2/html5/thumbnails/1.jpg)
February 3, 2012
Statistical Machine learning from HIV genomic data using HMMs
Jedidiah Francis Twitter: @jedidiahfrancis Email: [email protected] Blog: jedidiahfrancis.com Mobile: 07917184089
![Page 2: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine](https://reader036.vdocument.in/reader036/viewer/2022062603/5f0296647e708231d40501c2/html5/thumbnails/2.jpg)
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 2
Talk outline
Primer on Hidden Markov Models (HMMs) Inference in HIV genomic data Conclusion
![Page 3: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine](https://reader036.vdocument.in/reader036/viewer/2022062603/5f0296647e708231d40501c2/html5/thumbnails/3.jpg)
Practical uses
uses include: § finance (time series modeling), speech recognition, handwriting
recognition, medical (heart attack prediction), genomics (sequence analysis & alignment), robotics, meteorological (weather forecasting / modeling)
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 3
![Page 4: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine](https://reader036.vdocument.in/reader036/viewer/2022062603/5f0296647e708231d40501c2/html5/thumbnails/4.jpg)
Introduction to HMMs 1st order Markov chain:
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 4
S R
0.4
0.80.6
0.2
S S S S
R R R R
S R
0.4
0.80.6
0.2
T W T W
0.2 0.8 0.9 0.1
S S S S
R R R R
T WT T
Pr(Xt|X1, X2, . . . , Xt�1) = Pr(Xt|Xt�1)
![Page 5: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine](https://reader036.vdocument.in/reader036/viewer/2022062603/5f0296647e708231d40501c2/html5/thumbnails/5.jpg)
Problem 1
Given some model & parameters and sequence of observation D, compute . Observation: W T T W T W W W T W T T T T W T T W T T § Naïve approach sum over all possible paths (221≈2.1 million
paths).
§ Luckily we can use dynamic programming (forward algorithm) to reduce this mn operations to mn (42).
§ A similar algorithm (backward algorithm) does the same thing but in reverse order.
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 5
� = (A,B,⇡)Pr(D|�)
![Page 6: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine](https://reader036.vdocument.in/reader036/viewer/2022062603/5f0296647e708231d40501c2/html5/thumbnails/6.jpg)
Solution 1
Algorithm: forward algorithm
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 6
S S
R R
T
Pr(Xt|X1, X2, . . . , Xt�1) = Pr(Xt|Xt�1)� = (A,B,⇡)
Emission probability: ✏S(Xi)Transition probability: qij
Initialisation (i = 0) :f0(0) = 1, fk(0) = 0 8 k > 0
Recursion (i = 1, . . . , L) :fs(i+ 1) = [fS(i) qSS + fR(i) qRS ]⇥ ✏S(Xi+1)
Termination :Pr(D|�) =
Pk fk(L)
1
![Page 7: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine](https://reader036.vdocument.in/reader036/viewer/2022062603/5f0296647e708231d40501c2/html5/thumbnails/7.jpg)
Problem 2
Given some model 𝜆=(A,B,π) and sequence of observation D, find the most probable sequence of the underlying states. Observation: W T T W T W W W T W T T T T W T T W T T Path: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? § use the Viterbi algorithm
§ A trace back matrix keeps track of which is the most likely path
§ The most likely path can be found from:
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 7
Vk(i+ 1) = max[Vj(i) qjk]⇥ ✏S(Xi+1)
tk(i+ 1) = argmaxj [Vj(i)qjk]
maxk[Vk(L)]
![Page 8: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine](https://reader036.vdocument.in/reader036/viewer/2022062603/5f0296647e708231d40501c2/html5/thumbnails/8.jpg)
Solution 2
Observation: W T T W T W W W T W T T T T W T T W T T Path: S R R R R S S S S S R R R R R R S S S R
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 8
S S
R R
S
R
Xi-1 Xi Xi+1
VS(i+ 1) = max[Vj(i) qjS ]⇥ ✏S(Xi+1)
tS(i+ 1) = argmaxj [Vj(i)qjS ]
![Page 9: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine](https://reader036.vdocument.in/reader036/viewer/2022062603/5f0296647e708231d40501c2/html5/thumbnails/9.jpg)
HIV recombination
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 9
MASTER - 1003-102301p-8 1003-102301p-8 1003-022002p-25 1003-011702p-3 1003-103001p-50 1003-011702p-12 1003-102301p-29 1003-102301p-31 1003-011702p-21 1003-103001p-32 1003-102301p-22a 1003-022002p-1 1003-102301p-11 1003-103001p-68 1003-102301p-35 1003-103001p-35 1003-102301p-48 1003-102301p-7 1003-022002p-19 1003-022002p-11 1003-103001p-30 1003-022002p-15 1003-103001p-28 1003-102301p-14 1003-103001p-47 1003-022002p-52 1003-102301p-12a 1003-022002p-51 1003-103001p-54 1003-011702p-21a 1003-022002p-12 1003-103001p-10 1003-102301p-11a 1003-022002p-32 1003-103001p-21 1003-102301p-20 1003-102301p-15a 1003-103001p-38 1003-103001p-46 1003-103001p-25 1003-103001p-14 1003-111301p-3 1003-103001p-41a 1003-102301p-20a 1003-022002p-40 1003-102301p-53 1003-102301p-4a 1003-103001p-44a 1003-102301p-21a 1003-011702p-13 1003-022002p-7 1003-103001p-8 1003-102301p-9a 1003-022002p-30 1003-022002p-28 1003-103001p-1 1003-022002p-45 1003-022002p-17 1003-011702p-20 1003-102301p-3 1003-011702p-11 1003-022002p-20 1003-011702p-20a 1003-103001p-48 1003-103001p-6 1003-022002p-38 1003-022002p-3 1003-022002p-37 1003-102301p-6a 1003-022002p-31 1003-103001p-43 1003-011702p-26 1003-011702p-2 1003-103001p-46a 1003-022002p-4 1003-011702p-10 1003-103001p-9 1003-022002p-42 1003-011702p-23 1003-022002p-47 1003-102301p-52 1003-102301p-10a 1003-102301p-1 1003-022002p-44 1003-103001p-12 1003-011702p-23a 1003-102301p-19 1003-022002p-13 1003-022002p-33 1003-103001p-69 1003-022002p-53 1003-103001p-33a 1003-102301p-47 1003-103001p-49a 1003-102301p-54 1003-022002p-49 1003-103001p-44 1003-103001p-60 1003-022002p-41 1003-103001p-40 1003-011702p-16 1003-102301p-50 1003-022002p-46 1003-103001p-7 1003-103001p-50a 1003-103001p-16 1003-022002p-54 1003-102301p-55 1003-111301p-9 1003-102301p-30 1003-102301p-17 1003-102301p-42 1003-103001p-39 1003-011702p-22 1003-022002p-50 1003-111301p-4 1003-103001p-27a 1003-102301p-6 1003-102301p-45 1003-103001p-64 1003-102301p-51 1003-103001p-39a 1003-103001p-24 1003-111301p-12 1003-022002p-35 1003-103001p-52a 1003-103001p-58 1003-022002p-34 1003-102301p-49 1003-111301p-18 1003-103001p-48a 1003-103001p-15 1003-022002p-9 1003-102301p-43 1003-111301p-8 1003-102301p-10 1003-102301p-23 1003-103001p-61 1003-011702p-24 1003-011702p-22a 1003-103001p-59 1003-011702p-30 1003-103001p-29a 1003-103001p-38a 1003-103001p-51 1003-022002p-14 1003-103001p-41 1003-103001p-34a 1003-103001p-2 1003-102301p-18 1003-102301p-1a 1003-022002p-2 1003-103001p-36a 1003-111301p-5 1003-102301p-33 1003-102301p-41 1003-103001p-62 1003-103001p-49 1003-103001p-65 1003-102301p-7a 1003-102301p-4 1003-103001p-70 1003-011702p-18a 1003-103001p-53 1003-011702p-19a 1003-103001p-63 1003-011702p-19 1003-111301p-2 1003-111301p-21 1003-022002p-21 1003-111301p-1 1003-102301p-24a 1003-103001p-37a 1003-022002p-22 1003-011702p-18 1003-103001p-56 1003-011702p-1 1003-103001p-55 1003-102301p-15 1003-103001p-43a 1003-022002p-29 1003-022002p-48 1003-011702p-8 1003-022002p-36 1003-022002p-23 1003-103001p-42a 1003-103001p-45 1003-022002p-8 1003-103001p-57 1003-011702p-15 1003-111301p-7 1003-011702p-6 1003-103001p-42 1003-111301p-10 1003-011702p-14 1003-103001p-3 1003-022002p-18 1003-022002p-39 1003-103001p-37 1003-111301p-6 1003-103001p-13 1003-103001p-31 1003-102301p-12 1003-011702p-5 1003-103001p-20 1003-102301p-44 1003-103001p-45a 1003-102301p-37 1003-111301p-23 1003-111301p-22 1003-111301p-11 1003-022002p-6 1003-111301p-16 1003-111301p-24 1003-103001p-52 1003-102301p-38 1003-103001p-71 1003-111301p-17 1003-011702p-28 1003-011702p-25 1003-103001p-18 1003-102301p-9 1003-103001p-66 1003-011702p-7 1003-011702p-32 1003-022002p-27 1003-111301p-15 1003-103001p-51a 1003-103001p-40a 1003-111301p-19 1003-103001p-4 1003-111301p-20 1003-102301p-24 1003-011702p-9 1003-102301p-3a 1003-103001p-26 1003-102301p-16a 1003-103001p-36 1003-102301p-16 1003-102301p-13 1003-102301p-25 1003-102301p-13a 1003-102301p-36 1003-102301p-17a 1003-103001p-23 1003-103001p-47a 1003-022002p-26 1003-102301p-14a 1003-102301p-46 1003-102301p-8a 1003-102301p-2 1003-103001p-67 1003-102301p-19a 1003-102301p-26 1003-102301p-23a 1003-102301p-5a 1003-102301p-28 1003-102301p-27 1003-102301p-5
0 500 1000
Sequences compared to master
Base number
A:G
reen
, T:R
ed, G
:Ora
nge,
C:L
ight
blu
e, IU
PAC:
Dark
blu
e, G
aps:
Gra
y
![Page 10: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine](https://reader036.vdocument.in/reader036/viewer/2022062603/5f0296647e708231d40501c2/html5/thumbnails/10.jpg)
Generating estimates for 𝜌
builds hk+1 as an imperfect mosaic of h1,…,hk. Imperfect copying process
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 10
![Page 11: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine](https://reader036.vdocument.in/reader036/viewer/2022062603/5f0296647e708231d40501c2/html5/thumbnails/11.jpg)
Modeling the copy process
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 11
K
K+1
t1
t2
Δt
Single time point
Two time points
![Page 12: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine](https://reader036.vdocument.in/reader036/viewer/2022062603/5f0296647e708231d40501c2/html5/thumbnails/12.jpg)
Viterbi most likely path
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 12
![Page 13: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine](https://reader036.vdocument.in/reader036/viewer/2022062603/5f0296647e708231d40501c2/html5/thumbnails/13.jpg)
Statistical inference
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 13
![Page 14: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine](https://reader036.vdocument.in/reader036/viewer/2022062603/5f0296647e708231d40501c2/html5/thumbnails/14.jpg)
Closing remarks
Advantages of HMMs § Easy enough to implement and allows for tractable
computation § Rich enough to model very complex biological process Disadvantages § States are supposed to be conditionally independent, this is
sometimes not true. § Local maxima
§ Model may not converge to a truly global parameter max § Speed
§ Almost everything one does in an HMM involves enumerating all possible paths through the model
§ Can be sped up in various ways but still can be relatively slow.
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 14