parameter estimation for hmm lecture #7
DESCRIPTION
Parameter Estimation For HMM Lecture #7. Background Readings : Chapter 3.3 in the text book, Biological Sequence Analysis , Durbin et al., 2001. M. M. M. M. S 1. S 2. S L-1. S L. T. T. T. T. x 1. x 2. X L-1. x L. Reminder: Hidden Markov Model. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/1.jpg)
.
Parameter Estimation For HMM Lecture #7
Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
![Page 2: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/2.jpg)
2
Reminder: Hidden Markov Model
Markov Chain transition probabilities: p(Si+1= t|Si = s) = ast
Emission probabilities: p(Xi = b| Si = s) = es(b)
S1 S2 SL-1 SL
x1 x2 XL-1 xL
M M M M
TTTT
11 11
( , ) ( , , ; ,..., ) ( )i i i
L
L L s s s ii
p p s s x x a e xs x
For each integer L>0, this defines a probability space in which the simple events are HMM of length L.
![Page 3: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/3.jpg)
3
Hidden Markov Model for CpG Islands
The states: Domain(Si)={+, -} {A,C,T,G} (8 values)
In this representation P(xi| si) = 0 or 1 depending on whether xi is consistent with si . E.g. xi= G is consistent with si=(+,G) and with si=(-,G) but not with any other state of si.
A- T+
A T
G +
G
… …… …
![Page 4: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/4.jpg)
4
Reminder: Most Probable state pathS1 S2 SL-1 SL
x1 x2 XL-1 xL
M M M M
TTTT
Given an output sequence x = (x1,…,xL),
A most probable path s*= (s*1,…,s*
L) is one which maximizes p(s|x).
1( ,..., )
* *1 1 1* ( ,..., ) ( ,..., | ,..., )argmax
Ls s
L L Ls s s p s s x x
![Page 5: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/5.jpg)
6
A- C- T- T+
A C T T
G +
G
Predicting CpG islands via most probable path:
Output symbols: A, C, G, T (4 letters). Markov Chain states: 4 “-” states and 4 “+” states, two for each letter (8 states total).The most probable path (found by Viterbi’s algorithm) predicts CpG islands.
Experiment (Durbin et al, p. 60-61) shows that the predicted islands are shorter than the assumed ones. In addition quite a few “false negatives” are found.
![Page 6: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/6.jpg)
7
Reminder: Most probable state
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
Given an output sequence x = (x1,…,xL), si is a most probable state (at location i) if:
si=argmaxk p(Si=k |x).
),..,|(),..,(
),..,|(),..,|( Lii
L
LiiLii xxsSp
xxp
xxsSpxxsSp 1
1
11
![Page 7: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/7.jpg)
9
Finding the probability that a letter is in a CpG island via the algorithm for most probable state:
The probability that the occurrence of G in the i-th location is in a CpG island (+ state) is:
∑s+ p(Si =s+ |x) = ∑s+ F(Si=s+ )B(Si=s+
)
Where the summation is formally over the 4 “+” states, but actually
only state G+ need to be considered (why?)
A- C- T- T+
A C T T
G +
G
i
![Page 8: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/8.jpg)
10
Parameter Estimation for HMM
An HMM model is defined by the parameters: akl and ek(b), for all states k,l and all symbols b. Let θ denote the collection of these parameters.
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
lk
b
akl
ek(b)
![Page 9: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/9.jpg)
11
Parameter Estimation for HMM
To determine the values of (the parameters in) θ, use a training set = {x1,...,xn}, where each xj is a sequence which is assumed to fit the model.Given the parameters θ, each sequence xj has an assigned probability p(xj|θ).
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
![Page 10: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/10.jpg)
12
Maximum Likelihood Parameter Estimation for HMM
The elements of the training set {x1,...,xn}, are assumed to be independent, p(x1,..., xn|θ) = ∏j p (xj|θ).
ML parameter estimation looks for θ which maximizes the above.
The exact method for finding or approximating this θ depends on the nature of the training set used.
![Page 11: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/11.jpg)
13
Data for HMM
Possible properties of (the sequences in) the training set:1.For each xj, the information on the states sj
i (the symbols
xji are usually known).
2.The size (number of sequences) of the training set
S1 S2 SL-1 SL
x1 x2 XL-1 xL
M M M M
TTTT
![Page 12: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/12.jpg)
14
Case 1: ML when Sequences are fully known
We know the complete structure of each sequence in the training set {x1,...,xn}. We wish to estimate akl and ek(b) for all pairs of states k, l and symbols b.
By the ML method, we look for parameters θ* which maximize the probability of the sample set: p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
![Page 13: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/13.jpg)
15
Case 1: Sequences are fully known
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
Let mkl= |{i: si-1=k,si=l}| (in xj). mk(b)=|{i:si=k,xi=b}| (in xj).
( )
( , ) ( , )
then: ( | ) [ ( )]kl km m bkkl
k l k b
jp a e bx
For each xj we have:
j
iii
L
i
jisss
j xeaxp1
1)()|(
![Page 14: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/14.jpg)
16
Case 1 (cont)
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
By the independence of the xj’s, p(x1,...,xn| θ)=∏jp(xj|θ).
Thus, if Akl = #(transitions from k to l) in the training set, and Ek(b) = #(emissions of symbol b from state k) in the training set, we have:
( )
( , ) ( , )
1 ( | ) [ ( )],.., kl kA E bkkl
k l k b
np a e bx x
![Page 15: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/15.jpg)
17
Case 1 (cont)
( )
( , ) ( , )
[ ( )]kl kA E bkkl
k l k b
a e b
So we need to find akl’s and ek(b)’s which maximize:
Subject to:
])(,
)(,
0 [
1 and 1 states allFor
bea
beak
kkl
l bkkl
![Page 16: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/16.jpg)
18
Case 1 (cont)
( )
( , ) ( , )
( )
[ ( )]
[ ( )][ ] [ ]
kl k
kl k
A E bkkl
k l k b
A E bkkl
k l k b
F a e b
a e b
kb
Subject to: for all , 1, and e ( ) 1.kll
k a b
Rewriting, we need to maximize:
![Page 17: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/17.jpg)
19
Case 1 (cont)
If we maximize for each : s.t. 1klAklkl
ll
k a a ( )and also [ ( )] s.t. ( ) 1kE b
k kbb
e b e b
Then we will maximize also F.Each of the above is a simpler ML problem, which is similar to ML parameters estimation for a die, treated next.
![Page 18: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/18.jpg)
20
ML parameters estimation for a die
Let X be a random variable with 6 values x1,…,x6
denoting the six outcomes of a die. Here the parameters areθ ={1,2,3,4,5, 6} , ∑θi=1Assume that the data is one sequence:
Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6)So we have to maximize
2 3 2 21 2 3 4 5 6( | )P Data
Subject to: θ1+θ2+ θ3+ θ4+ θ5+ θ6=1 [and θi 0 ]
252 3 2
1 2 3 4 51
i.e., ( | ) 1 ii
P Data
![Page 19: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/19.jpg)
23
Maximum Likelihood EstimateBy the ML approach, we look for parameters that maximizes the probability of data (i.e., the likelihood function ).We will find the parameters by considering the corresponding log-likelihood function:
63 51 2 4
551 2 3 4 1
log ( | ) log 1N
N NN N Nii
P Data
5
16
5
11loglog
i ii ii NN
A necessary condition for (local) maximum is:
01
)|(log5
1
6
θ
i ij
j
j
NNDataP
![Page 20: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/20.jpg)
24
Finding the Maximum
Rearranging terms:
ii
jj N
N Divide the jth equation by the ith
equation:
Sum from j=1 to 6:
ii
ii
j j
jj N
N
N
N
6
16
1
1
So there is only one local – and hence global – maximum. Hence the MLE is given by:
6,..,1 iN
N ii
6
65
1
6
1 NNN
i ij
j
![Page 21: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/21.jpg)
25
Generalization for distribution with any number n of outcomes
Let X be a random variable with n values x1,…,xk denoting the k outcomes of Independently and Identically Distributed experiments, with parameters θ ={1,2,...,k} (θi is the probability of xi). Again, the data is one sequence of length n:
Data = (xi1,xi2
,...,xin)
Then we have to maximize
1 211 2( | ) , ( ... )knn n
kkP Data n n n
Subject to: θ1+θ2+ ....+ θk=1
11
1
1 11
i.e., ( | ) 1k
k
nknn
iki
P Data
![Page 22: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/22.jpg)
26
Generalization for n outcomes (cont)
i k
i k
n n
By treatment identical to the die case, the maximum is obtained when for all i:
Hence the MLE is given by the relative frequencies:
1,..,ii
ni k
n
![Page 23: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/23.jpg)
27
Fractional Exponents
Some models allow ni’s to be fractions (eg, if we are uncertain of a die outcome, we may consider it “6” with 20% confidence and “5” with 80%):We still can have
And the same analysis yields:
1,..,ii
ni k
n
1 211 2( | ) , ( ... )knn n
kkP Data n n n
![Page 24: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/24.jpg)
28
Apply the ML method to HMM
Let Akl = #(transitions from k to l) in the training set.Ek(b) = #(emissions of symbol b from state k) in the training set. We need to:
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
( )
( , ) ( , )
Maximize [ ( )]kl kA E bkkl
k l k b
a e b
k Subject to: for all states , 1, and e ( ) 1, , ( ) 0.kl kl kl b
k a b a e b
![Page 25: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/25.jpg)
29
Apply to HMM (cont.)
We apply the previous technique to get for each k the parameters {akl|l=1,..,m} and {ek(b)|bΣ}:
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
'' '
( ) , and ( )
( ')kl k
kl kkl kl b
A E ba e b
A E b
Which gives the optimal ML parameters
![Page 26: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/26.jpg)
30
Summary of Case 1: Sequences are fully known
We know the complete structure of each sequence in the training set {x1,...,xn}. We wish to estimate akl and ek(b) for all pairs of states k, l and symbols b.
Summary: when everything is known, we can find the (unique set of) parameters θ* which maximizes
p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
![Page 27: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/27.jpg)
31
Adding pseudo counts in HMM
If the sample set is too small, we may get a biased result. In this case we modify the actual count by our prior knowledge/belief: rkl is our prior belief and transitions from k to l.rk(b) is our prior belief on emissions of b from state k.
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
' '' '
( ) ( )then , and ( )
( ) ( ( ') ( ))kl kl k k
kl kkl kl k kl b
A r E b r ba e b
A r E b r b
![Page 28: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/28.jpg)
32
Case 2: State paths are unknown
For a given θ we have: p(x1,..., xn|θ)= p(x1| θ) p (xn|θ)
(since the xj are independent)
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
For each sequence x,p(x|θ)=∑s p(x,s|θ),
The sum taken over all state paths s which emit x.
![Page 29: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/29.jpg)
33
Case 2: State paths are unknown
Thus, for the n sequences (x1,..., xn) we have:
p(x1,..., xn |θ)= ∑ p(x1,..., xn , s1,..., sn |θ),
Where the summation is taken over all tuples of n state paths (s1,..., sn ) which generate (x1,..., xn ) .For simplicity, we will assume that n=1.
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
(s1,..., sn )
![Page 30: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/30.jpg)
34
Case 2: State paths are unknown
So we need to maximize p(x|θ)=∑s p(x,s|θ), where the summation is over all the sequences S which produce the output sequence x. Finding θ* which maximizes ∑s p(x,s|θ) is hard. [Unlike finding θ* which maximizes p(x,s|θ) for a single sequence (x,s).]
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
![Page 31: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/31.jpg)
35
ML Parameter Estimation for HMM
The general process for finding θ in this case is1. Start with an initial value of θ.2. Find θ’ so that p(x|θ’) > p(x|θ) 3. set θ = θ’.4. Repeat until some convergence criterion is met.
A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum-Welch training.
![Page 32: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/32.jpg)
36
Baum Welch training
We start with some values of akl and ek(b), which define prior values of θ. Then we use an iterative algorithm which attempts to replace θ by a θ* s.t.
p(x|θ*) > p(x|θ)This is done by “imitating” the algorithm for Case 1, where all states are known:
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
![Page 33: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/33.jpg)
37
Baum Welch training
In case 1 we computed the optimal values of akl and ek(b), (for the optimal θ) by simply counting the number Akl of transitions from state k to state l, and the number Ek(b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states.
Si= lSi-1= k
xi-1= b
… …
xi= c
![Page 34: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/34.jpg)
38
Baum Welch training
When the states are unknown, the counting process is replaced by averaging process:For each edge si-1 si we compute the average number of “k to l” transitions, for all possible pairs (k,l), over this edge. Then, for each k and l, we take Akl to be the sum over all edges.
Si= ?Si-1= ?
xi-1= b xi= c
… …
![Page 35: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/35.jpg)
39
Baum Welch training
Similarly, For each edge si b and each state k, we compute the average number of times that si=k, which is the expected number of “k → b” transmission on this edge. Then we take Ek(b) to be the sum over all such edges. These expected values are computed by assuming the current parameters θ:
Si = ?
xi-1= b
![Page 36: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/36.jpg)
40
Akl and Ek(b) when states are unknown
Akl and Ek(b) are computed according to the current distribution θ, that is:
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
Akl=∑sAsklp(s|x,θ), where As
kl is the number of k to l transitions in the sequence s.Ek(b)=∑sEs
k (b)p(s|x,θ), where Esk (b) is the number of times k
emits b in the sequence s with output x.
![Page 37: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/37.jpg)
41
Baum Welch: step 1a Count expected number of state transitions
For each i, k,l, compute the state transitions probabilities by the current θ:
s1 SisL
X1 Xi XL
Si-1
Xi-1.. ..
P(si-1=k, si=l | x,θ)For this, we use the forwards and backwards algorithms
![Page 38: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/38.jpg)
43
Baum Welch: Step 1a (cont)
Claim: By the probability distribution of HMM
s1 SisL
X1 Xi XL
Si-1
Xi-1.. ..
1
1( ) ( ) ( )( , | , )
( | )k kl l i l
i i
F i a e x B ip s k s l x
p x
(akl and el(xi) are the parameters defined by , and Fk(i-1), Bk(i) are the forward and backward algorithms)
![Page 39: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/39.jpg)
44
Step 1a: Computing P(si-1=k, si=l | x,θ)
P(x1,…,xL,si-1=k,si=l|) = P(x1,…,xi-1,si-1=k|) aklel(xi ) P(xi+1,…,xL |si=l,)
= Fk(i-1) aklel(xi ) Bl(i)
Via the forward algorithm
Via the backward algorithm
s1 s2 sL-1 sL
X1 X2 XL-1 XL
Si-1
Xi-1
si
Xi
x
p(si-1=k,si=l | x, ) = Fk(i-1) aklel(xi ) Bl(i)
)|( xp
![Page 40: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/40.jpg)
45
Step 1a (end)
For each pair (k,l), compute the expected number of state transitions from k to l, as the sum of the expected number of k to l transitions over all L edges :
1
1
1
1
11
( , , | )( | )
( ) ( ) ( )( | )
L
kl i i
i
L
k kl l i l
i
A p s k s l xp x
F i a e x B ip x
![Page 41: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/41.jpg)
46
Step 1a for many sequences:
Exercise: Prove that when we have n independent input sequences (x1,..., xn ), then Akl is given by:
11 1
1 1
1( = , = , )
( )
1( 1) ( ) ( )
( )
|n L
kl i ijj i
n Lj j
kl l ik ljj i
jA p s k s lp x
f i a e x b ip x
x
where and are the forward and backward algorithms
for under .
j jk lf b
jx
![Page 42: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/42.jpg)
47
Baum-Welch: Step 1b count expected number of symbols emissions
for state k and each symbol b, for each i where Xi=b, compute the expected number of times that Xi=b.
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi=b
),...,(/)()(
),...(/),...(
),...|(
Likik
LiL
Li
xxpsbsf
xxpksxxp
xxksp
1
11
1
![Page 43: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/43.jpg)
48
Baum-Welch: Step 1b
For each state k and each symbol b, compute the expected number of emissions of b from k as the sum of the expected number of times that si = k, over all i’s for which xi = b.
1
:
( ) ( ) ( )( | )
i
k k k
i x b
E b F i B ip x
![Page 44: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/44.jpg)
49
Step 1b for many sequences
Exercise: when we have n sequences (x1,..., xn ), the expected number of emissions of b from k is given by:
1 :
1( ) ( ) ( )
( ) ji
nj j
k k kjj i x b
E b F i B ip x
![Page 45: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/45.jpg)
50
Summary of Steps 1a and 1b: the E part of the Baum Welch training
These steps compute the expected numbers Akl of k,l transitions for all pairs of states k and l, and the expected numbers Ek(b) of transmitions of symbol b from state k, for all states k and symbols b.
The next step is the M step, which is identical to the computation of optimal ML parameters when all states are known.
![Page 46: Parameter Estimation For HMM Lecture #7](https://reader035.vdocument.in/reader035/viewer/2022062217/56812ee6550346895d9481da/html5/thumbnails/46.jpg)
51
Baum-Welch: step 2
'' '
( ) , and ( )
( ')kl k
kl kkl kl b
A E ba e b
A E b
Use the Akl’s, Ek(b)’s to compute the new values of akl and ek(b). These values define θ*.
The correctness of the EM algorithm implies that: p(x1,..., xn|θ*) p(x1,..., xn|θ)
i.e, θ* increases the probability of the data
This procedure is iterated, until some convergence criterion is met.