hidden markov models - autenticação · 1 an introduction to bioinformatics algorithms slides...
TRANSCRIPT
1
www.bioalgorithms.info An Introduction to Bioinformatics Algorithms
Slides revised and adapted to
Computational Biology
IST
2015/2016
Ana Teresa Freitas
Hidden Markov Models
Three classic HMM problems • Evaluation
• Given a model and an output sequence, what is the probability that the model generated that output?
• Decoding • Given a model and an output sequence, what is the most
likely state sequence (path) through the model that produced that sequence?
• Learning • Given a model and a set of observed sequences, what
should the model parameters be so that it has a high probability of generating those sequences?
2
Evaluation Problem
HMM parameters Σ: set of emission characters.
Ex.: Σ = {H, T} or {1,0} for coin tossing Σ = {A,C,T,G} for DNA Q: set of hidden states, each emitting symbols
from Σ. Q={F,B} for coin tossing
A = (akl): a |Q| x |Q| matrix of probability of changing from state k to state l.
aFF = 0.9 aFB = 0.1 aBF = 0.1 aBB = 0.9
E = (ek(b)): a |Q| x |Σ| matrix of probability of emitting symbol b while being in state k.
eF(0) = ½ eF(1) = ½ eB(0) = ¼ eB(1) = ¾
HMM Parameters (cont)
3
P(x,π) Calculation • P(x,π): Probability that sequence x was generated
by the path π: n P(x,π) = P(π0→ π1) . Π P(xi| πi) · P(πi → πi+1)
i=1
= a π0, π1 · Π e πi (xi) · a πi, πi+1
π0 and πn+1 represent the fictitious initial and terminal state (begin and end)
Decoding Problem • Goal: Find an optimal hidden path of states
given observations.
• Input: Sequence of observations x = x1…xn generated by an HMM M(Σ, Q, A, E)
• Output: A path that maximizes P(x|π) over all possible paths π.
4
Edit Graph for Decoding Problem
Building Manhattan for Decoding Problem
• Andrew Viterbi used the Manhattan grid model to solve the Decoding Problem.
• Every choice of π = π1… πn corresponds to a path in the graph.
• The only valid direction in the graph is eastward.
• This graph has |Q|2(n-1) edges.
5
Decoding Problem vs. Alignment Problem
Valid directions in the alignment problem.
Valid directions in the decoding problem.
Decoding Problem as Finding a Longest Path in a DAG
• The Decoding Problem is reduced to finding a longest path in the directed acyclic graph (DAG) above.
• Notes: the length of the path is defined as the product of its edges’ weights, not the sum.
6
Decoding Problem (cont’d)
• Every path in the graph has the probability
P(x|π).
• The Viterbi algorithm finds the path that maximizes P(x|π) among all possible paths.
• The Viterbi algorithm runs in O(n|Q|2) time.
Decoding Problem: weights of edges
w
The weight w is given by:
???
(k, i) (l, i+1)
7
Decoding Problem: weights of edges
w
The weight w is given by:
??
(k, i) (l, i+1)
n P(x|π) = Π e πi+1 (xi+1) . a πi, πi+1 i=0
Decoding Problem: weights of edges
w
The weight w is given by:
?
(k, i) (l, i+1)
i-th term = e πi+1 (xi+1) . a πi, πi+1
8
Decoding Problem: weights of edges
w
The weight w=el(xi+1). akl
(k, i) (l, i+1)
i-th term = el(xi+1). akl for πi =k, πi+1=l
Decoding Problem and Dynamic Programming
Sk,i is the probability of the most probable path for the prefix (x1,...,xi)
sl,i+1 = maxk Є Q {sk,i · weight of edge between (k,i) and (l,i+1)}=
= maxk Є Q {sk,i · akl · el (xi+1) }= el (xi+1) · maxk Є Q {sk,i · akl}
9
Decoding Problem (cont’d)
• Initialization: • sbegin,0 = 1 • sk,0 = 0 for k ≠ begin.
• For each i =0,...,L-1 and for each l ∈ Q recursively calculate:
sl,i+1 = el (xi+1) · maxk Є Q {sk,i · akl}
• Let π* be the optimal path. Then,
P(x|π*) = maxk Є Q {sk,n . ak,end}
Viterbi Algorithm • The value of the product can become extremely
small, which leads to overflowing. • To avoid overflowing, use log value instead.
sk,i+1= logel(xi+1) + max k Є Q {sk,i + log(akl)} • Initialization:
• sbegin,0 = 0 • sk,0 = -∞ for k ≠ begin
• The score for the best path
Score (X, π*) = max k Є Q {sk,n + log(akend)}
10
Viterbi Algorithm • Complexity:
• We calculate the values of O(|Q|.L) cells of the matrix V • Spend O(|Q|) operations per cell
• Overall time complexity: O(L.|Q|2) • Space complexity: O(L.|Q|)
Example • Consider a HMM model that includes two
hidden states (S1 and S2). • Initial probabilities for both 1 and 2 are equal
to 0.5, while transition probabilities are : aS1S1 = 0.6; aS2S2 = 0.5; aS1S2=0.4; aS2S1 = 0.5.
• Nucleotides T, C, G, A are emitted with probabilities eA = eT = 0.3 and eC = eG = 0.2 from state S1; and eA = eT = 0.2 and eC = eG = 0.3 from state S2.
11
Problem 1 – Evaluation • Given the HMM described in the example
and the output sequence x = GGCACTGAA what is the probability that the HMM generated x using the following path (sequence of hidden states) ? π = S1 S1 S2 S2 S2 S2 S1 S1 S1
Problem 2 - Decoding • Use the Viterbi algorithm to compute the
most likely path generating sequence x = GGCACTGAA using the HMM in the example.
12
Forward Algorithm
• For Markov chains we calculate the probability of a sequence, P(x)
• How to calculate this probability for an HMM as well?
• We must add the probabilities for all possible paths
∑=π
π ),()( xPxP
Forward Algorithm
• Define fk,i (forward probability) as the probability of emitting the prefix x1…xi and eventually reaching the state πi = k. fk,i = P(x1…xi , πi = k)
• The recurrence for the forward algorithm: fl,i+1 = el(xi+1) . Σ fk,i . akl
k Є Q
13
Forward algorithm
• Initialization: • fbegin,0 = 1 • fk,0 = 0 for k ≠ begin.
• For each i =0,...,L calculate:
fl,i = el (xi) · Σk fk,i-1 · akl
• Termination
P(x) = Σk fk,L . ak,end
Example • Consider a HMM model that includes two
hidden states (S1 and S2). • Initial probabilities for both 1 and 2 are equal
to 0.5, while transition probabilities are : aS1S1 = 0.6; aS2S2 = 0.5; aS1S2=0.4; aS2S1 = 0.5.
• Nucleotides T, C, G, A are emitted with probabilities eA = eT = 0.3 and eC = eG = 0.2 from state S1; and eA = eT = 0.2 and eC = eG = 0.3 from state S2.
14
Problem 3 – Compute P(x) • Use the Forward algorithm to compute the
the probability of a sequence x, P(x), using the HMM in the example.
x = GGC
Forward-Backward Problem
Given: a sequence of coin tosses generated by an HMM. Goal: find the probability that the dealer was using a biased coin at a particular time.
15
Backward Algorithm • However, forward probability is not the only
factor affecting P(πi = k|x).
• The sequence of transitions and emissions that the HMM undergoes between πi+1 and πn also affect P(πi = k|x).
forward xi backward
Backward Algorithm (cont’d)
• Define backward probability bk,i as the probability of being in state πi = k and emitting the suffix xi+1…xn.
• The recurrence for the backward algorithm:
bk,i = Σ el(xi+1) . bl,i+1 . akl l Є Q
16
Backward-Forward Algorithm
• The probability that the dealer used a biased coin at any moment i:
P(x, πi = k) fk(i) . bk(i) P(πi = k|x) = _______________ = ______________
P(x) P(x)
P(x) is the sum of P(x, πi = k) over all k
Example • Consider a HMM model with two hidden states, S1 and S2.
Consider also that the starting probabilities are 0.4 for S1 and 0.6 for S2, and the transition (T) and emission (E) probability matrices are the following:
• Compute the most likely path generating the sequence X = TAC. • What is the probability of generating the sequence X = TACG and
being at state S2 when generating the symbol C?
17
HMM Parameter Estimation • So far, we have assumed that the transition
and emission probabilities are known.
• However, in most HMM applications, the probabilities are not known. It’s very hard to estimate the probabilities.
HMM Parameter Estimation Problem
q Given Ø HMM with states and alphabet (emission
characters) Ø Independent training sequences x1, … xm q Find HMM parameters Θ (that is, akl, ek(b))
that maximize P(x1, …, xm | Θ) the joint probability of the training sequences.
18
Maximize the likelihood P(x1, …, xm | Θ) as a function of Θ is called the
likelihood of the model. The training sequences are assumed independent,
therefore P(x1, …, xm | Θ) = Πi P(xi | Θ)
The parameter estimation problem seeks Θ that realizes
In practice the log likelihood is computed to avoid
underflow errors
∏ ΘΘ i
ixP )|(max
Two situations Known paths for training sequences
CpG islands marked on training sequences One evening the casino dealer allows us to see when he changes dice
Unknown paths CpG islands are not marked Do not see when the casino dealer changes dice
19
Known paths Akl = # of times each k → l is taken in the training
sequences Ek(b) = # of times b is emitted from state k in the
training sequences Compute akl and ek(b) as maximum likelihood
estimators:
∑
∑
=
=
'
''
)'(/)()(
/
bkkk
lklklkl
bEbEbe
AAa
Pseudocounts q Some state k may not appear in any of the training
sequences. This means Akl = 0 for every state l and akl cannot be computed with the given equation.
q To avoid this overfitting use predetermined pseudocounts rkl and rk(b). Akl = # of transitions k→l + rkl
Ek(b) = # of emissions of b from k + rk(b) The pseudocounts reflect our prior biases about the
probability values.
20
Unknown paths: Viterbi training Idea: use Viterbi decoding to compute the most
probable path for training sequence x Start with some guess for initial parameters and compute π* the most probable path for x using initial parameters. Iterate until no change in π* :
1. Determine Akl and Ek(b) as before 2. Compute new parameters akl and ek(b) using the
same formulas as before 3. Compute new π* for x and the current parameters
Viterbi training analysis
q The algorithm converges precisely There are finitely many possible paths. New parameters are uniquely determined by the current π*. There may be several paths for x with the same probability, hence must compare the new π* with all previous paths having highest probability.
q Does not maximize the likelihood Πx P(x | Θ) but the contribution to the likelihood of the most probable path Πx P(x | Θ, π*)
q In general performs less well than Baum-Welch
21
Unknown paths: Baum-Welch
Idea: 1. Guess initial values for parameters.
art and experience, not science 2. Estimate new (better) values for parameters.
how ? 3. Repeat until stopping criteria is met.
what criteria ?
Better values for parameters
Would need the Akl and Ek(b) values but cannot count (the path is unknown) and do not want to use a most probable path.
For all states k,l, symbol b and training sequence x
Compute Akl and Ek(b) as expected values, given the current parameters