hidden markov models - autenticação · 1 an introduction to bioinformatics algorithms slides...

1

www.bioalgorithms.info An Introduction to Bioinformatics Algorithms

Slides revised and adapted to

Computational Biology

IST

2015/2016

Ana Teresa Freitas

Hidden Markov Models

Three classic HMM problems •  Evaluation

•  Given a model and an output sequence, what is the probability that the model generated that output?

•  Decoding •  Given a model and an output sequence, what is the most

likely state sequence (path) through the model that produced that sequence?

•  Learning •  Given a model and a set of observed sequences, what

should the model parameters be so that it has a high probability of generating those sequences?

2

Evaluation Problem

HMM parameters Σ: set of emission characters.

Ex.: Σ = {H, T} or {1,0} for coin tossing Σ = {A,C,T,G} for DNA Q: set of hidden states, each emitting symbols

from Σ. Q={F,B} for coin tossing

A = (akl): a |Q| x |Q| matrix of probability of changing from state k to state l.

aFF = 0.9 aFB = 0.1 aBF = 0.1 aBB = 0.9

E = (ek(b)): a |Q| x |Σ| matrix of probability of emitting symbol b while being in state k.

eF(0) = ½ eF(1) = ½ eB(0) = ¼ eB(1) = ¾

HMM Parameters (cont)

3

P(x,π) Calculation •  P(x,π): Probability that sequence x was generated

by the path π: n P(x,π) = P(π0→ π1) . Π P(xi| πi) · P(πi → πi+1)

i=1

= a π0, π1 · Π e πi (xi) · a πi, πi+1

π0 and πn+1 represent the fictitious initial and terminal state (begin and end)

Decoding Problem •  Goal: Find an optimal hidden path of states

given observations.

•  Input: Sequence of observations x = x1…xn generated by an HMM M(Σ, Q, A, E)

•  Output: A path that maximizes P(x|π) over all possible paths π.

4

Edit Graph for Decoding Problem

Building Manhattan for Decoding Problem

•  Andrew Viterbi used the Manhattan grid model to solve the Decoding Problem.

•  Every choice of π = π1… πn corresponds to a path in the graph.

•  The only valid direction in the graph is eastward.

•  This graph has |Q|2(n-1) edges.

5

Decoding Problem vs. Alignment Problem

Valid directions in the alignment problem.

Valid directions in the decoding problem.

Decoding Problem as Finding a Longest Path in a DAG

•  The Decoding Problem is reduced to finding a longest path in the directed acyclic graph (DAG) above.

•  Notes: the length of the path is defined as the product of its edges’ weights, not the sum.

6

Decoding Problem (cont’d)

•  Every path in the graph has the probability

P(x|π).

•  The Viterbi algorithm finds the path that maximizes P(x|π) among all possible paths.

•  The Viterbi algorithm runs in O(n|Q|2) time.

Decoding Problem: weights of edges

w

The weight w is given by:

???

(k, i) (l, i+1)

7


w


??

(k, i) (l, i+1)

n P(x|π) = Π e πi+1 (xi+1) . a πi, πi+1 i=0


w


?

(k, i) (l, i+1)

i-th term = e πi+1 (xi+1) . a πi, πi+1

8


w

The weight w=el(xi+1). akl

(k, i) (l, i+1)

i-th term = el(xi+1). akl for πi =k, πi+1=l

Decoding Problem and Dynamic Programming

Sk,i is the probability of the most probable path for the prefix (x1,...,xi)

sl,i+1 = maxk Є Q {sk,i · weight of edge between (k,i) and (l,i+1)}=

= maxk Є Q {sk,i · akl · el (xi+1) }= el (xi+1) · maxk Є Q {sk,i · akl}

9

Decoding Problem (cont’d)

•  Initialization: •  sbegin,0 = 1 •  sk,0 = 0 for k ≠ begin.

•  For each i =0,...,L-1 and for each l ∈ Q recursively calculate:

sl,i+1 = el (xi+1) · maxk Є Q {sk,i · akl}

•  Let π* be the optimal path. Then,

P(x|π*) = maxk Є Q {sk,n . ak,end}

Viterbi Algorithm •  The value of the product can become extremely

small, which leads to overflowing. •  To avoid overflowing, use log value instead.

sk,i+1= logel(xi+1) + max k Є Q {sk,i + log(akl)} •  Initialization:

•  sbegin,0 = 0 •  sk,0 = -∞ for k ≠ begin

•  The score for the best path

Score (X, π*) = max k Є Q {sk,n + log(akend)}

10

Viterbi Algorithm •  Complexity:

•  We calculate the values of O(|Q|.L) cells of the matrix V •  Spend O(|Q|) operations per cell

•  Overall time complexity: O(L.|Q|2) •  Space complexity: O(L.|Q|)

Example •  Consider a HMM model that includes two

hidden states (S1 and S2). •  Initial probabilities for both 1 and 2 are equal

to 0.5, while transition probabilities are : aS1S1 = 0.6; aS2S2 = 0.5; aS1S2=0.4; aS2S1 = 0.5.

•  Nucleotides T, C, G, A are emitted with probabilities eA = eT = 0.3 and eC = eG = 0.2 from state S1; and eA = eT = 0.2 and eC = eG = 0.3 from state S2.

11

Problem 1 – Evaluation •  Given the HMM described in the example

and the output sequence x = GGCACTGAA what is the probability that the HMM generated x using the following path (sequence of hidden states) ? π = S1 S1 S2 S2 S2 S2 S1 S1 S1

Problem 2 - Decoding •  Use the Viterbi algorithm to compute the

most likely path generating sequence x = GGCACTGAA using the HMM in the example.

12

Forward Algorithm

•  For Markov chains we calculate the probability of a sequence, P(x)

•  How to calculate this probability for an HMM as well?

•  We must add the probabilities for all possible paths

∑=π

π ),()( xPxP

Forward Algorithm

•  Define fk,i (forward probability) as the probability of emitting the prefix x1…xi and eventually reaching the state πi = k. fk,i = P(x1…xi , πi = k)

•  The recurrence for the forward algorithm: fl,i+1 = el(xi+1) . Σ fk,i . akl

k Є Q

13

Forward algorithm

•  Initialization: •  fbegin,0 = 1 •  fk,0 = 0 for k ≠ begin.

•  For each i =0,...,L calculate:

fl,i = el (xi) · Σk fk,i-1 · akl

•  Termination

P(x) = Σk fk,L . ak,end

Example •  Consider a HMM model that includes two

hidden states (S1 and S2). •  Initial probabilities for both 1 and 2 are equal

to 0.5, while transition probabilities are : aS1S1 = 0.6; aS2S2 = 0.5; aS1S2=0.4; aS2S1 = 0.5.

•  Nucleotides T, C, G, A are emitted with probabilities eA = eT = 0.3 and eC = eG = 0.2 from state S1; and eA = eT = 0.2 and eC = eG = 0.3 from state S2.

14

Problem 3 – Compute P(x) •  Use the Forward algorithm to compute the

the probability of a sequence x, P(x), using the HMM in the example.

x = GGC

Forward-Backward Problem

Given: a sequence of coin tosses generated by an HMM. Goal: find the probability that the dealer was using a biased coin at a particular time.

15

Backward Algorithm •  However, forward probability is not the only

factor affecting P(πi = k|x).

•  The sequence of transitions and emissions that the HMM undergoes between πi+1 and πn also affect P(πi = k|x).

forward xi backward

Backward Algorithm (cont’d)

•  Define backward probability bk,i as the probability of being in state πi = k and emitting the suffix xi+1…xn.

•  The recurrence for the backward algorithm:

bk,i = Σ el(xi+1) . bl,i+1 . akl l Є Q

16

Backward-Forward Algorithm

•  The probability that the dealer used a biased coin at any moment i:

P(x, πi = k) fk(i) . bk(i) P(πi = k|x) = _______________ = ______________

P(x) P(x)

P(x) is the sum of P(x, πi = k) over all k

Example •  Consider a HMM model with two hidden states, S1 and S2.

Consider also that the starting probabilities are 0.4 for S1 and 0.6 for S2, and the transition (T) and emission (E) probability matrices are the following:

•  Compute the most likely path generating the sequence X = TAC. •  What is the probability of generating the sequence X = TACG and

being at state S2 when generating the symbol C?

17

HMM Parameter Estimation •  So far, we have assumed that the transition

and emission probabilities are known.

•  However, in most HMM applications, the probabilities are not known. It’s very hard to estimate the probabilities.

HMM Parameter Estimation Problem

q Given Ø  HMM with states and alphabet (emission

characters) Ø  Independent training sequences x1, … xm q Find HMM parameters Θ (that is, akl, ek(b))

that maximize P(x1, …, xm | Θ) the joint probability of the training sequences.

18

Maximize the likelihood P(x1, …, xm | Θ) as a function of Θ is called the

likelihood of the model. The training sequences are assumed independent,

therefore P(x1, …, xm | Θ) = Πi P(xi | Θ)

The parameter estimation problem seeks Θ that realizes

In practice the log likelihood is computed to avoid

underflow errors

∏ ΘΘ i

ixP )|(max

Two situations Known paths for training sequences

CpG islands marked on training sequences One evening the casino dealer allows us to see when he changes dice

Unknown paths   CpG islands are not marked   Do not see when the casino dealer changes dice

19

Known paths Akl = # of times each k → l is taken in the training

sequences Ek(b) = # of times b is emitted from state k in the

training sequences Compute akl and ek(b) as maximum likelihood

estimators:

∑

∑

=

=

'

''

)'(/)()(

/

bkkk

lklklkl

bEbEbe

AAa

Pseudocounts q  Some state k may not appear in any of the training

sequences. This means Akl = 0 for every state l and akl cannot be computed with the given equation.

q  To avoid this overfitting use predetermined pseudocounts rkl and rk(b). Akl = # of transitions k→l + rkl

Ek(b) = # of emissions of b from k + rk(b) The pseudocounts reflect our prior biases about the

probability values.

20

Unknown paths: Viterbi training Idea: use Viterbi decoding to compute the most

probable path for training sequence x Start with some guess for initial parameters and compute π* the most probable path for x using initial parameters. Iterate until no change in π* :

1.  Determine Akl and Ek(b) as before 2.  Compute new parameters akl and ek(b) using the

same formulas as before 3.  Compute new π* for x and the current parameters

Viterbi training analysis

q  The algorithm converges precisely There are finitely many possible paths. New parameters are uniquely determined by the current π*. There may be several paths for x with the same probability, hence must compare the new π* with all previous paths having highest probability.

q  Does not maximize the likelihood Πx P(x | Θ) but the contribution to the likelihood of the most probable path Πx P(x | Θ, π*)

q  In general performs less well than Baum-Welch

21

Unknown paths: Baum-Welch

Idea: 1.  Guess initial values for parameters.

art and experience, not science 2.  Estimate new (better) values for parameters.

how ? 3.  Repeat until stopping criteria is met.

what criteria ?

Better values for parameters

Would need the Akl and Ek(b) values but cannot count (the path is unknown) and do not want to use a most probable path.

For all states k,l, symbol b and training sequence x

Compute Akl and Ek(b) as expected values, given the current parameters

hidden markov models - autenticação · 1 an introduction to bioinformatics algorithms slides...

Documents