hidden markov models - stanford university · hidden markov models 1 2 k ... using dynamic...

Hidden Markov Models 1

2

K

…

1

2

K

…

1

2

K

…

…

…

…

1

2

K

…

x1 x2 x3 xK

2

1

K

2

Motivating Example

•  CpG sites (-cytosine-phosphate-guanine-) §  C in CpG can be methylated -> 5-methylcytosine §  70%-80% of CpG cytonsines are methylated

http://helicase.pbworks.com/w/page/17605615/DNA%20Methylation

Motivating Example

http://www.nature.com/scitable/topicpage/the-role-of-methylation-in-gene-expression-1070

ΤΣΓ: Τυµορ Συππρεσσορ Γενε

Motivating Example

https://commons.wikimedia.org/wiki/File:Cpg_islands.svg

near gene-promoter normal example of the genome

CpG islands: regions with a high frequency of CpG sites

The three main questions on HMMs

1.  Decoding

GIVEN a HMM M, and a sequence x, FIND the sequence π of states that maximizes P[ x, π | M ]

2.  Evaluation GIVEN a HMM M, and a sequence x, FIND Prob[ x | M ]

3.  Learning GIVEN a HMM M, with unspecified transition/emission probs.,

and a sequence x, FIND parameters θ = (ei(.), aij) that maximize P[ x | θ ]

Problem 1: Decoding

Find the most likely parse of a sequence

Decoding - Review

GIVEN x = x1x2……xN Find π = π1, ……, πN,

to maximize P[ x, π ] π* = argmaxπ P[ x, π ] Maximizes a0π1 eπ1(x1) aπ1π2……aπN-1πN eπN(xN) Dynamic Programming! Vk(i) = max{π1… πi-1} P[x1…xi-1, π1, …, πi-1, xi, πi = k]

= Prob. of most likely sequence of states ending at

state πi = k

1

2

K …

1

2

K …

1

2

K …

…

…

…

1

2

K …

x1 x2 x3 xK

2

1

K

2

Given that we end up in state k at step i, maximize product to the left and right

Decoding – Review

Inductive assumption: Given that for all states k, and for a fixed position i,

Vk(i) = max{π1… πi-1} P[x1…xi-1, π1, …, πi-1, xi, πi = k]

What is Vl(i+1)? From definition, Vl(i+1) = max{π1… πi}P[ x1…xi, π1, …, πi, xi+1, πi+1 = l ]

= max{π1… πi}P(xi+1, πi+1 = l | x1…xi, π1,…, πi) P[x1…xi, π1,…, πi] = max{π1… πi}P(xi+1, πi+1 = l | πi ) P[x1…xi-1, π1, …, πi-1, xi, πi]

= maxk [P(xi+1, πi+1 = l | πi=k) max{π1… πi-1}P[x1…xi-1,π1,…,πi-1, xi,πi=k]] = maxk [ P(xi+1 | πi+1 = l ) P(πi+1 = l | πi=k) Vk(i) ] = el(xi+1) maxk akl Vk(i)

The Viterbi Algorithm - Review

Input: x = x1……xN Initialization:

V0(0) = 1 (0 is the imaginary first position) Vk(0) = 0, for all k > 0

Iteration:

Vj(i) = ej(xi) × maxk akj Vk(i – 1)

Ptrj(i) = argmaxk akj Vk(i – 1) Termination:

P(x, π*) = maxk Vk(N) Traceback:

πN* = argmaxk Vk(N) πi-1* = Ptrπi (i)

The Viterbi Algorithm - Review

Similar to “aligning” a set of states to a sequence Time:

O(K2N) Space:

O(KN)

x1 x2 x3 ………………………………………..xN

State 1 2

K

Vj(i)

Viterbi Algorithm – Review

Underflows are a significant problem

P[ x1,…., xi, π1, …, πi ] = a0π1 aπ1π2……aπi eπ1(x1)……eπi(xi) These numbers become extremely small – underflow Solution: Take the logs of all values

Vl(i) = log ek(xi) + maxk [ Vk(i-1) + log akl ]

Example - Review

Let x be a long sequence with a portion of ~ 1/6 6’s, followed by a portion of ~ ½ 6’s…

x = 123456123456…12345 6626364656…1626364656 Then, it is not hard to show that optimal parse is (exercise):

FFF…………………...F LLL………………………...L 6 characters “123456” parsed as F, contribute .956×(1/6)6 = 1.6×10-5

parsed as L, contribute .956×(1/2)1×(1/10)5 = 0.4×10-5

“162636” parsed as F, contribute .956×(1/6)6 = 1.6×10-5 parsed as L, contribute .956×(1/2)3×(1/10)3 = 9.0×10-5

Problem 2: Evaluation

Find the likelihood a sequence is generated by the model

Generating a sequence by the model

Given a HMM, we can generate a sequence of length n as follows: 1.  Start at state π1 according to prob a0π1 2.  Emit letter x1 according to prob eπ1(x1) 3.  Go to state π2 according to prob aπ1π2

4.  … until emitting xn

1

2

K …

1

2

K …

1

2

K …

…

…

…

1

2

K …

x1 x2 x3 xn

2

1

K

2 0

e2(x1)

a02

A couple of questions

Given a sequence x, •  What is the probability that x was generated by the model?

•  Given a position i, what is the most likely state that emitted xi?

Example: the dishonest casino Say x = 12341…23162616364616234112…21341

Most likely path: π = FF……F (too “unlikely” to transition F → L → F) However: marked letters more likely to be L than unmarked letters

P(box: FFFFFFFFFFF) = (1/6)11 * 0.9512 = 2.76-9 * 0.54 = 1.49-9 P(box: LLLLLLLLLLL) = [ (1/2)6 * (1/10)5 ] * 0.9510 * 0.052 = 1.56*10-7 * 1.5-3

=

0.23-9

F F

Evaluation

We will develop algorithms that allow us to compute:

P(x) Probability of x given the model P(xi…xj) Probability of a substring of x given the model

P(πi = k | x) “Posterior” probability that the ith state is k, given x A more refined measure of which states x may be in

The Forward Algorithm

We want to calculate P(x) = probability of x, given the HMM Sum over all possible ways of generating x:

P(x) = Σπ P(x, π) = Σπ P(x | π) P(π) To avoid summing over an exponential number of paths π, define

fk(i) = P(x1…xi, πi = k) (the forward probability)

“generate i first characters of x and end up in state k”

The Forward Algorithm – derivation

Define the forward probability: fk(i) = P(x1…xi, πi = k)

= Σπ1…πi-1 P(x1…xi-1, π1,…, πi-1, πi = k) ek(xi)

= Σl Σπ1…πi-2 P(x1…xi-1, π1,…, πi-2, πi-1 = l) alk ek(xi)

= Σl P(x1…xi-1, πi-1 = l) alk ek(xi)

= ek(xi) Σl fl(i – 1 ) alk

The Forward Algorithm

We can compute fk(i) for all k, i, using dynamic programming! Initialization:

f0(0) = 1 fk(0) = 0, for all k > 0

Iteration:

fk(i) = ek(xi) Σl fl(i – 1) alk

Termination:

P(x) = Σk fk(N)

Relation between Forward and Viterbi

VITERBI Initialization:

V0(0) = 1 Vk(0) = 0, for all k > 0

Iteration:

Vj(i) = ej(xi) maxk Vk(i – 1) akj Termination:

P(x, π*) = maxk Vk(N)

FORWARD Initialization:

f0(0) = 1 fk(0) = 0, for all k > 0

Iteration:

fl(i) = el(xi) Σk fk(i – 1) akl

Termination:

P(x) = Σk fk(N)

Motivation for the Backward Algorithm

We want to compute

P(πi = k | x), the probability distribution on the ith position, given x We start by computing P(πi = k, x) = P(x1…xi, πi = k, xi+1…xN)

= P(x1…xi, πi = k) P(xi+1…xN | x1…xi, πi = k) = P(x1…xi, πi = k) P(xi+1…xN | πi = k)

Then, P(πi = k | x) = P(πi = k, x) / P(x)

Forward, fk(i) Backward, bk(i)

The Backward Algorithm – derivation

Define the backward probability:

bk(i) = P(xi+1…xN | πi = k) “starting from ith state = k, generate rest of x”

= Σπi+1…πN P(xi+1,xi+2, …, xN, πi+1, …, πN | πi = k)

= Σl Σπi+1…πN P(xi+1,xi+2, …, xN, πi+1 = l, πi+2, …, πN | πi = k)

= Σl el(xi+1) akl Σπi+1…πN P(xi+2, …, xN, πi+2, …, πN | πi+1 = l)

= Σl el(xi+1) akl bl(i+1)

The Backward Algorithm

We can compute bk(i) for all k, i, using dynamic programming Initialization:

bk(N) = 1, for all k Iteration:

bk(i) = Σl el(xi+1) akl bl(i+1) Termination:

P(x) = Σl a0l el(x1) bl(1)

Computational Complexity

What is the running time, and space required, for Forward, and Backward?

Time: O(K2N) Space: O(KN)

Useful implementation technique to avoid underflows

Viterbi: sum of logs Forward/Backward: rescaling at each few positions by multiplying by a

constant

Posterior Decoding

We can now calculate fk(i) bk(i) P(πi = k | x) = ––––––– P(x)

Then, we can ask

What is the most likely state at position i of sequence x: Define π^ by Posterior Decoding:

π^i = argmaxk P(πi = k | x)

P(πi = k | x) = P(πi = k , x)/P(x) = P(x1, …, xi, πi = k, xi+1, … xn) / P(x) = P(x1, …, xi, πi = k) P(xi+1, … xn | πi = k) / P(x) = fk(i) bk(i) / P(x)

Posterior Decoding

•  For each state,

§  Posterior Decoding gives us a curve of likelihood of state for each position

§  That is sometimes more informative than Viterbi path π*

•  Posterior Decoding may give an invalid sequence of states (of prob 0)

§  Why?

Posterior Decoding

•  P(πi = k | x) = Σπ P(π | x) 1(πi = k)

= Σ {π:π[i] = k} P(π | x)

x1 x2 x3 …………………………………………… xN

State 1

l P(πi=l|x)

k

1(ψ) = 1, if ψ is true 0, otherwise

Viterbi, Forward, Backward

VITERBI Initialization:

V0(0) = 1 Vk(0) = 0, for all k > 0

Iteration: Vl(i) = el(xi) maxk Vk(i-1) akl Termination: P(x, π*) = maxk Vk(N)

FORWARD Initialization:

f0(0) = 1 fk(0) = 0, for all k > 0

Iteration:

fl(i) = el(xi) Σk fk(i-1) akl

Termination:

P(x) = Σk fk(N)

BACKWARD Initialization:

bk(N) = 1, for all k Iteration:

bl(i) = Σk el(xi+1) akl bk(i+1) Termination:

P(x) = Σk a0k ek(x1) bk(1)

Variants of HMMs

Higher-order HMMs

•  How do we model “memory” larger than one time point?

•  P(πi+1 = l | πi = k) akl •  P(πi+1 = l | πi = k, πi -1 = j) ajkl •  … •  A second order HMM with K states is equivalent to a first order HMM

with K2 states

state H state T

aHT(prev = H) aHT(prev = T)

aTH(prev = H) aTH(prev = T)

state HH state HT

state TH state TT

aHHT

aTTH

aHTT aTHH aTHT

aHTH

Similar Algorithms to 1st Order

•  P(πi+1 = l | πi = k, πi -1 = j)

§  Vlk(i) = maxj{ Vkj(i – 1) + … }

§  Time? Space?

Modeling the Duration of States

Length distribution of region X:

E[lX] = 1/(1-p)

•  Geometric distribution, with mean 1/(1-p)

This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions

X Y

1-p

1-q

p q

Example: exon lengths in genes

Solution 1: Chain several states

X Y

1-p

1-q

p

q X X

Disadvantage: Still very inflexible lX = C + geometric with mean 1/(1-p)

Solution 2: Negative binomial distribution

Duration in X: m turns, where §  During first m – 1 turns, exactly n – 1 arrows to next state are followed §  During mth turn, an arrow to next state is followed

m – 1 m – 1

P(lX = m) = n – 1 (1 – p)n-1+1p(m-1)-(n-1) = n – 1 (1 – p)npm-n

X(n)

p

X(2) X(1)

p

1 – p 1 – p

p

…… Y

1 – p

Example: genes in prokaryotes

•  EasyGene: Prokaryotic gene-finder

Larsen TS, Krogh A

•  Negative binomial with n = 3

Solution 3: Duration modeling

Upon entering a state: 1.  Choose duration d, according to probability distribution 2.  Generate d letters according to emission probs 3.  Take a transition to next state according to transition probs

Disadvantage: Increase in complexity of Viterbi:

Time: O(D) Space: O(1)

where D = maximum duration of state

F d<Df xi…xi+d-1

Pf

Warning, Rabiner’s tutorial claims O(D2)

& O(D) increases

Viterbi with duration modeling

Recall original iteration: Vl(i) = maxk Vk(i – 1) akl × el(xi)

New iteration:

Vl(i) = maxk maxd=1…Dl Vk(i – d) × Pl(d) × akl × Πj=i-d+1…iel(xj)

F L

transitions

emissions

d<Df

xi…xi + d – 1

emissions

d<Dl

xj…xj + d – 1

Pf Pl

Precompute cumulative

values

Learning

Re-estimate the parameters of the model based on training data

Two learning scenarios

1.  Estimation when the “right answer” is known Examples:

GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islands

GIVEN: the casino player allows us to observe him one evening,

as he changes dice and produces 10,000 rolls

2.  Estimation when the “right answer” is unknown Examples:

GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition

GIVEN: 10,000 rolls of the casino player, but we don’t see when he

changes dice QUESTION: Update the parameters θ of the model to maximize P(x|θ)

1. When the states are known

Given x = x1…xN for which the true π = π1…πN is known, Define:

Akl = # times k→l transition occurs in π Ek(b) = # times state k in π emits b in x

We can show that the maximum likelihood parameters θ (maximize P(x|θ)) are:

Akl Ek(b) akl = ––––– ek(b) = ––––––– Σi Aki Σc Ek(c)

1. When the states are known

Intuition: When we know the underlying states, Best estimate is the normalized frequency of transitions & emissions that occur in the training data

Drawback:

Given little data, there may be overfitting: P(x|θ) is maximized, but θ is unreasonable 0 probabilities – BAD

Example:

Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3 π = F, F, F, F, F, F, F, F, F, F Then: aFF = 1; aFL = 0 eF(1) = eF(3) = .2; eF(2) = .3; eF(4) = 0; eF(5) = eF(6) = .1

Pseudocounts

Solution for small training sets:

Add pseudocounts

Akl = # times k→l transition occurs in π + rkl Ek(b) = # times state k in π emits b in x + rk(b)

rkl, rk(b) are pseudocounts representing our prior belief Larger pseudocounts ⇒ Strong priof belief Small pseudocounts (ε < 1): just to avoid 0 probabilities

Pseudocounts

Example: dishonest casino

We will observe player for one day, 600 rolls

Reasonable pseudocounts:

r0F = r0L = rF0 = rL0 = 1; rFL = rLF = rFF = rLL = 1; rF(1) = rF(2) = … = rF(6) = 20 (strong belief fair is fair) rL(1) = rL(2) = … = rL(6) = 5 (wait and see for loaded)

Above #s are arbitrary – assigning priors is an art

2. When the states are hidden

We don’t know the true Akl, Ek(b) Idea: •  We estimate our “best guess” on what Akl, Ek(b) are

§  Or, we start with random / uniform values

•  We update the parameters of the model, based on our guess

•  We repeat

2. When the states are hidden

Starting with our best guess of a model M, parameters θ:

Given x = x1…xN for which the true π = π1…πN is unknown,

We can get to a provably more likely parameter set θ

i.e., θ that increases the probability P(x | θ) Principle: EXPECTATION MAXIMIZATION 1.  Estimate Akl, Ek(b) in the training data 2.  Update θ according to Akl, Ek(b) 3.  Repeat 1 & 2, until convergence

Estimating new parameters

To estimate Akl: (assume “| θCURRENT”, in all formulas below) At each position i of sequence x, find probability transition k→l is used: P(πi = k, πi+1 = l | x) =

[1/P(x)] × P(πi = k, πi+1 = l, x1…xN) = Q/P(x) where Q = P(x1…xi, πi = k, πi+1 = l, xi+1…xN) =

= P(πi+1 = l, xi+1…xN | πi = k) P(x1…xi, πi = k) = = P(πi+1 = l, xi+1xi+2…xN | πi = k) fk(i) = = P(xi+2…xN | πi+1 = l) P(xi+1 | πi+1 = l) P(πi+1 = l | πi = k) fk(i) = = bl(i+1) el(xi+1) akl fk(i)

fk(i) akl el(xi+1) bl(i+1)

So: P(πi = k, πi+1 = l | x, θ) = –––––––––––––––––– P(x | θCURRENT)

Estimating new parameters

•  So, Akl is the E[# times transition k→l, given current θ]

fk(i) akl el(xi+1) bl(i+1)

Akl = Σi P(πi = k, πi+1 = l | x, θ) = Σi ––––––––––––––––– P(x | θ)

•  Similarly,

Ek(b) = [1/P(x | θ)]Σ {i | xi = b} fk(i) bk(i)

k l

xi+1

akl

el(xi+1)

bl(i+1) fk(i)

x1………xi-1 xi+2………xN

xi

The Baum-Welch Algorithm

Initialization: Pick the best-guess for model parameters (or arbitrary)

Iteration:

1.  Forward 2.  Backward 3.  Calculate Akl, Ek(b), given θCURRENT 4.  Calculate new model parameters θNEW : akl, ek(b) 5.  Calculate new log-likelihood P(x | θNEW)

GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION Until P(x | θ) does not change much

The Baum-Welch Algorithm

Time Complexity: # iterations × O(K2N)

•  Guaranteed to increase the log likelihood P(x | θ)

•  Not guaranteed to find globally best parameters

Converges to local optimum, depending on initial

conditions

•  Too many parameters / too large model: Overtraining

Alternative: Viterbi Training

Initialization: Same Iteration:

1.  Perform Viterbi, to find π* 2.  Calculate Akl, Ek(b) according to π* + pseudocounts 3.  Calculate the new parameters akl, ek(b)

Until convergence Notes:

§  Not guaranteed to increase P(x | θ) §  Guaranteed to increase P(x | θ, π*) §  In general, worse performance than Baum-Welch

hidden markov models - stanford university · hidden markov models 1 2 k ... using dynamic...

Documents