hidden markov models. two learning scenarios 1.estimation when the “right answer” is known...

25
Hidden Markov Models

Post on 19-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Hidden Markov Models

Two learning scenarios

1. Estimation when the “right answer” is known

Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good

(experimental) annotations of the CpG islands

GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls

2. Estimation when the “right answer” is unknown

Examples:GIVEN: the porcupine genome; we don’t know how frequent are the

CpG islands there, neither do we know their composition

GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice

QUESTION:Update the parameters of the model to maximize P(x|)

1. When the right answer is known

Given x = x1…xN

for which the true = 1…N is known,

Define:

Akl = # times kl transition occurs in Ek(b) = # times state k in emits b in x

We can show that the maximum likelihood parameters are:

Akl Ek(b)

akl = ––––– ek(b) = –––––––

i Aki c Ek(c)

2. When the right answer is unknown

We don’t know the true Akl, Ek(b)

Idea:

• We estimate our “best guess” on what Akl, Ek(b) are

• We update the parameters of the model, based on our guess

• We repeat

2. When the right answer is unknown

Starting with our best guess of a model M, parameters :

Given x = x1…xN

for which the true = 1…N is unknown,

We can get to a provably more likely parameter set

Principle: EXPECTATION MAXIMIZATION

1. Estimate Akl, Ek(b) in the training data

2. Update according to Akl, Ek(b)

3. Repeat 1 & 2, until convergence

Estimating new parameters

To estimate Akl:

At each position i of sequence x,

Find probability transition kl is used:

P(i = k, i+1 = l | x) = [1/P(x)] P(i = k, i+1 = l, x1…xN) = Q/P(x)

where Q = P(x1…xi, i = k, i+1 = l, xi+1…xN) = = P(i+1 = l, xi+1…xN | i = k) P(x1…xi, i = k) = = P(i+1 = l, xi+1xi+2…xN | i = k) fk(i) = = P(xi+2…xN | i+1 = l) P(xi+1 | i+1 = l) P(i+1 = l | i = k) fk(i) = = bl(i+1) el(xi+1) akl fk(i)

fk(i) akl el(xi+1) bl(i+1)So: P(i = k, i+1 = l | x, ) = ––––––––––––––––––

P(x | )

Estimating new parameters

• So,

fk(i) akl el(xi+1) bl(i+1)

Akl = i P(i = k, i+1 = l | x, ) = i –––––––––––––––––

P(x | )

• Similarly,

Ek(b) = [1/P(x)] {i | xi = b} fk(i) bk(i)

k l

xi+1

akl

el(xi)

bl(i+1)fk(i)

x1………xi-1xi+2………xN

xi

The Baum-Welch Algorithm

Initialization:

Pick the best-guess for model parameters

(or arbitrary)

Iteration:1. Forward

2. Backward

3. Calculate Akl, Ek(b)

4. Calculate new model parameters akl, ek(b)

5. Calculate new log-likelihood P(x | )

GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION

Until P(x | ) does not change much

The Baum-Welch Algorithm

Time Complexity:

# iterations O(K2N)

• Guaranteed to increase the log likelihood of the model

P( | x) = P(x, ) / P(x) = P(x | ) / ( P(x) P() )

• Not guaranteed to find globally best parameters

Converges to local optimum, depending on initial conditions

• Too many parameters / too large model: Overtraining

Alternative: Viterbi Training

Initialization: Same

Iteration:1. Perform Viterbi, to find *

2. Calculate Akl, Ek(b) according to * + pseudocounts

3. Calculate the new parameters akl, ek(b)

Until convergence

Notes: Convergence is guaranteed – Why? Does not maximize P(x | ) In general, worse performance than Baum-Welch

Variants of HMMs

Higher-order HMMs

The Genetic Code

3 nucleotides make 1 amino acid

Statistical dependencies in triplets

Question:

Recognize protein-coding segments with a HMM

One way to model protein regions

P(xixi+1xi+2 | xi-1xixi+1)

Every state of the HMM emits 3 nucleotides

Transition probabilities:

Probability of one triplet, given previous triplet P(i, | i-1)

Emission probabilities:

P(xixi-1xi-2 | i ) = 1/0

P(xi-1xi-2xi-3 | i-1 ) = 1/0

AAA AAC

AAT

TTT

A more elegant way

Every state of the HMM emits 1 nucleotide

Transition probabilities:

Probability of one triplet, given previous 3 triplets

P(i, | i-1, i-2, i-3)

Emission probabilities:

P(xi | i)

Algorithms extend with small modifications

A C

G T

Modeling the Duration of States

Length distribution of region X:

E[lX] = 1/(1-p)

• Geometric distribution, with mean 1/(1-p)

This is a significant disadvantage of HMMs

Several solutions exist for modeling different length distributions

X Y

1-p

1-q

p q

Sol’n 1: Chain several states

X Y

1-p

1-q

p

qXX

Disadvantage: Still very inflexible lX = C + geometric with mean 1/(1-p)

Sol’n 2: Negative binomial distribution

Duration in X: m turns, where During first m – 1 turns, exactly n – 1 arrows to next state are followed During mth turn, an arrow to next state is followed

m – 1 m – 1

P(lX = m) = n – 1 (1 – p)n-1+1p(m-1)-(n-1) = n – 1 (1 – p)npm-n

X

p

XX

p

1 – p 1 – p

p

…… Y

1 – p

Example: genes in prokaryotes

• EasyGene:

Prokaryotic

gene-finder

Larsen TS, Krogh A

• Negative binomial with n = 3

Solution 3: Duration modeling

Upon entering a state:

1. Choose duration d, according to probability distribution2. Generate d letters according to emission probs3. Take a transition to next state according to transition probs

Disadvantage: Increase in complexity:

Time: O(D2)Space: O(D)

where D = maximum duration of state

X

Connection Between Alignment and HMMs

A state model for alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACCIMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII

M(+1,+1)

I(+1, 0)

J(0, +1)

Alignments correspond 1-to-1 with sequences of states M, I, J

Let’s score the transitions

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACCIMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII

M(+1,+1)

I(+1, 0)

J(0, +1)

Alignments correspond 1-to-1 with sequences of states M, I, J

s(xi, yj)

s(xi, yj) s(xi, yj)

-d -d

-e -e

-e

-e

How do we find optimal alignment according to this model?

Dynamic Programming:

M(i, j): Optimal alignment of x1…xi to y1…yj ending in M

I(i, j): Optimal alignment of x1…xi to y1…yj ending in I

J(i, j): Optimal alignment of x1…xi to y1…yj ending in J

The score is additive, therefore we can apply DP recurrence formulas

Needleman Wunsch with affine gaps – state version

Initialization:M(0,0) = 0; M(i,0) = M(0,j) = -, for i, j > 0I(i,0) = d + ie; J(0,j) = d + je

Iteration:

M(i – 1, j – 1)M(i, j) = s(xi, yj) + max I(i – 1, j – 1)

J(i – 1, j – 1)

e + I(i – 1, j)I(i, j) = max e + J(i, j – 1)

d + M(i – 1, j – 1)

e + I(i – 1, j)J(i, j) = max e + J(i, j – 1)

d + M(i – 1, j – 1)

Termination:Optimal alignment given by max { M(m, n), I(m, n), J(m, n) }

Probabilistic interpretation of an alignment

An alignment is a hypothesis that the two sequences are related by evolution

Goal:

Produce the most likely alignment

Assert the likelihood that the sequences are indeed related