dishonest casino as an hmm - rice universitydevika/comp470/lectures/lec3.pdf(c) devika subramanian,...

33
(c) Devika Subramanian, 2009 63 Dishonest casino as an HMM N = 2, S={F,L} M=2, O = {h,t} A = B= . 0.90 0.10 0.05 0.95 F F L L 0.9 0.1 0.5 5 . 0 F L h t 0] 1 [

Upload: others

Post on 18-Mar-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 63

Dishonest casino as an HMM N = 2, S={F,L} M=2, O = {h,t} A =

B=

.

0.90 0.10

0.05 0.95F

F L

L

0.9 0.1

0.5 5.0F

L

h t

0] 1[

Page 2: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 64

A generative model for CpG islands

There are two hidden states: CpG and non-CpG. Each state is characterized by emission probabilities of the 4 bases. You can’t see which state the model is, only the emitted bases are visible.

CpG Non-CpG

A:C:G:T:

Hidden state

ObservablesA:C:G:T:

Page 3: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 65

Filtering or the forward computation

Given an HMM model (A,B,pi), and an observation sequence o1…ot, can we find the most likely hidden state at time t, St?

P(St|o1…ot): filtering

Observation sequence: h h t t t t

What is the hiddenstate here (F or L)?

Page 4: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 66

Filtering (contd.)

S1S0 S2 S3 S4 S5 S6

h h t t t t

s LF

0.05 0.9

0.1

0.95

h:0.1t:0.9

h:0.5t:0.5

[1 0]

What is the distribution of S1?Since, s0=F, we can say thatP(S1|S0)=[0.95 0.05], based on thetransition probabilities alone. But is that all we know?

Page 5: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 67

More filtering

S1S0 S2 S3 S4 S5 S6

h h t t t t

[1 0]

s LF

0.05 0.9

0.1

0.95

h:0.1t:0.9

h:0.5t:0.5

We have also observed h at time 1.How can we fold it in into the assessment of the distribution of S1?

Page 6: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 68

Filtering (contd.)

)(

)()|()|(

1

11111

oP

SPSoPoSP

)05.0)(1.0(05.0)|()|(

)95.0)(5.0(95.0)|()|(

11

11

LhPhoLSP

FhPhoFSP

1)05.0)(1.0()95.0)(5.0(

Therefore, P(S1)=[0.99 0.01]

Page 7: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 69

Filtering computation

StSt-1

[p 1-p]

F L

ot

)...|()|()|()...,|( 111111

1

tt

s

ttttttt oosPsSPSoPoooSPt

Recursivelycomputed

Page 8: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 70

Summary: filtering

nii

aiobj

nii

sSooPi

ooScPooSP

Ttnj

n

i

ijttjt

i

ittt

tttt

1 ),( :nTerminatio

,)()()( :Recursion

1 ,)( :Initialize

).,,...,()( Define

).,...,,(),...,|( Find

T

11,0

1

11

0

1

11

Time complexity O(n2T)

Page 9: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 71

Smoothing/posterior decoding

S1S0 S2 S3 S4 S5 S6

h h t t t t

Question: can we re-estimate the distribution at Sk where k < t, using information about the observed sequence uptotime t?

That is, what is P(Sk|o1…ot) ?

Page 10: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 72

Backward computation

),...,|()|,...,( ),...,|( 111 kkktktk ooSPSoocPooSP

Forward computation

),()()( :Recursion

.1 ,1)( :Initialize

).|,...,()( Define

11,111

1

1

kTNikkj

N

j

ijk

T

iktkk

jobaci

Nii

sSooPi

Backward computation

Time complexity: O(n2T)

Page 11: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 73

Posterior decoding

)()( ),...,|( 1 iicooiSP kktk

Page 12: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 74

Full Decoding

Given HMM model (A,B,pi), and an observation sequence o1…ot, can we find the most likely hidden state sequence s1…st?

argmax_{s1…st} P(s1…st| o1…ot)

Page 13: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 75

The Viterbi algorithm

njTt

tjijti

t

i

tttss

t

obaij

nii

ooiSssPit

1,11

11

0

111,...,

),()(max)(:Recursion

1,)( :Initialize

),...,,,,...,(max)(11

Computational complexity = O(Tn2)

Page 14: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 76

Learning an HMM: case 1

Given observation sequences, and the corresponding hidden state sequences, can we find the most likely model (A,B,pi) which generated it?

FF F L L F F

h h t t t t

Training data

Page 15: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 77

Parameter estimation

Initial state distribution Fraction of times state i is state 1 in training

data

Transition probabilities aij = (number of transitions from i to j)/(number

of transitions from i)

Emission probabilities bk(i) = (number of times k is emitted in state

i)/(number of times state i occurs)

Page 16: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 78

Learning an HMM: case 2

Given just the observation sequences, can we find the most likely model = (A,B,pi) which generated it?

)|...(argmax 1

tooP

Annotated training data is difficult to get; so we wouldlike to derive model parameters from observablesequences.

Page 17: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 79

The EM algorithm

1. Guess a model

2. Use observation sequence to estimate transition probabilities, emission probabilities, and initial state probabilities.

3. Update model

4. Repeat 2 and 3 till no change in model

Page 18: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 80

Re-estimating parameters

What is the probability of being in state i at time t and moving to state j, given the current model and the observation sequence O?

),| ,(),( 1 OjSiSPji ttt

Page 19: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 81

Using forward and backward computation

n

i

n

j

ttjijt

ttjijt

t

jobai

jobaiji

1 1

11

11

)()()(

)()()(),(

i j

OtOt+1

Page 20: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 82

Re-estimating aij

The transition probabilities aij can be re-estimated as follows

1

1 1'

1

1

)',(

),(

ˆT

t

n

j

t

T

t

t

ij

ji

ji

a

Page 21: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 83

Initial state probabilities

t (i) t (i, j)j1

N

Initial state probabilities are simply )(1 i

Expected numberof times instate i

Page 22: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 84

Emission probabilities

i statein timesofnumber expected

k symbol observe and i statein timesofnumber expected)(ˆ kbi

T

t

t

T

kot

t

i

i

i

kb t

1

1

)(

)(

)(ˆ

Page 23: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 85

The EM algorithm

)( and ),( iji tt

1. Guess a model

2. Use observation sequence to estimate

3. Use these estimates to recalculate

4. Repeat 2 and 3 till no change in model

),,( ba

)',','(' ba

Page 24: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 86

Summary of CpG island HMM

Given a DNA region x, Viterbi decoding predicts locations of CpG islands on it.

Given a nucleotide xi, Viterbi decoding tells whether xi is in a CpG island in the most likely sequence.

Posterior decoding can assign locally optimal predictions of CpG islands.

A fully annotated training data set can be used to estimate the generating HMM.

Even without annotations, we can use the EM procedure to derive model parameters.

Page 25: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 87

How to design an HMM for a new problem

Architecture/topology design: What are the states, observation symbols, and

the topology of the state transition graph?

Learning/Training: Fully annotated or partially annotated training

datasets Parameter estimation by maximum likelihood or by EM

Validation/Testing: Fully annotated testing datasets Performance evaluation (accuracy, specificity and

sensitivity)

Page 26: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 88

HMM model structure

Duration modeling

s LF

0.05 0.9

0.1

0.9

0.95

h:0.1t:0.9

h:0.5t:0.5

What is the probability of staying withthe fair coin for T time steps?

Page 27: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 89

Inherent limitation of HMMs

The duration in state F follows an exponentially decaying distribution called a geometric distribution.

The geometric distribution gives too much probability to short sequences of Fs and Ls and too little to medium and long sequences of Fs and Ls.

)05.0()95.0()( 1 TTFXP

Page 28: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 90

Duration modeling

To obtain non-geometric length distributions, we use an array of n F states, as follows:

Generated length distribution is a negative binomial.

F F FnnL pp

n

LLXP )1(

1

1)|(|

n=3

p p ppp p p

Page 29: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 91

Why does this matter?

IntronExon

Length of stay in “Exon” state determines length of predicted exons. Very short exons are rare.

Similarly for introns. Introns shorter than 30 bp do not exist.

L

Geometric dist

Exon length dist

Page 30: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 92

Length distributions of exons and introns

Page 31: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 93

Generalized HMMs (semi-Markov HMMs)

Each state has a specified length distribution.

Pick a state to start at t=1. Repeat

Pick the length of stay (d) in current state from distribution P.

Emit d symbols in current state. Pick a new state (according to a matrix) and transition to

it at time t+d

Exon Intron

PE PI No self-transitions togenerate extra symbols

Page 32: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 94

Example

Multiple symbols emitted in each state. One to one mapping betweensymbols and hidden states is lost in the generalized HMM.

Page 33: Dishonest casino as an HMM - Rice Universitydevika/comp470/lectures/lec3.pdf(c) Devika Subramanian, 2009 90 Duration modeling To obtain non-geometric length distributions, we use an

(c) Devika Subramanian, 2009 95

Viterbi algorithm for gHMMs

Just like Viterbi for HMMs, but we use the entire stay in state instead of a state at a given time.

tt-k

sj si

t-k-1

)()()(),(

),(maxmax)(

1

0

,

,1..0

jaklobjkf

jkfi

ktjii

k

r

rtiit

itijtk

t

Probability of most likely pathending at t with stay of k+1 in state i following a stay in state j