speeding up algorithms for hidden markov models by exploiting repetitions

Speeding Up Algorithms for Hidden Markov Models by Exploiting

Repetitions

Shay MozesOren Weimann (MIT)

Michal Ziv-Ukelson (Tel-Aviv U.)

Shortly:• Hidden Markov Models are extensively used to model

processes in many fields• The runtime of HMM algorithms is usually linear in

the length of the input• We show how to exploit repetitions to obtain speedup• First provable speedup of Viterbi’s algorithm• Can use different compression schemes• Applies to several decoding and training algorithms

Markov Models

q1 q2• statesq1 , … , qk

• transition probabilitiesPi←j

• emission probabilitiesei(σ) σєΣ

• time independent, discrete, finite

e1(A) = 0.3

e1(C) = 0.2

e1(G) = 0.2

e1(T) = 0.3

e2(A) = 0.2

e2(C) = 0.3

e2(G) = 0.3

e2(T) = 0.2

P1←1 = 0.9 P2←1 = 0.1 P2←2 = 0.8

P1←2 = 0.2

Hidden Markov Models1

states

observed string

x1 x2 xnx3

Markov Models

• We are only given the description of the model and the observed string

• Decoding: find the hidden sequence of states that is most likely to have generated the observed string

probability of best sequence of states that emits first 5 chars and ends in state 2

v6[4]= e4(c)·P4←2·v5[2]

probability of best sequence of states that emits first 5 chars and ends in state j

v6[4]= P4←2·v5[2]v6[4]= v5[2]v6[4]=maxj{e4(c)·P4←j·v5[j]}v5[2]

Decoding – Viterbi’s Algorithm1 2 3 4 5 6 7 8 9 n

a a c g a c g g t

states

Outline

• Overview• Exploiting repetitions• Using LZ78• Using Run-Length Encoding• Summary of results

vn=M(xn) ⊗M(xn-1) ⊗ ··· ⊗M(x1) ⊗ v0

v2 = M(x2) ⊗M(x1) ⊗ v0

VA in Matrix Notation

Viterbi’s algorithm:

v1[i]=maxj{ei(x1)·Pi←j · v0[j]}v1[i]=maxj{ Mij(x1) · v0[j]}

Mij(σ) = ei (σ)·Pi←j

v1 = M(x1) ⊗ v0

(A⊗B)ij= maxk{Aik ·Bkj }

O(k2n)

O(k3n)

• use it twice!

vn=M(W)⊗M(t)⊗M(W)⊗M(t)⊗M(a)⊗M(c) ⊗v0

Exploiting Repetitionsc a t g a a c t g a a c

12 steps

6 steps

vn=M(c)⊗M(a)⊗M(a)⊗M(g)⊗M(t)⊗M(c)⊗M(a)⊗M(a)⊗M(g)⊗M(t)⊗M(a)⊗M(c)⊗v0

• compute M(W) = M(c)⊗M(a)⊗M(a)⊗M(g) once

ℓ - length of repetition W

λ – number of times W repeats in string

computing M(W) costs (ℓ -1)k3

each time W appears we save (ℓ -1)k2

W is good if λ(ℓ -1)k2 > (ℓ -1)k3

number of repeats λ > k number of states

Exploiting repetitions

matrix-matrix multiplication

matrix-vector multiplication

I. dictionary selection:choose the set D={Wi } of good substrings

II. encoding:compute M(Wi ) for every Wi in D

III. parsing:partition the input X into good substringsX = Wi1

Wi2 … Win’

X’ = i1,i2, … ,in’

IV. propagation:run Viterbi’s Algorithm on X’ using M(Wi)

General Scheme

Offline

Outline

• The next LZ-word is the longest LZ-word previously seen plus one character

• Use a triea

aacgacg

• Number of LZ-words is asymptotically < n ∕ log n

I. O(n)

II. O(k3n ∕ log n)

III. O(n)

IV. O(k2n ∕ log n)

Using LZ78Cost

I. dictionary selection:D = words in LZ parse of X

II. encoding: use incremental nature of LZM(Wσ)= M(W) ⊗ M(σ)

III. parsing:X’ = LZ parse of X

IV. propagation:run VA on X’ using M(Wi )

Speedup: k2n log n

k3n ∕ log n k

• Remember speedup condition: λ > k • Use just LZ-words that appear more than k times• These words are represented by trie nodes with more

than k descendants• Now must parse X (step III) differently• Ensures graceful degradation with increasing k:

Speedup: min(1,log n ∕ k)

Improvementa

Experimental results

• Short - 1.5Mbp chromosome 4 of S. Cerevisiae (yeast)• Long - 22Mbp human Y-chromosome

~x5 faster:

Outline

Run Length Encodingaaaccggggg → a3c2g5

aaaccggggg → a2a1c2g4g1

Summary of results• General framework • LZ78 log(n) ∕ k• RLE r ∕ log(r)• Byte-Pair Encoding r• Path reconstruction O(n)• F/B algorithms (standard matrix multiplication)• Viterbi training same speedups apply• Baum-Welch training speedup, many details• Parallelization

Thank you!

Any questions?

Path traceback

• In VA, easy to do in O(n) time by keeping track of maximizing states during computation

• The problem: we run VA on X’, so we get the sequence of states for X’, not for X.we only get the states on the boundaries of good substrings of X

• Solution: keep track of maximizing states when computing the matrices M(w). Takes O(n) time and O(nk2) space

Training

• Estimate unknown parameters Pi←j , ei(σ)• Use Expectation Maximization:

1. Decoding2. Recalculate parameters

• Viterbi Training: each iteration costs O( VA + n + k2)

Decoding (bottleneck) speedup!

path traceback +

update Pi←j , ei(σ)

Baum Welch Training

• each iteration costs: O( FB + nk2)

• If substring w has length l and repeats λ times satisfies:

then can speed up the entire process by precalculation

path traceback +

update Pi←j , ei(σ)

Decoding O(nk2)

speeding up algorithms for hidden markov models by exploiting repetitions

Documents

repetitions and obsessions in plautus

long, framing repetitions in biblical historiography

combinatorics on words: christoﬀel words and repetitions

repetitions in words part iii - ion.uwinnipeg.ca

077 the word repetitions in the qur'an

main menu main menu (click on the topics below) permutations...

speeding and radar

exploiting vertex relationships in speeding up subgraph...

@let@token spatial point processes: repetitions

speeding with ned:

beat speeding

saath: speeding up coflows by exploiting the spatial...

max speeding rods

speeding up spectrum analyzer measurements · speeding up...

cmp-mx21: lecture 5 repetitions

speeding up algorithms for hidden markov models by...

beautiful repetitions - 5-minute introduction to ... ·...

chapter 1: speeding

exploiting vertex relationships in speeding up subgraph

speeding up invalidity