hidden markov models in the context of genetic analysisrmjbale/stat/hmm.pdf · forward/backward...

53
Hidden Markov Models in the context of genetic analysis Vincent Plagnol UCL Genetics Institute November 22, 2012

Upload: tranhanh

Post on 13-Mar-2019

235 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Hidden Markov Models in the context of geneticanalysis

Vincent Plagnol

UCL Genetics Institute

November 22, 2012

Page 2: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Outline

1 Introduction

2 Two basic problemsForward/backward Baum-Welch algorithmViterbi algorithm

3 When the parameters are unknown

4 Two applicationsGene predictionCNV detection from SNP arrays

5 Two extensions to the basic HMMStochastic EMSemi-Markov models

Page 3: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Outline

1 Introduction

2 Two basic problemsForward/backward Baum-Welch algorithmViterbi algorithm

3 When the parameters are unknown

4 Two applicationsGene predictionCNV detection from SNP arrays

5 Two extensions to the basic HMMStochastic EMSemi-Markov models

Page 4: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

The problem

Many applications of statistics can be seen as a categorisation.

We try to fit complex patterns into discrete boxes in order toapprehend them better.

Clustering approaches are typical of this:

Inference of an individual’s ancestry being a mix of X and Y.Separation between high risk and low risk disease groups . . .

Hidden Markov Models try to achieve exactly this purpose ina different context.

Page 5: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Basic framework

Page 6: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

An example: gene discovery from DNA sequence

Page 7: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

An example: gene discovery from DNA sequence

We will first use this simplest example.

We assume that the hidden chain X has two states: gene, orintergenic.

To be complete there should be a third state: gene on thereverse strand.

For now we assume that the emission probabilities P(Yi |Xi )are independent conditionally on the hidden chain X .

This may not be good enough for most applications but this isa place to start.

Page 8: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Notations

(Y )ni=1 represents the sequence of observed data points.

The Yi can be discrete or continuous, but we will assumediscrete for now.

(X )ni=1 is the sequence of hidden states.

∀i ,Xi ∈ {1, . . . ,S} and we have S discrete hidden states.We also assume that we know the distribution P(Y |X ), butthis set of parameters may also be unknown.

Page 9: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Basic description of Markov Chains (1)

A discrete stochastic process X is Markovian is

P(X n1 |Xi ) = P(X i−1

1 |Xi )P(X i+11 |Xi )

Essentially the future and the past are independentconditionally on the present: it is “memory-less”.

One can easily make a continuous version of this.

If the Markov model has S states, then the process can bedescribed using a SxS transition matrix.

The diagonal values pii describe the probability to stay instate i .

Page 10: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Basic description of Markov Chains (2)

The probability to spend exactly k units of time in state i is:

P(X spends k units in i) = pkii (1− pii )

This is the definition of an geometric variable.

In a continuous state it would be an exponential distribution.

The definition of the present can also be modified: Xi may forexample depends on the previous k states instead of the lastone.

This increases the size of the parameter space but makes themodel richer.

Page 11: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Basics for hidden Markov Chains

The hidden Markov chain framework adds one layer (denotedY ) to the Markovian process discribed previously.

The conditional distribution of P(Yj |Xj = s) may be unknown,completely specified or partially specified.

Typically the number of hidden states S is relatively small (nomore than a few hundreds of states).

But n may be very large, i. e. X and Y may be very longsequences (think DNA sequences).

Page 12: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Slightly more general version

Without complicating anything, we can most of the timeassume that P(Yj |Xj) also varies with j .

Y could also be a Markov chain.

Non-Markovian stays can be, to some extent, mimicked byusing a sequence of hidden state:

First part of the gene, middle of the gene, end of the gene.

Page 13: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

The set of parameters Θ

1 (Pst) is the transition matrix for the hidden states.

2 Qsk = P(Y = k |X = s) is probability distribution for theobserved chain Y give X .

3 Lastly, we need a vector µ to initiate the hidden chain X .

Page 14: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Two related problems

1 At a given point i in the sequence, what is the most likelyhidden state Xi?

2 What is the most likely hidden sequence (X )ni=1?

3 The first question relates to marginal probabilities and thesecond to the joint likelihood.

Page 15: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Outline

1 Introduction

2 Two basic problemsForward/backward Baum-Welch algorithmViterbi algorithm

3 When the parameters are unknown

4 Two applicationsGene predictionCNV detection from SNP arrays

5 Two extensions to the basic HMMStochastic EMSemi-Markov models

Page 16: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Outline

1 Introduction

2 Two basic problemsForward/backward Baum-Welch algorithmViterbi algorithm

3 When the parameters are unknown

4 Two applicationsGene predictionCNV detection from SNP arrays

5 Two extensions to the basic HMMStochastic EMSemi-Markov models

Page 17: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

What we can compute at this stage

At this stage our tools are limited.

Given a sequence x = (x1, . . . , xn) we can compute

P(X = x ,Y = y) = P(X = x)P(Y = y |X = x)

This is the full joint likelihood for (X ,Y ).

Page 18: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Why problem 1 is difficult

P(Xi = xi |Y ) =P(Xi = xi ,Y )

P(Y )=

P(Xi = xi ,Y )∑s=1,...,S P(Xi = s,Y )

So the problem amounts to estimating P(Xi = r ,Y )

A direct computation would sum over all possible sequences:

P(Xi = s,Y ) =∑

x |xi=s

P [X = x ,Y ]

With S hidden states we need to sum over Sn terms, which isnot practical.

We need to be smarter.

Page 19: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

We need to use the Markovian assumption

P(Xi = s,Y ) = P(Xi = s)P(Y |Xi = s)

= P(Xi = s)∑x

P(Y ,X = x |Xi = s)

= P(Xi = s)∑x i1

P(Y i1 ,X

i1 = x i1|Xi = s)

×∑xni+1

P(Y ni+1,X

ni+1 = xni+1|Xi = s)

= P(Xi = s)P(Y i1 |Xi = s)× P(Y n

i+1|Xi = s)

= P(Y i1 ,Xi = s)P(Y n

i+1|Xi = s)

= αs(i)× βs(i)

Page 20: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

A new computation

We have shown that:

P(Xi = s|Y ) =αs(i)× βs(i)∑St=1 αt(i)× βt(i)

where:αs(i) = P(Y i

1 ,Xi = s)

βs(i) = P(Y ni+1|Xi = s)

And it is actually possible to compute, recursively, thequantities αs(i), βs(i).

Page 21: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Two recursive computations

The (forward) recursion for α is:

αs(i + 1) = P(Yi+1|Xi+1 = s)×S∑

t=1

αt(i)Pts

The (backward) recursion for β is:

βs(i − 1) =∑t

Pstβt(i)P(Yi |Xi = t)

Page 22: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Proof for the first recursion

αs(i + 1) = P(Y i+11 ,Xi+1 = s)

=∑t

P(Y i+11 ,Xi+1 = s|Xi = t)P(Xi = t)

=∑t

P(Y i+11 |Xi+1 = s,Xi = t)P(Xi+1 = s|Xi = t)P(Xi = t)

= P(Yi+1|Xi+1 = s)∑t

PtsP(Y i1 |Xi = t,Xi+1 = s)P(Xi = t)

= P(Yi+1|Xi+1 = s)∑t

PtsP(Y i1 ,Xi = t)

= P(Yi+1|Xi+1 = s)∑t

Ptsαt(i)

A similar proof is used for the backward recursion.

Page 23: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Computational considerations

The algorithm requires to store n × S floats.

In terms of computation times, the requirements are inS2 × N.

Linearity in n is the key feature because it enables the analysisof very long DNA sequences.

Note that probabilities rapidly become infinitely small.

Everything needs to be done at the log scale (be careful whenimplementing it).

Various R packages are available for hidden Markov Chains(google it!).

Page 24: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Outline

1 Introduction

2 Two basic problemsForward/backward Baum-Welch algorithmViterbi algorithm

3 When the parameters are unknown

4 Two applicationsGene predictionCNV detection from SNP arrays

5 Two extensions to the basic HMMStochastic EMSemi-Markov models

Page 25: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Pbm 2: Finding the most likely hidden sequence X̂

A different problem consists of finding the most likely hiddensequence X̂ .

Indeed, the most likely Xi using the marginal distribution maybe quite different from X̂i .

An algorithm exists to achieve this maximisation and it iscalled Viterbi algorithm.

Page 26: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

The Viterbi algorithm

DefineVs(i) = max

x i1

P(Y i1 ,X

i1|Xi = s)

Similarly to the previous problem a forward recursion can bedefined for Vs(n + 1) as a function of Vs .

Following this forward computation a reverse parsing of theMarkov chain can identify the most likely sequence.

Page 27: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

An exercise

Here is a table that shows the probability of the data for threestates (one state per row, 6 points in the chain). This matrixshows a log likelihood of the data given the position in the chainand the hidden state (which can be either 1, 2 or 3).

State 1 2 3 4 5 6

1 1 3 4 3 5 42 2 1 5 8 5 13 4 2 2 4 1 5

Assume that remaining in the same state costs no log-likelihood,but transitioning from one state to another costs one unit oflikelihood.The probability over the three states is uniform to start the chain.Compute

Vs(i) = maxx i1

P(Y i1 ,X

i1|Xi = s)

and estimate the most liekly Viterbi path.

Page 28: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

A few words about Andrew Viterbi

Andrew James Viterbi (born inBergamo in 1935) is anItalian-American electrical engineerand businessman.

In addition to his academic work heco-founded Qualcomm.

Viterbi made a very large donationto the University of SouthernCalifornia to name the school theViterbi school of engineering.

Page 29: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Computational considerations

Requirements are the same as before.

The algorithm requires to store n × S floats.

In terms of computation times, the requirements are inS2 × N.

Linearity in n is the key feature because it enables the analysisof very long DNA sequences.

Easy to code (in C or R, see example and R libraries).

Page 30: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Outline

1 Introduction

2 Two basic problemsForward/backward Baum-Welch algorithmViterbi algorithm

3 When the parameters are unknown

4 Two applicationsGene predictionCNV detection from SNP arrays

5 Two extensions to the basic HMMStochastic EMSemi-Markov models

Page 31: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Unknown parameters case

Often we do not know the distribution P(Y |X ).

We may also not know the transition probabilities for thehidden Markov chain X .

If the parameters Θ are not known, how can we estimatethem?

Page 32: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

What if we knew X?

If we know X , the problem becomes straightforward.

For example a maximum likelihood estimate would be:

P(Y = k |X = s) =

∑i 1Yi=k,Xi=s∑

i 1Xi=s

More sophisticated (but still straightforward) versions of thiscould be used if Y was a nth order Markov chain.

Page 33: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

A typical missing data problem

In this missing data context, a widely used algorithm is thethe Expectation-Maximisation (EM) algorithm .

The EM algorithm is set up to find the parameters thatmaximise the likelihood of the observed data Y in thepresence of missing data X .

At each step the likelihood is guaranteed to increase.

The algorithm can easily be stuck in a local maximum of thelikelihood surface.

Page 34: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

The basic idea of the EM

The is a general iterative algorithm with multiple applications.

It first computes the expected value of the likelihood given thecurrent parameters (essentially imputing the hidden chain X ):

Q(θ, θn) = EX |Y (log L(X ,Y , θn))

Then maximises the quantity Q(θ, θn+1) as a function of θ.

θn+1 = argmaxθ

Q(θ, θn)

Page 35: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

EM in the context of HMM

Pst =

∑i P(Xi = s,Xi+1 = t|Y )∑

i P(Xi = s|Y )

Qsk =

∑i 1Yi=kP(Xi = s|Y )∑

i P(Xi = s|Y )

The updated probabilities can be estimated using thesequences αs , βs estimated previously.

This special case of the EM for HMM is called theBaum-Welch algorithm.

Page 36: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Outline

1 Introduction

2 Two basic problemsForward/backward Baum-Welch algorithmViterbi algorithm

3 When the parameters are unknown

4 Two applicationsGene predictionCNV detection from SNP arrays

5 Two extensions to the basic HMMStochastic EMSemi-Markov models

Page 37: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Outline

1 Introduction

2 Two basic problemsForward/backward Baum-Welch algorithmViterbi algorithm

3 When the parameters are unknown

4 Two applicationsGene predictionCNV detection from SNP arrays

5 Two extensions to the basic HMMStochastic EMSemi-Markov models

Page 38: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Gene prediction

Zhang, Nat Rev Genetics, 2002

Page 39: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Some drawbacks of this approach

The number of hidden states can be very large.

Modelling codons takes three states, plus probably threestates for the first and three states for the last codons.

So about nine states just for the exons.

One probably needs nine more states on the reverse strand.

Some alternatives exist (using semi-Markov models).

Page 40: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Outline

1 Introduction

2 Two basic problemsForward/backward Baum-Welch algorithmViterbi algorithm

3 When the parameters are unknown

4 Two applicationsGene predictionCNV detection from SNP arrays

5 Two extensions to the basic HMMStochastic EMSemi-Markov models

Page 41: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Copy number variant detection from SNP arrays

+

+

+

+ +

+

+

+

+

+

+

+

++

+

+ +

++

+

++

+

++

++

+++

+

++

+++

++ ++

+

++

+

+

++

+

+

+

+

+

+

++

+

++ ++

+

+

+

+

++

+

+

+

+

+++

++

+

+

++

+

+++

+

+

++

+

+

+

+

+

+ +

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

++

+

+

+

+

+

+

++

++

++

+

++

+

+

+

++

+

++

++

+

++

+

++

+

++

+

+

+

+

+

+

+

+

++

+

+

++

+

+

+

++

+

+

+

++ +

+

+

+

+

+

++

+

+++

++

+

++

++

+

+

+

+

+

++

+

+

++

+

++

+

+

+

+ +

+

+ +

+++

++

+ ++

+++ +

+

+

++

+

+

++

++

++

+ +

+

+

+

+

+ +

+

++

+

+

+

+

+

+

++

+

+

+

+

+

++

+

++

+

+

+

+

+

+

+

+

+

+

++

++

+

+

+++

+

+

++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

++

+++

+

++

+

+

+

++

+

+

+

++

++

+

++

+

+++

+

+

+

+

+++

+

++

+

++

+

+

+

++

+

+ +

+

+

++

+

+

+++

+

+

+

++

++

+

+

+

++

++

++

+

+

+

+++

+

+

+

+

+

+

+++

++

+ +

+

+

+

+

+

+

+

+

++

+

+

+

+

++

++

+

++

+

+

+

+

+

+

++++

+

++

+

+

++

+

++

+

+

+

+

+

+

+

++

+

+

+ +

+

++

++

+

++

++

+

+

+

+

+

+

++

++

++

++

++

++

+

+

+

+ ++

+

+

+

+

+ +

+

+++

+

+

+

+

+

+

+

+

+

+

+

+

+

++

++

+

++

+

+

+

+

+

+

++

+

++

+ +

++

+

+

++

+

+

+

++

+

+

++

+

++

+

+

++

+ +++

+

++

++

+

+ +

+

+

+

+

+

+

++

+

+

+

+

+

++

+

+

+

++

+

+

+ +

++

+

+

+

+

++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+ ++

+

+

+

++

+

+

++

+++

++

+

+ +

+

++ +

+

+

+

+

+

+

++

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+++

+

+

+

+

+

+

+

+

+

+

+

+

+

+ +

+

+

+

++

+

+

+ +

+

++ +

+

+

+

+

+

+

+

+

+

+

+

+

++

+ +

+

+

+

+

+

+

+

+

+

++

+++

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

++

+

++

++

+

+

+

+

+

+

+

++

+

+

++

++

+

++

+

+

++

+

+

+

+

+

++

+

+

+

+

+

+

+ +

++

+

+

+

++

+

++

+

++++

++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+ +

+

+

+

++

+

+

+

+

++

+

+

+

+

++++

++

+

+

+

+

++

+

+

+

+

+

+

+++

+

+

+

+

+

+

+

++

+

+ ++

+

+

++

+

++

++

+

++

+

+

+

++

+

++++

+

+

+

+

++

+

+

+

+

+

+ +

++

++

++

+

+

++

+

+

+

++

+

++

+

+

+ +

+

+

++

+

+

+

+

+

++

+

+

+

+

+

++

+

+

++ +

+

+

++ ++

+

++

+

+

+

++ + +

+

+

+

+

+

+

+

+

+

+

+

++

++

+

+

++

+

+

+ +

+

+

+

+

+

+

+

+

+++

+

++

++

+

+++ +

++ +

+

+

+

++

++

+

++

++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

++ +

+

+

+

++

+

+

+

++

+

+

+

+

++

++

+

+

+

+

+

+

+

++ ++

+

+

+

+

+

+

+ +

+

++

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

++++

+

+

+++

+

+

+

+

+

+

+

+

++++

+

+

+

+

+

++

+

+

+

+

+

+

+++

++ +

+

+

+

+

+

+

+

+

+

+

+ +

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

++

++

+

+

+

++

+

++

+

+

++

+

+

+

+

+++

+

+

+++

++

+

+

+

+

+

+

+

+

+

+

+++

++

++

+

++

++ +

+

+

++

+

+

+

+

+

+

+

+

++

++

+

+

+

+

+

+

+

++

+

+

+

+

+

++

++++

+

+

+

++

+

++

+

+

++

+

+++

+++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+ +

+

+

++

+

+

+

+

+

+

+++

++

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+ +++

+

+

+

+

+

+++ ++

+++

+

+

+

+

+

+ + +

+

+

++

+

+

+

+

++

+

+

+

+

+

++

++

+

+

++

+

++

+

+

+

++

+

++

+

+

+

++

+

+

+

++

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

++

+

++

+

+

+

+

++

+

+

++

+

+

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

Allele 1

Alle

le 2

Page 42: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Copy number variant detection from SNP arrays

Wang et al, Genome Research 2007

Page 43: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Outline

1 Introduction

2 Two basic problemsForward/backward Baum-Welch algorithmViterbi algorithm

3 When the parameters are unknown

4 Two applicationsGene predictionCNV detection from SNP arrays

5 Two extensions to the basic HMMStochastic EMSemi-Markov models

Page 44: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Outline

1 Introduction

2 Two basic problemsForward/backward Baum-Welch algorithmViterbi algorithm

3 When the parameters are unknown

4 Two applicationsGene predictionCNV detection from SNP arrays

5 Two extensions to the basic HMMStochastic EMSemi-Markov models

Page 45: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Stochastic EM (SEM)

The EM-Baum-Welch algorithm essentially uses theconditional distribution of X given Y .

Another way to compute this expectation is to use aMonte-Carlo approach by simulating X given Y and taking anaverage.

This is a trade-off:

We of course do not retain the certainty that the likelihood isincreasing (as provided by the EM).However added randomness may avoid the pitfall of having theestimator stuck in local maximum (a major issue with the EM).

Page 46: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Stochastic EM (SEM)

A simulation of X conditionally on Y would use the followingdecomposition:

P(XN1 |Y N

1 ) = P(X1|Y N1 )P(X2|Y N

1 ,X1) · · ·P(XN |Y N1 ,X

N−11 )

This relies on being able to compute the marginal probabilitiesbut this is what Baum-Welch does.

Once the α, β have been computed, the simulation is linear intime and multiple sequences can be simulated rapidly.

Page 47: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

How to simulate in practice

The simulation uses the equality:

P(Xi+1 = t|Y ,Xi = s) =PstP(Yi+1|Xi+1 = t)P(Y n

i+2|Xi+1 = s)

P(Y ni+1|Xi = s)

=PstP(Yi+1|Xi+1 = t)βt(i + 1)

βs(i)

Note that this is a forward-backward algorithm as well but theforward step is built into the simulation step, unlike thetraditional Baum-Welch.

Page 48: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Estimation issues

Using a single estimated run for the hidden chain X isnecessarily less efficient that relying on the expectedprobability.

The number of data points must be very large to make theestimation precise.

One could potentially take an average of multiple simulatedruns.

With sufficient numbers of simulations one actually gets veryclose to the EM.

List most practical estimation procedures one has to find thegood combination of tools, and there is not one answer.

Page 49: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Outline

1 Introduction

2 Two basic problemsForward/backward Baum-Welch algorithmViterbi algorithm

3 When the parameters are unknown

4 Two applicationsGene predictionCNV detection from SNP arrays

5 Two extensions to the basic HMMStochastic EMSemi-Markov models

Page 50: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Semi-Markov models (HSMM)

In the context of the gene prediction using three states percodon is not satisfying.

We would like something that takes into account groups of3bp jointly.

Semi-Markov models do exactly this.

When entering a state s, a random variable Ts is drawn for theduration in state s.Then the emission probability for Y can be defined for theentire duration of the stay.So codons are naturally defined by groups of 3bp instead ofdealing with multiple hidden states.

Page 51: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Backward recursion for SEM applied to semi-Markovhidden chains

We are interested in computing the quantities:

∀n ∈ [1,N − 1], ∀i ∈ [1, k], βi (n) = P(Y Nn+1|Y n

1 ,Xn = i)

βi (N) = 1

βi (n) = P(Y Nn+1|Y n

1 ,Xn = εi )

=∑j

∑l<N−n

PijP(Tγj = l)P(Y n+l+1n+1 |X n+l

n+1)βj(n + l)

Note the complexity not in NS2 ×max(l) as opposed to NS2 before.

Page 52: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Forward simulations for SEM

One can simulate a new hidden sequence recursively with theformulas:

P(X n+ln+1 = j |Y N

1 ,Xn = i)

=PijP(Tj = l)P(Y n+l+1

n+1 |X n+ln+1 = j)βj(n + l)

βi (n)

This is very much analogous to the basic HMM situation, with theextra complication generated by the variable state length.

Page 53: Hidden Markov Models in the context of genetic analysisrmjbale/Stat/HMM.pdf · Forward/backward Baum-Welch algorithm Viterbi algorithm 3 When the parameters are unknown 4 Two applications

Estimation for semi-Markov models

It is possible to run a Viterbi algorithm using the samerecursion derived for the Markovian case.

It is also possible to use a SEM algorithm to simulate thehidden sequence X and use it to estimate the parameters ofthe model.

A full EM is also possible but I never implemented it.

The computational requirements may become challenging butit all depends on the application.