hidden markov models in the context of genetic analysisrmjbale/stat/hmm.pdf · forward/backward...

Hidden Markov Models in the context of geneticanalysis

Vincent Plagnol

UCL Genetics Institute

November 22, 2012

Outline

1 Introduction

2 Two basic problemsForward/backward Baum-Welch algorithmViterbi algorithm

3 When the parameters are unknown

4 Two applicationsGene predictionCNV detection from SNP arrays

5 Two extensions to the basic HMMStochastic EMSemi-Markov models

Outline

1 Introduction

The problem

Many applications of statistics can be seen as a categorisation.

We try to fit complex patterns into discrete boxes in order toapprehend them better.

Clustering approaches are typical of this:

Inference of an individual’s ancestry being a mix of X and Y.Separation between high risk and low risk disease groups . . .

Hidden Markov Models try to achieve exactly this purpose ina different context.

Basic framework

An example: gene discovery from DNA sequence

We will first use this simplest example.

We assume that the hidden chain X has two states: gene, orintergenic.

To be complete there should be a third state: gene on thereverse strand.

For now we assume that the emission probabilities P(Yi |Xi )are independent conditionally on the hidden chain X .

This may not be good enough for most applications but this isa place to start.

Notations

(Y )ni=1 represents the sequence of observed data points.

The Yi can be discrete or continuous, but we will assumediscrete for now.

(X )ni=1 is the sequence of hidden states.

∀i ,Xi ∈ {1, . . . ,S} and we have S discrete hidden states.We also assume that we know the distribution P(Y |X ), butthis set of parameters may also be unknown.

Basic description of Markov Chains (1)

A discrete stochastic process X is Markovian is

P(X n1 |Xi ) = P(X i−1

1 |Xi )P(X i+11 |Xi )

Essentially the future and the past are independentconditionally on the present: it is “memory-less”.

One can easily make a continuous version of this.

If the Markov model has S states, then the process can bedescribed using a SxS transition matrix.

The diagonal values pii describe the probability to stay instate i .

Basic description of Markov Chains (2)

The probability to spend exactly k units of time in state i is:

P(X spends k units in i) = pkii (1− pii )

This is the definition of an geometric variable.

In a continuous state it would be an exponential distribution.

The definition of the present can also be modified: Xi may forexample depends on the previous k states instead of the lastone.

This increases the size of the parameter space but makes themodel richer.

Basics for hidden Markov Chains

The hidden Markov chain framework adds one layer (denotedY ) to the Markovian process discribed previously.

The conditional distribution of P(Yj |Xj = s) may be unknown,completely specified or partially specified.

Typically the number of hidden states S is relatively small (nomore than a few hundreds of states).

But n may be very large, i. e. X and Y may be very longsequences (think DNA sequences).

Slightly more general version

Without complicating anything, we can most of the timeassume that P(Yj |Xj) also varies with j .

Y could also be a Markov chain.

Non-Markovian stays can be, to some extent, mimicked byusing a sequence of hidden state:

First part of the gene, middle of the gene, end of the gene.

The set of parameters Θ

1 (Pst) is the transition matrix for the hidden states.

2 Qsk = P(Y = k |X = s) is probability distribution for theobserved chain Y give X .

3 Lastly, we need a vector µ to initiate the hidden chain X .

Two related problems

1 At a given point i in the sequence, what is the most likelyhidden state Xi?

2 What is the most likely hidden sequence (X )ni=1?

3 The first question relates to marginal probabilities and thesecond to the joint likelihood.

Outline

1 Introduction

Outline

1 Introduction

What we can compute at this stage

At this stage our tools are limited.

Given a sequence x = (x1, . . . , xn) we can compute

P(X = x ,Y = y) = P(X = x)P(Y = y |X = x)

This is the full joint likelihood for (X ,Y ).

Why problem 1 is difficult

P(Xi = xi |Y ) =P(Xi = xi ,Y )

P(Y )=

P(Xi = xi ,Y )∑s=1,...,S P(Xi = s,Y )

So the problem amounts to estimating P(Xi = r ,Y )

A direct computation would sum over all possible sequences:

P(Xi = s,Y ) =∑

x |xi=s

P [X = x ,Y ]

With S hidden states we need to sum over Sn terms, which isnot practical.

We need to be smarter.

We need to use the Markovian assumption

P(Xi = s,Y ) = P(Xi = s)P(Y |Xi = s)

= P(Xi = s)∑x

P(Y ,X = x |Xi = s)

= P(Xi = s)∑x i1

P(Y i1 ,X

i1 = x i1|Xi = s)

×∑xni+1

P(Y ni+1,X

ni+1 = xni+1|Xi = s)

= P(Xi = s)P(Y i1 |Xi = s)× P(Y n

i+1|Xi = s)

= P(Y i1 ,Xi = s)P(Y n

i+1|Xi = s)

= αs(i)× βs(i)

A new computation

We have shown that:

P(Xi = s|Y ) =αs(i)× βs(i)∑St=1 αt(i)× βt(i)

where:αs(i) = P(Y i

1 ,Xi = s)

βs(i) = P(Y ni+1|Xi = s)

And it is actually possible to compute, recursively, thequantities αs(i), βs(i).

Two recursive computations

The (forward) recursion for α is:

αs(i + 1) = P(Yi+1|Xi+1 = s)×S∑

αt(i)Pts

The (backward) recursion for β is:

βs(i − 1) =∑t

Pstβt(i)P(Yi |Xi = t)

Proof for the first recursion

αs(i + 1) = P(Y i+11 ,Xi+1 = s)

P(Y i+11 ,Xi+1 = s|Xi = t)P(Xi = t)

P(Y i+11 |Xi+1 = s,Xi = t)P(Xi+1 = s|Xi = t)P(Xi = t)

= P(Yi+1|Xi+1 = s)∑t

PtsP(Y i1 |Xi = t,Xi+1 = s)P(Xi = t)

= P(Yi+1|Xi+1 = s)∑t

PtsP(Y i1 ,Xi = t)

= P(Yi+1|Xi+1 = s)∑t

Ptsαt(i)

A similar proof is used for the backward recursion.

Computational considerations

The algorithm requires to store n × S floats.

In terms of computation times, the requirements are inS2 × N.

Linearity in n is the key feature because it enables the analysisof very long DNA sequences.

Note that probabilities rapidly become infinitely small.

Everything needs to be done at the log scale (be careful whenimplementing it).

Various R packages are available for hidden Markov Chains(google it!).

Outline

1 Introduction

Pbm 2: Finding the most likely hidden sequence X̂

A different problem consists of finding the most likely hiddensequence X̂ .

Indeed, the most likely Xi using the marginal distribution maybe quite different from X̂i .

An algorithm exists to achieve this maximisation and it iscalled Viterbi algorithm.

The Viterbi algorithm

DefineVs(i) = max

P(Y i1 ,X

i1|Xi = s)

Similarly to the previous problem a forward recursion can bedefined for Vs(n + 1) as a function of Vs .

Following this forward computation a reverse parsing of theMarkov chain can identify the most likely sequence.

An exercise

Here is a table that shows the probability of the data for threestates (one state per row, 6 points in the chain). This matrixshows a log likelihood of the data given the position in the chainand the hidden state (which can be either 1, 2 or 3).

State 1 2 3 4 5 6

1 1 3 4 3 5 42 2 1 5 8 5 13 4 2 2 4 1 5

Assume that remaining in the same state costs no log-likelihood,but transitioning from one state to another costs one unit oflikelihood.The probability over the three states is uniform to start the chain.Compute

Vs(i) = maxx i1

P(Y i1 ,X

i1|Xi = s)

and estimate the most liekly Viterbi path.

A few words about Andrew Viterbi

Andrew James Viterbi (born inBergamo in 1935) is anItalian-American electrical engineerand businessman.

In addition to his academic work heco-founded Qualcomm.

Viterbi made a very large donationto the University of SouthernCalifornia to name the school theViterbi school of engineering.

Computational considerations

Requirements are the same as before.

The algorithm requires to store n × S floats.

In terms of computation times, the requirements are inS2 × N.

Linearity in n is the key feature because it enables the analysisof very long DNA sequences.

Easy to code (in C or R, see example and R libraries).

Outline

1 Introduction

Unknown parameters case

Often we do not know the distribution P(Y |X ).

We may also not know the transition probabilities for thehidden Markov chain X .

If the parameters Θ are not known, how can we estimatethem?

What if we knew X?

If we know X , the problem becomes straightforward.

For example a maximum likelihood estimate would be:

P(Y = k |X = s) =

∑i 1Yi=k,Xi=s∑

i 1Xi=s

More sophisticated (but still straightforward) versions of thiscould be used if Y was a nth order Markov chain.

A typical missing data problem

In this missing data context, a widely used algorithm is thethe Expectation-Maximisation (EM) algorithm .

The EM algorithm is set up to find the parameters thatmaximise the likelihood of the observed data Y in thepresence of missing data X .

At each step the likelihood is guaranteed to increase.

The algorithm can easily be stuck in a local maximum of thelikelihood surface.

The basic idea of the EM

The is a general iterative algorithm with multiple applications.

It first computes the expected value of the likelihood given thecurrent parameters (essentially imputing the hidden chain X ):

Q(θ, θn) = EX |Y (log L(X ,Y , θn))

Then maximises the quantity Q(θ, θn+1) as a function of θ.

θn+1 = argmaxθ

Q(θ, θn)

EM in the context of HMM

∑i P(Xi = s,Xi+1 = t|Y )∑

i P(Xi = s|Y )

∑i 1Yi=kP(Xi = s|Y )∑

i P(Xi = s|Y )

The updated probabilities can be estimated using thesequences αs , βs estimated previously.

This special case of the EM for HMM is called theBaum-Welch algorithm.

Outline

1 Introduction

Outline

1 Introduction

Gene prediction

Zhang, Nat Rev Genetics, 2002

Some drawbacks of this approach

The number of hidden states can be very large.

Modelling codons takes three states, plus probably threestates for the first and three states for the last codons.

So about nine states just for the exons.

One probably needs nine more states on the reverse strand.

Some alternatives exist (using semi-Markov models).

Outline

1 Introduction

Copy number variant detection from SNP arrays

++ + +

+++ ++

0.0 0.5 1.0 1.5

Allele 1

Copy number variant detection from SNP arrays

Wang et al, Genome Research 2007

Outline

1 Introduction

Outline

1 Introduction

Stochastic EM (SEM)

The EM-Baum-Welch algorithm essentially uses theconditional distribution of X given Y .

Another way to compute this expectation is to use aMonte-Carlo approach by simulating X given Y and taking anaverage.

This is a trade-off:

We of course do not retain the certainty that the likelihood isincreasing (as provided by the EM).However added randomness may avoid the pitfall of having theestimator stuck in local maximum (a major issue with the EM).

Stochastic EM (SEM)

A simulation of X conditionally on Y would use the followingdecomposition:

P(XN1 |Y N

1 ) = P(X1|Y N1 )P(X2|Y N

1 ,X1) · · ·P(XN |Y N1 ,X

N−11 )

This relies on being able to compute the marginal probabilitiesbut this is what Baum-Welch does.

Once the α, β have been computed, the simulation is linear intime and multiple sequences can be simulated rapidly.

How to simulate in practice

The simulation uses the equality:

P(Xi+1 = t|Y ,Xi = s) =PstP(Yi+1|Xi+1 = t)P(Y n

i+2|Xi+1 = s)

P(Y ni+1|Xi = s)

=PstP(Yi+1|Xi+1 = t)βt(i + 1)

βs(i)

Note that this is a forward-backward algorithm as well but theforward step is built into the simulation step, unlike thetraditional Baum-Welch.

Estimation issues

Using a single estimated run for the hidden chain X isnecessarily less efficient that relying on the expectedprobability.

The number of data points must be very large to make theestimation precise.

One could potentially take an average of multiple simulatedruns.

With sufficient numbers of simulations one actually gets veryclose to the EM.

List most practical estimation procedures one has to find thegood combination of tools, and there is not one answer.

Outline

1 Introduction

Semi-Markov models (HSMM)

In the context of the gene prediction using three states percodon is not satisfying.

We would like something that takes into account groups of3bp jointly.

Semi-Markov models do exactly this.

When entering a state s, a random variable Ts is drawn for theduration in state s.Then the emission probability for Y can be defined for theentire duration of the stay.So codons are naturally defined by groups of 3bp instead ofdealing with multiple hidden states.

Backward recursion for SEM applied to semi-Markovhidden chains

We are interested in computing the quantities:

∀n ∈ [1,N − 1], ∀i ∈ [1, k], βi (n) = P(Y Nn+1|Y n

1 ,Xn = i)

βi (N) = 1

βi (n) = P(Y Nn+1|Y n

1 ,Xn = εi )

∑l<N−n

PijP(Tγj = l)P(Y n+l+1n+1 |X n+l

n+1)βj(n + l)

Note the complexity not in NS2 ×max(l) as opposed to NS2 before.

Forward simulations for SEM

One can simulate a new hidden sequence recursively with theformulas:

P(X n+ln+1 = j |Y N

1 ,Xn = i)

=PijP(Tj = l)P(Y n+l+1

n+1 |X n+ln+1 = j)βj(n + l)

βi (n)

This is very much analogous to the basic HMM situation, with theextra complication generated by the variable state length.

Estimation for semi-Markov models

It is possible to run a Viterbi algorithm using the samerecursion derived for the Markovian case.

It is also possible to use a SEM algorithm to simulate thehidden sequence X and use it to estimate the parameters ofthe model.

A full EM is also possible but I never implemented it.

The computational requirements may become challenging butit all depends on the application.

hidden markov models in the context of genetic analysisrmjbale/stat/hmm.pdf · forward/backward...

Documents

a tutorial on hidden markov...

the viterbi algorithm

probabilistic algorithm for list viterbi...

viterbi decoding algorithm

the stack algorithm vs viterbi algorithm

area-efficient architectures for the viterbi algorithm. i....

a joint trellis coded quantization (tcq) data hiding ... ·...

viterbi algorithm in continuous-phase frequency …national...

sequence labeling problem -...

a tutorial on hidden markov...

contents · figure 1.2: the (n,k,ν) convolutional encoder...

carpediem: optimizing the viterbi algorithm and applications...

viterbi algorithm survivor path decoding · viterbi...

carpediem: optimizing the viterbi algorithm and applications...

ee365: hidden markov models - stanford...

introduction of project accelerating viterbi algorithm...

a viterbi decoder - pure - aanmelden · pdf filea viterbi...

viterbi algorithm in continuous-phase frequency shift keying

high-speed parallel viterbi decoding: algorithm and vlsi...

convolutionalcodes · 1.hard decision viterbi...