hmm -class 14- - pedeciba · 2011. 8. 30. · = fh;tgforcointossing = f1;2;3;4;5;6gfordicetossing...

Hiden Markov Models

Bioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 1 / 65

Outline

CG-islandsThe “Fair Bet Casino"Hidden Markov ModelDecoding AlgorithmForward-Backward Algorithm

HMM Overview

Machine learning approach used in bioinformaticsPresented with training data→derive important insights about the(hidden) parametersOnce the algorithm has been “trained" → insights can be applied totest data↑ training data, ↑ accuracy of the algorithm

HMM Overview

Parameters learned during training → KNOWLEDGEApplication of those parameters to new data → algorithms use of thatknowledgeHMM: learn unknown probabilistic parameter from training samplesand uses these parameters in the framework of dynamic programming(or others) to find the best explanation for the experimental data

CpG-islands

Double stranded DNA:...ApCpCpApTpGpApTpGpCpApGpGpApCpTpTpCpCpApTpCpGpTpTpCpGpCp......| | | | | | | | | | | | | | | | | | | | | | | | | | | | ......TpGpGpTpApCpTpApCpGpTpCpCpTpGpApApGpGpTpApGpCpApApGpCpGp...Given 4 nucleotides: assume equal probability of occurrence ⇒ p(N) ≈ 1/4for each N ∈ {A,G ,C ,T}. ⇒ probability of occurrence of a dinucleotide is≈ 1/16.However, the frequencies of dinucleotides in DNA sequences are not equal.In particular, the p(CG) is typically < 1/16.

CpG-islands

CG is the least frequent dinucleotide, because the C in a CpG pair is oftenmodified by methylation (that is, an H-atom is replaced by a CH3-group)and the methyl-C has the tendency to mutate into T afterwards.

Upstream of a gene however, the methylation process is suppressed in shortregions of the genome of length 100-5000. These areas are calledCpG-islands and they are characterized by the fact that we see moreCpG-pairs in them than elsewhere.

So, finding the CpG islands in a genome is an important problem.

CpG-island

CpG-islands

Definition (Gardiner-Garden & Frommer, 1987):A CpG island is DNA sequence of length about 200bp with a C + Gcontent of 50% and a ratio of observed-to-expected number of CpG’s thatis above 0.6.

According to a recent study1, human chromosomes 21 and 22 containabout 1100 CpG-islands and about 750 genes.

1Comprehensive analysis of CpG islands in human chromosomes 21and 22, D. Takai & P. A. Jones, PNAS, March 19, 2002

Questions

1 Discrimination problem: Given a short segment of genomicsequence. How can we decide whether this segment comes from aCpG-island or not?

2 Localisation problem: Given a long segment of genomic sequence.How can we find all contained CpG-islands?

CpG islands and the casino dealer

CpG island problem is similar to “dishonest casino problem”:An occasionally dishonest casino uses two kinds of coins, a fair one and aloaded one. The game is to flip coins with two possible outcomes only:Head or Tail.Fair coin2: p(H|F ) = p(T |F ) = 1

2Biased coin: p(H|B) = 3

4 , p(T |B) = 14

The occasionally dishonest casino dealer changes between the two coinswith 10% probability.

2Recall P(A|B) denotes probability for an event A given BBioinfo I (Institut Pasteur Montevideo) HMM -class 14- August 29th, 2011 10 / 65

The casino problem

As input a sequence of coin tosses is made, where either of the two coinswas used:

π = π1π2π3 . . . πn, πi ∈ {F,B}

However, the problem is one observes just a sequence of outcomes of thetosses:

x = x1x2x3 . . . xn, xi ∈ {H,T}

The fair bet casino problem

Goal: Given a sequence of coin tosses, determine when the dealer used afair coin and when a biased coin

Input: A sequence x = x1x2x3...xn of coin tosses made by two possiblecoins (F or B).

Output: A sequence π = π1π2π3...πn, with each πi being either F or Bindicating that xi is the result of tossing the Fair or Biased coin respectively.

Problem...

Fair Bet Casino Problem:Any observed outcome of cointosses could have been →→generated by any sequenceof states!

π = FFFF ...FF is a validanswer for every observedsequence of tosses, as it isπ = BBBB...BB

Problem...

Fair Bet Casino Problem:Any observed outcome of cointosses could have been →→generated by any sequenceof states!

Need to incorporate a way tograde different sequencesdifferently

P(x |fair coin) vs. P(x |biased coin)

Suppose first that dealer never changes coins. Some definitions:I P(x |fair coin): prob. of the dealer using the F coin and generating the

outcome xI P(x |biased coin): prob. of the dealer using the B coin and generating

outcome x

Assume x is the sequence of n tosses:P(x |fair coin) = P(x1...xn|fair coin)

= Πni=1p(xi |fair coin) =

P(x |biased coin) = P(x1...xn|biased coin)

= Πni=1p(xi |biased coin) =

n−k=

k - the number of Heads in x.

If P(x |fair coin) > P(x |biased coin), then the dealer most likely useda fair coinIf P(x |biased coin)> P(x |fair coin), then the dealer most likely used abiased coin

When are these probabilies equal?

If P(x |fair coin) > P(x |biased coin), then the dealer most likely useda fair coinIf P(x |biased coin)> P(x |fair coin), then the dealer most likely used abiased coin

When are these probabilies equal?

P(x |fair coin) = P(x |biased coin)

2n = 3k

n = klog23

when k = nlog23 (k = 0.67n)

Hence, when k < nlog23 the dealer most likely used a fair coin.

When k > nlog23 he most likely used a biased coin.

Log-odds Ratio

We define log-odds ratio as follows:

log2(P(x |fair coin)

P(x |biased coin))

= Σki=1log2(

p+(xi )

p−(xi ))

= n − klog23

log-odds< 0 biased coin

log-odds> 0 fair coin

Computing Log-odds Ratio in Sliding Windows

We know the dealer switches coins, albeit rarely...

Consider a sliding window of the outcome sequence. Find the log-odds forthis short window.

If log-odds ratio of window is negative, then dealer most likely used abiased coin. If positive, otherwise.The same approach can be used to find CG-islands in long DNAsequences. Calculate log-odds ratios for a sliding window of someparticular length. Declare window with positive score as a potentialCG-island.

Disadvantages:- the length of CG-island is not known in advance, or the window length forthe casino- different windows may classify the same position differently

HMM represent a different probabilistic approach

Hidden Markov Models (HMM)

Motivation: Localisation problem, how to detect CpG-islands inside along sequence x = x [1, L), L� 0?

One idea is to use the Markov chain models derived above and apply themby calculating log-odds ratio for a window of a fixed width w � L that ismoved along the sequence and the score S(x [k , k + w)) is plotted for eachstarting position k (1 ≤ k ≤ L− w)

Problems:- how to determine the boundaries of CpG-islands,- which window size w should one choose?

Approach: Merge the two models Model+ (Fair, CpG) and Model−

(Biased, not CpG) to obtain a so-called Hidden Markov Model.

Hidden Markov Models

Can be viewed as an abstract machine with k hidden states that emitssymbols from an alphabet Σ.Each state has its own probability distribution, and the machineswitches between states according to this probability distribution.While in a certain state, the machine makes 2 decisions:

I What state should I move to next?I What symbol - from the alphabet Σ - should I emit?

Why “Hidden"?

Observers can see the emitted symbols of an HMM but have no abilityto know which state the HMM is currently in.Thus, the goal is to infer the most likely hidden states of an HMMbased on the given sequence of emitted symbols.

Hidden Markov Models

DefinitionA hidden Markov model (HMM) is a system M = (Σ,Q,P, e) consisting of

an alphabet Σ, often also emission alphabet called,a set of states Q,a matrix P = {pkl} of transition probabilities pkl for k , l ∈ Q, andan emission probability ek(b) for every k ∈ Q and b ∈ Σ.

HMM Parameters: examples

Σ: set of emission characters

Σ = {H,T} for coin tossingΣ = {1, 2, 3, 4, 5, 6} for dice tossing

Q: set of hidden states, each emitting symbols from Σ

Q = {F ,B} for coin tossing

HMM Parameters: examples

P = (pkl ): a |Q| × |Q| matrix of probability of changing from state k tostate l

pFF = 0.9 pFB = 0.1pBF = 0.1 pBB = 0.9

E = (ek(b)): a |Q| × |Σ| matrix of probability of emitting symbol b whilebeing in state k

eF (H) = 12 eF (T ) = 1

2eB(H) = 3

4 eB(T ) = 14

HMM for Fair Bet Casino

The Fair Bet Casino in HMM terms:

Σ = {0, 1}(0 for Tails and 1 Heads)

Q = {F ,B} − F for Fair & B for Biased coin

Transition Probabilities P Emission Probabilities e

Formally, the corresponding HMM for the casino is defined asM(Σ,Q,P, e):

Σ = {0, 1}, 0: tails and 1: headsQ = {F ,B}, F: fair and B: biasedP transition probability matrix, pFF = pBB = 0.9

pFB = pBF = 0.1E emision matrix: e(0)F = 1

2 , e(1)F = 12 , e(0)B = 1

4 , e(1)B = 34

HMM for CpG-islands

In our case of the localisation of CpG-islands, we want to have bothModels (CpG+ and CpG−) integrated in one, with a small probabilityof switching from one chain to the other at each transition point.It might seem like we should model the DNA sequence with two states(CpG+, CpG−). However, for a DNA sequence, the emissionprobability of nucleotide is dependent on the previous emission, notthe previous state (CpG island/not a CpG island) and hence we needmore than two states.We resolve this by relabeling the states such that we have the statesA+, C+, G+ and T+ which emit A, C , G and/or T withinCpG-regions, and vice versa states A−, C−, G− and T− which emit A,C , G and/or T within non-CpG-regions.

HMM for CpG-islands

The graph of transitions between states looks as follows:

A C TG

A C TG+ + + +

− − − −

(Additionally, we have all transitions between states in either of the twosets that carry over from the two Markov chains Model+ and Model−.)

HMM for CpG-islands

Thus, we haveΣ = {A,G ,C ,T} for the alphabet,Q = {A+,C+,G+,T+,A−,C−,G−,T−}and possibly start and end states for the states set.The transition probability between the + and − states is small.Generally, the chance of switching from + to − is larger than vice versa.In the case of the CpG-islands the emission probabilities are all 0 or 1 ascan be seen from below.

HMM for CpG-islands

DefinitionA path π = (π1, π2, . . . , πL) is a sequence of states in the model M. Eachstate emits a symbol with a certain probability thereby generating asequence of symbols.

Generally a sequence can be generated from an HMM as follows: first abegin state π1 is chosen according to the probability p0i . In that state asymbol is emitted according to the emission probability eπ1 for that state.We then move to the next state π2 with probability pπ1i and so on. Wewould report the sequence of emitted symbols.

Hidden Paths

A path π = π1...πn in the HMM is defined as a sequence of states.Consider path π = FFFBBBBBFFF and sequence x = 01011101001

P(x |π) Calculation

P(x |π): Probability that sequence x was generated by the path π:

P(x |π) = P(π0 → π1)× Πni=1P(xi |πi )× P(πi → πi+1)

= pπ0,π1 × Πni=1eπi (xi )× pπi ,πi+1

= Πni=0eπi+1(xi+1)× pπi ,πi+1

if we count from i = 0 instead of i = 1

P(x |π): Probability that sequence x was generated by the path π:

P(x |π) = P(π0 → π1)× Πni=1P(xi |πi )× P(πi → πi+1)

= pπ0,π1 × Πni=1eπi (xi )× pπi ,πi+1

= Πni=0eπi+1(xi+1)× pπi ,πi+1

if we count from i = 0 instead of i = 1

Calulation of P(x |π) in the example:

We assume that we knew π and observed xHowever, we do not know π (only the dealer knows) → π is hiddenIf you only observe x=01011101001, you might ask whether or notπ=FFFBBBBBFFF is the “best" explanation for xIf it’s not the best, can we reconstruct the best one?

Decoding Problem

Goal: Find an optimal hidden path of states given observations

Input: Sequence of observations x = x1...xn generated by an HMMM(Σ,Q,A,E )

Output: A path that maximizes P(x |π) over all possible paths π

Building Manhattan for Decoding Problem

Andrew Viterbi used the Manhattan grid model to solve the DecodingProblem.Every choice of π = π1...πn corresponds to a path in the graph.The only valid direction in the graph is eastward.This graph has |Q|2(n − 1) edges.

Edit Graph for Decoding Problem

Decoding Problem vs. Alignment Problem

Decoding Problem as Finding a Longest Path in a DAG

The Decoding Problem is reduced to finding a longest path in thedirected acyclic graph (DAG) above.Notes: the length of the path is defined as the product of its edges’weights, not the sum.

Decoding Problem

Every path in the graph has the probability P(x |π)

The Viterbi algorithm finds the path that maximizes P(x |π) among allpossible pathsThe Viterbi algorithm runs in O(n|Q|2) time

Decoding Problem: weights of edges

Decoding Problem and Dynamic Programming

vl ,i+1 = maxk∈Q{vk,i · weight of edge between (k , i) and (l , i + 1)}

= maxk∈Q{vk,i · akl · el (xi+1)}

= el (xi+1) ·maxk∈Q{vk,i · akl}

vk(i) is called the Viterbi variable.

Decoding Problem

Initialization:vbegin,0 = 1vk,0 = 0 for k 6= begin

Let π∗ be the optimal path. Then,P(x |π∗) = maxk∈Q{vk,n · ak,end}

Viterbi Algorithm

The value of the product can become extremely small, which leads tooverflowing.To avoid overflowing, use log value instead:

vk,i+1 = logel (xi+1) + maxk∈Q{vk,i + log(pkl )}

Viterbi example -weather-

High (H) and Low (L) pressure fronts are related to the weatherHigh fronts tend to provoke sunny daysCloudy days are related to Low frontsHaving seen the weather in the last three days we want to know whichpressure front had generated the weather:Weather in last 3 days: Sunny-Cloudy-Sunny

Viterbi example

Viterbi example -DP-

Forward-Backward Problem

Viterbi → Given a sequence of coin tosses, what is the most probable pathof states

Given: a sequence of coin tosses generated by an HMM.

Goal: find the probability that the dealer was using a biased coin at aparticular time.

Forward Algorithm

A blind hermit can only sense the seaweed state (dry, damp, soggy), butneeds to know the weather (sunny, cloudy and rainy), i.e. the hiddenstates. HMM M(Σ,Q,A,E ) is known

Viterbi would calculatethe most probablesequence of states thatgeneratesx=dry-damp-soggy.Forward algorithmcalculates the probabilityof being at state k attime i

Forward Algorithm

P(x , πi = k) = P(observation | hidden state is k) × P (all paths to state kat time t)P(x , πi = k) =P(damp|cloudy) × P(all paths to state cloudy at time T=2)

Forward Algorithm

Define fk,i (forward probability) as the probability of emitting the prefixx1...xi and reaching the state π = k → P(x , πi = k)The recurrence for the forward algorithm:

fk,i = ek(xi ) · Σfl ,i−1 · plk

It is calculated as viterbi, with dynamic programming. Instead of taking theproduct of the probabilities, sum them up

Backward-algorithm

Using the Viterbi algorithm we can search for the most probable path, ie.state sequence that has generated an observed sequence. But this does nothelp us with prediction on future states.

DefinitionThe backward-variable is defined as the probability of being at state πi = kand then to generate (emit) the suffix sequence (xi+1, . . . , xL):

bk(i) = P(xi+1 . . . xL, πi = k)

The backward algorithm uses a very similar recursion formula:

bk(i) =∑l∈Q

el (xi+1)bl (i + 1)pkl

Backward-algorithm

Input: HMM M = (Σ,Q,P, e) and sequence of symbols xOutput: probability P(x | M)

Initialization: (i = L): bk(L) = pk0 for all k .

For all i = L− 1 . . . 1, k ∈ Q:bk(i) =

∑l∈Q el (xi+1)bl (i + 1)pkl

Termination: P(x | M) =∑

l∈Q(p0lel (x1)bl (1))

Comparison of the three variables

Viterbi vk(i) probability, with which the most probablestate path generates the sequence of symbols(x1, x2, . . . , xi ) and the system is in state kat time i .

Forward fk(i) probability, that the prefix sequence of sym-bols x1, . . . , xi is generated, and the systemis in state k at time i .

Backward bk(i) probability, that the system starts in state kat time i and then generates the sequence ofsymbols xi+1, . . . , xL.

hmm -class 14- - pedeciba · 2011. 8. 30. · = fh;tgforcointossing = f1;2;3;4;5;6gfordicetossing...

Documents

genotoul bioinfo & sigenae genome assembly...

mugqic bioinfo chip-seq

unit 1 intro to bioinfo and biological databases student...

session i overview bioinfo dm and app mmc

makefiles bioinfo

appli bioinfo

m.tech bioinfo brochure 2014

bioinformática geocities/mirkozimic/bioinfo introducción,...

statistical computing resampling methods lecture 2: bioinfo...

bioinfo project

ΣΥΝΟΠΤΙΚΟΣ ΟΔΗΓΟΣ matlab – simulink ·...

bioinfo/stat 545 biostat646 data analysis in molecular...

into to bioinfo

t-bioinfo overview

training day :...

protein sequence databases -...

aula aline dna traducao bioinfo

bioinfo/stat 545 biostat646 data analysis in molecular...

masters bioinfo 2013-11-14-15

bioinfo-1 nei