hidden markov models (1)

Hidden Markov Models (1)

Brief review of discrete time finite Markov ChainHidden Markov ModelExamples of HMM in BioinformaticsEstimations

Basic Local Alignment Search Tool (BLAST)The strategy Important parameters Example Variations

BLAST“Basic Local Alignment Search Tool”

Given a sequence, how to find its ‘relatives’ in a database of millions of sequences?

An exhaustive search using pairwise alignment is impossible.

BLAST is a heuristic method that finds good, but not necessarily optimal, matches. It is fast.

Basis: good alignments should contain “words” that match each other very well (identical or with very high alignment score).

BLAST

Pandit et al. In Silico Biology 4, 0047

BLASTThe scoring matrix: example:BLOSUM62

The expected score of two random sequences aligned to each other is negative. High scores of alignments between two random sequences follow the Extreme Value Distribution.

0, jiji Spp

BLAST

The general idea of the original BLAST:

Find a match of words in a database. This is done efficiently using pre-processed database and hashing technique.

Extend the match to find longer local alignments.

Report results exceeding a predefined threshold.

http://www.cbi.pku.edu.cn/images/blast_algorithm.gif

BLAST

Two key parameters:

w: word size

Normally a word size of 3~5 is used for protein. The word size 3 yields 203=8000 possible words. Word size of 11~12 is used for nucleotides.

t: threshold of word matching score

Higher t less word matches lower sensitivity/higher specificity

BLAST

Threshold for extention

Threshold for reporting significant hit

BLASTHow significant is a hit ?

S: the raw alignment score.

S’: the bit score. A normalized score. Bit scores from different alignments, or different scoring matrices are comparable.

S’ =(λS-lnK)/ln2K &λ: constants, adjusting for scoring matrix

E: the expected value. number of chance alignments with scores equivalent to or better than S’.

E = mN2-s’

m: query sizeN: size of database

BLAST

Blasting a sample sequence.

BLAST

………

BLASTThe “two-hit” method.

Two sequences with good similarity should share >1 word.

Only extend matches when two hits occur on the same diagonal within certain distance.

It reduces sensitivity. Pair it with lower per-word threshold t.

BLAST

There are many variants with different specialties. For example

Gapped BLAST

Allows the detection of a gapped alignment.

PSI-BLAST

“Position Specific Iterated BLAST”To detect weak relationships.Constructs Position Specific Score Matrix automatically

……

Discrete time finite Markov Chain

Possible states: finite discrete set S {E1, E2, … … , Es}

From time t to t+1, make stochastic movement from one state to another

The Markov Property:

At time t, the process is at Ej,Then at time t+1, the probability it is at Ek only depends

on Ej

The temporally homogeneous transition probabilities property:

Prob(Ej → Ek) is independent of time t.


The transition probability matrix:

Pij=Prob(Ei → Ej)

An N step transition: Pij(N)=Prob(Ei → ……. → Ej in N steps)It can be shown that

ssss

s

s

ppp

ppp

ppp

P

...

............

...

...

21

22221

11211

NPNP )(


Consider the two step transition probability from Ei to Ej:

This is the ijth element of

So, the two-step transition matrix

Extending this argument, we have ----------------------------------------------------------------------------• Absorbing states:

pii=1. Once enter this state, stay in this state.We don’t consider this.

k

kjikij ppp )2(

2P

2)2( PP NN PP )(


The stationary state:

The probability of being at each state stays constant.

is the stationary state.

For finite, aperiodic, irreducible Markov chains, exists and is unique.

Periodic: if a state can only be returned to at t0, 2t0, 3t0,…….. , t0>1Irreducible: any state can be eventually reached from any state

itt

ip

ii

kkjki

),1()(

,

s ,,.........,, 321


...)( nn PP

Note:

Markov chain is an elegant model in many situations in bioinformatics, e.g. evolutionary process, sequence analysis etc.

BUT the two Markov assumptions may not hold true.

Hidden Markov Model

More general than Markov model.

A discrete time Markov Model with extra features.

E1

E2

E3

A Markov Chain

O1

O2

Emission

Initiation

Hidden Markov Model

The emissions are probablistic, state-dependent and time-independent

Example: E1

E2

A

B

0.1 0.8

0.5

0.5

0.25

0.75

We don’t observe the states of the Markov Chain, but we observe the emissions.

If both E1 and E2 have the same chance to initiate the chain, and we observe sequence “BBB”, what is the most likely state sequence that the chain went through?

Hidden Markov Model

Let Q denote the state sequence, and O denote the observed emission. Our goal is to find:

Q

Q

QPQOP

QPQOPOQP

OQP

)()|(

)()|()|(

)|(maxarg

O=“BBB”,

E1->E1->E1: 0.5*0.9*0.9*0.5*0.5*0.5= 0.050625E1->E1->E2: 0.5*0.9*0.1*0.5*0.5*0.75= 0.0084375E1->E2->E1: 0.5*0.1*0.8*0.5*0.75*0.5= 0.0075E1->E2->E2: 0.5*0.1*0.2*0.5*0.75*0.75= 0.0028125E2->E1->E1: 0.5*0.8*0.9*0.75*0.5*0.5= 0.0675E2->E1->E2: 0.5*0.8*0.1*0.75*0.5*0.75= 0.01125E2->E2->E1: 0.5*0.2*0.8*0.75*0.75*0.5= 0.0225E2->E2->E2: 0.5*0.2*0.2*0.75*0.75*0.75= 0.0084375

)()|( QPQOP

Hidden Markov Model

},,{ BP Parameters:

initial distributiontransition probabilitiesemission probabilities

Common questions:

How to efficiently calculate emissions:

How to find the most likely hidden state:

How to find the most likely parameters:

)|( OP

)|(maxarg OQPQ

)|(maxarg

OP

Hidden Markov Models in Bioinformatics

A generative model for multiple sequence alignment.


Data segmentation in array-based copy number analysis.

R package aCGH in Bioconductor.


Amplified

Normal

Deleted

An over-simplified model.

The Forward and Backward Algorithm

Q

QPQOPOP )|(),|()|(

If there are a total of S states, and we study an emision from T steps of the chain, then #(all possible Q)=ST

When either S or T is big, direct calculation is impossible.

Let

Let denote the hidden state at time t,

Let bi denote the emission probabilities from state Ei

Consider the “forward variables”:

)O , ,O ,(OO t21(t)

itt EqOPit )()( ,,

)(tq

Emissions up to step t chain at state i at step t


At step 1:

At step t+1:

… … …

By induction, we know all

Then P(O) can be found by

A total of 2TS2 computations are needed.

),( iT

S

i

iT1

),(

)(),1( 1Obi ii

S

ktiki Obpktit

11)(),(),1(


Back to our simple example, find P(“BBB”)

Now. What saved us the computing time? It is the reusing the shared components.And what allowed us to do that?.

75.0*2.0*)2,2(75.0*1.0*)1,2()2,3(

5.0*8.0*)2,2(5.0*9.0*)1,2()1,3(

75.0*2.0*)2,1(75.0*1.0*)1,1()2,2(

5.0*8.0*)2,1(5.0*9.0*)1,1()1,2(

75.0*5.0)2,1(

5.0*5.0)1,1(

The short memory of the Markov Chain.


The backward algorithm:

From T-1 to 1, we can iteratively calculate:

S

ktkik

itTtt

ktObpit

jjT

EqOOOPit

1

21

),()(),1(

,1),(

|,......,,),(

Emissions after step t chain at state i at step t


Back to our simple example, observing “BBB”, we have

......

1*75.0*1.01*5.0*9.0

)2,3()"(")1,3()"("

)1|""()1,2(

1)2,3(

1)1,3(

212111

23

BbpBbp

EqBOP

Posterior state probabilities

),(),(

)|,...,(),,...,,(

),,...,,|,...,(),,...,,(

),(

)(1

)(21

)(211

)(21

)(

itit

EiqOOPEiqOOOP

EiqOOOOOPEiqOOOP

EiqOP

tTt

tt

ttTt

tt

t

S

i

iTOP1

),()(

S

i

tt

iT

itit

OP

EiqOPOEiqP

1

)()(

),(

),(),(

)(

),()|(

The Viterbi Algorithm

To find:

So, to maximize the conditional probability, we can simply maximize the joint probability.

Define:

)|(maxarg OQPQ

),(maxarg)(

),(maxarg)|(maxarg

)(

),(max)|(max

OQPOP

OQPOQP

OP

OQPOQP

QQQ

QQ

(t)tt

qqqt OEiqqqqPi

t

and ,,,......,,max)( 121,...,, 121


Then our goal becomes:

So, this problem is solved by dynamic programming.

)(max),(max iOQP TiQ

)()(max)(:

)1()(:

1

1

OtbpkiInduction

ObiInitiation

ikitk

t

ii

(t)tt

qqqt OEiqqqqPi

t

and ,,,......,,max)( 121,...,, 121


Time: 1 2 3

States: E1 0.5*0.5 0.15 0.0675

E2 0.5*0.75 0.05625 0.01125

The calculation is from left border to right border, while in the sequence alignment, it is from upper-left corner to the right and lower borders.

Again, why is this efficient calculation possible?The short memory of the Markov Chain!

The Estimation of parameters

When the topology of an HMM is known, how do we find the initial distribution, the transition probabilities and the emission probabilities, from a set of emitted sequences?

Difficulties:

Usually the number of parameters is not small. Even the simple chain in our example has 5 parameters.

Normally the landscape of the likelihood function is bumpy. Finding the global optimum is hard.

}{ ),|(maxarg )(dOOOP

The Estimation of parameters

The general idea of the Baun-Welch method:

(1) Start with some initial values (2) Re-estimate the parameters

(1) The previously mentioned forward- and backward-probabilities will be used.

(2) It is guaranteed that the likelihood of the data improves with the re-estimation

(3) Repeat (2) until a local optimum is reached

We will discuss the details next time.

To be continued

hidden markov models (1)

Documents