csce555 bioinformatics lecture 6 hidden markov models meeting: mw 4:00pm-5:15pm swgn2a21 instructor:...

33
CSCE555 Bioinformatics CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu . HAPPY CHINESE NEW YEAR

Upload: caitlin-copeland

Post on 28-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

CSCE555 BioinformaticsCSCE555 Bioinformatics

Lecture 6 Hidden Markov Models

Meeting: MW 4:00PM-5:15PM SWGN2A21Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555

University of South CarolinaDepartment of Computer Science and Engineering2008 www.cse.sc.edu.

HAPPY CHINESE NEW YEAR

Page 2: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

RoadmapRoadmap

Probablistic Models of Sequences

Introduction to HMM

Profile HMMs as MSA models

Measuring Similarity between Sequence and

HMM Profile model

Summary

04/19/23 2

Page 3: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Multiple Sequence Multiple Sequence AlignmentAlignmentAlignment containing multiple DNA / protein

sequencesLook for conserved regions → similar functionExample:

#Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT#Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT#Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC#Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC#Oppossum ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG#Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT#Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT

3

Page 4: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Probablistic Model: Position-Probablistic Model: Position-specific scoring matrices specific scoring matrices ((PSSMPSSM))

Page 5: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Difficulty in biological Difficulty in biological sequencessequencesVariation in a family of

sequences◦Gaps of variable lengths◦Conserved segments with different

degrees◦PSSM cannot handle variable-length

gaps◦Need a statistical sequence model

5

Page 6: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Regular Expressions Regular Expressions ModelModelRegular expressions

◦Protein spelling is much more free that English spelling

◦ [AT] [CG] [AC] [ACGT]* A [TG] [GC]

6

Page 7: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

RoadmapRoadmap

Probablistic Models of Sequences

Introduction to HMM

Profile HMMs as MSA models

Measuring Similarity between Sequence and

HMM Profile model

Summary

04/19/23 7

Page 8: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Hidden Markov Model (HMM)Hidden Markov Model (HMM)HMM is:

◦Statistical model◦Well suited for many tasks in molecular

biologyUsing HMM in molecular biology

◦Probabilistic profile (profile HMM) From a family of proteins, for searching a

database for other members of the family Resemble the profile and weight matrix

methods

◦Grammatical structure Gene finding Recognize signals Prediction (must follow the rules of a gene)

8

Page 9: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Detect Cheating in Coin Toss Detect Cheating in Coin Toss GameGame

Fair and biased coins could be used

Question: is it possible to determine whether a biased coin has been used based on the output sequence of the Head/Tail sequence?

HTTTHTHTHTTHHHHTHTHTHTHHHHTHT

Page 10: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

EXAMPLE : Fair Coin TossEXAMPLE : Fair Coin TossConsider the single coin scenarioWe could model the process producing

the sequence of H’s and T’s as a Markov model with two states, and equal transition probabilities: TH

0.5

0.5

0.50.5

Only one fair coin is used here

Page 11: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Example: Fair and Biased Example: Fair and Biased CoinsCoinsConsider the scenario where there are two

coins: Fair coin and Biased coinVisible state do not correspond to hidden

state - Visible state : Output of H or T - Hidden state : Which coin was tossed

HTTTHTHTHTTHHHHTHTHTHTHHHHTHT

Page 12: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

12

Hidden Markov ModelsHidden Markov Models

Page 13: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

13

Ingredients of a HMMIngredients of a HMM Collection of states: {S1, S2,…,SN}

State transition probabilities (transition matrix)

Aij = P(qt+1 = Si | qt = Sj)

Initial state distribution

i = P(q1 = Si)

Observations: {O1, O2,…,OM}

Observation probabilities:

Bj(k) = P(vt = Ok | qt = Sj)

Page 14: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

14

Ingredients of Our HMMIngredients of Our HMM States:{Ssunny, Srainy, Ssnowy}

State transition probabilities (transition matrix)

A =

Initial state distribution

i = (.7 .25 .05)

Observations: {O1, O2,…,OM}

Observation probabilities (emission matrix): B =

.08 .15 .05

.38 .6 .02

.75 .05 .2

.08 .15 .05

.38 .6 .02

.75 .05 .2

Page 15: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

15

Probability of a Sequence of Probability of a Sequence of EventsEvents

P(O) = P(Ogloves, Ogloves, Oumbrella,…, Oumbrella)

= P(O | Q)P(Q) = P(O | q1,…,q7)

= 0.7 x 0.86 x 0.32

x 0.14 x 0.6 + …

all Q

q1,…q7

Page 16: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

16

Typical HMM ProblemsTypical HMM ProblemsAnnotation Given a model M and an observed

string S, what is the most probable path through M generating S

Classification Given a model M and an observed string S, what is the total probability of S under M

Consensus Given a model M, what is the string having the highest probability under M

Training Given a set of strings and a model structure, find transition and emission probabilities assigning high probabilities to the strings

Page 17: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

RoadmapRoadmap

Probablistic Models of Sequences

Introduction to HMM

Profile HMMs as MSA models

Measuring Similarity between Sequence and

HMM Profile model

Summary

04/19/23 17

Page 18: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

HMM Profiles as Sequence HMM Profiles as Sequence ModelsModelsGiven the multiple alignment of

sequences, we can use HMM to model the sequences

Each column of the alignment may be represented by a hidden state that produced that column

Insertions and deletions may be represented by other states

Page 19: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Profile HMMsProfile HMMsHMM with a structure that in a natural

way allows position-dependent gap penalties◦Main states

model the columns of the alignment

◦Insert states model highly variable regions

◦Delete states to jump over one or more columns i.e. to model the situation when just a few of

the sequences have a “-” in the multiple alignment at a position

19

Page 20: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

HMM Sequences ContinuedHMM Sequences Continued

Page 21: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Profile HMM ExampleProfile HMM Example Consider the following six sequences shown

below A multiple sequence alignment of these

sequences is the first step towards the processing of inducing the hidden markov model

SEQ1 G C C C A

SEQ2 A G C

SEQ3 A A G C

SEQ4 A G A A

SEQ5 A A A C

SEQ6 A G C

Page 22: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Profile HMM TopologyProfile HMM Topology The topology of HMM is established using consensus

sequence The structure of a Profile HMM is shown below:- The square box represent match states Diamonds represent insert states Circles represent delete states

Page 23: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Profile HMM Example Profile HMM Example ContinuedContinued

The aligned columns correspond to either emissions from the match state or to emissions from the insert state

The consensus columns are used to define the match states M1,M2,M3 for the HMM

After defining the match states, the corresponding insert and delete states are used to define the complete HMM topology

Page 24: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Transition ProbabilitiesTransition ProbabilitiesThe values of the transition probabilities are

computed using the frequency of the transitions as each sequence is considered

The model parameters are computed using the state transition sequences shown in the figure below:-

Page 25: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Transition Probabilities Transition Probabilities ContinuedContinued

The frequency of each of the transitions and the corresponding emission probabilities are shown below

State0 1 2 3

MMMDMI

4 5 6 41 0 0 -1 0 0 2

IMIDII

1 0 0 20 0 0 -0 0 0 2

DMDDDI

- 1 0 0- 0 0 -- 0 0 0

Page 26: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Emission ProbabilitiesEmission ProbabilitiesThe emission probability is

computed using the formula:-

The emission probability specifies the probability of emitting each of the symbols in |∑ | in the state k

Page 27: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Emission Probabilities Emission Probabilities ContinuedContinuedThe emission probability for each

state is computed as shown below:

Page 28: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Searching the Profile Searching the Profile HMMHMMSequences can be searched against the

HMM to detect whether or not they belong to a particular family of sequences described by the profile HMM

Using a global alignment, the probability of the most probable alignment and sequence can be determined using the Viterbi algorithm

Full probability of a sequence aligning to the profile HMM determined using the forward algorithm

Page 29: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

How A Sequence Fit a How A Sequence Fit a Model?Model?

◦Probability depends on the length of the sequence

◦Not suitable to use as a score29

Page 30: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Length-independent ScoreLength-independent ScoreLog-odds score

◦The logarithm of the probability of the sequence divided by the probability according to a null model

30

Page 31: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Length-independent ScoreLength-independent ScoreHMM using log-odds

31

Page 32: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

SummarySummaryHMMHow to build Profile HMM modelScoring Fit between Sequence

and HMM model

Page 33: CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Next LectureNext LectureGene-findingReading:

◦Textbook (CG) chapter 4◦Textbook (EB) chapter 8