a practical introduction to hidden markov models for bioinformaticians thomas r. ioerger department...

A Practical Introduction toHidden Markov Models for

Bioinformaticians

Thomas R. Ioerger

Department of Computer Science

Texas A&M University

HMM’s - What are They?

• useful for modeling protein/DNA sequence patterns

• probabilistic state-transition diagrams• Markov processes - independence from history• “hidden” states

heads

tails

0.4

0.6

0.60.4

HTHTTHTTTHTTHTHHHTHT...

A G

C T

AACATGGTACATGTTAG...

A T G CA 0.2 0.35 0.15 0.3T 0.25 0.15 0.4 0.2G 0.25 0.25 0.25 0.25C 0.3 0.25 0.2 0.25

intron exon

A G

CT

A G

C

T

0.09

0.06

Historical Context

• voice compression, information theory– how to fit full range of human voice in

8kbs of multiplexed/wireless bandwidth?

– encode/decode into sequence of “codebook vectors”

• speech = string of phonemes– probability of what follows what...

• statistical sampling, convergence properties (“mixing”)

“lets ‘gO tu the‘pär-tE”

PARTYPOTTY PATTY

p ar tE

Essential References• Rabiner paper:

– Rabiner, L.R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77:257-286.

• Durbin book:

– Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998). Biological Sequence Analysis. Cambridge University Press.

• Anders Krogh

– Ch. 4 in Computational Biology: Pattern Analysis and Machine Learning Methods (1998; Salzberg, Searls, and Kasif, eds.)

– Krogh, Brown, Mian, Sojlander, Haussler (1994). Hidden Markov models in computational biology: applications to protein modeling. Journal of Molecular Biology, 235:1501-1531.

• David Haussler (UCSB)

– Sjolander, K., Karplus, K., Brown, M., Hughey, R., Krogh, A., Mian, I.S, and Haussler, D. (1996). Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Computer Applications in the Biosciences, 12:327-345.

Most probable path: Viterbi algorithm

• if the path of states is unknown, can be recovered• path that gives the highest probability to sequence

• define pi,j as the probability of the most probable path ending in state j after emitting element i

• compute recursively:– suppose you knew pi-1,j for all states up to previous char

– update: pi,k=P(ai|sk) * maxj(pi-1,j*P(sk|sj))

• dynamic programming and “traceback”– keep table of state probs, start with 1st char, assign prob to

each state, iterate updates...a1 a2 a3 a4

s1 0.6 0.9 0.9 0.2s2 0.1 0.1 0.05 0.8s3 0.3 0 0.05 0

Forward Algortihm

• probability of a sequence summed over all possible paths

• similar to Viterbi, except replace ‘max’ with probabilistic ‘sum’– let qi,j be total prob of seeing a1..ai and ending in state j

– qi+1,k=P(ai+1|sk)* qi,j*P(sk|sj))

• at end of computation, probability of each state pn,j gives total probability of having generated the whole sequence and ended in that state

• total probability of sequence given model is pn,j

Backward Algorithm

• estimates probability of going through state j at time i (i.e. observation i generated by state j)

• P(si=j|a1..an)=Prob(si=j|a1..ai)*Prob(ai+1..an|si=j)Prob(a1..an)

prefix probability: comes from Forward algorithm

(total prob, to normalize)

suffix: work backwards from lastchar, summing probs over states

Training HMM’s

• easy if you know the states (e.g. multiple alignment)– just compute prob. distribution over elements at each site

• problem: dealing with unobservable state info– raw sequences: which element came from which state?

• EM (expectation maximization, aka ‘Baum-Welch’)– uses forward and backward algorithms– start with random transition probabilities– compute most likely path and state probabilities

• use best current guesses to estimate hidden information

– update state probabilities based on that; iterate– converges on transition probabilities that maximizes

likelihood of observed input (training) sequences

ACTVHLLRKMPASTIHILRKMAACSVHILKKQPACIVHMLKKMP

A: 1.0 C: .75S: .25

S: .25T: .5I: .25

V: .75I: .25

ACTTTSPVVHLLRKMPASTIHILKMAMNFIYPQSACSVHILKKQPCIVLKKMP

Which letters came from which states?Let the model tell you (using Backwardalgorithm, based on current params)

Advanced Issues

• topology searching – similar to Bayesian networks? (Chickering, Heckerman,

Buntine...)

– “duration”, length (# states), and exit probabilities

• higher-order HMM’s (probabilities depend on window of several recent states)– P(ai|ai-1,ai-2,ai-3)

• Dirichlet priors – pseudo-counts (because 0 doesn’t necessarily mean 0, but

Prob=0 kills the whole computation)

Applications I: Protein Families• PFAM database

– http://pfam.wustl.edu, http://sanger.ac.uk/Software/Pfam

– models for each family available

– files consist of transition probabilities

– 3 states per “site”: match, insert, delete

– constructed from multiple sequences (unaligned)

– can use for searching (more sensitive than homology detection) or alignment

start site 1 site 2 site 3 site 4 site 5

ins

del

ins

del

ins

del

ins

del

ins

del

ins

A, C, D...

A, C, D...

HMM Software

• HMMER – Sean Eddy; Wash. Univ. St. Louis

– http://hmmer.wustl.edu

– commands: hmmalign, hmmsearch

– ls/fs, calibration

• other software: SAM – Hughey and Krogh (1996), CABIOS, 12:95-107.

– http://www.cse.ucsc.edu/research/compbio/sam.html

• Splice junctions: GeneSplicer (Salzberg)– Petrea, M., Lin, X., and Salzberg, S. (2001). GeneSplicer: a new

computational method for splice site prediction. Nucleic Acids Research, 29(5):1185-1190.

• Gene finding: GLIMMER (Salzberg)– Salzberg, S., Delcher, A., Kasif, S., White, O., and Salzberg, S. (1998).

Microbial gene identification using interpolated Markov models. Nucleic Acids Research, 26(2):544-548.

– use HMMs to refine classification of ORFs (e.g. GC content)

– build 1 HMM for each of 6 reading frames (different nuc. probabilities)

– interpolated 5th-order Markov chain (depends on preceeding 5-mer)

– use combined scores to indicate consistency - help pick right frame

Applications II: DNA patterns

...<exon>........GT[..........<intron>..........]AG.......<exon>...HMM-donor HMM-accept

HMM-codingHMM-codingHMM-codingHMM-coding

Applications III: Secondary Structure

• Goldman, N., Thorne, J.L., and Jones, D.T. (1996). Using evolutionary trees in protein secondary structure prediction and other comparative analyses. JMB, 263:196-208.

• old method: amino acid biases; probability based on window of local residues (Chou-Fasman)

• new method: HMM– 3 models: helix, strand, coil (each has 20x20 table with

transition frequencies between neighbors ai=>ai+1)

• problem: AA replacement probabilities depend on evolutionary distance– given a set of training sequences, probs depend on how

similar/redundant the sequences are – not independent sample (homology weighting?)– correct the probabilities for distance between sequence (%

homology) - determines instantaneous substitution “rates”

• using the HMMs to predict secondary structure– input: multiple alignment– step 1: estimate phylogeny by maximum likelihood– step 2: probability tables are scaled for PAM distance

– step 3: estimate Prob(class|sitei) based on summing over Prob(class|sitei-1) and Prob(ai|class) and Prob(ai|ai-1) and at site i+1 (for consistency)

– takes forward and backward pass to determine P(class|sitei)

• Bienkowska, Yu, Zarakhovich, Rogers, and Smith (2000). “Protein fold recognition by total alignment probability”. Proteins: Structure, Function, and Genetics, 40(30):451-462.

• improved approach to threading/family 3D profiles• model proteins as segments of secondary structure

– build HMM with set of states for each segment (DSSP)– calculate amino acid probabilities in each (like 3D profile),

also function of exposure

• P(model|seq)=P(seq|model)*P(model)/P(seq)– P(model) based on SCOP distribution– P(seq|model)

• method 1: most probable path (Viterbi alg): 64% accuracy

• method 2: sum over all paths (Forward alg): 90% accuracy

Applications IV: Protein Fold Recognition

strand

helix

loop

helix

loop

strand

Applications V: Fold classification by Support Vector Machine

• Jaakkola, T. Diekhans, M., and Haussler, D. (1999). Using the Fisher kernel method to detect remote protein homologies. Proceedings of Conf. on Intelligent Systems for Molecular Biology, 149-158.

• Support Vector Machines (SVMs)– classification based on maximum-margin hyperplanes

between positive and negative examples

– efficient computation uses only pairwise distance

globins

non-globinsequences

what space is this?what distance metric?

• HMM distance between sequences: Fisher kernel– re-represent each sequence by vector of “sufficient

statistics” (probabilities of internal parameters in model)

– e.g. how likely was state i used by sequence A?

– Fisher score is derivative (sensitivity) of log-likelihood score of sequence with respect to internal parameters

• UX= log P(X|H1, )

• gradient = vector of derivatives, one for each parameter

– dist(X,X’) = exp[-1/(22)(UX-UX’)T(UX-UX’)]• like dot-product or Euclidean distance between two vectors

Example I: KaiA, N-terminal domain

• protein involved in regulating circadian rhythms in cyanobacteria

• 284 amino acids: 2 domains– N-terminal domain solved by NMR (Liwang, Vakonakis)

– whole molecule solved together by crystallography (Sacchettini)

• N-terminal domain– no apparent homology to known proteins a priori

• BLAST didn’t reveal family relationship; none>15% identity

– when structure solved, found to be structure, similar to response receiver domains (CheY, FixR, AmiR...)

• chemotaxis, nitrogen fixation, amidase activity...

• relation detected by structural homology search (DALI)

N-terminal domain structure

Using HMMER on KaiA

• 1. Search PFAM with sequence:– returns response_reg family

• 2. Download HMM model for family

• 3. Run KaiA seq with model through hmmsearch to determine significance

• Run KaiA seq with representative members of family through hmmalign to build multiple alignment

> KaiA, domain 1MLSQIAICIWVESTAILQDCQRALSADRYQLQVCESG...> CheY - 3CHYADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAED...> DrrD - 1KGSNVRVLVVEDERDLADLITEALKKEFTVDVCYDGEEGY...> NarL - 1RNLEPATILLIDDHPMLRTGVKQLISMAPDITVVGEASNG...> FixJ - 1D5WMQDYTVHIVDDEEPVRKSLAFMLTMNGFAVKMHQSAE...

Input: FASTA format

HMMER2.0 [2.3.1]NAME response_regACC PF00072DESC Response regulator receiver domainLENG 125ALPH AminoRF noCS noMAP yesCOM hmmbuild -F HMM_ls.ann SEED.annCOM hmmcalibrate --seed 0 HMM_ls.annNSEQ 54DATE Mon Jun 23 20:37:43 2003CKSUM 3355GA -17.0 -17.0TC -17.0 -17.0NC -17.1 -17.1XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 EVD -88.885818 0.177915HMM A C D E F G H I ... 1 -346 57 -5567 -1768 445 -964 1156 928 2 -1511 -4502 -477 -2328 -4823 -4003 1597 -1632 3 501 259 -117 908 1124 -390 -625 4024 ...

statestransition probabilities

The response_reg HMM model

>hmmsearch respregls.hmm combined.fa

hmmsearch - search a sequence database with a profile HMMHMMER 2.2g (August 2001)Copyright (C) 1992-2001 HHMI/Washington University School of MedicineFreely distributed under the GNU General Public License (GPL)- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -HMM file: respregls.hmm [response_reg]Sequence database: combined.faper-sequence score cutoff: [none]per-domain score cutoff: [none]per-sequence Eval cutoff: <= 10 per-domain Eval cutoff: [none]- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query HMM: response_regAccession: PF00072Description: Response regulator receiver domain [HMM has been calibrated; E-values are empirical estimates]

Scores for complete sequences (score includes all domains):Sequence Description Score E-value N -------- ----------- ----- ------- ---3CHY: CheY 156.2 3e-47 1KaiA, domain 1 -51.3 0.0037 11amx -97.6 3 1

significant if <0.1

>hmmalign respregls.hmm combined3.fa

DrrD YAL--NEP.-FDVVILDI-LPV.....HDGWE.ILKS-RESGVNTPV---CheY LNKLQAGG.-YGFVISDWNMPN.....MDGLE.LLKTIRADGAMSALPVLNarL IELAESLD.-PDLILLDLNMPG.....MNGLE.TLDKLREKSLSGRI--VFixJ LAFAPDVR.-NGVLVT-LRMPD.....MSGVE.LLRNLGDLKINIPS--IEtr1 LRVVSHEH.--KVVFMDVCMPGvenyqIA-LR.IHEKFTQRHQRPLL--VAmiR FDV----P.-VDVVFTSI-FQN.....RHHDE.IAALLAAGTPRTTL--VKaiA LEYAQTHRdQIDCLILVAANPS.....---FRaVVQQLCFEGVVVPA--I#=GC RF xxxxxxxx.xxxxxxxxxxxxx.....xxxxx.xxxxxxxxxxxxxxxxx

DrrD LLTALSDVEYRVKGLN-GADDYLPKPFDLRELIARVRALIRRkSeskstkCheY MVTAEAKKENIIAAAQAGASGYVVKPFTAATLEEKLNKIFEKlGm.....NarL VFSVSNHEEDVVTALKRGADGYLLKDMEPEDLLKALHQAAAGeMvlsealFixJ VITGHGDVPMAVEAMKAGAVDFIEKPFEDTVIIEAIERASEHlV......Etr1 ALSGNTDKSTKEKCMSFGLDGVLLKPVSLDNIRDVLSDLLEPrVlye...AmiR ALVEYESPAVLSQIIELECHGVITQPLDAHRVLPVLVSARRIsEemaklkKaiA VVGDRDP---AKEQLYHSAELHLGIH-QLEQLPYQVDAALAEfLrlapve#=GC RF xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.x......

Multiple Alignment with Response_Reg family members

HMM states

Summary of KaiA

• Weak homology could have been detected using HMMs (through PFAM search)– HMM model is more general representation of family

– allows more sensitive searches

• HMMs allow pairwise alignment to other fold family members– No reasonable global alignment could be constructed

using Smith-Waterman

– tried various gap parameters, similarity matrices

– produced “random” gap placement

– yet alignment of CheY and KaiA to model gives meaningful pairwise alignment

a practical introduction to hidden markov models for bioinformaticians thomas r. ioerger department...

Documents

pa i s i

state j ps i

element i

observation i

time i

state j q i

max j p i

j slide