a practical introduction to hidden markov models for bioinformaticians thomas r. ioerger department...

27
A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

Upload: brittany-oswald

Post on 01-Apr-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

A Practical Introduction toHidden Markov Models for

Bioinformaticians

Thomas R. Ioerger

Department of Computer Science

Texas A&M University

Page 2: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

HMM’s - What are They?

• useful for modeling protein/DNA sequence patterns

• probabilistic state-transition diagrams• Markov processes - independence from history• “hidden” states

heads

tails

0.4

0.6

0.60.4

HTHTTHTTTHTTHTHHHTHT...

A G

C T

AACATGGTACATGTTAG...

A T G CA 0.2 0.35 0.15 0.3T 0.25 0.15 0.4 0.2G 0.25 0.25 0.25 0.25C 0.3 0.25 0.2 0.25

intron exon

A G

CT

A G

C

T

0.09

0.06

Page 3: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

Historical Context

• voice compression, information theory– how to fit full range of human voice in

8kbs of multiplexed/wireless bandwidth?

– encode/decode into sequence of “codebook vectors”

• speech = string of phonemes– probability of what follows what...

• statistical sampling, convergence properties (“mixing”)

“lets ‘gO tu the‘pär-tE”

PARTYPOTTY PATTY

p ar tE

Page 4: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

Essential References• Rabiner paper:

– Rabiner, L.R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77:257-286.

• Durbin book:

– Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998). Biological Sequence Analysis. Cambridge University Press.

• Anders Krogh

– Ch. 4 in Computational Biology: Pattern Analysis and Machine Learning Methods (1998; Salzberg, Searls, and Kasif, eds.)

– Krogh, Brown, Mian, Sojlander, Haussler (1994). Hidden Markov models in computational biology: applications to protein modeling. Journal of Molecular Biology, 235:1501-1531.

• David Haussler (UCSB)

– Sjolander, K., Karplus, K., Brown, M., Hughey, R., Krogh, A., Mian, I.S, and Haussler, D. (1996). Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Computer Applications in the Biosciences, 12:327-345.

Page 5: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

Basic Mathematics

• Markov chains: prob. of a sequence: S=a1a2...an

– P(S)=P(a1)P(a2| a1)P(a3 | a1 a2)...P(an|a1... an-1)

– P(S)=P(a1)P(a2| a1)P(a3 | a2)...P(an|an-1)

– P(S)=P(a1) P(ai|ai-1)

• HMMs: prob. depends on states passed thru– if known: (states= s1, s2... sn)

• P(S)=P(a1|s1)P(s2|s1)P(a2|s2)P(s3|s2)...P(si|si-1)P(ai|si)

– if unknown, sum over all possible paths• see the Forward algorithm below

• important for determining how consistent (probable) an observed sequence is, given a model

Page 6: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

Most probable path: Viterbi algorithm

• if the path of states is unknown, can be recovered• path that gives the highest probability to sequence

• define pi,j as the probability of the most probable path ending in state j after emitting element i

• compute recursively:– suppose you knew pi-1,j for all states up to previous char

– update: pi,k=P(ai|sk) * maxj(pi-1,j*P(sk|sj))

• dynamic programming and “traceback”– keep table of state probs, start with 1st char, assign prob to

each state, iterate updates...a1 a2 a3 a4

s1 0.6 0.9 0.9 0.2s2 0.1 0.1 0.05 0.8s3 0.3 0 0.05 0

Page 7: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

Forward Algortihm

• probability of a sequence summed over all possible paths

• similar to Viterbi, except replace ‘max’ with probabilistic ‘sum’– let qi,j be total prob of seeing a1..ai and ending in state j

– qi+1,k=P(ai+1|sk)* qi,j*P(sk|sj))

• at end of computation, probability of each state pn,j gives total probability of having generated the whole sequence and ended in that state

• total probability of sequence given model is pn,j

Page 8: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

Backward Algorithm

• estimates probability of going through state j at time i (i.e. observation i generated by state j)

• P(si=j|a1..an)=Prob(si=j|a1..ai)*Prob(ai+1..an|si=j)Prob(a1..an)

prefix probability: comes from Forward algorithm

(total prob, to normalize)

suffix: work backwards from lastchar, summing probs over states

Page 9: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

Training HMM’s

• easy if you know the states (e.g. multiple alignment)– just compute prob. distribution over elements at each site

• problem: dealing with unobservable state info– raw sequences: which element came from which state?

• EM (expectation maximization, aka ‘Baum-Welch’)– uses forward and backward algorithms– start with random transition probabilities– compute most likely path and state probabilities

• use best current guesses to estimate hidden information

– update state probabilities based on that; iterate– converges on transition probabilities that maximizes

likelihood of observed input (training) sequences

Page 10: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

ACTVHLLRKMPASTIHILRKMAACSVHILKKQPACIVHMLKKMP

A: 1.0 C: .75S: .25

S: .25T: .5I: .25

V: .75I: .25

ACTTTSPVVHLLRKMPASTIHILKMAMNFIYPQSACSVHILKKQPCIVLKKMP

Which letters came from which states?Let the model tell you (using Backwardalgorithm, based on current params)

Page 11: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

Advanced Issues

• topology searching – similar to Bayesian networks? (Chickering, Heckerman,

Buntine...)

– “duration”, length (# states), and exit probabilities

• higher-order HMM’s (probabilities depend on window of several recent states)– P(ai|ai-1,ai-2,ai-3)

• Dirichlet priors – pseudo-counts (because 0 doesn’t necessarily mean 0, but

Prob=0 kills the whole computation)

Page 12: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

Applications I: Protein Families• PFAM database

– http://pfam.wustl.edu, http://sanger.ac.uk/Software/Pfam

– models for each family available

– files consist of transition probabilities

– 3 states per “site”: match, insert, delete

– constructed from multiple sequences (unaligned)

– can use for searching (more sensitive than homology detection) or alignment

start site 1 site 2 site 3 site 4 site 5

ins

del

ins

del

ins

del

ins

del

ins

del

ins

A, C, D...

A, C, D...

Page 13: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

HMM Software

• HMMER – Sean Eddy; Wash. Univ. St. Louis

– http://hmmer.wustl.edu

– commands: hmmalign, hmmsearch

– ls/fs, calibration

• other software: SAM – Hughey and Krogh (1996), CABIOS, 12:95-107.

– http://www.cse.ucsc.edu/research/compbio/sam.html

Page 14: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

• Splice junctions: GeneSplicer (Salzberg)– Petrea, M., Lin, X., and Salzberg, S. (2001). GeneSplicer: a new

computational method for splice site prediction. Nucleic Acids Research, 29(5):1185-1190.

• Gene finding: GLIMMER (Salzberg)– Salzberg, S., Delcher, A., Kasif, S., White, O., and Salzberg, S. (1998).

Microbial gene identification using interpolated Markov models. Nucleic Acids Research, 26(2):544-548.

– use HMMs to refine classification of ORFs (e.g. GC content)

– build 1 HMM for each of 6 reading frames (different nuc. probabilities)

– interpolated 5th-order Markov chain (depends on preceeding 5-mer)

– use combined scores to indicate consistency - help pick right frame

Applications II: DNA patterns

...<exon>........GT[..........<intron>..........]AG.......<exon>...HMM-donor HMM-accept

HMM-codingHMM-codingHMM-codingHMM-coding

Page 15: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

Applications III: Secondary Structure

• Goldman, N., Thorne, J.L., and Jones, D.T. (1996). Using evolutionary trees in protein secondary structure prediction and other comparative analyses. JMB, 263:196-208.

• old method: amino acid biases; probability based on window of local residues (Chou-Fasman)

• new method: HMM– 3 models: helix, strand, coil (each has 20x20 table with

transition frequencies between neighbors ai=>ai+1)

Page 16: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

• problem: AA replacement probabilities depend on evolutionary distance– given a set of training sequences, probs depend on how

similar/redundant the sequences are – not independent sample (homology weighting?)– correct the probabilities for distance between sequence (%

homology) - determines instantaneous substitution “rates”

• using the HMMs to predict secondary structure– input: multiple alignment– step 1: estimate phylogeny by maximum likelihood– step 2: probability tables are scaled for PAM distance

– step 3: estimate Prob(class|sitei) based on summing over Prob(class|sitei-1) and Prob(ai|class) and Prob(ai|ai-1) and at site i+1 (for consistency)

– takes forward and backward pass to determine P(class|sitei)

Page 17: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

• Bienkowska, Yu, Zarakhovich, Rogers, and Smith (2000). “Protein fold recognition by total alignment probability”. Proteins: Structure, Function, and Genetics, 40(30):451-462.

• improved approach to threading/family 3D profiles• model proteins as segments of secondary structure

– build HMM with set of states for each segment (DSSP)– calculate amino acid probabilities in each (like 3D profile),

also function of exposure

• P(model|seq)=P(seq|model)*P(model)/P(seq)– P(model) based on SCOP distribution– P(seq|model)

• method 1: most probable path (Viterbi alg): 64% accuracy

• method 2: sum over all paths (Forward alg): 90% accuracy

Applications IV: Protein Fold Recognition

strand

helix

loop

helix

loop

strand

Page 18: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

Applications V: Fold classification by Support Vector Machine

• Jaakkola, T. Diekhans, M., and Haussler, D. (1999). Using the Fisher kernel method to detect remote protein homologies. Proceedings of Conf. on Intelligent Systems for Molecular Biology, 149-158.

• Support Vector Machines (SVMs)– classification based on maximum-margin hyperplanes

between positive and negative examples

– efficient computation uses only pairwise distance

globins

non-globinsequences

what space is this?what distance metric?

Page 19: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

• HMM distance between sequences: Fisher kernel– re-represent each sequence by vector of “sufficient

statistics” (probabilities of internal parameters in model)

– e.g. how likely was state i used by sequence A?

– Fisher score is derivative (sensitivity) of log-likelihood score of sequence with respect to internal parameters

• UX= log P(X|H1, )

• gradient = vector of derivatives, one for each parameter

– dist(X,X’) = exp[-1/(22)(UX-UX’)T(UX-UX’)]• like dot-product or Euclidean distance between two vectors

Page 20: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

Example I: KaiA, N-terminal domain

• protein involved in regulating circadian rhythms in cyanobacteria

• 284 amino acids: 2 domains– N-terminal domain solved by NMR (Liwang, Vakonakis)

– whole molecule solved together by crystallography (Sacchettini)

• N-terminal domain– no apparent homology to known proteins a priori

• BLAST didn’t reveal family relationship; none>15% identity

– when structure solved, found to be structure, similar to response receiver domains (CheY, FixR, AmiR...)

• chemotaxis, nitrogen fixation, amidase activity...

• relation detected by structural homology search (DALI)

Page 21: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

N-terminal domain structure

Page 22: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

Using HMMER on KaiA

• 1. Search PFAM with sequence:– returns response_reg family

• 2. Download HMM model for family

• 3. Run KaiA seq with model through hmmsearch to determine significance

• Run KaiA seq with representative members of family through hmmalign to build multiple alignment

Page 23: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

> KaiA, domain 1MLSQIAICIWVESTAILQDCQRALSADRYQLQVCESG...> CheY - 3CHYADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAED...> DrrD - 1KGSNVRVLVVEDERDLADLITEALKKEFTVDVCYDGEEGY...> NarL - 1RNLEPATILLIDDHPMLRTGVKQLISMAPDITVVGEASNG...> FixJ - 1D5WMQDYTVHIVDDEEPVRKSLAFMLTMNGFAVKMHQSAE...

Input: FASTA format

Page 24: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

HMMER2.0 [2.3.1]NAME response_regACC PF00072DESC Response regulator receiver domainLENG 125ALPH AminoRF noCS noMAP yesCOM hmmbuild -F HMM_ls.ann SEED.annCOM hmmcalibrate --seed 0 HMM_ls.annNSEQ 54DATE Mon Jun 23 20:37:43 2003CKSUM 3355GA -17.0 -17.0TC -17.0 -17.0NC -17.1 -17.1XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 EVD -88.885818 0.177915HMM A C D E F G H I ... 1 -346 57 -5567 -1768 445 -964 1156 928 2 -1511 -4502 -477 -2328 -4823 -4003 1597 -1632 3 501 259 -117 908 1124 -390 -625 4024 ...

statestransition probabilities

The response_reg HMM model

Page 25: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

>hmmsearch respregls.hmm combined.fa

hmmsearch - search a sequence database with a profile HMMHMMER 2.2g (August 2001)Copyright (C) 1992-2001 HHMI/Washington University School of MedicineFreely distributed under the GNU General Public License (GPL)- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -HMM file: respregls.hmm [response_reg]Sequence database: combined.faper-sequence score cutoff: [none]per-domain score cutoff: [none]per-sequence Eval cutoff: <= 10 per-domain Eval cutoff: [none]- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query HMM: response_regAccession: PF00072Description: Response regulator receiver domain [HMM has been calibrated; E-values are empirical estimates]

Scores for complete sequences (score includes all domains):Sequence Description Score E-value N -------- ----------- ----- ------- ---3CHY: CheY 156.2 3e-47 1KaiA, domain 1 -51.3 0.0037 11amx -97.6 3 1

significant if <0.1

Page 26: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

>hmmalign respregls.hmm combined3.fa

DrrD YAL--NEP.-FDVVILDI-LPV.....HDGWE.ILKS-RESGVNTPV---CheY LNKLQAGG.-YGFVISDWNMPN.....MDGLE.LLKTIRADGAMSALPVLNarL IELAESLD.-PDLILLDLNMPG.....MNGLE.TLDKLREKSLSGRI--VFixJ LAFAPDVR.-NGVLVT-LRMPD.....MSGVE.LLRNLGDLKINIPS--IEtr1 LRVVSHEH.--KVVFMDVCMPGvenyqIA-LR.IHEKFTQRHQRPLL--VAmiR FDV----P.-VDVVFTSI-FQN.....RHHDE.IAALLAAGTPRTTL--VKaiA LEYAQTHRdQIDCLILVAANPS.....---FRaVVQQLCFEGVVVPA--I#=GC RF xxxxxxxx.xxxxxxxxxxxxx.....xxxxx.xxxxxxxxxxxxxxxxx

DrrD LLTALSDVEYRVKGLN-GADDYLPKPFDLRELIARVRALIRRkSeskstkCheY MVTAEAKKENIIAAAQAGASGYVVKPFTAATLEEKLNKIFEKlGm.....NarL VFSVSNHEEDVVTALKRGADGYLLKDMEPEDLLKALHQAAAGeMvlsealFixJ VITGHGDVPMAVEAMKAGAVDFIEKPFEDTVIIEAIERASEHlV......Etr1 ALSGNTDKSTKEKCMSFGLDGVLLKPVSLDNIRDVLSDLLEPrVlye...AmiR ALVEYESPAVLSQIIELECHGVITQPLDAHRVLPVLVSARRIsEemaklkKaiA VVGDRDP---AKEQLYHSAELHLGIH-QLEQLPYQVDAALAEfLrlapve#=GC RF xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.x......

Multiple Alignment with Response_Reg family members

HMM states

Page 27: A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University

Summary of KaiA

• Weak homology could have been detected using HMMs (through PFAM search)– HMM model is more general representation of family

– allows more sensitive searches

• HMMs allow pairwise alignment to other fold family members– No reasonable global alignment could be constructed

using Smith-Waterman

– tried various gap parameters, similarity matrices

– produced “random” gap placement

– yet alignment of CheY and KaiA to model gives meaningful pairwise alignment