cecs 694-04 bioinformatics journal club eric rouchka, d.sc. september 10, 2003
DESCRIPTION
SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics . 19 (11):1404-1411. CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003. What is Multiple Sequence Alignment (MSA) ?. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/1.jpg)
Eric C. Rouchka, University of Louisville
SATCHMO: sequence alignment and tree construction using hidden
Markov models
Edgar, R.C. and Sjolander, K. Bioinformatics. 19(11):1404-1411.
CECS 694-04 Bioinformatics Journal ClubEric Rouchka, D.Sc.September 10, 2003
![Page 2: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/2.jpg)
Eric C. Rouchka, University of Louisville
What is Multiple Sequence Alignment (MSA) ?
• Taking more than two sequences and aligning based on similarity
![Page 3: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/3.jpg)
Eric C. Rouchka, University of Louisville
Globin Example>gamma_AMGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVD
PENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVASALSSRYH>alfaVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFK
LLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR>betaVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVD
PENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH>deltaVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDKLHVD
PENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVANALAHKYH>epsilonVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDP
ENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH>gamma_GMGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVD
PENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVASALSSRYH>myoglobinMGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIP
VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG>teta1ALSAEDRALVRALWKKLGSNVGVYTTEALERTFLAFPATKTYFSHLDLSPGSSQVRAHGQKVADALSLAVERLDDLPHALSALSHLHACQLRVDPASFQLL
GHCLLVTLARHYPGDFSPALQASLDKFLSHVISALVSEYR>zetaSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHC
LLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEKYR
![Page 4: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/4.jpg)
Eric C. Rouchka, University of Louisville
Globin Multiple Alignment
![Page 5: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/5.jpg)
Eric C. Rouchka, University of Louisville
Why do MSA?
• Homology Searching– Important regions conserved across (or
within) species• Genic Regions• Regulatory Elements
• Phylogenetic Classification• Subfamily classification• Identification of critical residues
![Page 6: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/6.jpg)
Eric C. Rouchka, University of Louisville
MSA Approaches
• All columns alignable across all sequences– MSA– ClustalW
• Columns alignable throughout all sequences singled out (Profile HMM)– HMMER– SAM
![Page 7: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/7.jpg)
Eric C. Rouchka, University of Louisville
MSA
• N-dimensional dynamic programming• Time consuming• High memory usage
• Guaranteed to yield maximum alignment
![Page 8: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/8.jpg)
Eric C. Rouchka, University of Louisville
ClustalW
• Progressive Alignment– Sequences aligned in pair-wise fashion– Alignment scores produce phylogenetic
tree
– Enhanced dynamic programming approach
![Page 9: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/9.jpg)
Eric C. Rouchka, University of Louisville
Hidden Markov Models
• Match State, Insert State, Delete State
![Page 10: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/10.jpg)
Eric C. Rouchka, University of Louisville
HMMs
• Models conserved regions
• Successful at detecting and aligning critical motifs and conserved core structure
• Difficulty in aligning sequence outside of these regions
![Page 11: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/11.jpg)
Eric C. Rouchka, University of Louisville
SATCHMO
• Simultaneous Alignment and Tree Construction using Hidden Markov mOdels
www.lib.jmu.edu/music/composers/ armstrong.htm
![Page 12: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/12.jpg)
Eric C. Rouchka, University of Louisville
SATCHMO
• Progressive Alignment– Built iteratively in pairs– Profile HMMs used
• Alignments of same sequences not same at each node
• Number of columns predicted smaller as structures diverge
• Output not represented by single matrix
![Page 13: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/13.jpg)
Eric C. Rouchka, University of Louisville
Why HMMs?
• Homologs ranked through scoring• Accurate profiles from small numbers of
sequences• Accurately combines two alignments
having low sequence similarity
![Page 14: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/14.jpg)
Eric C. Rouchka, University of Louisville
Bits saved relative to background
• K = 1..M: HMM node number• a: amino acid type• Pk(a): emission probability of a in kth match state
• P0(a): approximation of background probability of a
![Page 15: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/15.jpg)
Eric C. Rouchka, University of Louisville
Sequence weights
• Sequences weighted such that b converges on a desired value
• Weights compensate for correlation in sequences
![Page 16: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/16.jpg)
Eric C. Rouchka, University of Louisville
HMM Construction
• Profile HMM constructed from multiple alignment
• Some columns alignable; others not
![Page 17: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/17.jpg)
Eric C. Rouchka, University of Louisville
HMM Construction
• Given an alignment a, a profile HMM is generated
• Each column in a is assigned to an emitter state – transition probabilities are calculated based on observed amino acids
![Page 18: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/18.jpg)
Eric C. Rouchka, University of Louisville
Transition Probabilities
• If we have a total of five match states, the probabilities can be stored in the following table:
![Page 19: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/19.jpg)
Eric C. Rouchka, University of Louisville
HMM Terminology
: Path through an HMM to produce a sequence s
• P(A|) = P(s| s)
+: maximum probability path through the HMM
![Page 20: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/20.jpg)
Eric C. Rouchka, University of Louisville
Aligning Two Alignments
• One alignment is converted to an HMM
• Second alignment is aligned to the HMM– Some columns remain alignable– Affinities (relative match scores) calculated
• New MSA results• HMM Constructed from new MSA
![Page 21: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/21.jpg)
Eric C. Rouchka, University of Louisville
Aligning Two Alignments
![Page 22: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/22.jpg)
Eric C. Rouchka, University of Louisville
SATCHMO Algorithm
• Step 1: – Create a cluster for each input sequence and
construct an HMM from the sequence
• Step 2: – Calculate the similarity of all pairs of clusters and
identify a pair with highest similarity – align the target and template to produce a new
node
![Page 23: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/23.jpg)
Eric C. Rouchka, University of Louisville
SATCHMO Algorithm
• Repeat set 2 until:– All sequences assigned to a cluster– Highest similarity between clusters is below a
threshold– No alignable positions are predicted
• Output: A set of binary trees – Nodes are sequences– Each node contains an HMM aligning the
sequences in the subtree
![Page 24: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/24.jpg)
Eric C. Rouchka, University of Louisville
Graphical Interface for SATCHMO
![Page 25: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/25.jpg)
Eric C. Rouchka, University of Louisville
Demonstration of SATCHMO
![Page 26: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/26.jpg)
Eric C. Rouchka, University of Louisville
Validation Set
• BAliBASE benchmark alignment set used– Ref1: equidistant sequences– Ref2: distantly related sequences– Ref3: subgroups of sequences; < 25%
similarity between groups– Ref4: alignments with long extensions on
the ends– Ref5: alignments with long insertions
![Page 27: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/27.jpg)
Eric C. Rouchka, University of Louisville
Comparision of Results
• SATCHMO compared to:– ClustalW (Progressive Pairwise Alignment)– SAM (HMM)
![Page 28: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/28.jpg)
Eric C. Rouchka, University of Louisville
![Page 29: CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003](https://reader036.vdocument.in/reader036/viewer/2022081604/568145f7550346895db2ff93/html5/thumbnails/29.jpg)
Eric C. Rouchka, University of Louisville
Discussion
• SATCHMO effective in identifying protein domains
• Comparison to T-Coffee and PRRP would be useful– Time and sensitivity
• Tree representation is unique, modeling structural similarity