fast, sensitive homology detection using hmmer ·...
TRANSCRIPT
Fast, sensitive homology detection using HMMER
Rob FinnSequence Families Team Lead@robdfinn, [email protected] Nov 2018
Making sense of sequence data
Sequence Data Information
Experimental Literature
Sporadic Literature
Similarity
Uncharacterized
Reference Proteomes
Complete Proteomes
Other Sequences & Metagenomics
Model Organisms
MGnify Protein Database
• >1 billion sequences, mean length of 205
• <1% match UniProtKB, but 58% match Pfam
0
1
2
3
4
5
0 500 1000 1500 2000length of sequence
Freq
uenc
y (m
illion
s) ProteinPartail
C−term truncated
N−term truncated
Full length
Length distribution
-200,000,000
0
200,000,000
400,000,000
600,000,000
800,000,000
1,000,000,000
1,200,000,000
2002 2004 2006 2008 2010 2012 2014 2016 2018
Num
ber o
f Seq
uenc
es
Year
Growth of MGnify Compared with UniProt
UniProt MGNify
undergone convergent evolution to form a stable 3-on-3 a-helical sandwich fold. Interest-ingly, it was subsequently discovered that phycocianins can aggregate forming clusters thatthen adhere to the membrane forming the so-called phycobilisomes. Such a functionalrelationship may indeed point to convergent evolution from a distant common ancestor.
The second example, which is extracted from the work of one of our groups (Tsigelnyet al., 2000), illustrated how the combination and integration of different sources ofinformation, including structural alignments, could help to functionally characterize aprotein. In our work, two new EF-hand motifs were identified in acetylcholinesterase(AChE) and related proteins bycombining the results fromahiddenMarkovmodel sequencesearch, Prosite pattern extraction, and protein structure alignments by CE. It was also foundthat thea–b hydrolase fold family, including acetylcholinesterases, contains putative Ca2þ
binding sites, indicative of an EF-hand motif, and which in some family members may becritical for heterologous cell associations. This putative finding represented the secondcharacterization of an EF-hand motif within an extracellular protein, which previouslyhad only been found in osteonectins. Thus, structure alignment had contributed to ourunderstanding of an important family of proteins.
Finally, the third example, also from a previous work of one of our groups (McMahonet al., 2005), combined information from structural alignments deposited in the DBAlidatabase and experiments to analyze the sequence and fold diversity of a C-type lectindomain.Wedemonstrated that theC-type lectin fold adoptedbyamajor tropismdeterminantsequence, a retroelement-encoded receptor binding protein, provides a highly staticstructural scaffold in support of a diverse array of sequences. Immunoglobulins are knownto fulfill the same role of a scaffold supporting a large variety of sequences necessary for anantigenic response. C-type lectins were shown to represent a different evolutionary solutiontaken by retroelements to balance diversity against stability.
MULTIPLE STRUCTURE ALIGNMENT
Our discussions thus far have involved only pair-wise structure comparison and alignment,or at best, alignment of multiple structures to a single representative in a pair-wise fashion(i.e., progressive pair-wise structure alignment). Most of the available methods for multiplestructure alignment start by computing all pair-wise alignments between a set of structuresbut then use them to generate the optimal consensus alignment between all the structures.
Figure 16.2. Structurealignment for c-phycocyanin (1CPC:A) (black)andcolicinA (1COL:A) (gray)as
computed by SALIGN. The alignment extended over 86 residues with a 0.97A"RMSD. The sequence
identity of the superposed residues with respect to the shorter of the two structures was 11.9%.
406 STRUCTURE COMPARISON AND ALIGNMENT
Sequence And Structure Alignments
undergone convergent evolution to form a stable 3-on-3 a-helical sandwich fold. Interest-ingly, it was subsequently discovered that phycocianins can aggregate forming clusters thatthen adhere to the membrane forming the so-called phycobilisomes. Such a functionalrelationship may indeed point to convergent evolution from a distant common ancestor.
The second example, which is extracted from the work of one of our groups (Tsigelnyet al., 2000), illustrated how the combination and integration of different sources ofinformation, including structural alignments, could help to functionally characterize aprotein. In our work, two new EF-hand motifs were identified in acetylcholinesterase(AChE) and related proteins bycombining the results fromahiddenMarkovmodel sequencesearch, Prosite pattern extraction, and protein structure alignments by CE. It was also foundthat thea–b hydrolase fold family, including acetylcholinesterases, contains putative Ca2þ
binding sites, indicative of an EF-hand motif, and which in some family members may becritical for heterologous cell associations. This putative finding represented the secondcharacterization of an EF-hand motif within an extracellular protein, which previouslyhad only been found in osteonectins. Thus, structure alignment had contributed to ourunderstanding of an important family of proteins.
Finally, the third example, also from a previous work of one of our groups (McMahonet al., 2005), combined information from structural alignments deposited in the DBAlidatabase and experiments to analyze the sequence and fold diversity of a C-type lectindomain.Wedemonstrated that theC-type lectin fold adoptedbyamajor tropismdeterminantsequence, a retroelement-encoded receptor binding protein, provides a highly staticstructural scaffold in support of a diverse array of sequences. Immunoglobulins are knownto fulfill the same role of a scaffold supporting a large variety of sequences necessary for anantigenic response. C-type lectins were shown to represent a different evolutionary solutiontaken by retroelements to balance diversity against stability.
MULTIPLE STRUCTURE ALIGNMENT
Our discussions thus far have involved only pair-wise structure comparison and alignment,or at best, alignment of multiple structures to a single representative in a pair-wise fashion(i.e., progressive pair-wise structure alignment). Most of the available methods for multiplestructure alignment start by computing all pair-wise alignments between a set of structuresbut then use them to generate the optimal consensus alignment between all the structures.
Figure 16.2. Structurealignment for c-phycocyanin (1CPC:A) (black)andcolicinA (1COL:A) (gray)as
computed by SALIGN. The alignment extended over 86 residues with a 0.97A"RMSD. The sequence
identity of the superposed residues with respect to the shorter of the two structures was 11.9%.
406 STRUCTURE COMPARISON AND ALIGNMENT
Fig. Adapted from Chap.16, Structural Bioinformatics, 2nd Ed., Marti-Renom et al
• Statistical inference, accounting for uncertainty
• Use more information
Profile hidden Markov models
• Statistical inference, accounting for uncertainty
• Use more information
Profile hidden Markov models
P(t | model of homology to q)
P(t | model of nonhomology)P(t | model of homology to q)
P(t | H)P(t | R)
S = log P(t | H)P(t | R)
S = log P(t,πo | H)P(t | R)
joint probability of t, and the alignment
S = log P(t,πo | H)P(t | R)
Optimal alignment scores are only an approximation.and the approximation breaks down on remote homologs.
...GHRL...
...| |...
...GI-M...
S = log P(t,πo | H)P(t | R)
Optimal alignment scores are only an approximation.and the approximation breaks down on remote homologs.
...GHRL...
...| |...
...GI-M...
V = logP(t,πo | H)
P(t | R)maxπ P(t,π | H)
P(t | R)= log
F = logP(t | H)P(t | R)
Σπ P(t,π | H)P(t | R)
= log
According to inference theory, the correct score is a log-odds ratio summed over all alignments
optimal alignment score HMMs: "Viterbi" score, V
HMMs: "Forward" score, F
According to inference theory, the correct score is a log-odds ratio summed over all alignments
Depends on: - a probability model of alignment, not just scores - algorithms fast enough to use in practice
V = logP(t,πo | H)
P(t | R)maxπ P(t,π | H)
P(t | R)= log
F = logP(t | H)P(t | R)
Σπ P(t,π | H)P(t | R)
= log
optimal alignment score HMMs: "Viterbi" score, V
HMMs: "Forward" score, F
BLAST (almost)
HMMER
According to inference theory, the correct score is a log-odds ratio summed over all alignments
V = logP(t,πo | H)
P(t | R)maxπ P(t,π | H)
P(t | R)= log
F = logP(t | H)P(t | R)
Σπ P(t,π | H)P(t | R)
= log
optimal alignment score HMMs: "Viterbi" score, V
HMMs: "Forward" score, F
Depends on: - a probability model of alignment, not just scores
• Statistical inference, accounting for uncertainty
• Use more information
Profile hidden Markov models
Profile Hidden Markov Models - Encapsulate diversity
M1 M2 M3 M4 M5B E
I0 I1 I2 I4I3 I5
D5D4D3D1 D2
seq1 ACG-LDseq2 SCG--ESeq3 NCGgFD Seq4 TCG-WQ 123-45
seq1 ACG-LDseq2 SCG--ESeq3 NCGgFD Seq4 TCG-WQ 123-45
Plan7 core model
Input multiple alignment: Consensus columns assigned,Defining inserts and deletes:
N T A S G
W F L Y
D E QC
Profile Hidden Markov Models
seq1 ACG-LDseq2 SCG--ESeq3 NCGgFD Seq4 TCG-WQ 123-45
Input multiple alignment: Consensus columns assigned,Defining inserts and deletes:
Profile Hidden Markov Models
seq1 ACG-LDseq2 SCG--ESeq3 NCGgFD Seq4 TCG-WQ 123-45
Input multiple alignment: Consensus columns assigned,Defining inserts and deletes:
Profile Hidden Markov Models
seq1 ACG-LDseq2 SCG--ESeq3 NCGgFD Seq4 TCG-WQ 123-45
seq1 ACG-LDseq2 SCG--ESeq3 NCGgFD Seq4 TCG-WQ 123-45
Input multiple alignment: Consensus columns assigned,Defining inserts and deletes:
Profile Hidden Markov Models
M1 M2 M3 M4 M5B E
I0 I1 I2 I4I3 I5
D5D4D3D1 D2
seq1 ACG-LDseq2 SCG--ESeq3 NCGgFD Seq4 TCG-WQ 123-45
seq1 ACG-LDseq2 SCG--ESeq3 NCGgFD Seq4 TCG-WQ 123-45
Plan7 core model
Input multiple alignment: Consensus columns assigned,Defining inserts and deletes:
N T A S G
W F L Y
D E QC
Profile Hidden Markov Models
M1 M2 M3 M4 M5B E
I0 I1 I2 I4I3 I5
D5D4D3D1 D2
seq1 ACG-LDseq2 SCG--ESeq3 NCGgFD Seq4 TCG-WQ 123-45
seq1 ACG-LDseq2 SCG--ESeq3 NCGgFD Seq4 TCG-WQ 123-45
Plan7 core model
Input multiple alignment: Consensus columns assigned,Defining inserts and deletes:
N T A S G
W F L Y
D E QC
anecdotal search example: globin superfamily
query: alignment of three vertebratehemoglobins and one myoglobin
target db: Uniprot 7.0 (207K seqs)(contains about 1060 known globins)
at E <= 0.01:PSI-BLAST sees: 915 globins (9 sec)HMMER3 sees: 1002 globins (8sec)
Aplysia myoglobin (PDB 1mba)
~300 Mya
~550 Mya
~600-700 Mya?
~1000 Mya?
~2500 Mya?
HBA_HUMANHBA_MOUSE
HBB_HUMANHBB1_MOUSE
MYG_HUMANMYG_MOUSE
NGB_HUMANNGB_MOUSE
LGB1_PEALGB2_PEA
HMP_VIBCHHMP_ECOLI
alpha hemoglobins
beta hemoglobins
myoglobins
neuroglobins
plant leghaemoglobins
bacterial nitric oxide dioxygenases
4e-463e-42
2e-579e-50
1e-452e-41
--
--
1.10.45
PSI-BLAST HMMERE-value (statistical significance)
9e-624e-554e-642e-572e-586e-54
1e-72e-7
5e-55e-6
0.004-
Projecting profile HMMs back onto structures
generated using http://www.skylign.org/3DPatch/
Jakubec D et al, Bioinformatics, 2018
Different HMMER search methods• phmmer—single protein sequence against protein sequence database. • hmmscan—single protein sequence against profile HMM library (Pfam,
CATH-Gene3D, PIRSF, Superfamily and TIGRFAMs).
• hmmsearch—either multiple sequence alignment or profile HMM against protein sequence database.
• jackhmmer—iterative searches. Initiated with a single sequence, a profile HMM or a multiple sequence alignment against a target sequence database.
Find out more?
UNIT 3.15The HMMER Web Server for ProteinSequence Similarity SearchAnanth Prakash,1 Matt Jeffryes,1 Alex Bateman,1 and Robert D. Finn1
1European Molecular Biology Laboratory, The European Bioinformatics Institute(EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom
Protein sequence similarity search is one of the most commonly used bioin-formatics methods for identifying evolutionarily related proteins. In general,sequences that are evolutionarily related share some degree of similarity, andsequence-search algorithms use this principle to identify homologs. The re-quirement for a fast and sensitive sequence search method led to the de-velopment of the HMMER software, which in the latest version (v3.1) usesa combination of sophisticated acceleration heuristics and mathematical andcomputational optimizations to enable the use of profile hidden Markov models(HMMs) for sequence analysis. The HMMER Web server provides a commonplatform by linking the HMMER algorithms to databases, thereby enabling thesearch for homologs, as well as providing sequence and functional annotationby linking external databases. This unit describes three basic protocols andtwo alternate protocols that explain how to use the HMMER Web server usingvarious input formats and user defined parameters. C⃝ 2017 by John Wiley &Sons, Inc.
Keywords: bioinformatics ! homology ! profile hidden Markov model ! pro-tein sequence analysis
How to cite this article:Prakash, A., Jeffryes, M., Bateman, A., & Finn, R. D. (2017). The
HMMER web server for protein sequence similarity search. CurrentProtocols in Bioinformatics, 60, 3.15.1–3.15.23. doi:
10.1002/cpbi.40
INTRODUCTION
The HMMER Web server (http://www.ebi.ac.uk/Tools/hmmer/) is an open-access proteinsequence similarity search tool that hosts a suite of HMMER algorithms to identifyevolutionarily related proteins and/or domains by employing profile hidden Markovmodels (HMMs; APPENDIX 3A, Schuster-Bockler & Bateman, 2007) for fast and efficientdetection of close and remote homologs. The HMMER Web server provides four searchinterfaces to the corresponding algorithms in the HMMER suite (http://hmmer.org):phmmer, hmmscan, hmmsearch, and jackhmmer. The functionality of these algorithmsare outlined in Table 3.15.1. The HMMER Web server can work with various inputformats and user-defined parameters to provide results that are presented to help inferprotein sequence conservation, function, and evolution. This article provides detailedprotocols for using the Web versions of PHMMER, HMMSCAN, and JACKHMMERalgorithms, and ways to navigate and interpret the output.
Basic Protocol 1 and Alternate Protocol 1 describe in detail how to use the basic andadvanced search features, respectively, in PHMMER, and interpret the results usinga protein sequence as the starting point. The logical organization and interpretationof the output described in Basic Protocol 1 is common to all other protocols and istherefore described in detail; user is referred back to this section in subsequent protocols.
Current Protocols in Bioinformatics 3.15.1–3.15.23, December 2017Published online December 2017 in Wiley Online Library (wileyonlinelibrary.com).doi: 10.1002/cpbi.40Copyright C⃝ 2017 John Wiley & Sons, Inc.
FindingSimilarities andInferringHomologies
3.15.1
Supplement 60
UNIT 3.15The HMMER Web Server for ProteinSequence Similarity SearchAnanth Prakash,1 Matt Jeffryes,1 Alex Bateman,1 and Robert D. Finn1
1European Molecular Biology Laboratory, The European Bioinformatics Institute(EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom
Protein sequence similarity search is one of the most commonly used bioin-formatics methods for identifying evolutionarily related proteins. In general,sequences that are evolutionarily related share some degree of similarity, andsequence-search algorithms use this principle to identify homologs. The re-quirement for a fast and sensitive sequence search method led to the de-velopment of the HMMER software, which in the latest version (v3.1) usesa combination of sophisticated acceleration heuristics and mathematical andcomputational optimizations to enable the use of profile hidden Markov models(HMMs) for sequence analysis. The HMMER Web server provides a commonplatform by linking the HMMER algorithms to databases, thereby enabling thesearch for homologs, as well as providing sequence and functional annotationby linking external databases. This unit describes three basic protocols andtwo alternate protocols that explain how to use the HMMER Web server usingvarious input formats and user defined parameters. C⃝ 2017 by John Wiley &Sons, Inc.
Keywords: bioinformatics ! homology ! profile hidden Markov model ! pro-tein sequence analysis
How to cite this article:Prakash, A., Jeffryes, M., Bateman, A., & Finn, R. D. (2017). The
HMMER web server for protein sequence similarity search. CurrentProtocols in Bioinformatics, 60, 3.15.1–3.15.23. doi:
10.1002/cpbi.40
INTRODUCTION
The HMMER Web server (http://www.ebi.ac.uk/Tools/hmmer/) is an open-access proteinsequence similarity search tool that hosts a suite of HMMER algorithms to identifyevolutionarily related proteins and/or domains by employing profile hidden Markovmodels (HMMs; APPENDIX 3A, Schuster-Bockler & Bateman, 2007) for fast and efficientdetection of close and remote homologs. The HMMER Web server provides four searchinterfaces to the corresponding algorithms in the HMMER suite (http://hmmer.org):phmmer, hmmscan, hmmsearch, and jackhmmer. The functionality of these algorithmsare outlined in Table 3.15.1. The HMMER Web server can work with various inputformats and user-defined parameters to provide results that are presented to help inferprotein sequence conservation, function, and evolution. This article provides detailedprotocols for using the Web versions of PHMMER, HMMSCAN, and JACKHMMERalgorithms, and ways to navigate and interpret the output.
Basic Protocol 1 and Alternate Protocol 1 describe in detail how to use the basic andadvanced search features, respectively, in PHMMER, and interpret the results usinga protein sequence as the starting point. The logical organization and interpretationof the output described in Basic Protocol 1 is common to all other protocols and istherefore described in detail; user is referred back to this section in subsequent protocols.
Current Protocols in Bioinformatics 3.15.1–3.15.23, December 2017Published online December 2017 in Wiley Online Library (wileyonlinelibrary.com).doi: 10.1002/cpbi.40Copyright C⃝ 2017 John Wiley & Sons, Inc.
FindingSimilarities andInferringHomologies
3.15.1
Supplement 60
AcknowledgementsEMBL-EBI The Sequence Families team:
Matthias Blum Hsin-Yu Chang Sara El-Gebali Matthew Fraser Jaina Mistry Alex Mitchell Gift Nuka Typhaine Paysan-Lafosse Sebastien Pesseat Simon Potter Matloob Qureshi Lorna Richardson Gustavo Salazar-Orejuela Amaia Sangrador
Collaborators: Harvard University Sean Eddy
University of Montana Travis Wheeler
InterPro
hmmscan-single protein sequence against profile HMM library
hmmscan - Search results
CFTR_RAT (P34158): ABC transporter, a chloride ion channel controlled by phosphorylation.
ABC transporter trans-membrane region
ABC transporter domain
jackhmmer-iterative searches. Initiated with a single sequence, a profile HMM or a multiple sequence alignment against a target sequence database
jackhmmer-iterative searches
jackhmmer-iterative searches
phmmer—single protein sequence against protein sequence database
phmmer- search TRPA1_HUMAN (O75762)