fast, sensitive homology detection using hmmer ·...

Fast, sensitive homology detection using HMMER

Rob FinnSequence Families Team Lead@robdfinn, [email protected] Nov 2018

Making sense of sequence data

Sequence Data Information

Experimental Literature

Sporadic Literature

Similarity

Uncharacterized

Reference Proteomes

Complete Proteomes

Other Sequences & Metagenomics

Model Organisms

MGnify Protein Database

• >1 billion sequences, mean length of 205

• <1% match UniProtKB, but 58% match Pfam

0

1

2

3

4

5

0 500 1000 1500 2000length of sequence

Freq

uenc

y (m

illion

s) ProteinPartail

C−term truncated

N−term truncated

Full length

Length distribution

-200,000,000

0

200,000,000

400,000,000

600,000,000

800,000,000

1,000,000,000

1,200,000,000

2002 2004 2006 2008 2010 2012 2014 2016 2018

Num

ber o

f Seq

uenc

es

Year

Growth of MGnify Compared with UniProt

UniProt MGNify

undergone convergent evolution to form a stable 3-on-3 a-helical sandwich fold. Interest-ingly, it was subsequently discovered that phycocianins can aggregate forming clusters thatthen adhere to the membrane forming the so-called phycobilisomes. Such a functionalrelationship may indeed point to convergent evolution from a distant common ancestor.

The second example, which is extracted from the work of one of our groups (Tsigelnyet al., 2000), illustrated how the combination and integration of different sources ofinformation, including structural alignments, could help to functionally characterize aprotein. In our work, two new EF-hand motifs were identified in acetylcholinesterase(AChE) and related proteins bycombining the results fromahiddenMarkovmodel sequencesearch, Prosite pattern extraction, and protein structure alignments by CE. It was also foundthat thea–b hydrolase fold family, including acetylcholinesterases, contains putative Ca2þ

binding sites, indicative of an EF-hand motif, and which in some family members may becritical for heterologous cell associations. This putative finding represented the secondcharacterization of an EF-hand motif within an extracellular protein, which previouslyhad only been found in osteonectins. Thus, structure alignment had contributed to ourunderstanding of an important family of proteins.

Finally, the third example, also from a previous work of one of our groups (McMahonet al., 2005), combined information from structural alignments deposited in the DBAlidatabase and experiments to analyze the sequence and fold diversity of a C-type lectindomain.Wedemonstrated that theC-type lectin fold adoptedbyamajor tropismdeterminantsequence, a retroelement-encoded receptor binding protein, provides a highly staticstructural scaffold in support of a diverse array of sequences. Immunoglobulins are knownto fulfill the same role of a scaffold supporting a large variety of sequences necessary for anantigenic response. C-type lectins were shown to represent a different evolutionary solutiontaken by retroelements to balance diversity against stability.

MULTIPLE STRUCTURE ALIGNMENT

Our discussions thus far have involved only pair-wise structure comparison and alignment,or at best, alignment of multiple structures to a single representative in a pair-wise fashion(i.e., progressive pair-wise structure alignment). Most of the available methods for multiplestructure alignment start by computing all pair-wise alignments between a set of structuresbut then use them to generate the optimal consensus alignment between all the structures.

Figure 16.2. Structurealignment for c-phycocyanin (1CPC:A) (black)andcolicinA (1COL:A) (gray)as

computed by SALIGN. The alignment extended over 86 residues with a 0.97A"RMSD. The sequence

identity of the superposed residues with respect to the shorter of the two structures was 11.9%.

406 STRUCTURE COMPARISON AND ALIGNMENT

Sequence And Structure Alignments

undergone convergent evolution to form a stable 3-on-3 a-helical sandwich fold. Interest-ingly, it was subsequently discovered that phycocianins can aggregate forming clusters thatthen adhere to the membrane forming the so-called phycobilisomes. Such a functionalrelationship may indeed point to convergent evolution from a distant common ancestor.

The second example, which is extracted from the work of one of our groups (Tsigelnyet al., 2000), illustrated how the combination and integration of different sources ofinformation, including structural alignments, could help to functionally characterize aprotein. In our work, two new EF-hand motifs were identified in acetylcholinesterase(AChE) and related proteins bycombining the results fromahiddenMarkovmodel sequencesearch, Prosite pattern extraction, and protein structure alignments by CE. It was also foundthat thea–b hydrolase fold family, including acetylcholinesterases, contains putative Ca2þ

binding sites, indicative of an EF-hand motif, and which in some family members may becritical for heterologous cell associations. This putative finding represented the secondcharacterization of an EF-hand motif within an extracellular protein, which previouslyhad only been found in osteonectins. Thus, structure alignment had contributed to ourunderstanding of an important family of proteins.

Finally, the third example, also from a previous work of one of our groups (McMahonet al., 2005), combined information from structural alignments deposited in the DBAlidatabase and experiments to analyze the sequence and fold diversity of a C-type lectindomain.Wedemonstrated that theC-type lectin fold adoptedbyamajor tropismdeterminantsequence, a retroelement-encoded receptor binding protein, provides a highly staticstructural scaffold in support of a diverse array of sequences. Immunoglobulins are knownto fulfill the same role of a scaffold supporting a large variety of sequences necessary for anantigenic response. C-type lectins were shown to represent a different evolutionary solutiontaken by retroelements to balance diversity against stability.

MULTIPLE STRUCTURE ALIGNMENT

Our discussions thus far have involved only pair-wise structure comparison and alignment,or at best, alignment of multiple structures to a single representative in a pair-wise fashion(i.e., progressive pair-wise structure alignment). Most of the available methods for multiplestructure alignment start by computing all pair-wise alignments between a set of structuresbut then use them to generate the optimal consensus alignment between all the structures.

Figure 16.2. Structurealignment for c-phycocyanin (1CPC:A) (black)andcolicinA (1COL:A) (gray)as

computed by SALIGN. The alignment extended over 86 residues with a 0.97A"RMSD. The sequence

identity of the superposed residues with respect to the shorter of the two structures was 11.9%.

406 STRUCTURE COMPARISON AND ALIGNMENT

Fig. Adapted from Chap.16, Structural Bioinformatics, 2nd Ed., Marti-Renom et al

• Statistical inference, accounting for uncertainty

• Use more information

Profile hidden Markov models

P(t | model of homology to q)

P(t | model of nonhomology)P(t | model of homology to q)

P(t | H)P(t | R)

S = log P(t | H)P(t | R)

S = log P(t,πo | H)P(t | R)

joint probability of t, and the alignment

S = log P(t,πo | H)P(t | R)

Optimal alignment scores are only an approximation.and the approximation breaks down on remote homologs.

...GHRL...

...| |...

...GI-M...

V = logP(t,πo | H)

P(t | R)maxπ P(t,π | H)

P(t | R)= log

F = logP(t | H)P(t | R)

Σπ P(t,π | H)P(t | R)

= log

According to inference theory, the correct score is a log-odds ratio summed over all alignments

optimal alignment score HMMs: "Viterbi" score, V

HMMs: "Forward" score, F


Depends on: - a probability model of alignment, not just scores - algorithms fast enough to use in practice

V = logP(t,πo | H)


P(t | R)= log



= log



BLAST (almost)

HMMER


V = logP(t,πo | H)


P(t | R)= log



= log



Depends on: - a probability model of alignment, not just scores

• Statistical inference, accounting for uncertainty

• Use more information

Profile hidden Markov models

Profile Hidden Markov Models - Encapsulate diversity

M1 M2 M3 M4 M5B E

I0 I1 I2 I4I3 I5

D5D4D3D1 D2

seq1 ACG-LDseq2 SCG--ESeq3 NCGgFD Seq4 TCG-WQ 123-45


Plan7 core model

Input multiple alignment: Consensus columns assigned,Defining inserts and deletes:

N T A S G

W F L Y

D E QC

Profile Hidden Markov Models




M1 M2 M3 M4 M5B E

I0 I1 I2 I4I3 I5

D5D4D3D1 D2



Plan7 core model


N T A S G

W F L Y

D E QC

anecdotal search example: globin superfamily

query: alignment of three vertebratehemoglobins and one myoglobin

target db: Uniprot 7.0 (207K seqs)(contains about 1060 known globins)

at E <= 0.01:PSI-BLAST sees: 915 globins (9 sec)HMMER3 sees: 1002 globins (8sec)

Aplysia myoglobin (PDB 1mba)

~300 Mya

~550 Mya

~600-700 Mya?

~1000 Mya?

~2500 Mya?

HBA_HUMANHBA_MOUSE

HBB_HUMANHBB1_MOUSE

MYG_HUMANMYG_MOUSE

NGB_HUMANNGB_MOUSE

LGB1_PEALGB2_PEA

HMP_VIBCHHMP_ECOLI

alpha hemoglobins

beta hemoglobins

myoglobins

neuroglobins

plant leghaemoglobins

bacterial nitric oxide dioxygenases

4e-463e-42

2e-579e-50

1e-452e-41

--

--

1.10.45

PSI-BLAST HMMERE-value (statistical significance)

9e-624e-554e-642e-572e-586e-54

1e-72e-7

5e-55e-6

0.004-

Projecting profile HMMs back onto structures

generated using http://www.skylign.org/3DPatch/

Jakubec D et al, Bioinformatics, 2018

Different HMMER search methods• phmmer—single protein sequence against protein sequence database. • hmmscan—single protein sequence against profile HMM library (Pfam,

CATH-Gene3D, PIRSF, Superfamily and TIGRFAMs).

• hmmsearch—either multiple sequence alignment or profile HMM against protein sequence database.

• jackhmmer—iterative searches. Initiated with a single sequence, a profile HMM or a multiple sequence alignment against a target sequence database.

Find out more?

UNIT 3.15The HMMER Web Server for ProteinSequence Similarity SearchAnanth Prakash,1 Matt Jeffryes,1 Alex Bateman,1 and Robert D. Finn1

1European Molecular Biology Laboratory, The European Bioinformatics Institute(EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom

Protein sequence similarity search is one of the most commonly used bioin-formatics methods for identifying evolutionarily related proteins. In general,sequences that are evolutionarily related share some degree of similarity, andsequence-search algorithms use this principle to identify homologs. The re-quirement for a fast and sensitive sequence search method led to the de-velopment of the HMMER software, which in the latest version (v3.1) usesa combination of sophisticated acceleration heuristics and mathematical andcomputational optimizations to enable the use of profile hidden Markov models(HMMs) for sequence analysis. The HMMER Web server provides a commonplatform by linking the HMMER algorithms to databases, thereby enabling thesearch for homologs, as well as providing sequence and functional annotationby linking external databases. This unit describes three basic protocols andtwo alternate protocols that explain how to use the HMMER Web server usingvarious input formats and user defined parameters. C⃝ 2017 by John Wiley &Sons, Inc.

Keywords: bioinformatics ! homology ! profile hidden Markov model ! pro-tein sequence analysis

How to cite this article:Prakash, A., Jeffryes, M., Bateman, A., & Finn, R. D. (2017). The

HMMER web server for protein sequence similarity search. CurrentProtocols in Bioinformatics, 60, 3.15.1–3.15.23. doi:

10.1002/cpbi.40

INTRODUCTION

The HMMER Web server (http://www.ebi.ac.uk/Tools/hmmer/) is an open-access proteinsequence similarity search tool that hosts a suite of HMMER algorithms to identifyevolutionarily related proteins and/or domains by employing profile hidden Markovmodels (HMMs; APPENDIX 3A, Schuster-Bockler & Bateman, 2007) for fast and efficientdetection of close and remote homologs. The HMMER Web server provides four searchinterfaces to the corresponding algorithms in the HMMER suite (http://hmmer.org):phmmer, hmmscan, hmmsearch, and jackhmmer. The functionality of these algorithmsare outlined in Table 3.15.1. The HMMER Web server can work with various inputformats and user-defined parameters to provide results that are presented to help inferprotein sequence conservation, function, and evolution. This article provides detailedprotocols for using the Web versions of PHMMER, HMMSCAN, and JACKHMMERalgorithms, and ways to navigate and interpret the output.

Basic Protocol 1 and Alternate Protocol 1 describe in detail how to use the basic andadvanced search features, respectively, in PHMMER, and interpret the results usinga protein sequence as the starting point. The logical organization and interpretationof the output described in Basic Protocol 1 is common to all other protocols and istherefore described in detail; user is referred back to this section in subsequent protocols.

Current Protocols in Bioinformatics 3.15.1–3.15.23, December 2017Published online December 2017 in Wiley Online Library (wileyonlinelibrary.com).doi: 10.1002/cpbi.40Copyright C⃝ 2017 John Wiley & Sons, Inc.

FindingSimilarities andInferringHomologies

3.15.1

Supplement 60

UNIT 3.15The HMMER Web Server for ProteinSequence Similarity SearchAnanth Prakash,1 Matt Jeffryes,1 Alex Bateman,1 and Robert D. Finn1

1European Molecular Biology Laboratory, The European Bioinformatics Institute(EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom

Protein sequence similarity search is one of the most commonly used bioin-formatics methods for identifying evolutionarily related proteins. In general,sequences that are evolutionarily related share some degree of similarity, andsequence-search algorithms use this principle to identify homologs. The re-quirement for a fast and sensitive sequence search method led to the de-velopment of the HMMER software, which in the latest version (v3.1) usesa combination of sophisticated acceleration heuristics and mathematical andcomputational optimizations to enable the use of profile hidden Markov models(HMMs) for sequence analysis. The HMMER Web server provides a commonplatform by linking the HMMER algorithms to databases, thereby enabling thesearch for homologs, as well as providing sequence and functional annotationby linking external databases. This unit describes three basic protocols andtwo alternate protocols that explain how to use the HMMER Web server usingvarious input formats and user defined parameters. C⃝ 2017 by John Wiley &Sons, Inc.

Keywords: bioinformatics ! homology ! profile hidden Markov model ! pro-tein sequence analysis

How to cite this article:Prakash, A., Jeffryes, M., Bateman, A., & Finn, R. D. (2017). The

HMMER web server for protein sequence similarity search. CurrentProtocols in Bioinformatics, 60, 3.15.1–3.15.23. doi:

10.1002/cpbi.40

INTRODUCTION

The HMMER Web server (http://www.ebi.ac.uk/Tools/hmmer/) is an open-access proteinsequence similarity search tool that hosts a suite of HMMER algorithms to identifyevolutionarily related proteins and/or domains by employing profile hidden Markovmodels (HMMs; APPENDIX 3A, Schuster-Bockler & Bateman, 2007) for fast and efficientdetection of close and remote homologs. The HMMER Web server provides four searchinterfaces to the corresponding algorithms in the HMMER suite (http://hmmer.org):phmmer, hmmscan, hmmsearch, and jackhmmer. The functionality of these algorithmsare outlined in Table 3.15.1. The HMMER Web server can work with various inputformats and user-defined parameters to provide results that are presented to help inferprotein sequence conservation, function, and evolution. This article provides detailedprotocols for using the Web versions of PHMMER, HMMSCAN, and JACKHMMERalgorithms, and ways to navigate and interpret the output.

Basic Protocol 1 and Alternate Protocol 1 describe in detail how to use the basic andadvanced search features, respectively, in PHMMER, and interpret the results usinga protein sequence as the starting point. The logical organization and interpretationof the output described in Basic Protocol 1 is common to all other protocols and istherefore described in detail; user is referred back to this section in subsequent protocols.

Current Protocols in Bioinformatics 3.15.1–3.15.23, December 2017Published online December 2017 in Wiley Online Library (wileyonlinelibrary.com).doi: 10.1002/cpbi.40Copyright C⃝ 2017 John Wiley & Sons, Inc.

FindingSimilarities andInferringHomologies

3.15.1

Supplement 60

AcknowledgementsEMBL-EBI The Sequence Families team:

Matthias Blum Hsin-Yu Chang Sara El-Gebali Matthew Fraser Jaina Mistry Alex Mitchell Gift Nuka Typhaine Paysan-Lafosse Sebastien Pesseat Simon Potter Matloob Qureshi Lorna Richardson Gustavo Salazar-Orejuela Amaia Sangrador

Collaborators: Harvard University Sean Eddy

University of Montana Travis Wheeler

InterPro

Demo

http://www.ebi.ac.uk/Tools/hmmer

http://www.ebi.ac.uk/Tools/hmmer

hmmscan-single protein sequence against profile HMM library

hmmscan - Search results

CFTR_RAT (P34158): ABC transporter, a chloride ion channel controlled by phosphorylation.

ABC transporter trans-membrane region

ABC transporter domain

jackhmmer-iterative searches. Initiated with a single sequence, a profile HMM or a multiple sequence alignment against a target sequence database

jackhmmer-iterative searches

phmmer—single protein sequence against protein sequence database

phmmer- search TRPA1_HUMAN (O75762)

fast, sensitive homology detection using hmmer ·...

Documents