introduction to hmmer - a biosequence analysis tool with hidden markov models

20
INTRODUCTION TO HMMER Biosequence Analysis Using Profile Hidden Markov Models Anaxagoras Fotopoulos | 2014 Course: Algorithms in Molecular B

Upload: anax-fotopoulos

Post on 24-Apr-2015

374 views

Category:

Health & Medicine


0 download

DESCRIPTION

HMMER is used for searching sequence databases for homologs of protein sequences, and for making protein sequence alignments. It implements methods using probabilistic models called profile hidden Markov models (profile HMMs). Compared to BLAST, FASTA, and other sequence alignment and database search tools based on older scoring methodology, HMMER aims to be significantly more accurate and more able to detect remote homologs because of the strength of its underlying mathematical models. In the past, this strength came at significant computational expense, but in the new HMMER3 project, HMMER is now essentially as fast as BLAST. As part of this evolution in the HMMER software, we are committed to making the software available to as many scientists as possible. Earlier releases of HMMER were restricted to command line use. To make the software more accessible to the wide scientific community, we now provide servers that allow sequence searches to be performed interactively via the Web.

TRANSCRIPT

Page 1: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

INTRODUCTION TO HMMER

Biosequence Analysis Using Profile

Hidden Markov Models Anaxagoras Fotopoulos | 2014

Course: Algorithms in Molecular Biology

Page 2: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

A brief History

Sean Eddy

HMMER 1.8, the first public release of HMMER, came in April 1995

“Far too much of HMMER was written in coffee shops, airport lounges, transoceanic flights, and Graeme Mitchison’s kitchen”

“If the world worked as I hoped, the combination of the book Biological Sequence Analysis and the existence of HMMER2 as a widely-used proof of principle should have motivated the widespread adoption of probabilistic modeling methods for sequence analysis.”

“BLAST continued to be the most widely used search program. HMMs widely considered as a mysterious and orthogonal black box.”

“NCBI, seemed to be slow to adopt or even understand HMM methods. This nagged at me; the revolution was unfinished!”

“In 2006 we moved the lab and I decided that we should aim to replace BLAST with an entirely new generation of software. The result is the HMMER3 project.”

Page 3: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

Usage

HMMER is used to search for homologs of protein or DNA sequences to sequence databases or to single sequences by comparing a profile-HMM

Able to make sequence alignments.

Powerful when the query is an alignment of multiple instances of a sequence family.

Automated construction and maintenance of large multiple alignment databases. Useful to organize sequences into evolutionarily related families

Automated annotation of the domain structure of proteins by searching in protein family databases such as Pfam and InterPro 

Page 4: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

How it worksHMMER makes a profile-HMM

from a multiple sequenc

e alignme

nt

A query is created

that assigns a position-specific scoring

system for substitution

s, insertions

and deletions.

Sequences that score significantly better to the profile-

HMM compared to a null

model are considered

to be homologou

s

HMMER3 also makes extensive use of parallel distribution commands for increasing

computational speed based on a significant acceleration of the Smith-Waterman algorithm for

aligning two sequences (Farrar M, 2007)

Posterior probabilities of alignment are

reported, enabling assessments on a residue-by-residue

basis.

HMMER3 uses Forward scores rather than Viterbi scores,

which improves sensitivity. Forward scores are better for

detecting distant homologs

Page 5: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

Index of Commands (1/4)

Build models and align sequences (DNA or protein)

Build a profile HMM from an input multiple alignment.

Make a multiple alignment of many sequences to a common profile HMM.

hmmbuild

hmmalign

Page 6: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

Index of Commands (2/4)Search protein queries to

protein databases

Search a single protein sequence to a protein sequence database

Iteratively search a protein sequence to a protein sequence database

Search a protein profile HMM against a protein sequence database.

Search a protein sequence against a protein profile HMM database.

Search daemon used for hmmer.org website.

phmmer

jackhmmer

hmmsearch

hmmscan

hmmpgmd

Like PSIBLAST

Like BLASTP

Page 7: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

Index of Commands (3/4)

Search DNA queries to DNA databases

Search DNA queries against DNA database

Search a DNA sequence against a DNA profile HMM database

nhmmer

nhmmscan

Like BLASTN

Page 8: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

Index of Commands (4/4)

Other Utilities

Modify alignment file to mask column ranges.

Convert profile formats to/from HMMER3 format.

Generate (sample) sequences from a profile HMM.

Get a profile HMM by name or accession from an HMM database.

Format an HMM database into a binary format for hmmscan

alimask

hmmconvert

hmmemit

hmmfetch

hmmpress

Show summary statistics for each profile in an HMM databasehmmstat

Page 9: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

Basic Examples

with HMMER

hmmbuild [options] <hmmfile out> <multiple sequence alignment file>

Most Used Options-o <f> Direct the summary output to file <f>, rather than to stdout.-O <f> Resave annotated modified source alignments to a file <f> in Stockholm format.--amino Specify that all sequences in msafile are proteins.--dna Specify that all sequences in msafile are DNAs.--rna Specify that all sequences in msafile are RNAs.--pnone Don’t use any priors. Probability parameters will simply be the observed frequencies,after relative sequence weighting.--plaplace Use a Laplace +1 prior in place of the default mixture Dirichlet prior.

> hmmbuild globins4.hmm tutorial/globins4.sto

Page 10: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

Basic Examples

with HMMER

hmmbuild [options] <hmmfile out> <multiple sequence alignment file>

> hmmbuild globins4.hmm tutorial/globins4.sto

Internal Use!

Page 11: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

Basic Examples

with HMMER

hmmsearch [options] <hmmfile> <seqdb>

> hmmsearch globins4.hmm uniprot sprot.fasta > globins4.out

Keynoteshmmsearch accepts any FASTA file as target database input. It also accepts EMBL/UniProt text format-o <f> Direct the human-readable output to a file <f> instead of the default stdout.-A <f> Save a multiple alignment of all significant hits (those satisfying inclusion thresholds) to the file <f>.--tblout <f> Save a simple tabular (space-delimited) file summarizing the per-target output, with one data line per homologous target sequence found.--domtblout <f> Save a simple tabular (space-delimited) file summarizing the per-domain output, with one data line per homologous domain detected in a query sequence for each homologous model.

• The most important number here is the sequence E-value

• The lower the E-value, the more significant the hit

• if both E-values are significant (<< 1), the sequence is likely to be homologous to your query.

• if the full sequence E-value is significant but the single best domain E-value is not, the target sequence is a multidomain remote homolog

Search a protein profile HMM against a protein sequence

database.

Page 12: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

Basic Examples

with HMMER

phmmer [options] <seqfile> <seqdb>

> phmmer tutorial/HBB HUMAN uniprot sprot.fasta

Keynotes• phmmer works essentially

just like hmmsearch does, except you provide a query sequence instead of a query profile HMM.

• The default score matrix is BLOSUM62

• Everything about the output is essentially as previously described for hmmsearch

• jackhmmer is for searching a single sequence query iteratively against a sequence database, (like PSI-BLAST)

search protein sequence(s) against a protein sequence

database

jackhmmer [options] <seqfile> <seqdb>

> jackhmmer tutorial/HBB HUMAN uniprot sprot.fasta

Iterative protein searches

• The first round is identical to a phmmer search. All the matches that pass the inclusion thresholds are put in a multiple alignment.

• In the second (and subsequent) rounds, a profile is made from these results, and the database is searched again with the profile.

• Iterations continue either until no new sequences are detected or the maximum number of iterations is reached.

Page 13: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

Basic Examples

with HMMER

> jackhmmer tutorial/HBB HUMAN uniprot sprot.fasta

jackhmmer [options] <seqfile> <seqdb>

Iterative protein searches

• This is telling you that the new alignment contains 936 sequences, your query plus 935 significant matches.

• For round two, it’s built a new model from this alignment.

• After round 2, many more globin sequences have been found

• After round five, the search ends it reaches the default maximum of five iterations

Page 14: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

Basic Examples

with HMMER

> hmmalign globins4.hmm tutorial/globins45.fasta

hmmalign [options] <hmmfile> <seqfile>

Creating multiple alignments

A file with 45 unaligned globin

sequences

Posterior Probability Estimate

Page 15: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

Smart(Hmm)er

Create a tiny database

> hmmpress minifam> hmmscan minifam tutorial/7LESS

DROME

> hmmsearch globins4.hmm uniprot sprot.fasta> cat globins4.hmm | hmmsearch - uniprot sprot.fasta> cat uniprot sprot.fasta | hmmsearch globins4.hmm -

Identical

> hmmfetch --index Pfam-A.hmm> cat myqueries.list | hmmfetch -f Pfam.hmm - | hmmsearch - uniprot sprot.fastaThis takes a list of query profile names/accessions in myqueries.list, fetches them one by one from Pfam, and does an hmmsearch with each of them against UniProt

Page 16: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

Latest Edition

FeaturesDNA sequence comparison. HMMER now includes tools that are specifically designed for DNA/DNAcomparison: nhmmer and nhmmscan. The most notable improvement over using HMMER3’s tools is theability to search long (e.g. chromosome length) target sequences.

More sequence input formats. HMMER now handles a wide variety of input sequence file formats, bothaligned (Stockholm, Aligned FASTA, Clustal, NCBI PSI-BLAST, PHYLIP, Selex, UCSC SAM A2M) andunaligned (FASTA, EMBL, Genbank), usually with autodetection.

MSV stage of HMMER acceleration pipeline now even faster. Bjarne Knudsen, Chief Scientific Officerof CLC bio in Denmark, contributed an important optimization of the MSV filter (the first stage in the accelerated ”filter pipeline”) that increases overall HMMER3 speed by about two-fold. This speed improvement has no impact on sensitivity.

Web implementation of hmmer

Page 17: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

http://hmmer.janelia.org/search/hmmsearch

Available Onlinephmmer

hmmscanHmmsearchjackhammer

Page 18: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

Advantages/Disadvantages

One is that HMMs do not capture any higher-order correlations. An HMM assumes that the identity of a particular position is independent of the identity of all other positions.

Profile HMMs are often not good models of structural RNAs, for instance, because an HMM cannot describe base pairs.

The methods are consistent and therefore highly automatable, allowing us to make libraries of hundreds of profile HMMs and apply them on a very large scale to whole genome analysis

HMMER can be used as a search tool for additional homologues

Page 19: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

More Information

http://hmmer.janelia.org

http://cryptogenomicon.org/

Page 20: Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models

20

Thank you!

National & KapodistrianUniversity of AthensDepartment of Informatics

Technological Education Institute of AthensDepartment of Biomedical Engineering

Biomedical ResearchFoundation Academy of Athens

Demokritos National Center for Scientific Research

Algorithms in Molecular BiologyInformation Technologies in Medicine and Biology