interpro final print
Post on 04-Jun-2018
217 Views
Preview:
TRANSCRIPT
8/13/2019 InterPro Final Print
http://slidepdf.com/reader/full/interpro-final-print 1/9
Bioinformatics Practical No.
InterPro
Aim: The protein sequence analysis using InterPro.
1. Introduction:
InterPro is a resource that provides functional analysis of protein sequences by classifying them into
families and predicting the presence of domains and important sites. To classify proteins in this way,
InterPro uses predictive models, known as signatures, provided by several different databases
(referred to as member databases) that make up the InterPro consortium. The aim of InterPro is to
combine their individual strengths to provide a single resource through which scientists can access
comprehensive information about protein families, domains and functional sites.
The InterPro Consortium
The following databases make up the InterPro Consortium:
1) PROSITE is a database of protein families and domains. It consists of biologically significant
sites, patterns and profiles that help to reliably identify to which known protein family a new
sequence belongs. PROSITE is base at the Swiss Institute of Bioinformatics (SIB), Geneva,
Switzerland.
2) HAMAP stands for High-quality Automated and Manual Annotation of Proteins. HAMAP
profiles are manually created by expert curators. They identify proteins that are part of well-
conserved proteins families or subfamilies. HAMAP is based at the Swiss Institute of
Bioinformatics (SIB), Geneva, Switzerland.
3) Pfam is a large collection of multiple sequence alignments and hidden Markov models covering
many common protein domains. Pfam is based at the Wellcome Trust Sanger Institute, Hinxton,
UK.
4) PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs
used to characterise a protein family or domain. PRINTS is based at the University of
Manchester, UK.
5) The ProDom protein domain database consists of an automatic compilation of homologous
domains. Current versions of ProDom are built using a novel procedure based on recursive PSI-
BLAST searches. ProDom is based at PRABI Villeurbanne, France.
8/13/2019 InterPro Final Print
http://slidepdf.com/reader/full/interpro-final-print 2/9
6) SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation
of genetically mobile domains and the analysis of domain architectures. SMART is based at at
EMBL, Heidelberg, Germany.
7) TIGRFAMs is a collection of protein families, featuring curated multiple sequence alignments,
hidden Markov models (HMMs) and annotation, which provides a tool for identifying
functionally related proteins based on sequence homology. TIGRFAMs is based at the J. Craig
Venter Institute, Rockville, MD, US.
8) The PIRSF protein classification system is a network with multiple levels of sequence diversity
from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins
and domains. PIRSF is based at the Protein Information Resource, Georgetown University
Medical Centre, Washington DC, US.
9) SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of
known structure. The library is based on the SCOP classification of proteins: each model
corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the
domain belongs to. SUPERFAMILY is based at the University of Bristol, UK.
10) The CATH-Gene3D database describes protein families and domain architectures in complete
genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-
linkage clustering according to sequence identity. Mapping of predicted structure and sequence
domains is undertaken using hidden Markov models libraries representing CATH and Pfam
domains. CATH-Gene3D is based at University College, London, UK.
11) PANTHER is a large collection of protein families that have been subdivided into functionally
related subfamilies, using human expertise. These subfamilies model the divergence of specific
functions within protein families, allowing more accurate association with function, as well as
inference of amino acids important for functional specificity. Hidden Markov models (HMMs)
are built for each family and subfamily for classifying additional protein sequences. PANTHER is
based at at University of Southern California, CA, US.
Contents and coverage of InterPro 42.0
InterPro protein matches are calculated for all UniProtKB and UniParc proteins. The following
statistics are for all UniProtKB proteins.
InterPro release 42.0 contains 24622 entries, (last entry: IPR027636) representing:
1) Family (16547)
2) Domain (6972)
8/13/2019 InterPro Final Print
http://slidepdf.com/reader/full/interpro-final-print 3/9
3) Repeat (273)
4) Sites
5) Active site (105)
6) Binding site (71)
7) Conserved site (639)
8) PTM (15)
InterPro cites 37735 publications in PubMed.
Features:
1) InterProScan is the software package that allows sequences to be scanned against InterPro's
signatures. The software is available:
- as a web-based tool for the analysis of single protein sequences programmatically via
Web services that allow up to 25 sequences to be analysed per request (both SOAP and
REST-based services are available)
- as a downloadable package for local installation from the EBI's FTP server .
InterProScan is run regularly against UniProtKB and the results are made available via the
InterPro website.
2) In July 2009, a BioMart was added to the InterPro suite of services. BioMart provides users with
the ability to retrieve large sets of data, based on sophisticated queries that may incorporate
multiple filters. Users are able to specify precisely which fields are included in the results
returned. The InterPro BioMart has been described previously, including a detailed explanation of
how to use the BioMart with several example queries. The most important benefit provided by
this feature is the ability to interrogate InterPro for multiple entries, proteins or member database
signatures in a single query, which is a feature not available from the main InterPro Web
interface.
3) Utopia: InterPro signature match data can be visualised on multiple sequence alignments and 3D
structures using Utopia tools.
8/13/2019 InterPro Final Print
http://slidepdf.com/reader/full/interpro-final-print 4/9
4) InterPro Text-based search:
Text search, using InterPro entry names and identifiers, UniProt accessions, GO terms, PDB
identifiers, or free text, to find information in InterPro relating to your query.
Protocol: Using InterPro sequence analysis:
1. Go to http://www.ebi.ac.uk/interpro/ .
2. Get a protein sequence in FASTA format from NCBI site and paste it as a query sequence in the
space provided.
3. Click “Search”.
8/13/2019 InterPro Final Print
http://slidepdf.com/reader/full/interpro-final-print 5/9
Sample Result for Sequence analysis:
1. Insulin [Crassostrea gigas] GenBank: EKC18433.1 protein sequence was used as query
sequence.
2. The results obtained display protein family membership, domains and repeats, detailed
signature matches and gene ontology predictions for the protein.
3. The gene ontology prediction includes:
- Biological Process in which the query protein is involved
- Molecular function of the query protein
- Cellular component the protein constitutes
8/13/2019 InterPro Final Print
http://slidepdf.com/reader/full/interpro-final-print 6/9
Hyperlinks to individual Protein
fingerprints from member databases
Predicted Molecular Function of the
protein
Extracellular domain predicted
8/13/2019 InterPro Final Print
http://slidepdf.com/reader/full/interpro-final-print 7/9
Using InterProScan:
InterProScan (v4.8) is a sequence analysis application (protein sequences) that combines different
protein signature recognition methods into one resource.
Protocol:
1) Click on InterProScan on the home-page.
2) Enter the query protein sequence in the space provided.
3) Select the databases to search the query protein sequence against.
4) Click Submit.
8/13/2019 InterPro Final Print
http://slidepdf.com/reader/full/interpro-final-print 8/9
Interpretation
1) The graphical output gives the various protein signatures from different signature databases
selected to which the query protein sequence matched.
2) The source protein signature database is color coded, according to the legend displayed below the
results.
3) The highlighted box on the left, gives the InterPro accession no. eg. (IPR016179) and the
hyperlinks to the individual signature entries from the source databases.
4) Hence, in this case the query protein sequence viz. can be said to have protein signature matches
from:
8/13/2019 InterPro Final Print
http://slidepdf.com/reader/full/interpro-final-print 9/9
a) Gene3D: No description
b) Pfam: “Insulin”
c) SMART: “Insulin/ Insulin-like growth factor”
d) Superfamily: “Insulin like”
e) PRINTS: “INSULINFAMLY”
f) Prosite: “INSULIN”
g) Panther: Insulin/ Insulin like growth factor
h) PIR: Signal peptide
Conclusion:
InterPro combines signatures from multiple, diverse databases into a single searchable
resource, reducing redundancy and helping users interpret their sequence analysis results. By
uniting the member databases, InterPro capitalises on their individual strengths, producing a
powerful diagnostic tool and integrated resource.
Application:
InterPro is used by research scientists interested in the large-scale analysis of whole proteomes, genomes
and metagenomes, as well as researchers seeking to characterise individual protein sequences. Within the
EBI, InterPro is used to help annotate protein sequences in UniProtKB. It is also used by the Gene
Ontology Annotation group to automatically assign Gene Ontology terms to protein sequences.
References:
1) Hunter et al., InterPro in 2011: new developments in the family and domain prediction database
Nucleic Acids Research, 2012, Vol. 40, Database issue, doi:10.1093/nar/gkr948
2) www.ebi.ac.uk/interpro
top related