interpro final print

9
Bioinformatics Practical No. InterPro Aim: The protein sequence analysis using InterPro. 1. Introduction: InterPro is a resource that provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites. To classify proteins in this way, InterPro uses predictive models, known as signatures, provided by several different databases (referred to as member databases) that make up the InterPro consortium. The aim of InterPro is to combine their individual strengths to provide a single resource through which scientists can access comprehensive information about protein families, domains and functional sites.  The InterPro Consortiu m The following databases make up the InterPro Consortium: 1) PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is base at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland. 2) HAMAP stands for High-quality Automated and Manual Annotation of Proteins. HAMAP  profiles are manually created by expert curators. They identify proteins that are part of well- conserved proteins families or subfamilies. HAMAP is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland. 3) Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains. Pfam is based at the Wellcome Trust Sanger Institute, Hinxton, UK. 4) PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family or domain. PRINTS is based at the University of Manchester, UK. 5) The ProDom protein domain database consists of an automatic compilation of homologous domains. Current versions of ProDom are built using a novel procedure based on recursive PSI- BLAST searches. ProDom is based at PRABI Villeurbanne, France.

Upload: venkatesh-mahadevan

Post on 04-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: InterPro Final Print

8/13/2019 InterPro Final Print

http://slidepdf.com/reader/full/interpro-final-print 1/9

Bioinformatics Practical No.

InterPro

Aim: The protein sequence analysis using InterPro.

1.  Introduction:

InterPro is a resource that provides functional analysis of protein sequences by classifying them into

families and predicting the presence of domains and important sites. To classify proteins in this way,

InterPro uses predictive models, known as signatures, provided by several different databases

(referred to as member databases) that make up the InterPro consortium. The aim of InterPro is to

combine their individual strengths to provide a single resource through which scientists can access

comprehensive information about protein families, domains and functional sites. 

The InterPro Consortium

The following databases make up the InterPro Consortium:

1)  PROSITE  is a database of protein families and domains. It consists of biologically significant

sites, patterns and profiles that help to reliably identify to which known protein family a new

sequence belongs. PROSITE is base at the Swiss Institute of Bioinformatics (SIB), Geneva,

Switzerland.

2)  HAMAP  stands for High-quality Automated and Manual Annotation of Proteins. HAMAP

 profiles are manually created by expert curators. They identify proteins that are part of well-

conserved proteins families or subfamilies. HAMAP is based at the Swiss Institute of

Bioinformatics (SIB), Geneva, Switzerland.

3)  Pfam is a large collection of multiple sequence alignments and hidden Markov models covering

many common protein domains. Pfam is based at the Wellcome Trust Sanger Institute, Hinxton,

UK.

4)  PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs

used to characterise a protein family or domain. PRINTS is based at the University of

Manchester, UK.

5)  The ProDom  protein domain database consists of an automatic compilation of homologous

domains. Current versions of ProDom are built using a novel procedure based on recursive PSI-

BLAST searches. ProDom is based at PRABI Villeurbanne, France.

Page 2: InterPro Final Print

8/13/2019 InterPro Final Print

http://slidepdf.com/reader/full/interpro-final-print 2/9

6)  SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation

of genetically mobile domains and the analysis of domain architectures. SMART is based at at

EMBL, Heidelberg, Germany.

7)  TIGRFAMs  is a collection of protein families, featuring curated multiple sequence alignments,

hidden Markov models (HMMs) and annotation, which provides a tool for identifying

functionally related proteins based on sequence homology. TIGRFAMs is based at the J. Craig

Venter Institute, Rockville, MD, US.

8)  The PIRSF protein classification system is a network with multiple levels of sequence diversity

from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins

and domains. PIRSF is based at the Protein Information Resource, Georgetown University

Medical Centre, Washington DC, US.

9)  SUPERFAMILY  is a library of profile hidden Markov models that represent all proteins of

known structure. The library is based on the SCOP classification of proteins: each model

corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the

domain belongs to. SUPERFAMILY is based at the University of Bristol, UK.

10)  The CATH-Gene3D  database describes protein families and domain architectures in complete

genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-

linkage clustering according to sequence identity. Mapping of predicted structure and sequence

domains is undertaken using hidden Markov models libraries representing CATH and Pfam

domains. CATH-Gene3D is based at University College, London, UK.

11) PANTHER  is a large collection of protein families that have been subdivided into functionally

related subfamilies, using human expertise. These subfamilies model the divergence of specific

functions within protein families, allowing more accurate association with function, as well as

inference of amino acids important for functional specificity. Hidden Markov models (HMMs)

are built for each family and subfamily for classifying additional protein sequences. PANTHER is

 based at at University of Southern California, CA, US.

Contents and coverage of InterPro 42.0

InterPro protein matches are calculated for all UniProtKB and UniParc proteins. The following

statistics are for all UniProtKB proteins.

InterPro release 42.0 contains 24622 entries, (last entry: IPR027636) representing:

1)  Family (16547)

2)  Domain (6972)

Page 3: InterPro Final Print

8/13/2019 InterPro Final Print

http://slidepdf.com/reader/full/interpro-final-print 3/9

3)  Repeat (273)

4)  Sites

5)  Active site (105)

6)  Binding site (71)

7)  Conserved site (639)

8)  PTM (15)

InterPro cites 37735 publications in PubMed.

Features:

1)  InterProScan  is the software package that allows sequences to be scanned against InterPro's

signatures. The software is available:

-  as a web-based tool for the analysis of single protein sequences programmatically via

Web services that allow up to 25 sequences to be analysed per request (both SOAP and

REST-based services are available)

-  as a downloadable package for local installation from the EBI's FTP server .

InterProScan is run regularly against UniProtKB and the results are made available via the

InterPro website.

2)  In July 2009, a BioMart was added to the InterPro suite of services. BioMart provides users with

the ability to retrieve large sets of data, based on sophisticated queries that may incorporate

multiple filters. Users are able to specify precisely which fields are included in the results 

returned. The InterPro BioMart has been described previously, including a detailed explanation of

how to use the BioMart with several example queries. The most important benefit provided by

this feature is the ability to interrogate InterPro for multiple entries, proteins or member database

signatures in a single query, which is a feature not available from the main InterPro Web

interface.

3)  Utopia:  InterPro signature match data can be visualised on multiple sequence alignments and 3D

structures using Utopia tools.

Page 4: InterPro Final Print

8/13/2019 InterPro Final Print

http://slidepdf.com/reader/full/interpro-final-print 4/9

4)  InterPro Text-based search:

Text search, using InterPro entry names and identifiers, UniProt accessions, GO terms, PDB

identifiers, or free text, to find information in InterPro relating to your query.

Protocol: Using InterPro sequence analysis:

1.  Go to http://www.ebi.ac.uk/interpro/ . 

2.  Get a protein sequence in FASTA format from NCBI site and paste it as a query sequence in the

space provided.

3.  Click “Search”.

Page 5: InterPro Final Print

8/13/2019 InterPro Final Print

http://slidepdf.com/reader/full/interpro-final-print 5/9

Sample Result for Sequence analysis:

1.  Insulin [Crassostrea gigas] GenBank: EKC18433.1 protein sequence was used as query

sequence.

2.  The results obtained display protein family membership, domains and repeats, detailed

signature matches and gene ontology predictions for the protein.

3.  The gene ontology prediction includes:

-  Biological Process in which the query protein is involved

-  Molecular function of the query protein

-  Cellular component the protein constitutes

Page 6: InterPro Final Print

8/13/2019 InterPro Final Print

http://slidepdf.com/reader/full/interpro-final-print 6/9

 

Hyperlinks to individual Protein

fingerprints from member databases

Predicted Molecular Function of the

protein

Extracellular domain predicted

Page 7: InterPro Final Print

8/13/2019 InterPro Final Print

http://slidepdf.com/reader/full/interpro-final-print 7/9

 

Using InterProScan: 

InterProScan (v4.8) is a sequence analysis application (protein sequences) that combines different

 protein signature recognition methods into one resource.

Protocol:

1)  Click on InterProScan on the home-page.

2)  Enter the query protein sequence in the space provided.

3)  Select the databases to search the query protein sequence against.

4)  Click Submit.

Page 8: InterPro Final Print

8/13/2019 InterPro Final Print

http://slidepdf.com/reader/full/interpro-final-print 8/9

 

Interpretation

1)  The graphical output gives the various protein signatures  from different signature databases

selected to which the query protein sequence matched.

2)  The source protein signature database is color coded, according to the legend displayed below the

results.

3)  The highlighted box on the left, gives the InterPro accession no. eg. (IPR016179)  and the

hyperlinks to the individual signature entries from the source databases.

4)  Hence, in this case the query protein sequence viz. can be said to have protein signature matches

from:

Page 9: InterPro Final Print

8/13/2019 InterPro Final Print

http://slidepdf.com/reader/full/interpro-final-print 9/9

a)  Gene3D: No description

 b)  Pfam: “Insulin”

c)  SMART: “Insulin/ Insulin-like growth factor”

d)  Superfamily: “Insulin like”

e)  PRINTS: “INSULINFAMLY”

f)  Prosite: “INSULIN”

g)  Panther: Insulin/ Insulin like growth factor

h)  PIR: Signal peptide

Conclusion:

InterPro combines signatures from multiple, diverse databases into a single searchable

resource, reducing redundancy and helping users interpret their sequence analysis results. By

uniting the member databases, InterPro capitalises on their individual strengths, producing a

 powerful diagnostic tool and integrated resource.

Application: 

InterPro is used by research scientists interested in the large-scale analysis of whole proteomes, genomes

and metagenomes, as well as researchers seeking to characterise individual protein sequences. Within the

EBI, InterPro is used to help annotate protein sequences in UniProtKB. It is also used by the Gene

Ontology Annotation group to automatically assign Gene Ontology terms to protein sequences.

References:

1) Hunter et al., InterPro in 2011: new developments in the family and domain prediction database

 Nucleic Acids Research, 2012, Vol. 40, Database issue, doi:10.1093/nar/gkr948

2) www.ebi.ac.uk/interpro