mcb september, 2010 protein sequence databases [email protected] swiss-prot group,...

176
MCB September, 2010 Protein Sequence Databases [email protected] Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: use and pitfalls http://education.expasy.org/cours/CJSFEAP2010/

Post on 22-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

[email protected] group, GenevaSIB Swiss Institute of Bioinformatics

Protein sequence databases:use and pitfalls

http://education.expasy.org/cours/CJSFEAP2010/

Page 2: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Mr. ProteomicsMr. Protein sequence databases

Page 3: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

protein identificationby database matching

massspectrometry

analysis

624.

3

769.

8

893.

4 1056

.1

1326

.7

1501

.9

1759

.8

2100

.6

2200

624.

3

769.

8

893.

4

994.

5

1056

.1

1326

.7

1501

.9

1759

.8

1923

.4

2100

.6

600 2200

TYGGAARGPGFK

PSTTGVEMFR

EHICLLGR

GANR

samples with peptides

Large protein lists is not the end point in Proteomics

-> importance of protein sequence annotation

Page 4: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

New challenge

Flood of data -> need to be stored, curated and made available for analysis and knowledge discovery

Page 5: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Many protein sequence databases…

•Which does contain the highest quality data ?•Which is comprehensive ?•Which is up-to-date ?•Which is redundant ?•Which is indexed (allows complex queries) ?•Which Web server does respond most quickly ?•Which does contain complete proteomes ?• …….??????

Page 6: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

A HUPO test sample study reveals common problems in mass spectrometry–based proteomics

PubMed 19448641 (2009)

• A single mass spectrometry experiment can identified up to about 4000 proteins (15’000 peptides)

• Protein databases vary greatly in terms of their curation, completeness and comprehensiveness (search with different protein databases = could get different results).

• Only 7 labs (on 27) were able to identify the 20 human proteins present in a sample, also due to the fact that the search engines used cannot distinguish among different identifiers for the same protein…

Page 7: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Awareness of the content and usage of knowledge

resources is a pre-requisite to do any type of « serious »

research in the field of molecular life sciences

(AMB, 2007)

Page 8: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

Menu

Introduction

Nucleic acid sequence databases ENA, GenBank, DDBJ

Protein sequence databasesUniProt databases (UniProtKB)

NCBI protein databases

Page 9: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

Menu

Introduction

Nucleic acid sequence databases ENA, GenBank, DDBJ

Protein sequence databasesUniProt databases (UniProtKB)

NCBI protein databases

Page 10: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Protein sequence origin

More than 99 % of the protein sequences are derived from the translation of nucleotide sequences

(genomes and/or cDNAs)

-> Important to know where the protein sequence comes from…

(sequencing & gene prediction quality) !

Page 11: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

… ~ 2500 genomes sequenced (single organism, varying sizes, including virus)

… ~ 5’000 ongoing genome sequencing projects

… cDNAs sequencing projects (ESTs or cDNAs)

… metagenome sequencing projects = environmental samples: multiple ‘unknown’ organisms,

Page 12: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

Metagenomics:Metagenomics:study of genetic material recovered directly from environmental samples

• Global Ocean Sampling (C. Venter) 1ml sea water: 1 mo bacteria and 10 mo virus

• Whale fall (AAFZ00000000.1)

• Soil, sand beach, New-York air, …

• Human fluids, mouse gut (millions of bacteria within human body)

• Water treatment industry…

• Lists of projects: http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi

Venter’s Sorcerer II

Page 13: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

… ~ 2500 genomes sequenced (single organism, varying sizes)

… ~ 5’000 ongoing genome sequencing projects

… cDNAs sequencing projects (ESTs or cDNAs)

… metagenome sequencing projects

… personal human genomes

new generation sequencers : Illumina: 25 billions of bp /day;

Page 14: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

http://www.youtube.com/watch?v=mVZI7NBgcWM

2’000’000 $(2007)

70’000’000 $(diploid,

2007)

3’000’000’000 $(public consortium,

2000)

300’000’000 $(Celera, 2000)

2010

Page 15: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

How many proteins-coding genes at the end?

Page 16: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

190‘500'025'0421st estimate: ~30 million species (1.8 million named) 2nd estimate:

20 million bacteria/archea x 4'000 genes

1 million protists x 6'000 genes

5 million insects x 14'000 genes

2 million fungi x 6'000 genes

0.5 million plants x 20'000 genes

0.5 million molluscs, worms, arachnids, etc. x 20'000 genes

0.1 million vertebrates x 25'000 genes

The calculation: 2x107x4000+1x106x6000+5x106x14000+2x106x6000+5x105x20000+5x105x20000+1x105x25000

+20000 (Craig Venter)+ 42(Douglas Adam) + …

Page 17: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

About 190 milliards of proteins (?)

About 12.0 millions of ‘known’ protein sequences in 2010(from ~290’000 species)

More than 99 % of the protein sequences are derived from the translation of nucleotide sequences

Less than 1 % direct protein sequencing (Edman, MS/MS…)

-> It is important that users know where the protein sequence comes from…

(sequencing & gene prediction quality) !

Page 18: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

Menu

Introduction

Nucleic acid sequencedatabases ENA/GenBank, DDBJ

Protein sequence databasesUniProt databases (UniProtKB)

NCBI protein databases

Page 19: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

ENA (EMBL-Bank) GenBank

DDBJ

Page 20: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

http://www.insdc.org/

ENA/GenBank/DDBJ

Page 21: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

cDNAs, ESTs, genes, genomes, …

ENA, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

The hectic life of a sequence …

archive of primary sequence data and corresponding annotation submitted by the

laboratories that did the sequencing.

Page 22: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Journals do not (SHOULD NOT) accept a paper dealing with a nucleic acid sequence if the ENA/GenBank/DDBJ AC

number is not available…

‘journal publishers generally require deposition prior to publication so that an accession number can be included in

the paper.’

…not the case yet for protein sequences !!!

Page 23: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

taxonomy

Cross-references

references

accession number

Page 24: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

CDS annotation

(Prediction or experimentally determined)

sequence

CDSCoDing Sequence

(proposed by submitters)

annotation provided by

the laboratories that did the sequencing

Page 25: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

CONTIG --------------------------------------------------------------------------------------CGANGGCCTATCAACAATGAAAGGTCGAAACCTG

Genomic AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACAATG-AAGGTCGAAACCTG *** ************ ** * **************

 CONTIG CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG-------------------------------------------------------Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA ************************************ **************************** CONTIG ------------------------------------------------------------------------------------------------------------------------Genomic TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT

CONTIG ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGACGenomic GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC ************************************************************************** CONTIG TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTCGenomic TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC

**************************** ***********************************************

CONTIG CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGAGenomic CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA ************************************************************************************************************************

CONTIG TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG-----------------------------------------------------------------------------Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT ********************************** ******** CONTIG -------------------------------------------------------------------------------------------------------------------GNAAAGenomic GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA * ***

CONTIG GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGATAAACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC

Genomic GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGATAAACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC******************************************* * ************** ******** ***** **** * *********** ***************************

 CONTIG C-----------------------------------------------------------------------------------------------------------------------Genomic CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA *  

CoDing SequenceAlignment between a mRNA and a genomic sequence

exon

exon

exon

exon

exon

intron

intron

intron

Page 26: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

CDS translation provided by ENA

CDS provided by the submitters

The first Met !

Page 27: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Very rarely done…

Pitfall no 1 – gene prediction ‘quality’

! ? !

Page 28: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

Complete genome (submitted)

but only ~ 2,000 CDS/proteins available !

Pitfall no 2 – CDS annotation and submission

! ? !

Page 29: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

http://www.ebi.ac.uk/swissprot/sptr_stats/index.html

…annotated CDS in UniProtKB (no gene prediction)(~290’000 species)

Page 30: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

From nucleic acid to amino acid sequences From nucleic acid to amino acid sequences databases….databases….

Page 31: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

The hectic life of a protein sequence …

cDNAs, ESTs, genomes, …

ENA, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

…if the submitters provide an annotated Coding Sequence

(CDS)(1/10 ENA entries)

Protein sequence databases

Nucleic acid databases

Gene predictionRefSeq, Ensembl

no CDS

Page 32: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

Why doing things in a simple way, when you can do it in a very complex

one ?

Page 33: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

The hectic life of a sequence …

TrEMBL Genpept

CoDing Sequences provided by submitters

cDNAs, ESTs, genomes, …

ENA, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

Swiss-Prot

RefSeq PRF

Scientific publications derived sequences

Ensembl

CCDS

UniParc

UniProtKB

PDB(PIR)

+ all ‘species’ specific databases (EcoGene, TAIR, …)

(IPI)

UniMES

CoDing Sequences provided by submitters

and gene prediction

TPA

Page 34: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

Major protein sequence database ‘sources’

UniProtKB: Swiss-Prot + TrEMBL

NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA

PIR PDB PRF

UniProtKB/Swiss-Prot: manually annotated protein sequences (12’200 species)

UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (290’000 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (230’000 species ?)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences (PIR-NRL3D)

PRF: Protein Research Foundation : journal scan of ‘published’ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species)

TPA: Third Party Annotation Sequence Database (update of entries derived from GenBank primary data)

Integrated resources

‘cross-references’

Resources kept separated

Page 35: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

Ensembl: UniProtKB + RefSeq + gene prediction (40 species)

Vega: Ensembl + gene prediction + manual annotation (5 species)

IPI: UniProtKB + RefSeq + Ensembl + TAIR (arabidopsis db) + H-InvDB (human cDNAs manual annotation) + VEGA (Vertebrate Genome Annotation) (7 species)

Closure in 2010 !!!

CCDS: consensus between EBI, NCBI, Sanger, USCS, (3 species)

Others:

OWL: Swiss-Prot + PIR + PDB + GenPept (obsolete)

MSDB (Mascot): Swiss-Prot + PIR + PDB + TrEMBL + GenBank…

dbESTs: translated ESTs (in the 6 frames; no annotated CDSs, no gene prediction)

Major protein sequence database ‘composite’

Page 36: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Phenyx: UniProtKB, IPI, NCBInr

Mascot: NCBInr, Swiss-Prot, dbEST

Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept

Aldente: UniProtKB.

ProFound: NCBInr, Swiss-Prot, dbEST

OMSSA: NCBInr and RefSeq.

Different protein databases available for different online proteomic tools

! ? !

Page 37: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Databases used

Study done on 28 proteomic papers (from 2010):

• The majority of labs ~61% use IPI

• ~18% use Swiss-Prot (mainly human, some bacteria)

• ~20% use other sources such as NCBI, SGD or in house developed databases

Personal communication: Silvia Jimenez, nov 2010

Page 38: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

Menu

Introduction

Nucleic acid sequence databases ENA-Bank/GenBank, DDBJ

Protein sequence databasesUniProt databases (UniProtKB)

NCBI protein databases

Page 39: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

UniProt

SIB + EBI + PIR

Page 40: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

UniProtKBUniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (~12 mo entries)

UniParcUniParc: protein sequence archive (ENA equivalent at the

protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated (query, Blast, download) (~25mo entries)

UniRefUniRef: 3 clusters of protein sequences with 100, 90 and 50 % similarity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo entries; UniRef50 3.3 mo entries)

UniMESUniMES: protein sequences derived from metagenomic projects (mostly Global Ocean Sampling (GOS)) (download) (8 mo entries, included in UniParc)

Page 41: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

UniProt databases

Page 42: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

UniProtKBan encyclopedia on proteins

composed of 2 sectionsUniProtKB/TrEMBL and UniProtKB/Swiss-Prot

released every 4 weeks

Page 43: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

UniProtKB

from ENA to TrEMBL

UniProtKB protein sequence data are mainly derived from ENA (CDS) but also from Ensembl

and other sequence resources such as RefSeq or model organism databases (MODs).

Data from the PIR database have been integrated in UniProt since 2003.

Page 44: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

TrEMBL

ENA

Automated extraction of protein sequence

(translated CDS), gene name and references.+Automated annotation

Page 45: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

The quality of UniProtKB/TrEMBL data, including the protein sequence, is directly dependent on the

information provided by the submitter of the original nucleotide entry.

Automated annotation• Redundancy check (100% merge)• Family attribution (InterPro)• Many other cross-references• Rule-based automated annotation

! ? !

Page 46: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence
Page 47: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

UniProtKB

from TrEMBL to Swiss-Prot

Once manually annotated and integrated into Swiss-Prot, the entry is deleted from TrEMBL

-> minimal redundancy

Page 48: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

TrEMBL

ENA

Automated extraction of protein sequence (translated CDS), gene name and

references.+Automated annotation

Manual annotation of the sequence and associated

biological information

Swiss-Prot

Page 49: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

SequenceSequence features

Ontologies

ReferencesNomenclature

Splice variants

Annotations

Page 50: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

UniProtKB: from TrEMBL to Swiss-Prot

Manual annotation

1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)

2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …)

Page 51: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

UniProtKB: from TrEMBL to Swiss-Prot

Manual annotation

1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)

2. Biological information (extract literature information, ortholog data propagation, protein sequence analysis…)

Page 52: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

The displayed sequence is the most prevalent protein sequence and/or the protein sequence which is also found in orthologous species.

The displayed sequence is generally derived from the translation of the genomic sequence (when available).

Sequence differences are documented.

1 entry <-> 1 gene (1 species) 1 displayed sequence

(annotation of alternative sequences, when available)

UniProtKB/Swiss-ProtProtein sequence annotation

Page 53: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

What is the current status?

• At least 20% of Swiss-Prot entries required a minimal amount of curation effort so as to obtain the “correct” sequence.

• Typical problems– unsolved conflicts;– uncorrected initiation sites;– frameshifts;– other ‘problems’

Page 54: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

… once upon a time, it was a gene on chromosome 11…

Page 55: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

! ? !

… once upon a time, it was a gene on chromosome 11…

All these sequences are available in protein sequence databases (i.e.GenPept) !!!

Page 56: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

Quality of protein information from genome projects

• Lets look at proteins originating from genome projects:– Drosophila: the paradigm of a curated genome should look

like (thanks to FlyBase) : only 1.8% of the gene models conflict with Swiss-Prot sequences;

– Arabidopsis: a typical example of a genome where a lot of annotation was done when it was sequenced, but no update since then (at least in the public view): 20% of the gene models are erroneous;

– Tetraodon nigroviridis: the typical example of a quick and dirty automatic run through a genome with no manual intervention: >90% of the gene models produce incorrect proteins.

– Bacteria and Archaea have almost no splicing, so predictions are “easier”, however errors are still made… Start codons, missed small proteins (<100aa)…

Page 57: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

UniProtKB/Swiss-ProtProtein sequence annotation

Page 58: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Example of problem (derived from gene prediction pipeline)

Ensembl completes the human ‘proteome’ by predicting/annotating missing genes according to orthologs sequences..

ID   URAD_HUMAN            Unreviewed;       171 AA. AC   A6NGE7; DT   24-JUL-2007, integrated into UniProtKB/TrEMBL. DT   24-JUL-2007, sequence version 1. DT   02-OCT-2007, entry version 3. DE   2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog DE   (OHCU decarboxylase homolog) (Parahox neighbour). GN   Name=PRHOXNB; …DR   EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR   Ensembl; ENSG00000183463; Homo sapiens. DR   HGNC; HGNC:17785; PRHOXNB. PE   4: Predicted; In primates the genes coding for the enzymes for the

degradation of uric acid were inactivated and converted to pseudogenes.

Page 59: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

• Producing a clean set of sequences is not a trivial task;

• It is not getting easier as more and more types of sequence data are submitted;

Page 60: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

• The ‘Protein existence’ tag indicates what is the evidence for the existence of a given protein (but not associated with sequence quality);

• Different qualifiers:1. Evidence at protein level (~18%) (MS, western blot (tissue specificity), immuno (subcellular location),

…)2. Evidence at transcript level (~19%)3. Inferred from homology (~58 %)4. Predicted (~5%)5. Uncertain (mainly in TrEMBL)

‘Protein existence’ tag

Page 61: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Page 62: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

In order to avoid ‘pseudogenes’ and most of the unprobable protein sequences, you can filter your query and avoid sequences with ‘protein existence tag’ = ‘Uncertain’

Page 63: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

The ‘alternative’ sequence(s)

Page 64: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

(Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154).

Proteome complexityExample with human

Not predictable at the genome level !-> important post-

genomic data !

~20’000

Page 65: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Multiple alignment of the end of the available GCR sequences

Annotation of the sequence differences (protein diversity)

1 entry <-> 1 gene (1 species)

…and natural variant

Page 66: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

P04150

www.uniprot.org

Page 67: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Available in separated files!

Important remark

> 30’000 additional sequences (total)

Page 68: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

The ‘alternative’ sequence(s)

not ‘directly available’ for a lot of tools, including protein identification tools, Blast, depending on the server

!….

Page 69: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

UniProtKB: from TrEMBL to Swiss-Prot

Manual annotation

1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)

2. Biological information (extract literature information, ortholog data propagation, protein sequence analysis…)

Page 70: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

UniProtKB/Swiss-Prot gathers data form multiple sources:

- publications (literature/Pubmed)- prediction programs (Prosite, Anabelle)- contacts with experts - other databases- nomenclature committees

Maximum usage of controlled vocabularyKeywords, Tissues, Post-translational modifications, Strains, Species, Subcellular location, Extracellular domains, Journals, …

Gene Ontology…

Extract literature informationand protein sequence analysis

maximum usage of controlled vocabulary

Page 71: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Protein nomenclature

Page 72: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

…enable researchers to obtain a summary of what is known about a protein…

General annotation

(Comments)

www.uniprot.org

Page 73: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

Sequence annotation

(Features)

…enable researchers to obtain a summary of what is known about a protein…

www.uniprot.org

Page 74: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Ontologies

Swiss-Prot keywords

Gene Ontology (GO terms)

Page 75: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Human protein manual annotation: some statistics (Aug 2010)

Page 76: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Non-experimental qualifiers UniProtKB/Swiss-Prot considers both experimental and predicted data and makes a clear distinction

between both.

Level. Type of evidence Qualifier

1st. Strong experimental evidence

2nd. Light experimental evidence Probable

3rd. Inferred by similarity with homologous protein (data of 1st or 2nd level)

By similarity

4th. Inferred by sequence prediction

Potential

Page 77: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

Phenyx: UniProtKB**, IPI, NCBInr

Mascot: NCBInr, Swiss-Prot*, dbEST, OWL, MSDB

Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept, Ludwignr, OWL

Aldente: UniProtKB**

ProFound: NCBInr, Swiss-Prot, dbEST

OMSSA: NCBInr and RefSeq

** the tool takes into account AP, PTM, …but not yet variants /conflict annotations

* the tool takes into account AP annotations

Do proteomic analysis tools make use of sequence annotation ?

Page 78: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Identification of biologically active proteins:

use Swiss-Prot annotations with Phenyx

• Sequence processing annotations

– Removal of signal peptides– Removal of transit peptides– Extraction of active chains

• Post-translational modifications

• Sequence variants

– Splicing variants

– Sequence mutations

Page 79: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Access to UniProtKB

www.uniprot.org

Page 80: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

www.uniprot.org

Page 81: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

www.uniprot.org

Page 82: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Search

A very powerful text search tool with autocompletion and refinement

options allowing to look for UniProt entries and documentation by

biological information

Page 83: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

The search interface guides users with helpful suggestions and hints

Page 84: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Result pages: Highly customizableResult pages: Highly customizable

Page 85: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

The URL (results) can be bookmarked and manually modified.

Page 86: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Blast

A tool associated with the standard options to search

sequences in UniProt databases

Page 87: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

Blast results: customize display

Page 88: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Align

A ClustalW multiple alignment tool with amino-acids highlighting optionsand feature annotation highlighting

option

Page 89: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

ClustalW multiple alignment of insulin

sequences

Page 90: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Retrieve

A UniProt specific tool allowing to retrieve a list of entries in several standard formats.

You can then query your ‘personal database’ with the UniProt search tool.

Page 91: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

AC

Large protein lists is not the end point in Proteomics

-> importance of protein sequence annotation

Page 92: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Retrieve tool (UniProt)

Page 93: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Play with the customize display tool…

Page 94: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Human proteins functional distribution

Maybe

Potentially

Putative

Expected

Probably

Hopefully

~40 % of human proteins have no known function (experimental data)…but many more are associated with GO terms…(computer-assigned).

Page 95: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

ID Mapping

Gives the possibility to get a mapping between different databases for a given

protein

Page 96: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

These identifiers are all pointing to TP53 (p53):

Question: same protein sequence ?

P04637, NP_000537, ENSG00000141510, CCDS11118, GC17M007512, UPI000002ED67, IPI00025087, etc.

- Specific and unique for each database…- Essential for retrieving your data and citation- Beware their ‘stability’…(linked to a protein sequence or to a gene)

- ID mapping tools ! ? !

Page 97: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Page 98: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Complete proteomes

• http://www.uniprot.org/taxonomy/?query=complete:yes

KW: Complete proteome

Page 99: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Download

Page 100: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Downloading UniProt Downloading UniProt http://www.uniprot.org/downloads

Page 101: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

UniProtKB

Statistics

Page 102: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

520’000 + 11’60’000 12’000’000

Swiss-Prot & TrEMBL introduce a new arithmetical

concept !

Redundancy in TrEMBL&

Redundancy between TrEMBL and Swiss-Prot

12’000 species 290’000 species

Swiss-Prot TrEMBL

Page 103: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

UniProtKB/Swiss-Prot(Manual annotation)

UniProtKB/TrEMBL(automatic annotation)

12’000 speciesmainly model organisms

Page 104: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

Not yet available

Page 105: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

~ 200 new entries / day new release every 4 weeks

-Annotation is useful, good annotation is better, update is essential !

- Some entries have gone through more than 120 versions since their integration in UniProtKB/Swiss-Prot

Page 106: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

UniProtKB entry history

Always cite the primary accession number (AC) !

Page 107: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

Menu

Introduction

Nucleic acid sequencedatabases ENA/GenBank, DDBJ

Protein sequence databasesUniProt databases (UniProtKB)

NCBI protein databases

Page 108: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

NCBI protein databases

(Entrez protein, NCBI nr)

http://www.ncbi.nlm.nih.gov/protein

Page 109: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

Major protein sequence database ‘sources’

UniProtKB: Swiss-Prot + TrEMBL

NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA

PIR PDB PRF

UniProtKB/Swiss-Prot: manually annotated protein sequences (12’200 species)

UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (290’000 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (230’000 species ?)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: Protein Research Foundation : journal scan of ‘published’ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species)

Integrated resources

‘cross-references’

Resources kept separated

Page 110: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA

Page 111: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

RefSeqProduced by NCBI and NLM

Information: http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/handbook/ch18.pdf

http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=handbook&part=ch1#GenBank_ASM

http://www.ncbi.nlm.nih.gov/RefSeq/

NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA

Page 112: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

• Reference Sequence (RefSeq)

• provides one example of each natural biological molecule (protein- mRNA- genomic DNA) for major organisms (11’000 species)

-> several entries for the same gene (if alternative splicing)

• gene prediction

• manual annotation (‘reviewed’ tagged entries)

• Annotation are mainly found in Entrez Gene (« interdependent curated resources »)

• Can be queried via Entrez protein system

•Nice accession numbers: NP_, NM_, etc…

RefSeq

Page 113: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Query RefSeq

Page 114: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

KW

AC

Taxonomy

References

Page 115: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

GenBank sourceand status

Annotation and ontologies

Page 116: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Page 117: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

UniProtKB vs RefSeq

Page 118: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Page 119: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

RefSeq chooses one or several protein reference sequences for a given gene: they do not annotate the sequence differences.

- If there is an alternative splicing event, there will be several distinct entries for a given gene

Example: GCR_HUMAN

GCR_HUMANUniProtKB/Swiss-Prot

1 UniProtKB entry 7 RefSeq entriescross-linked with

Page 120: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

GI number ‘GenInfo identifier’ number

- In addition to an AC number specific from the original database, each protein sequence in the NCBInr database (included Swiss-Prot entry) has a GI number.

Page 121: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

AC

Page 122: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

GI number: ‘GenInfo identifier’ number

- If the sequence changes in any way, a new GI number will be assigned:

- A Sequence Revision History tool is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record:

http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi

Page 123: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Page 124: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

ID/AC mapping

Page 125: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Page 126: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

http://www.ebi.ac.uk/Tools/picr/

Page 127: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Protein databases for

proteomic analysis…

Page 128: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

IPIhttp://www.ebi.ac.uk/IPI/IPIhelp.html

IPI Closure in 2010 !

Page 129: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Page 130: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Automatic approach that builds clusters through combining knowledge already present in the primary data source (UniProtKB, RefSeq, Ensembl) and sequence similarity.

IPI=UniProtKB + RefSeq + Ensembl (+ H-InvDB, TAIR and VEGA)

human, mouse, rat, zebrafish, Arabidopsis, chicken, and cow proteomes

Page 131: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

No annotation

Page 132: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Mascot

http://www.matrixscience.com/search_intro.html

Page 133: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Update November, the 11th….

Page 134: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

ENA/GenBank/DDBJ RefSeqwww.ncbi.nlm.nih.gov/RefSeq/

UniProtwww.uniprot.org

Protein and nucleotide data Genomic, RNA and protein data

Protein data only 

Biological data added by the submitters (gene name, tissue…)

Biological data annotated by curators, also found in the corresponding Entrez Gene entry

Biological data annotated by curators (Swiss-Prot), within the entry

Not curated  Partially manually curated (‘reviewed’ entries)

Manually curated in Swiss-Prot, not in TrEMBL 

Author submission NCBI creates from existing data

UniProt creates from existing data

Only author can revise (except TPA)

NCBI revises as new data emerge

UniProt revises as new data emerge

Multiple records for same loci common 

Single records for each molecule of major organisms

Single records for each protein of major organisms (in Swiss-Prot, TrEMBL is redundant)

Records can contradict each other  

Identification and annotation of discrepancy

No limit to species included   Limited to model organisms Priority (but not limited) to model organisms

Data exchanged among INSDC members 

NCBI database; collaboration with UniProt

UniProt database; collaboration with NCBI (RefSeq, CCDS)

Page 135: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

All documents are online

http://education.expasy.org/cours/CJSFEAP2010/

Page 136: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Additional material

Page 137: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

PIR

Page 138: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

PIR: the Protein Identification Resource

PIR-PSD is no more updated, but exists as an archive

Page 139: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

PDB

Page 140: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

PDB• PDB (Protein Data Bank), 3D structure

• Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X-ray or NMR studies

• Contains also the corresponding protein sequences *The PIR-NRL3D database makes the sequence information in PDB available for

similarity searches and other tools

• Includes protein sequences which are mutated, chimearic etc… (created specifically to study the effect of a mutation on the 3D structure)

Page 141: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

PDB: Protein Data Bankwww.rcsb.org/pdb/

• Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA).

• Associated with specialized programs allow the visualization of the corresponding 3D structure (e.g., SwissPDB-viewer, Chime, Rasmol)).

• Currently there are ~67’000 structural data for about 15’000 different proteins, but far less protein family (highly redundant) !

Page 142: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

PDB: example

Page 143: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Coordinates of each atom

Sequence

Page 144: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Page 145: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Visualisation with Jmol

Page 146: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

PRF

Protein Research Foundation

Page 147: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

http://www.genome.jp/dbget-bin/www_bfind?prf

Looks for the peptide sequence described in publication (and which are not submitted in databases !!!)

Page 148: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Page 149: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Ensembl http://www.ensembl.org/

Reviewhttp://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D610

Annotation pipelinehttp://www.genome.org/cgi/content/full/14/5/942

Page 150: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

- Ensembl: align the genomic sequences with all the sequences found in ENA, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (-> known genes)

- Also do gene prediction (-> novel genes)

Ensembl= UniProtKB + RefSeq + gene prediction

- DNA, RNA and protein sequences available for several species.

- Ensembl concentrates on vertebrate genomes, but other groups have adapted the system for use with plant, fungal and metazoa genomes.

Page 151: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Page 152: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Page 153: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Example of problem

Ensembl completes the human ‘proteome’ by annotating missing genes according to orthologs sequences..

ID   URAD_HUMAN            Unreviewed;       171 AA. AC   A6NGE7; DT   24-JUL-2007, integrated into UniProtKB/TrEMBL. DT   24-JUL-2007, sequence version 1. DT   02-OCT-2007, entry version 3. DE   2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog DE   (OHCU decarboxylase homolog) (Parahox neighbour). GN   Name=PRHOXNB; …DR   EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR   Ensembl; ENSG00000183463; Homo sapiens. DR   HGNC; HGNC:17785; PRHOXNB. PE   4: Predicted;

In primates the genes coding for the enzymes catalyzing the degradation of uric acid were inactivated and converted to pseudogenes.

Page 154: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

CCDS

Page 155: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

htt

p:/

/ww

w.n

cb

i.n

lm.n

ih.g

ov/C

CD

S/

Page 156: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

CCDS (human)CCDS (human)

Combining different approaches – ab initio, by

similarity - and taking advantage of the expertise

acquired by different institutes, including manual

annotation…

Consensus between 4 institutions…

Page 157: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Page 158: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

UniParc

Page 159: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

UniParc

- non-redundant protein sequence archive, containing both active and inactive sequences (including sequences which are not in UniProtKB i.e. immunoglobulins….)

- the equivalent of ENA/GenBank/DDBJ at the protein level

- species-merged: merge sequences between species when 100% identical over the whole length.

- no annotation (only taxonomy)

- can be searched only with database names, taxonomy, checksum (CRC64) and accession numbers (ACs) or UniProtKB, UniRef and UniParc IDs.

- Beware: contains wrong prediction, pseudogenes etc…

Page 160: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Query UniParc

Page 161: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

UniRef

Page 162: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

‘UniRef is useful for comprehensive BLAST similarity searches by providing

sets of representative sequences’

Page 163: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

«Collapsing BLAST results»

Three collections of sequence clusters from UniProtKB and selected UniParc entries:

One UniRef100 entry -> all identical sequences (identical sequences and sub-fragments are grouped in a single record) -> reduction of 12 %

One UniRef90 entry -> sequences that have at least 90 % or more identity -> reduction of 40 %

One UniRef50 entry -> sequences that are at least 50 % identical-> reduction of 65 %

Based on sequence identity -> Independent of the species !

Page 164: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Page 165: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Independent of species and

sequence length

UniRef 90

Page 166: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

UniMes

Page 167: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental protein data (only GOS data for the moment).

Download only (but included in UniParc -> Blast).

- UniMES Fasta sequences- UniMES matches to InterPro methods

ftp.uniprot.org/pub/databases/uniprot

Page 168: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Page 169: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

UniMES: sequences in fasta format

Page 170: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Page 171: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

Phenyx: UniProtKB***, IPI, NCBInr

Mascot: NCBInr, Swiss-Prot*, dbEST, OWL, MSDB

Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept, Ludwignr, OWL.

Aldente: UniProtKB***.

ProFound: NCBInr, Swiss-Prot, dbEST

OMSSA: NCBInr and RefSeq.

Translation of ESTs sequences in the 6 frames(EST are not associated with annotated CDSs !)

*** the tool takes into account AP, PTM, variants /conflict annotations

* the tool takes into account AP annotations (but not for the online public version)

Page 172: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Use of UniProtKB/Swiss-Prot annotation by Phenyx

Page 173: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Improve Identification of biologically active proteins:

Use Swiss-Prot annotations with Phenyx

• Sequence processing annotations

– Removal of signal peptides– Removal of transit peptides– Extraction of active chains

• Post-translational modifications

• Sequence variants

– Splicing variants

– Sequence mutations

Page 174: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Phenyx and alternative sequences

http://www.genebio.com/products/phenyx/features.html#section4

Page 175: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases

Page 176: MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence

MCBSeptember, 2010

Protein Sequence Databases