mcb september, 2010 protein sequence databases [email protected] swiss-prot group,...
Post on 22-Dec-2015
216 views
TRANSCRIPT
MCBSeptember, 2010
Protein Sequence Databases
[email protected] group, GenevaSIB Swiss Institute of Bioinformatics
Protein sequence databases:use and pitfalls
http://education.expasy.org/cours/CJSFEAP2010/
MCBSeptember, 2010
Protein Sequence Databases
Mr. ProteomicsMr. Protein sequence databases
MCBSeptember, 2010
Protein Sequence Databases
protein identificationby database matching
massspectrometry
analysis
624.
3
769.
8
893.
4 1056
.1
1326
.7
1501
.9
1759
.8
2100
.6
2200
624.
3
769.
8
893.
4
994.
5
1056
.1
1326
.7
1501
.9
1759
.8
1923
.4
2100
.6
600 2200
TYGGAARGPGFK
PSTTGVEMFR
EHICLLGR
GANR
samples with peptides
Large protein lists is not the end point in Proteomics
-> importance of protein sequence annotation
MCBSeptember, 2010
Protein Sequence Databases
New challenge
Flood of data -> need to be stored, curated and made available for analysis and knowledge discovery
MCBSeptember, 2010
Protein Sequence Databases
Many protein sequence databases…
•Which does contain the highest quality data ?•Which is comprehensive ?•Which is up-to-date ?•Which is redundant ?•Which is indexed (allows complex queries) ?•Which Web server does respond most quickly ?•Which does contain complete proteomes ?• …….??????
MCBSeptember, 2010
Protein Sequence Databases
A HUPO test sample study reveals common problems in mass spectrometry–based proteomics
PubMed 19448641 (2009)
• A single mass spectrometry experiment can identified up to about 4000 proteins (15’000 peptides)
• Protein databases vary greatly in terms of their curation, completeness and comprehensiveness (search with different protein databases = could get different results).
• Only 7 labs (on 27) were able to identify the 20 human proteins present in a sample, also due to the fact that the search engines used cannot distinguish among different identifiers for the same protein…
MCBSeptember, 2010
Protein Sequence Databases
Awareness of the content and usage of knowledge
resources is a pre-requisite to do any type of « serious »
research in the field of molecular life sciences
(AMB, 2007)
Menu
Introduction
Nucleic acid sequence databases ENA, GenBank, DDBJ
Protein sequence databasesUniProt databases (UniProtKB)
NCBI protein databases
Menu
Introduction
Nucleic acid sequence databases ENA, GenBank, DDBJ
Protein sequence databasesUniProt databases (UniProtKB)
NCBI protein databases
MCBSeptember, 2010
Protein Sequence Databases
Protein sequence origin
More than 99 % of the protein sequences are derived from the translation of nucleotide sequences
(genomes and/or cDNAs)
-> Important to know where the protein sequence comes from…
(sequencing & gene prediction quality) !
MCBSeptember, 2010
Protein Sequence Databases
… ~ 2500 genomes sequenced (single organism, varying sizes, including virus)
… ~ 5’000 ongoing genome sequencing projects
… cDNAs sequencing projects (ESTs or cDNAs)
… metagenome sequencing projects = environmental samples: multiple ‘unknown’ organisms,
Metagenomics:Metagenomics:study of genetic material recovered directly from environmental samples
• Global Ocean Sampling (C. Venter) 1ml sea water: 1 mo bacteria and 10 mo virus
• Whale fall (AAFZ00000000.1)
• Soil, sand beach, New-York air, …
• Human fluids, mouse gut (millions of bacteria within human body)
• Water treatment industry…
• Lists of projects: http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi
Venter’s Sorcerer II
MCBSeptember, 2010
Protein Sequence Databases
… ~ 2500 genomes sequenced (single organism, varying sizes)
… ~ 5’000 ongoing genome sequencing projects
… cDNAs sequencing projects (ESTs or cDNAs)
… metagenome sequencing projects
… personal human genomes
new generation sequencers : Illumina: 25 billions of bp /day;
MCBSeptember, 2010
Protein Sequence Databases
http://www.youtube.com/watch?v=mVZI7NBgcWM
2’000’000 $(2007)
70’000’000 $(diploid,
2007)
3’000’000’000 $(public consortium,
2000)
300’000’000 $(Celera, 2000)
2010
MCBSeptember, 2010
Protein Sequence Databases
How many proteins-coding genes at the end?
MCBSeptember, 2010
Protein Sequence Databases
190‘500'025'0421st estimate: ~30 million species (1.8 million named) 2nd estimate:
20 million bacteria/archea x 4'000 genes
1 million protists x 6'000 genes
5 million insects x 14'000 genes
2 million fungi x 6'000 genes
0.5 million plants x 20'000 genes
0.5 million molluscs, worms, arachnids, etc. x 20'000 genes
0.1 million vertebrates x 25'000 genes
The calculation: 2x107x4000+1x106x6000+5x106x14000+2x106x6000+5x105x20000+5x105x20000+1x105x25000
+20000 (Craig Venter)+ 42(Douglas Adam) + …
About 190 milliards of proteins (?)
About 12.0 millions of ‘known’ protein sequences in 2010(from ~290’000 species)
More than 99 % of the protein sequences are derived from the translation of nucleotide sequences
Less than 1 % direct protein sequencing (Edman, MS/MS…)
-> It is important that users know where the protein sequence comes from…
(sequencing & gene prediction quality) !
Menu
Introduction
Nucleic acid sequencedatabases ENA/GenBank, DDBJ
Protein sequence databasesUniProt databases (UniProtKB)
NCBI protein databases
MCBSeptember, 2010
Protein Sequence Databases
ENA (EMBL-Bank) GenBank
DDBJ
MCBSeptember, 2010
Protein Sequence Databases
http://www.insdc.org/
ENA/GenBank/DDBJ
cDNAs, ESTs, genes, genomes, …
ENA, GenBank, DDBJ
Data not submitted to public databases, delayed or cancelled…
The hectic life of a sequence …
archive of primary sequence data and corresponding annotation submitted by the
laboratories that did the sequencing.
MCBSeptember, 2010
Protein Sequence Databases
Journals do not (SHOULD NOT) accept a paper dealing with a nucleic acid sequence if the ENA/GenBank/DDBJ AC
number is not available…
‘journal publishers generally require deposition prior to publication so that an accession number can be included in
the paper.’
…not the case yet for protein sequences !!!
taxonomy
Cross-references
references
accession number
CDS annotation
(Prediction or experimentally determined)
sequence
CDSCoDing Sequence
(proposed by submitters)
annotation provided by
the laboratories that did the sequencing
CONTIG --------------------------------------------------------------------------------------CGANGGCCTATCAACAATGAAAGGTCGAAACCTG
Genomic AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACAATG-AAGGTCGAAACCTG *** ************ ** * **************
CONTIG CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG-------------------------------------------------------Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA ************************************ **************************** CONTIG ------------------------------------------------------------------------------------------------------------------------Genomic TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT
CONTIG ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGACGenomic GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC ************************************************************************** CONTIG TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTCGenomic TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC
**************************** ***********************************************
CONTIG CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGAGenomic CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA ************************************************************************************************************************
CONTIG TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG-----------------------------------------------------------------------------Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT ********************************** ******** CONTIG -------------------------------------------------------------------------------------------------------------------GNAAAGenomic GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA * ***
CONTIG GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGATAAACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC
Genomic GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGATAAACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC******************************************* * ************** ******** ***** **** * *********** ***************************
CONTIG C-----------------------------------------------------------------------------------------------------------------------Genomic CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA *
CoDing SequenceAlignment between a mRNA and a genomic sequence
exon
exon
exon
exon
exon
intron
intron
intron
CDS translation provided by ENA
CDS provided by the submitters
The first Met !
MCBSeptember, 2010
Protein Sequence Databases
Very rarely done…
Pitfall no 1 – gene prediction ‘quality’
! ? !
Complete genome (submitted)
but only ~ 2,000 CDS/proteins available !
Pitfall no 2 – CDS annotation and submission
! ? !
MCBSeptember, 2010
Protein Sequence Databases
http://www.ebi.ac.uk/swissprot/sptr_stats/index.html
…annotated CDS in UniProtKB (no gene prediction)(~290’000 species)
MCBSeptember, 2010
Protein Sequence Databases
From nucleic acid to amino acid sequences From nucleic acid to amino acid sequences databases….databases….
The hectic life of a protein sequence …
cDNAs, ESTs, genomes, …
ENA, GenBank, DDBJ
Data not submitted to public databases, delayed or cancelled…
…if the submitters provide an annotated Coding Sequence
(CDS)(1/10 ENA entries)
Protein sequence databases
Nucleic acid databases
Gene predictionRefSeq, Ensembl
no CDS
Why doing things in a simple way, when you can do it in a very complex
one ?
The hectic life of a sequence …
TrEMBL Genpept
CoDing Sequences provided by submitters
cDNAs, ESTs, genomes, …
ENA, GenBank, DDBJ
Data not submitted to public databases, delayed or cancelled…
Swiss-Prot
RefSeq PRF
Scientific publications derived sequences
Ensembl
CCDS
UniParc
UniProtKB
PDB(PIR)
+ all ‘species’ specific databases (EcoGene, TAIR, …)
(IPI)
UniMES
CoDing Sequences provided by submitters
and gene prediction
TPA
Major protein sequence database ‘sources’
UniProtKB: Swiss-Prot + TrEMBL
NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA
PIR PDB PRF
UniProtKB/Swiss-Prot: manually annotated protein sequences (12’200 species)
UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (290’000 species)
GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (230’000 species ?)
PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB
PDB: Protein Databank: 3D data and associated sequences (PIR-NRL3D)
PRF: Protein Research Foundation : journal scan of ‘published’ peptide sequences
RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species)
TPA: Third Party Annotation Sequence Database (update of entries derived from GenBank primary data)
Integrated resources
‘cross-references’
Resources kept separated
Ensembl: UniProtKB + RefSeq + gene prediction (40 species)
Vega: Ensembl + gene prediction + manual annotation (5 species)
IPI: UniProtKB + RefSeq + Ensembl + TAIR (arabidopsis db) + H-InvDB (human cDNAs manual annotation) + VEGA (Vertebrate Genome Annotation) (7 species)
Closure in 2010 !!!
CCDS: consensus between EBI, NCBI, Sanger, USCS, (3 species)
Others:
OWL: Swiss-Prot + PIR + PDB + GenPept (obsolete)
MSDB (Mascot): Swiss-Prot + PIR + PDB + TrEMBL + GenBank…
dbESTs: translated ESTs (in the 6 frames; no annotated CDSs, no gene prediction)
Major protein sequence database ‘composite’
MCBSeptember, 2010
Protein Sequence Databases
Phenyx: UniProtKB, IPI, NCBInr
Mascot: NCBInr, Swiss-Prot, dbEST
Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept
Aldente: UniProtKB.
ProFound: NCBInr, Swiss-Prot, dbEST
OMSSA: NCBInr and RefSeq.
Different protein databases available for different online proteomic tools
! ? !
MCBSeptember, 2010
Protein Sequence Databases
Databases used
Study done on 28 proteomic papers (from 2010):
• The majority of labs ~61% use IPI
• ~18% use Swiss-Prot (mainly human, some bacteria)
• ~20% use other sources such as NCBI, SGD or in house developed databases
Personal communication: Silvia Jimenez, nov 2010
Menu
Introduction
Nucleic acid sequence databases ENA-Bank/GenBank, DDBJ
Protein sequence databasesUniProt databases (UniProtKB)
NCBI protein databases
MCBSeptember, 2010
Protein Sequence Databases
UniProt
SIB + EBI + PIR
UniProtKBUniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (~12 mo entries)
UniParcUniParc: protein sequence archive (ENA equivalent at the
protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated (query, Blast, download) (~25mo entries)
UniRefUniRef: 3 clusters of protein sequences with 100, 90 and 50 % similarity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo entries; UniRef50 3.3 mo entries)
UniMESUniMES: protein sequences derived from metagenomic projects (mostly Global Ocean Sampling (GOS)) (download) (8 mo entries, included in UniParc)
MCBSeptember, 2010
Protein Sequence Databases
UniProt databases
MCBSeptember, 2010
Protein Sequence Databases
UniProtKBan encyclopedia on proteins
composed of 2 sectionsUniProtKB/TrEMBL and UniProtKB/Swiss-Prot
released every 4 weeks
MCBSeptember, 2010
Protein Sequence Databases
UniProtKB
from ENA to TrEMBL
UniProtKB protein sequence data are mainly derived from ENA (CDS) but also from Ensembl
and other sequence resources such as RefSeq or model organism databases (MODs).
Data from the PIR database have been integrated in UniProt since 2003.
TrEMBL
ENA
Automated extraction of protein sequence
(translated CDS), gene name and references.+Automated annotation
MCBSeptember, 2010
Protein Sequence Databases
The quality of UniProtKB/TrEMBL data, including the protein sequence, is directly dependent on the
information provided by the submitter of the original nucleotide entry.
Automated annotation• Redundancy check (100% merge)• Family attribution (InterPro)• Many other cross-references• Rule-based automated annotation
! ? !
MCBSeptember, 2010
Protein Sequence Databases
UniProtKB
from TrEMBL to Swiss-Prot
Once manually annotated and integrated into Swiss-Prot, the entry is deleted from TrEMBL
-> minimal redundancy
TrEMBL
ENA
Automated extraction of protein sequence (translated CDS), gene name and
references.+Automated annotation
Manual annotation of the sequence and associated
biological information
Swiss-Prot
MCBSeptember, 2010
Protein Sequence Databases
SequenceSequence features
Ontologies
ReferencesNomenclature
Splice variants
Annotations
MCBSeptember, 2010
Protein Sequence Databases
UniProtKB: from TrEMBL to Swiss-Prot
Manual annotation
1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)
2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …)
MCBSeptember, 2010
Protein Sequence Databases
UniProtKB: from TrEMBL to Swiss-Prot
Manual annotation
1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)
2. Biological information (extract literature information, ortholog data propagation, protein sequence analysis…)
MCBSeptember, 2010
Protein Sequence Databases
The displayed sequence is the most prevalent protein sequence and/or the protein sequence which is also found in orthologous species.
The displayed sequence is generally derived from the translation of the genomic sequence (when available).
Sequence differences are documented.
1 entry <-> 1 gene (1 species) 1 displayed sequence
(annotation of alternative sequences, when available)
UniProtKB/Swiss-ProtProtein sequence annotation
MCBSeptember, 2010
Protein Sequence Databases
What is the current status?
• At least 20% of Swiss-Prot entries required a minimal amount of curation effort so as to obtain the “correct” sequence.
• Typical problems– unsolved conflicts;– uncorrected initiation sites;– frameshifts;– other ‘problems’
MCBSeptember, 2010
Protein Sequence Databases
… once upon a time, it was a gene on chromosome 11…
MCBSeptember, 2010
Protein Sequence Databases
! ? !
… once upon a time, it was a gene on chromosome 11…
All these sequences are available in protein sequence databases (i.e.GenPept) !!!
Quality of protein information from genome projects
• Lets look at proteins originating from genome projects:– Drosophila: the paradigm of a curated genome should look
like (thanks to FlyBase) : only 1.8% of the gene models conflict with Swiss-Prot sequences;
– Arabidopsis: a typical example of a genome where a lot of annotation was done when it was sequenced, but no update since then (at least in the public view): 20% of the gene models are erroneous;
– Tetraodon nigroviridis: the typical example of a quick and dirty automatic run through a genome with no manual intervention: >90% of the gene models produce incorrect proteins.
– Bacteria and Archaea have almost no splicing, so predictions are “easier”, however errors are still made… Start codons, missed small proteins (<100aa)…
UniProtKB/Swiss-ProtProtein sequence annotation
MCBSeptember, 2010
Protein Sequence Databases
Example of problem (derived from gene prediction pipeline)
Ensembl completes the human ‘proteome’ by predicting/annotating missing genes according to orthologs sequences..
ID URAD_HUMAN Unreviewed; 171 AA. AC A6NGE7; DT 24-JUL-2007, integrated into UniProtKB/TrEMBL. DT 24-JUL-2007, sequence version 1. DT 02-OCT-2007, entry version 3. DE 2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog DE (OHCU decarboxylase homolog) (Parahox neighbour). GN Name=PRHOXNB; …DR EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR Ensembl; ENSG00000183463; Homo sapiens. DR HGNC; HGNC:17785; PRHOXNB. PE 4: Predicted; In primates the genes coding for the enzymes for the
degradation of uric acid were inactivated and converted to pseudogenes.
• Producing a clean set of sequences is not a trivial task;
• It is not getting easier as more and more types of sequence data are submitted;
MCBSeptember, 2010
Protein Sequence Databases
• The ‘Protein existence’ tag indicates what is the evidence for the existence of a given protein (but not associated with sequence quality);
• Different qualifiers:1. Evidence at protein level (~18%) (MS, western blot (tissue specificity), immuno (subcellular location),
…)2. Evidence at transcript level (~19%)3. Inferred from homology (~58 %)4. Predicted (~5%)5. Uncertain (mainly in TrEMBL)
‘Protein existence’ tag
MCBSeptember, 2010
Protein Sequence Databases
MCBSeptember, 2010
Protein Sequence Databases
In order to avoid ‘pseudogenes’ and most of the unprobable protein sequences, you can filter your query and avoid sequences with ‘protein existence tag’ = ‘Uncertain’
MCBSeptember, 2010
Protein Sequence Databases
The ‘alternative’ sequence(s)
MCBSeptember, 2010
Protein Sequence Databases
(Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154).
Proteome complexityExample with human
Not predictable at the genome level !-> important post-
genomic data !
~20’000
MCBSeptember, 2010
Protein Sequence Databases
Multiple alignment of the end of the available GCR sequences
Annotation of the sequence differences (protein diversity)
1 entry <-> 1 gene (1 species)
…and natural variant
MCBSeptember, 2010
Protein Sequence Databases
Available in separated files!
Important remark
> 30’000 additional sequences (total)
MCBSeptember, 2010
Protein Sequence Databases
The ‘alternative’ sequence(s)
not ‘directly available’ for a lot of tools, including protein identification tools, Blast, depending on the server
!….
MCBSeptember, 2010
Protein Sequence Databases
UniProtKB: from TrEMBL to Swiss-Prot
Manual annotation
1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)
2. Biological information (extract literature information, ortholog data propagation, protein sequence analysis…)
MCBSeptember, 2010
Protein Sequence Databases
UniProtKB/Swiss-Prot gathers data form multiple sources:
- publications (literature/Pubmed)- prediction programs (Prosite, Anabelle)- contacts with experts - other databases- nomenclature committees
Maximum usage of controlled vocabularyKeywords, Tissues, Post-translational modifications, Strains, Species, Subcellular location, Extracellular domains, Journals, …
Gene Ontology…
Extract literature informationand protein sequence analysis
maximum usage of controlled vocabulary
MCBSeptember, 2010
Protein Sequence Databases
Protein nomenclature
…enable researchers to obtain a summary of what is known about a protein…
General annotation
(Comments)
www.uniprot.org
Sequence annotation
(Features)
…enable researchers to obtain a summary of what is known about a protein…
www.uniprot.org
MCBSeptember, 2010
Protein Sequence Databases
Ontologies
Swiss-Prot keywords
Gene Ontology (GO terms)
MCBSeptember, 2010
Protein Sequence Databases
Human protein manual annotation: some statistics (Aug 2010)
MCBSeptember, 2010
Protein Sequence Databases
Non-experimental qualifiers UniProtKB/Swiss-Prot considers both experimental and predicted data and makes a clear distinction
between both.
Level. Type of evidence Qualifier
1st. Strong experimental evidence
2nd. Light experimental evidence Probable
3rd. Inferred by similarity with homologous protein (data of 1st or 2nd level)
By similarity
4th. Inferred by sequence prediction
Potential
Phenyx: UniProtKB**, IPI, NCBInr
Mascot: NCBInr, Swiss-Prot*, dbEST, OWL, MSDB
Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept, Ludwignr, OWL
Aldente: UniProtKB**
ProFound: NCBInr, Swiss-Prot, dbEST
OMSSA: NCBInr and RefSeq
** the tool takes into account AP, PTM, …but not yet variants /conflict annotations
* the tool takes into account AP annotations
Do proteomic analysis tools make use of sequence annotation ?
MCBSeptember, 2010
Protein Sequence Databases
Identification of biologically active proteins:
use Swiss-Prot annotations with Phenyx
• Sequence processing annotations
– Removal of signal peptides– Removal of transit peptides– Extraction of active chains
• Post-translational modifications
• Sequence variants
– Splicing variants
– Sequence mutations
MCBSeptember, 2010
Protein Sequence Databases
Access to UniProtKB
www.uniprot.org
MCBSeptember, 2010
Protein Sequence Databases
www.uniprot.org
MCBSeptember, 2010
Protein Sequence Databases
www.uniprot.org
MCBSeptember, 2010
Protein Sequence Databases
Search
A very powerful text search tool with autocompletion and refinement
options allowing to look for UniProt entries and documentation by
biological information
MCBSeptember, 2010
Protein Sequence Databases
The search interface guides users with helpful suggestions and hints
MCBSeptember, 2010
Protein Sequence Databases
Result pages: Highly customizableResult pages: Highly customizable
The URL (results) can be bookmarked and manually modified.
MCBSeptember, 2010
Protein Sequence Databases
Blast
A tool associated with the standard options to search
sequences in UniProt databases
Blast results: customize display
MCBSeptember, 2010
Protein Sequence Databases
Align
A ClustalW multiple alignment tool with amino-acids highlighting optionsand feature annotation highlighting
option
MCBSeptember, 2010
Protein Sequence Databases
ClustalW multiple alignment of insulin
sequences
MCBSeptember, 2010
Protein Sequence Databases
Retrieve
A UniProt specific tool allowing to retrieve a list of entries in several standard formats.
You can then query your ‘personal database’ with the UniProt search tool.
MCBSeptember, 2010
Protein Sequence Databases
AC
Large protein lists is not the end point in Proteomics
-> importance of protein sequence annotation
MCBSeptember, 2010
Protein Sequence Databases
Retrieve tool (UniProt)
MCBSeptember, 2010
Protein Sequence Databases
Play with the customize display tool…
MCBSeptember, 2010
Protein Sequence Databases
Human proteins functional distribution
Maybe
Potentially
Putative
Expected
Probably
Hopefully
~40 % of human proteins have no known function (experimental data)…but many more are associated with GO terms…(computer-assigned).
MCBSeptember, 2010
Protein Sequence Databases
ID Mapping
Gives the possibility to get a mapping between different databases for a given
protein
MCBSeptember, 2010
Protein Sequence Databases
These identifiers are all pointing to TP53 (p53):
Question: same protein sequence ?
P04637, NP_000537, ENSG00000141510, CCDS11118, GC17M007512, UPI000002ED67, IPI00025087, etc.
- Specific and unique for each database…- Essential for retrieving your data and citation- Beware their ‘stability’…(linked to a protein sequence or to a gene)
- ID mapping tools ! ? !
MCBSeptember, 2010
Protein Sequence Databases
MCBSeptember, 2010
Protein Sequence Databases
Complete proteomes
• http://www.uniprot.org/taxonomy/?query=complete:yes
KW: Complete proteome
MCBSeptember, 2010
Protein Sequence Databases
Download
MCBSeptember, 2010
Protein Sequence Databases
Downloading UniProt Downloading UniProt http://www.uniprot.org/downloads
MCBSeptember, 2010
Protein Sequence Databases
UniProtKB
Statistics
520’000 + 11’60’000 12’000’000
Swiss-Prot & TrEMBL introduce a new arithmetical
concept !
Redundancy in TrEMBL&
Redundancy between TrEMBL and Swiss-Prot
12’000 species 290’000 species
Swiss-Prot TrEMBL
MCBSeptember, 2010
Protein Sequence Databases
UniProtKB/Swiss-Prot(Manual annotation)
UniProtKB/TrEMBL(automatic annotation)
12’000 speciesmainly model organisms
Not yet available
~ 200 new entries / day new release every 4 weeks
-Annotation is useful, good annotation is better, update is essential !
- Some entries have gone through more than 120 versions since their integration in UniProtKB/Swiss-Prot
UniProtKB entry history
Always cite the primary accession number (AC) !
Menu
Introduction
Nucleic acid sequencedatabases ENA/GenBank, DDBJ
Protein sequence databasesUniProt databases (UniProtKB)
NCBI protein databases
MCBSeptember, 2010
Protein Sequence Databases
NCBI protein databases
(Entrez protein, NCBI nr)
http://www.ncbi.nlm.nih.gov/protein
Major protein sequence database ‘sources’
UniProtKB: Swiss-Prot + TrEMBL
NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA
PIR PDB PRF
UniProtKB/Swiss-Prot: manually annotated protein sequences (12’200 species)
UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (290’000 species)
GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (230’000 species ?)
PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB
PDB: Protein Databank: 3D data and associated sequences
PRF: Protein Research Foundation : journal scan of ‘published’ peptide sequences
RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species)
Integrated resources
‘cross-references’
Resources kept separated
MCBSeptember, 2010
Protein Sequence Databases
NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA
MCBSeptember, 2010
Protein Sequence Databases
RefSeqProduced by NCBI and NLM
Information: http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/handbook/ch18.pdf
http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=handbook&part=ch1#GenBank_ASM
http://www.ncbi.nlm.nih.gov/RefSeq/
NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA
• Reference Sequence (RefSeq)
• provides one example of each natural biological molecule (protein- mRNA- genomic DNA) for major organisms (11’000 species)
-> several entries for the same gene (if alternative splicing)
• gene prediction
• manual annotation (‘reviewed’ tagged entries)
• Annotation are mainly found in Entrez Gene (« interdependent curated resources »)
• Can be queried via Entrez protein system
•Nice accession numbers: NP_, NM_, etc…
RefSeq
MCBSeptember, 2010
Protein Sequence Databases
Query RefSeq
MCBSeptember, 2010
Protein Sequence Databases
KW
AC
Taxonomy
References
GenBank sourceand status
Annotation and ontologies
MCBSeptember, 2010
Protein Sequence Databases
MCBSeptember, 2010
Protein Sequence Databases
UniProtKB vs RefSeq
MCBSeptember, 2010
Protein Sequence Databases
RefSeq chooses one or several protein reference sequences for a given gene: they do not annotate the sequence differences.
- If there is an alternative splicing event, there will be several distinct entries for a given gene
Example: GCR_HUMAN
GCR_HUMANUniProtKB/Swiss-Prot
1 UniProtKB entry 7 RefSeq entriescross-linked with
MCBSeptember, 2010
Protein Sequence Databases
GI number ‘GenInfo identifier’ number
- In addition to an AC number specific from the original database, each protein sequence in the NCBInr database (included Swiss-Prot entry) has a GI number.
MCBSeptember, 2010
Protein Sequence Databases
AC
MCBSeptember, 2010
Protein Sequence Databases
GI number: ‘GenInfo identifier’ number
- If the sequence changes in any way, a new GI number will be assigned:
- A Sequence Revision History tool is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record:
http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi
MCBSeptember, 2010
Protein Sequence Databases
MCBSeptember, 2010
Protein Sequence Databases
ID/AC mapping
MCBSeptember, 2010
Protein Sequence Databases
MCBSeptember, 2010
Protein Sequence Databases
http://www.ebi.ac.uk/Tools/picr/
MCBSeptember, 2010
Protein Sequence Databases
Protein databases for
proteomic analysis…
MCBSeptember, 2010
Protein Sequence Databases
IPIhttp://www.ebi.ac.uk/IPI/IPIhelp.html
IPI Closure in 2010 !
MCBSeptember, 2010
Protein Sequence Databases
MCBSeptember, 2010
Protein Sequence Databases
Automatic approach that builds clusters through combining knowledge already present in the primary data source (UniProtKB, RefSeq, Ensembl) and sequence similarity.
IPI=UniProtKB + RefSeq + Ensembl (+ H-InvDB, TAIR and VEGA)
human, mouse, rat, zebrafish, Arabidopsis, chicken, and cow proteomes
MCBSeptember, 2010
Protein Sequence Databases
No annotation
MCBSeptember, 2010
Protein Sequence Databases
Mascot
http://www.matrixscience.com/search_intro.html
MCBSeptember, 2010
Protein Sequence Databases
Update November, the 11th….
ENA/GenBank/DDBJ RefSeqwww.ncbi.nlm.nih.gov/RefSeq/
UniProtwww.uniprot.org
Protein and nucleotide data Genomic, RNA and protein data
Protein data only
Biological data added by the submitters (gene name, tissue…)
Biological data annotated by curators, also found in the corresponding Entrez Gene entry
Biological data annotated by curators (Swiss-Prot), within the entry
Not curated Partially manually curated (‘reviewed’ entries)
Manually curated in Swiss-Prot, not in TrEMBL
Author submission NCBI creates from existing data
UniProt creates from existing data
Only author can revise (except TPA)
NCBI revises as new data emerge
UniProt revises as new data emerge
Multiple records for same loci common
Single records for each molecule of major organisms
Single records for each protein of major organisms (in Swiss-Prot, TrEMBL is redundant)
Records can contradict each other
Identification and annotation of discrepancy
No limit to species included Limited to model organisms Priority (but not limited) to model organisms
Data exchanged among INSDC members
NCBI database; collaboration with UniProt
UniProt database; collaboration with NCBI (RefSeq, CCDS)
MCBSeptember, 2010
Protein Sequence Databases
All documents are online
http://education.expasy.org/cours/CJSFEAP2010/
MCBSeptember, 2010
Protein Sequence Databases
Additional material
MCBSeptember, 2010
Protein Sequence Databases
PIR
MCBSeptember, 2010
Protein Sequence Databases
PIR: the Protein Identification Resource
PIR-PSD is no more updated, but exists as an archive
MCBSeptember, 2010
Protein Sequence Databases
PDB
MCBSeptember, 2010
Protein Sequence Databases
PDB• PDB (Protein Data Bank), 3D structure
• Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X-ray or NMR studies
• Contains also the corresponding protein sequences *The PIR-NRL3D database makes the sequence information in PDB available for
similarity searches and other tools
• Includes protein sequences which are mutated, chimearic etc… (created specifically to study the effect of a mutation on the 3D structure)
MCBSeptember, 2010
Protein Sequence Databases
PDB: Protein Data Bankwww.rcsb.org/pdb/
• Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA).
• Associated with specialized programs allow the visualization of the corresponding 3D structure (e.g., SwissPDB-viewer, Chime, Rasmol)).
• Currently there are ~67’000 structural data for about 15’000 different proteins, but far less protein family (highly redundant) !
MCBSeptember, 2010
Protein Sequence Databases
PDB: example
MCBSeptember, 2010
Protein Sequence Databases
Coordinates of each atom
Sequence
MCBSeptember, 2010
Protein Sequence Databases
MCBSeptember, 2010
Protein Sequence Databases
Visualisation with Jmol
MCBSeptember, 2010
Protein Sequence Databases
PRF
Protein Research Foundation
MCBSeptember, 2010
Protein Sequence Databases
http://www.genome.jp/dbget-bin/www_bfind?prf
Looks for the peptide sequence described in publication (and which are not submitted in databases !!!)
MCBSeptember, 2010
Protein Sequence Databases
MCBSeptember, 2010
Protein Sequence Databases
Ensembl http://www.ensembl.org/
Reviewhttp://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D610
Annotation pipelinehttp://www.genome.org/cgi/content/full/14/5/942
MCBSeptember, 2010
Protein Sequence Databases
- Ensembl: align the genomic sequences with all the sequences found in ENA, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (-> known genes)
- Also do gene prediction (-> novel genes)
Ensembl= UniProtKB + RefSeq + gene prediction
- DNA, RNA and protein sequences available for several species.
- Ensembl concentrates on vertebrate genomes, but other groups have adapted the system for use with plant, fungal and metazoa genomes.
MCBSeptember, 2010
Protein Sequence Databases
MCBSeptember, 2010
Protein Sequence Databases
MCBSeptember, 2010
Protein Sequence Databases
Example of problem
Ensembl completes the human ‘proteome’ by annotating missing genes according to orthologs sequences..
ID URAD_HUMAN Unreviewed; 171 AA. AC A6NGE7; DT 24-JUL-2007, integrated into UniProtKB/TrEMBL. DT 24-JUL-2007, sequence version 1. DT 02-OCT-2007, entry version 3. DE 2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog DE (OHCU decarboxylase homolog) (Parahox neighbour). GN Name=PRHOXNB; …DR EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR Ensembl; ENSG00000183463; Homo sapiens. DR HGNC; HGNC:17785; PRHOXNB. PE 4: Predicted;
In primates the genes coding for the enzymes catalyzing the degradation of uric acid were inactivated and converted to pseudogenes.
MCBSeptember, 2010
Protein Sequence Databases
CCDS
MCBSeptember, 2010
Protein Sequence Databases
htt
p:/
/ww
w.n
cb
i.n
lm.n
ih.g
ov/C
CD
S/
MCBSeptember, 2010
Protein Sequence Databases
CCDS (human)CCDS (human)
Combining different approaches – ab initio, by
similarity - and taking advantage of the expertise
acquired by different institutes, including manual
annotation…
Consensus between 4 institutions…
MCBSeptember, 2010
Protein Sequence Databases
MCBSeptember, 2010
Protein Sequence Databases
UniParc
MCBSeptember, 2010
Protein Sequence Databases
UniParc
- non-redundant protein sequence archive, containing both active and inactive sequences (including sequences which are not in UniProtKB i.e. immunoglobulins….)
- the equivalent of ENA/GenBank/DDBJ at the protein level
- species-merged: merge sequences between species when 100% identical over the whole length.
- no annotation (only taxonomy)
- can be searched only with database names, taxonomy, checksum (CRC64) and accession numbers (ACs) or UniProtKB, UniRef and UniParc IDs.
- Beware: contains wrong prediction, pseudogenes etc…
MCBSeptember, 2010
Protein Sequence Databases
Query UniParc
MCBSeptember, 2010
Protein Sequence Databases
UniRef
MCBSeptember, 2010
Protein Sequence Databases
‘UniRef is useful for comprehensive BLAST similarity searches by providing
sets of representative sequences’
MCBSeptember, 2010
Protein Sequence Databases
«Collapsing BLAST results»
Three collections of sequence clusters from UniProtKB and selected UniParc entries:
One UniRef100 entry -> all identical sequences (identical sequences and sub-fragments are grouped in a single record) -> reduction of 12 %
One UniRef90 entry -> sequences that have at least 90 % or more identity -> reduction of 40 %
One UniRef50 entry -> sequences that are at least 50 % identical-> reduction of 65 %
Based on sequence identity -> Independent of the species !
MCBSeptember, 2010
Protein Sequence Databases
MCBSeptember, 2010
Protein Sequence Databases
Independent of species and
sequence length
UniRef 90
MCBSeptember, 2010
Protein Sequence Databases
UniMes
MCBSeptember, 2010
Protein Sequence Databases
The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental protein data (only GOS data for the moment).
Download only (but included in UniParc -> Blast).
- UniMES Fasta sequences- UniMES matches to InterPro methods
ftp.uniprot.org/pub/databases/uniprot
MCBSeptember, 2010
Protein Sequence Databases
MCBSeptember, 2010
Protein Sequence Databases
UniMES: sequences in fasta format
MCBSeptember, 2010
Protein Sequence Databases
Phenyx: UniProtKB***, IPI, NCBInr
Mascot: NCBInr, Swiss-Prot*, dbEST, OWL, MSDB
Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept, Ludwignr, OWL.
Aldente: UniProtKB***.
ProFound: NCBInr, Swiss-Prot, dbEST
OMSSA: NCBInr and RefSeq.
Translation of ESTs sequences in the 6 frames(EST are not associated with annotated CDSs !)
*** the tool takes into account AP, PTM, variants /conflict annotations
* the tool takes into account AP annotations (but not for the online public version)
MCBSeptember, 2010
Protein Sequence Databases
Use of UniProtKB/Swiss-Prot annotation by Phenyx
MCBSeptember, 2010
Protein Sequence Databases
Improve Identification of biologically active proteins:
Use Swiss-Prot annotations with Phenyx
• Sequence processing annotations
– Removal of signal peptides– Removal of transit peptides– Extraction of active chains
• Post-translational modifications
• Sequence variants
– Splicing variants
– Sequence mutations
MCBSeptember, 2010
Protein Sequence Databases
Phenyx and alternative sequences
http://www.genebio.com/products/phenyx/features.html#section4
MCBSeptember, 2010
Protein Sequence Databases
MCBSeptember, 2010
Protein Sequence Databases