mcb september, 2010 protein sequence databases [email protected] swiss-prot group,...

MCBSeptember, 2010

Protein Sequence Databases

[email protected] group, GenevaSIB Swiss Institute of Bioinformatics

Protein sequence databases:use and pitfalls

http://education.expasy.org/cours/CJSFEAP2010/

MCBSeptember, 2010


Mr. ProteomicsMr. Protein sequence databases

MCBSeptember, 2010


protein identificationby database matching

massspectrometry

analysis

624.

3

769.

8

893.

4 1056

.1

1326

.7

1501

.9

1759

.8

2100

.6

2200

624.

3

769.

8

893.

4

994.

5

1056

.1

1326

.7

1501

.9

1759

.8

1923

.4

2100

.6

600 2200

TYGGAARGPGFK

PSTTGVEMFR

EHICLLGR

GANR

samples with peptides

Large protein lists is not the end point in Proteomics

-> importance of protein sequence annotation

MCBSeptember, 2010


New challenge

Flood of data -> need to be stored, curated and made available for analysis and knowledge discovery

MCBSeptember, 2010


Many protein sequence databases…

•Which does contain the highest quality data ?•Which is comprehensive ?•Which is up-to-date ?•Which is redundant ?•Which is indexed (allows complex queries) ?•Which Web server does respond most quickly ?•Which does contain complete proteomes ?• …….??????

MCBSeptember, 2010


A HUPO test sample study reveals common problems in mass spectrometry–based proteomics

PubMed 19448641 (2009)

• A single mass spectrometry experiment can identified up to about 4000 proteins (15’000 peptides)

• Protein databases vary greatly in terms of their curation, completeness and comprehensiveness (search with different protein databases = could get different results).

• Only 7 labs (on 27) were able to identify the 20 human proteins present in a sample, also due to the fact that the search engines used cannot distinguish among different identifiers for the same protein…

MCBSeptember, 2010


Awareness of the content and usage of knowledge

resources is a pre-requisite to do any type of « serious »

research in the field of molecular life sciences

(AMB, 2007)

Menu

Introduction

Nucleic acid sequence databases ENA, GenBank, DDBJ

Protein sequence databasesUniProt databases (UniProtKB)

NCBI protein databases

MCBSeptember, 2010


Protein sequence origin

More than 99 % of the protein sequences are derived from the translation of nucleotide sequences

(genomes and/or cDNAs)

-> Important to know where the protein sequence comes from…

(sequencing & gene prediction quality) !

MCBSeptember, 2010


… ~ 2500 genomes sequenced (single organism, varying sizes, including virus)

… ~ 5’000 ongoing genome sequencing projects

… cDNAs sequencing projects (ESTs or cDNAs)

… metagenome sequencing projects = environmental samples: multiple ‘unknown’ organisms,

Metagenomics:Metagenomics:study of genetic material recovered directly from environmental samples

• Global Ocean Sampling (C. Venter) 1ml sea water: 1 mo bacteria and 10 mo virus

• Whale fall (AAFZ00000000.1)

• Soil, sand beach, New-York air, …

• Human fluids, mouse gut (millions of bacteria within human body)

• Water treatment industry…

• Lists of projects: http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi

Venter’s Sorcerer II

MCBSeptember, 2010


… ~ 2500 genomes sequenced (single organism, varying sizes)

… ~ 5’000 ongoing genome sequencing projects

… cDNAs sequencing projects (ESTs or cDNAs)

… metagenome sequencing projects

… personal human genomes

new generation sequencers : Illumina: 25 billions of bp /day;

MCBSeptember, 2010


http://www.youtube.com/watch?v=mVZI7NBgcWM

2’000’000 $(2007)

70’000’000 $(diploid,

2007)

3’000’000’000 $(public consortium,

2000)

300’000’000 $(Celera, 2000)

2010

MCBSeptember, 2010


How many proteins-coding genes at the end?

MCBSeptember, 2010


190‘500'025'0421st estimate: ~30 million species (1.8 million named) 2nd estimate:

20 million bacteria/archea x 4'000 genes

1 million protists x 6'000 genes

5 million insects x 14'000 genes

2 million fungi x 6'000 genes

0.5 million plants x 20'000 genes

0.5 million molluscs, worms, arachnids, etc. x 20'000 genes

0.1 million vertebrates x 25'000 genes

The calculation: 2x107x4000+1x106x6000+5x106x14000+2x106x6000+5x105x20000+5x105x20000+1x105x25000

+20000 (Craig Venter)+ 42(Douglas Adam) + …

About 190 milliards of proteins (?)

About 12.0 millions of ‘known’ protein sequences in 2010(from ~290’000 species)

More than 99 % of the protein sequences are derived from the translation of nucleotide sequences

Less than 1 % direct protein sequencing (Edman, MS/MS…)

-> It is important that users know where the protein sequence comes from…

(sequencing & gene prediction quality) !

Menu

Introduction

Nucleic acid sequencedatabases ENA/GenBank, DDBJ



MCBSeptember, 2010


ENA (EMBL-Bank) GenBank

DDBJ

MCBSeptember, 2010


http://www.insdc.org/

ENA/GenBank/DDBJ

cDNAs, ESTs, genes, genomes, …

ENA, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

The hectic life of a sequence …

archive of primary sequence data and corresponding annotation submitted by the

laboratories that did the sequencing.

MCBSeptember, 2010


Journals do not (SHOULD NOT) accept a paper dealing with a nucleic acid sequence if the ENA/GenBank/DDBJ AC

number is not available…

‘journal publishers generally require deposition prior to publication so that an accession number can be included in

the paper.’

…not the case yet for protein sequences !!!

taxonomy

Cross-references

references

accession number

CDS annotation

(Prediction or experimentally determined)

sequence

CDSCoDing Sequence

(proposed by submitters)

annotation provided by

the laboratories that did the sequencing

CONTIG --------------------------------------------------------------------------------------CGANGGCCTATCAACAATGAAAGGTCGAAACCTG

Genomic AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACAATG-AAGGTCGAAACCTG *** ************ ** * **************

CONTIG CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG-------------------------------------------------------Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA ************************************ **************************** CONTIG ------------------------------------------------------------------------------------------------------------------------Genomic TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT

CONTIG ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGACGenomic GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC ************************************************************************** CONTIG TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTCGenomic TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC

**************************** ***********************************************

CONTIG CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGAGenomic CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA ************************************************************************************************************************

CONTIG TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG-----------------------------------------------------------------------------Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT ********************************** ******** CONTIG -------------------------------------------------------------------------------------------------------------------GNAAAGenomic GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA * ***

CONTIG GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGATAAACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC

Genomic GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGATAAACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC******************************************* * ************** ******** ***** **** * *********** ***************************

CONTIG C-----------------------------------------------------------------------------------------------------------------------Genomic CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA *

CoDing SequenceAlignment between a mRNA and a genomic sequence

exon

exon

exon

exon

exon

intron

intron

intron

CDS translation provided by ENA

CDS provided by the submitters

The first Met !

MCBSeptember, 2010


Very rarely done…

Pitfall no 1 – gene prediction ‘quality’

! ? !

Complete genome (submitted)

but only ~ 2,000 CDS/proteins available !

Pitfall no 2 – CDS annotation and submission

! ? !

MCBSeptember, 2010


http://www.ebi.ac.uk/swissprot/sptr_stats/index.html

…annotated CDS in UniProtKB (no gene prediction)(~290’000 species)

MCBSeptember, 2010


From nucleic acid to amino acid sequences From nucleic acid to amino acid sequences databases….databases….

The hectic life of a protein sequence …

cDNAs, ESTs, genomes, …

ENA, GenBank, DDBJ


…if the submitters provide an annotated Coding Sequence

(CDS)(1/10 ENA entries)

Protein sequence databases

Nucleic acid databases

Gene predictionRefSeq, Ensembl

no CDS

Why doing things in a simple way, when you can do it in a very complex

one ?

The hectic life of a sequence …

TrEMBL Genpept

CoDing Sequences provided by submitters

cDNAs, ESTs, genomes, …

ENA, GenBank, DDBJ


Swiss-Prot

RefSeq PRF

Scientific publications derived sequences

Ensembl

CCDS

UniParc

UniProtKB

PDB(PIR)

+ all ‘species’ specific databases (EcoGene, TAIR, …)

(IPI)

UniMES

CoDing Sequences provided by submitters

and gene prediction

TPA

Major protein sequence database ‘sources’

UniProtKB: Swiss-Prot + TrEMBL

NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA

PIR PDB PRF

UniProtKB/Swiss-Prot: manually annotated protein sequences (12’200 species)

UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (290’000 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (230’000 species ?)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences (PIR-NRL3D)

PRF: Protein Research Foundation : journal scan of ‘published’ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species)

TPA: Third Party Annotation Sequence Database (update of entries derived from GenBank primary data)

Integrated resources

‘cross-references’

Resources kept separated

Ensembl: UniProtKB + RefSeq + gene prediction (40 species)

Vega: Ensembl + gene prediction + manual annotation (5 species)

IPI: UniProtKB + RefSeq + Ensembl + TAIR (arabidopsis db) + H-InvDB (human cDNAs manual annotation) + VEGA (Vertebrate Genome Annotation) (7 species)

Closure in 2010 !!!

CCDS: consensus between EBI, NCBI, Sanger, USCS, (3 species)

Others:

OWL: Swiss-Prot + PIR + PDB + GenPept (obsolete)

MSDB (Mascot): Swiss-Prot + PIR + PDB + TrEMBL + GenBank…

dbESTs: translated ESTs (in the 6 frames; no annotated CDSs, no gene prediction)

Major protein sequence database ‘composite’

MCBSeptember, 2010


Phenyx: UniProtKB, IPI, NCBInr

Mascot: NCBInr, Swiss-Prot, dbEST

Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept

Aldente: UniProtKB.

ProFound: NCBInr, Swiss-Prot, dbEST

OMSSA: NCBInr and RefSeq.

Different protein databases available for different online proteomic tools

! ? !

MCBSeptember, 2010


Databases used

Study done on 28 proteomic papers (from 2010):

• The majority of labs ~61% use IPI

• ~18% use Swiss-Prot (mainly human, some bacteria)

• ~20% use other sources such as NCBI, SGD or in house developed databases

Personal communication: Silvia Jimenez, nov 2010

Menu

Introduction

Nucleic acid sequence databases ENA-Bank/GenBank, DDBJ



MCBSeptember, 2010


UniProt

SIB + EBI + PIR

UniProtKBUniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (~12 mo entries)

UniParcUniParc: protein sequence archive (ENA equivalent at the

protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated (query, Blast, download) (~25mo entries)

UniRefUniRef: 3 clusters of protein sequences with 100, 90 and 50 % similarity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo entries; UniRef50 3.3 mo entries)

UniMESUniMES: protein sequences derived from metagenomic projects (mostly Global Ocean Sampling (GOS)) (download) (8 mo entries, included in UniParc)

MCBSeptember, 2010


UniProt databases

MCBSeptember, 2010


UniProtKBan encyclopedia on proteins

composed of 2 sectionsUniProtKB/TrEMBL and UniProtKB/Swiss-Prot

released every 4 weeks

MCBSeptember, 2010


UniProtKB

from ENA to TrEMBL

UniProtKB protein sequence data are mainly derived from ENA (CDS) but also from Ensembl

and other sequence resources such as RefSeq or model organism databases (MODs).

Data from the PIR database have been integrated in UniProt since 2003.

TrEMBL

ENA

Automated extraction of protein sequence

(translated CDS), gene name and references.+Automated annotation

MCBSeptember, 2010


The quality of UniProtKB/TrEMBL data, including the protein sequence, is directly dependent on the

information provided by the submitter of the original nucleotide entry.

Automated annotation• Redundancy check (100% merge)• Family attribution (InterPro)• Many other cross-references• Rule-based automated annotation

! ? !

MCBSeptember, 2010


UniProtKB

from TrEMBL to Swiss-Prot

Once manually annotated and integrated into Swiss-Prot, the entry is deleted from TrEMBL

-> minimal redundancy

TrEMBL

ENA

Automated extraction of protein sequence (translated CDS), gene name and

references.+Automated annotation

Manual annotation of the sequence and associated

biological information

Swiss-Prot

MCBSeptember, 2010


SequenceSequence features

Ontologies

ReferencesNomenclature

Splice variants

Annotations

MCBSeptember, 2010


UniProtKB: from TrEMBL to Swiss-Prot

Manual annotation

1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)

2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …)

MCBSeptember, 2010



Manual annotation


2. Biological information (extract literature information, ortholog data propagation, protein sequence analysis…)

MCBSeptember, 2010


The displayed sequence is the most prevalent protein sequence and/or the protein sequence which is also found in orthologous species.

The displayed sequence is generally derived from the translation of the genomic sequence (when available).

Sequence differences are documented.

1 entry <-> 1 gene (1 species) 1 displayed sequence

(annotation of alternative sequences, when available)

UniProtKB/Swiss-ProtProtein sequence annotation

MCBSeptember, 2010


What is the current status?

• At least 20% of Swiss-Prot entries required a minimal amount of curation effort so as to obtain the “correct” sequence.

• Typical problems– unsolved conflicts;– uncorrected initiation sites;– frameshifts;– other ‘problems’

MCBSeptember, 2010


… once upon a time, it was a gene on chromosome 11…

MCBSeptember, 2010


! ? !

… once upon a time, it was a gene on chromosome 11…

All these sequences are available in protein sequence databases (i.e.GenPept) !!!

Quality of protein information from genome projects

• Lets look at proteins originating from genome projects:– Drosophila: the paradigm of a curated genome should look

like (thanks to FlyBase) : only 1.8% of the gene models conflict with Swiss-Prot sequences;

– Arabidopsis: a typical example of a genome where a lot of annotation was done when it was sequenced, but no update since then (at least in the public view): 20% of the gene models are erroneous;

– Tetraodon nigroviridis: the typical example of a quick and dirty automatic run through a genome with no manual intervention: >90% of the gene models produce incorrect proteins.

– Bacteria and Archaea have almost no splicing, so predictions are “easier”, however errors are still made… Start codons, missed small proteins (<100aa)…

UniProtKB/Swiss-ProtProtein sequence annotation

MCBSeptember, 2010


Example of problem (derived from gene prediction pipeline)

Ensembl completes the human ‘proteome’ by predicting/annotating missing genes according to orthologs sequences..

ID URAD_HUMAN Unreviewed; 171 AA. AC A6NGE7; DT 24-JUL-2007, integrated into UniProtKB/TrEMBL. DT 24-JUL-2007, sequence version 1. DT 02-OCT-2007, entry version 3. DE 2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog DE (OHCU decarboxylase homolog) (Parahox neighbour). GN Name=PRHOXNB; …DR EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR Ensembl; ENSG00000183463; Homo sapiens. DR HGNC; HGNC:17785; PRHOXNB. PE 4: Predicted; In primates the genes coding for the enzymes for the

degradation of uric acid were inactivated and converted to pseudogenes.

• Producing a clean set of sequences is not a trivial task;

• It is not getting easier as more and more types of sequence data are submitted;

MCBSeptember, 2010


• The ‘Protein existence’ tag indicates what is the evidence for the existence of a given protein (but not associated with sequence quality);

• Different qualifiers:1. Evidence at protein level (~18%) (MS, western blot (tissue specificity), immuno (subcellular location),

…)2. Evidence at transcript level (~19%)3. Inferred from homology (~58 %)4. Predicted (~5%)5. Uncertain (mainly in TrEMBL)

‘Protein existence’ tag

MCBSeptember, 2010


MCBSeptember, 2010


In order to avoid ‘pseudogenes’ and most of the unprobable protein sequences, you can filter your query and avoid sequences with ‘protein existence tag’ = ‘Uncertain’

MCBSeptember, 2010


The ‘alternative’ sequence(s)

MCBSeptember, 2010


(Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154).

Proteome complexityExample with human

Not predictable at the genome level !-> important post-

genomic data !

~20’000

MCBSeptember, 2010


Multiple alignment of the end of the available GCR sequences

Annotation of the sequence differences (protein diversity)

1 entry <-> 1 gene (1 species)

…and natural variant

P04150

www.uniprot.org

http://www.uniprot.org/

MCBSeptember, 2010


Available in separated files!

Important remark

> 30’000 additional sequences (total)

MCBSeptember, 2010


The ‘alternative’ sequence(s)

not ‘directly available’ for a lot of tools, including protein identification tools, Blast, depending on the server

!….

MCBSeptember, 2010



Manual annotation


2. Biological information (extract literature information, ortholog data propagation, protein sequence analysis…)

MCBSeptember, 2010


UniProtKB/Swiss-Prot gathers data form multiple sources:

- publications (literature/Pubmed)- prediction programs (Prosite, Anabelle)- contacts with experts - other databases- nomenclature committees

Maximum usage of controlled vocabularyKeywords, Tissues, Post-translational modifications, Strains, Species, Subcellular location, Extracellular domains, Journals, …

Gene Ontology…

Extract literature informationand protein sequence analysis

maximum usage of controlled vocabulary

MCBSeptember, 2010


Protein nomenclature

…enable researchers to obtain a summary of what is known about a protein…

General annotation

(Comments)

www.uniprot.org


Sequence annotation

(Features)

…enable researchers to obtain a summary of what is known about a protein…

www.uniprot.org


MCBSeptember, 2010


Ontologies

Swiss-Prot keywords

Gene Ontology (GO terms)

MCBSeptember, 2010


Human protein manual annotation: some statistics (Aug 2010)

MCBSeptember, 2010


Non-experimental qualifiers UniProtKB/Swiss-Prot considers both experimental and predicted data and makes a clear distinction

between both.

Level. Type of evidence Qualifier

1st. Strong experimental evidence

2nd. Light experimental evidence Probable

3rd. Inferred by similarity with homologous protein (data of 1st or 2nd level)

By similarity

4th. Inferred by sequence prediction

Potential

Phenyx: UniProtKB**, IPI, NCBInr

Mascot: NCBInr, Swiss-Prot*, dbEST, OWL, MSDB

Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept, Ludwignr, OWL

Aldente: UniProtKB**


OMSSA: NCBInr and RefSeq

** the tool takes into account AP, PTM, …but not yet variants /conflict annotations

* the tool takes into account AP annotations

Do proteomic analysis tools make use of sequence annotation ?

MCBSeptember, 2010


Identification of biologically active proteins:

use Swiss-Prot annotations with Phenyx

• Sequence processing annotations

– Removal of signal peptides– Removal of transit peptides– Extraction of active chains

• Post-translational modifications

• Sequence variants

– Splicing variants

– Sequence mutations

MCBSeptember, 2010


Access to UniProtKB

www.uniprot.org

MCBSeptember, 2010


www.uniprot.org

MCBSeptember, 2010


Search

A very powerful text search tool with autocompletion and refinement

options allowing to look for UniProt entries and documentation by

biological information

MCBSeptember, 2010


The search interface guides users with helpful suggestions and hints

MCBSeptember, 2010


Result pages: Highly customizableResult pages: Highly customizable

The URL (results) can be bookmarked and manually modified.

MCBSeptember, 2010


Blast

A tool associated with the standard options to search

sequences in UniProt databases

Blast results: customize display

MCBSeptember, 2010


Align

A ClustalW multiple alignment tool with amino-acids highlighting optionsand feature annotation highlighting

option

MCBSeptember, 2010


ClustalW multiple alignment of insulin

sequences

MCBSeptember, 2010


Retrieve

A UniProt specific tool allowing to retrieve a list of entries in several standard formats.

You can then query your ‘personal database’ with the UniProt search tool.

MCBSeptember, 2010


AC

Large protein lists is not the end point in Proteomics

-> importance of protein sequence annotation

MCBSeptember, 2010


Retrieve tool (UniProt)

MCBSeptember, 2010


Play with the customize display tool…

MCBSeptember, 2010


Human proteins functional distribution

Maybe

Potentially

Putative

Expected

Probably

Hopefully

~40 % of human proteins have no known function (experimental data)…but many more are associated with GO terms…(computer-assigned).

MCBSeptember, 2010


ID Mapping

Gives the possibility to get a mapping between different databases for a given

protein

MCBSeptember, 2010


These identifiers are all pointing to TP53 (p53):

Question: same protein sequence ?

P04637, NP_000537, ENSG00000141510, CCDS11118, GC17M007512, UPI000002ED67, IPI00025087, etc.

- Specific and unique for each database…- Essential for retrieving your data and citation- Beware their ‘stability’…(linked to a protein sequence or to a gene)

- ID mapping tools ! ? !

MCBSeptember, 2010


MCBSeptember, 2010


Complete proteomes

• http://www.uniprot.org/taxonomy/?query=complete:yes

KW: Complete proteome

MCBSeptember, 2010


Download

MCBSeptember, 2010


Downloading UniProt Downloading UniProt http://www.uniprot.org/downloads

http://www.uniprot.org/downloads

MCBSeptember, 2010


UniProtKB

Statistics

520’000 + 11’60’000 12’000’000

Swiss-Prot & TrEMBL introduce a new arithmetical

concept !

Redundancy in TrEMBL&

Redundancy between TrEMBL and Swiss-Prot

12’000 species 290’000 species

Swiss-Prot TrEMBL

MCBSeptember, 2010


UniProtKB/Swiss-Prot(Manual annotation)

UniProtKB/TrEMBL(automatic annotation)

12’000 speciesmainly model organisms

Not yet available

~ 200 new entries / day new release every 4 weeks

-Annotation is useful, good annotation is better, update is essential !

- Some entries have gone through more than 120 versions since their integration in UniProtKB/Swiss-Prot

UniProtKB entry history

Always cite the primary accession number (AC) !

Menu

Introduction

Nucleic acid sequencedatabases ENA/GenBank, DDBJ



MCBSeptember, 2010



(Entrez protein, NCBI nr)

http://www.ncbi.nlm.nih.gov/protein

Major protein sequence database ‘sources’

UniProtKB: Swiss-Prot + TrEMBL


PIR PDB PRF

UniProtKB/Swiss-Prot: manually annotated protein sequences (12’200 species)

UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (290’000 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (230’000 species ?)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: Protein Research Foundation : journal scan of ‘published’ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species)

Integrated resources

‘cross-references’

Resources kept separated

MCBSeptember, 2010



MCBSeptember, 2010


RefSeqProduced by NCBI and NLM

Information: http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/handbook/ch18.pdf

http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=handbook&part=ch1#GenBank_ASM

http://www.ncbi.nlm.nih.gov/RefSeq/


• Reference Sequence (RefSeq)

• provides one example of each natural biological molecule (protein- mRNA- genomic DNA) for major organisms (11’000 species)

-> several entries for the same gene (if alternative splicing)

• gene prediction

• manual annotation (‘reviewed’ tagged entries)

• Annotation are mainly found in Entrez Gene (« interdependent curated resources »)

• Can be queried via Entrez protein system

•Nice accession numbers: NP_, NM_, etc…

RefSeq

MCBSeptember, 2010


Query RefSeq

MCBSeptember, 2010


KW

AC

Taxonomy

References

GenBank sourceand status

Annotation and ontologies

MCBSeptember, 2010


MCBSeptember, 2010


UniProtKB vs RefSeq

MCBSeptember, 2010


RefSeq chooses one or several protein reference sequences for a given gene: they do not annotate the sequence differences.

- If there is an alternative splicing event, there will be several distinct entries for a given gene

Example: GCR_HUMAN

GCR_HUMANUniProtKB/Swiss-Prot

1 UniProtKB entry 7 RefSeq entriescross-linked with

MCBSeptember, 2010


GI number ‘GenInfo identifier’ number

- In addition to an AC number specific from the original database, each protein sequence in the NCBInr database (included Swiss-Prot entry) has a GI number.

MCBSeptember, 2010


AC

MCBSeptember, 2010


GI number: ‘GenInfo identifier’ number

- If the sequence changes in any way, a new GI number will be assigned:

- A Sequence Revision History tool is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record:

http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi



MCBSeptember, 2010


MCBSeptember, 2010


ID/AC mapping

MCBSeptember, 2010


MCBSeptember, 2010


http://www.ebi.ac.uk/Tools/picr/

MCBSeptember, 2010


Protein databases for

proteomic analysis…

MCBSeptember, 2010


IPIhttp://www.ebi.ac.uk/IPI/IPIhelp.html

IPI Closure in 2010 !

MCBSeptember, 2010


MCBSeptember, 2010


Automatic approach that builds clusters through combining knowledge already present in the primary data source (UniProtKB, RefSeq, Ensembl) and sequence similarity.

IPI=UniProtKB + RefSeq + Ensembl (+ H-InvDB, TAIR and VEGA)

human, mouse, rat, zebrafish, Arabidopsis, chicken, and cow proteomes

MCBSeptember, 2010


No annotation

MCBSeptember, 2010


Mascot

http://www.matrixscience.com/search_intro.html

MCBSeptember, 2010


Update November, the 11th….

ENA/GenBank/DDBJ RefSeqwww.ncbi.nlm.nih.gov/RefSeq/

UniProtwww.uniprot.org

Protein and nucleotide data Genomic, RNA and protein data

Protein data only

Biological data added by the submitters (gene name, tissue…)

Biological data annotated by curators, also found in the corresponding Entrez Gene entry

Biological data annotated by curators (Swiss-Prot), within the entry

Not curated Partially manually curated (‘reviewed’ entries)

Manually curated in Swiss-Prot, not in TrEMBL

Author submission NCBI creates from existing data

UniProt creates from existing data

Only author can revise (except TPA)

NCBI revises as new data emerge

UniProt revises as new data emerge

Multiple records for same loci common

Single records for each molecule of major organisms

Single records for each protein of major organisms (in Swiss-Prot, TrEMBL is redundant)

Records can contradict each other

Identification and annotation of discrepancy

No limit to species included Limited to model organisms Priority (but not limited) to model organisms

Data exchanged among INSDC members

NCBI database; collaboration with UniProt

UniProt database; collaboration with NCBI (RefSeq, CCDS)

MCBSeptember, 2010


All documents are online

http://education.expasy.org/cours/CJSFEAP2010/

MCBSeptember, 2010


Additional material

MCBSeptember, 2010


PIR

MCBSeptember, 2010


PIR: the Protein Identification Resource

PIR-PSD is no more updated, but exists as an archive

MCBSeptember, 2010


PDB

MCBSeptember, 2010


PDB• PDB (Protein Data Bank), 3D structure

• Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X-ray or NMR studies

• Contains also the corresponding protein sequences *The PIR-NRL3D database makes the sequence information in PDB available for

similarity searches and other tools

• Includes protein sequences which are mutated, chimearic etc… (created specifically to study the effect of a mutation on the 3D structure)

MCBSeptember, 2010


PDB: Protein Data Bankwww.rcsb.org/pdb/

• Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA).

• Associated with specialized programs allow the visualization of the corresponding 3D structure (e.g., SwissPDB-viewer, Chime, Rasmol)).

• Currently there are ~67’000 structural data for about 15’000 different proteins, but far less protein family (highly redundant) !

MCBSeptember, 2010


PDB: example

MCBSeptember, 2010


Coordinates of each atom

Sequence

MCBSeptember, 2010


MCBSeptember, 2010


Visualisation with Jmol

MCBSeptember, 2010


PRF

Protein Research Foundation

MCBSeptember, 2010


http://www.genome.jp/dbget-bin/www_bfind?prf

Looks for the peptide sequence described in publication (and which are not submitted in databases !!!)

MCBSeptember, 2010


MCBSeptember, 2010


Ensembl http://www.ensembl.org/

Reviewhttp://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D610

Annotation pipelinehttp://www.genome.org/cgi/content/full/14/5/942

http://www.ensembl.org/



http://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D610

http://www.genome.org/cgi/content/full/14/5/942

MCBSeptember, 2010


- Ensembl: align the genomic sequences with all the sequences found in ENA, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (-> known genes)

- Also do gene prediction (-> novel genes)

Ensembl= UniProtKB + RefSeq + gene prediction

- DNA, RNA and protein sequences available for several species.

- Ensembl concentrates on vertebrate genomes, but other groups have adapted the system for use with plant, fungal and metazoa genomes.

MCBSeptember, 2010


MCBSeptember, 2010


Example of problem

Ensembl completes the human ‘proteome’ by annotating missing genes according to orthologs sequences..

ID URAD_HUMAN Unreviewed; 171 AA. AC A6NGE7; DT 24-JUL-2007, integrated into UniProtKB/TrEMBL. DT 24-JUL-2007, sequence version 1. DT 02-OCT-2007, entry version 3. DE 2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog DE (OHCU decarboxylase homolog) (Parahox neighbour). GN Name=PRHOXNB; …DR EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR Ensembl; ENSG00000183463; Homo sapiens. DR HGNC; HGNC:17785; PRHOXNB. PE 4: Predicted;

In primates the genes coding for the enzymes catalyzing the degradation of uric acid were inactivated and converted to pseudogenes.

MCBSeptember, 2010


CCDS

MCBSeptember, 2010


htt

p:/

/ww

w.n

cb

i.n

lm.n

ih.g

ov/C

CD

S/

MCBSeptember, 2010


CCDS (human)CCDS (human)

Combining different approaches – ab initio, by

similarity - and taking advantage of the expertise

acquired by different institutes, including manual

annotation…

Consensus between 4 institutions…

MCBSeptember, 2010


MCBSeptember, 2010


UniParc

MCBSeptember, 2010


UniParc

- non-redundant protein sequence archive, containing both active and inactive sequences (including sequences which are not in UniProtKB i.e. immunoglobulins….)

- the equivalent of ENA/GenBank/DDBJ at the protein level

- species-merged: merge sequences between species when 100% identical over the whole length.

- no annotation (only taxonomy)

- can be searched only with database names, taxonomy, checksum (CRC64) and accession numbers (ACs) or UniProtKB, UniRef and UniParc IDs.

- Beware: contains wrong prediction, pseudogenes etc…

MCBSeptember, 2010


Query UniParc

MCBSeptember, 2010


UniRef

MCBSeptember, 2010


‘UniRef is useful for comprehensive BLAST similarity searches by providing

sets of representative sequences’

MCBSeptember, 2010


«Collapsing BLAST results»

Three collections of sequence clusters from UniProtKB and selected UniParc entries:

One UniRef100 entry -> all identical sequences (identical sequences and sub-fragments are grouped in a single record) -> reduction of 12 %

One UniRef90 entry -> sequences that have at least 90 % or more identity -> reduction of 40 %

One UniRef50 entry -> sequences that are at least 50 % identical-> reduction of 65 %

Based on sequence identity -> Independent of the species !

MCBSeptember, 2010


MCBSeptember, 2010


Independent of species and

sequence length

UniRef 90

MCBSeptember, 2010


UniMes

MCBSeptember, 2010


The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental protein data (only GOS data for the moment).

Download only (but included in UniParc -> Blast).

- UniMES Fasta sequences- UniMES matches to InterPro methods

ftp.uniprot.org/pub/databases/uniprot

ftp://ftp.uniprot.org/pub/databases/uniprot

MCBSeptember, 2010


MCBSeptember, 2010


UniMES: sequences in fasta format

MCBSeptember, 2010


Phenyx: UniProtKB***, IPI, NCBInr

Mascot: NCBInr, Swiss-Prot*, dbEST, OWL, MSDB

Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept, Ludwignr, OWL.

Aldente: UniProtKB***.


OMSSA: NCBInr and RefSeq.

Translation of ESTs sequences in the 6 frames(EST are not associated with annotated CDSs !)

*** the tool takes into account AP, PTM, variants /conflict annotations

* the tool takes into account AP annotations (but not for the online public version)

MCBSeptember, 2010


Use of UniProtKB/Swiss-Prot annotation by Phenyx

MCBSeptember, 2010


Improve Identification of biologically active proteins:

Use Swiss-Prot annotations with Phenyx

• Sequence processing annotations

– Removal of signal peptides– Removal of transit peptides– Extraction of active chains

• Post-translational modifications

• Sequence variants

– Splicing variants

– Sequence mutations

MCBSeptember, 2010


Phenyx and alternative sequences

http://www.genebio.com/products/phenyx/features.html#section4

MCBSeptember, 2010


mcb september, 2010 protein sequence databases [email protected] swiss-prot group,...

Documents

protein slide

different protein databases

peptides protein databases

protein sequences

peptides large protein

orgcourscjsfeap2010

knowledge discovery

proteomics importance