protein sequence databases -...

152
Salvador Martínez de Bartolomé [email protected] Bioinformatics support ProteoRed Proteomics Facility, National Center for Biotechnology, Madrid 1. Proteomics database contents Protein sequence databases

Upload: others

Post on 21-Jul-2020

12 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Salvador Martínez de Bartolomé[email protected] support – ProteoRedProteomics Facility, National Center for Biotechnology, Madrid

1. Proteomics database contents

Protein sequence databases

Page 2: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Menu

Introduction : bioinformatics and sequence databases

Nucleic acid sequence databases

Protein sequences databases (sources)

Protein sequences databases (other)

Page 3: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Biology of the XXI century

Three major developments:

High throughput technique analysis: DNA sequencing, mass spectrometry, micro-

Numerous biological databases available through the Web

Bioinformatics tools available through the Web

Page 4: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Tool

ToolTool

Tool

Tool

Tool

Tool

Tool

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

databasedatabase

database

databasedatabase

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

Tool

Tool

Tool

Tool

Tool

Tool

Tool

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

Tool

Tool

Tool

Tool

Tool

Tool

Tool

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

Tool

Tool

Tool

Tool

Tool

Tool

Tool

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

Tool

Tool

Tool

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

Tool

Tool

Tool

Tool

Tool

database ToolAn overwhelming number of unordered resources

Page 5: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

database

Tool

database

databasedatabase

database

database

databasedatabase

database

database

database

database

database

database

database

Protein Sequence

database

database

database

database

database

databasedatabasedatabase

database

database

database

database

databasedatabasedatabase

database

database

databasedatabase

3o Structure

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

databasedatabase

database

database

Protein 2D

PAGE & MS

database

database

database

database

database

database

database

database

PTM

ToolTool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Protein identification

&

characterization

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

ToolTool

Tool

PTM Prediction tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

1o Structure AnalysisTool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool

3o Structure

Prediction

Tool

ToolTool

Tool

Tool

Tool

Tool Tool

Tool

Tool

Tool

Nucleotide Amino

Acid Translator

Tool

ToolTool

Sequence Alignment

ToolTool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

Tool

Tool

Tool

ToolTool

Similarity Search

database

database

database

database

database

database

databasedatabase

Gene Expression

Protein

Interactions

database

database

database

databasedatabase

database

database

database

database

database

database database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

databasedatabase

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

Species / Genomic

databasedatabase

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

Functional

2o Structure Prediction

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Subcellular localization

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database database

database

database

database

database

database

database

database

databasedatabase

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

Polymorphism / Mutation /

Disease

database

database

databaedatabase

database

database

database

Topology Prediction

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Pattern &

Profile searchDomains &

classification

database

database

database

database

database database

database

database

2o Structure

Tool

ToolTool

ToolTool

Tool

Tool

database

Database Database

databasedatabase

database

database

database

database

database

databasedatabase

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

Database

Database

Database

Database

ToolDatabase

Database

Database

Database

Database

database

database

database

database

databasedatabase

database

Database

Database

database

databasedatabase

databasedatabase

database

Phylogenetics &

Taxonomy

database

database

database

database

database

database

database

database database

database

database

References /

nomenclatur

e

Nucleotide sequence

repository

Page 6: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

database

database

database

database

database

database database

database

database

References /

nomenclatur

e

database

databasedatabase

databasedatabase

database

Phylogenetics &

Taxonomy

database

database

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Subcellular

localization

database

Tool

database

databasedatabase

database

database

databasedatabase

database

database

database

database

database

database

database

Protein Sequence

database

database

database

database

database

databasedatabasedatabase

database

database

database

database

databasedatabasedatabase

database

database

databasedatabase

3o Structure

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

databasedatabase

database

database

Protein 2D

PAGE & MS

database

database

database

database

database

database

database

database

PTM

ToolTool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Protein identification

&

characterization

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

ToolTool

Tool

PTM Prediction tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

1o Structure AnalysisTool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool

3o Structure

Prediction

Tool

ToolTool

Tool

Tool

Tool

Tool Tool

Tool

Tool

Tool

Nucleotide Amino

Acid Translator

Tool

ToolTool

Sequence Alignment

ToolTool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

Tool

Tool

Tool

ToolTool

Similarity Search

database

database

database

database

database

database

databasedatabase

Gene Expression

Protein

Interactions

database

database

database

databasedatabase

database

database

database

database

database

database database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

databasedatabase

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

Species / Genomic

databasedatabase

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

Functional

2o Structure Prediction

Tool

Tool

Tool

Tool

Tool

Tool

Tool

ToolTool

Tool

Tool

database

database

database

database

database

database

database

database

database

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database database

database

database

database

database

database

database

database

databasedatabase

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

database

database

Polymorphism / Mutation /

Disease

database

database

databaedatabase

database

database

database

Topology Prediction

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Tool

Pattern &

Profile searchDomains &

classification

database

database

database

database

database database

database

database

2o

Structure

Tool

ToolTool

ToolTool

Tool

Tool

database

Database Database

databasedatabase

database

database

database

database

database

databasedatabase

database

database

databasedatabase

database

database

database

database

database

database

database

database

database

database

database

Database

Database

Database

Database

ToolDatabase

Database

Database

Database

Database

database

database

database

database

databasedatabase

database

Database

Database

Nucleotide

sequence

repository

UniProtKB

(Swiss-Prot/TrEMBL)

TargetP

EcoGene

Ensembl

FlyBase

MGD

SGDSubtiList

TIGR CMR

HIV

TAIR

MEROPS

ENZYME

TRANSFAC

KEGG

HAMAP

PROSITE

InterPro

PfamProDom

BLOCKS

TIGRFAM

ProtoMap

CATH

SCOP

PDBSWISS-MODEL

ScanProsite

MotifScan

HSSP JpredGOR

DIP

IntAct

ProtScaleProtParamBLAST

FASTA

dbSNP

GeneCards

OMIMCleanEx

DDBJ

GenBank

EMBL

TreeBaseNEWT

Taxonomy

PSORT

Glycosuite

PhosphBase

NetOGlyc

ChloroP

PeptideMass

Mascot

Phenyx ECO2DBASE

Siena-2D PAGE

SWISS-2D PAGE

TMHMM

SOSUI

PubMed

HUGOGO

ClustalW

DIALIGN

Translate

Page 7: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Molecular bioinformatics: an operational definition

The applications of computer sciences to molecular biology…

…in particular for the study of macromolecules such as proteins, nucleic acids and oligosaccharides

Page 8: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Protein sequence databases

- Identification of proteins by proteomics--> completeness, sequence quality

- Similarity searches (functional prediction)--> sequence quality (non redundance)

- Training datasets (prediction tools)--> sequence and annotation quality

- Genome annotation…

Page 9: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

(Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154).

Proteome complexity

Not predictable at the genome

level !

Page 12: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

… ~ 1630 genomes sequenced(single organism, varying sizes)

… ~ 952 ongoing genome sequencing projects

Page 13: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj

Page 14: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html

Page 15: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

… ~ 1630 genomes sequenced(single organism, varying sizes)

… ~ 952 ongoing genome sequencing projects

…. ~ 200 metagenome sequencing projects (environmental samples: multiple „unknown‟ organisms, varying sizes)

Ecological metagenomes: beach sand, Sargasso Sea….

Organismal metagenomes: mouse gut

~ 17 million sequences being processed at Venter Institute

Page 16: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

= 179'000'000'000

How many „protein‟ sequences at the end ?

For fun: estimate: ~30 million species (1.5 million named)

20 million bacteria/archea x 4'000 genes (182-8500)

5 million protists x 6'000 genes

3 million insects x 14'000 genes

1 million fungi x 6'000 genes

0.6 million plants x 20'000 genes

0.2 million molluscs, worms, arachnids, etc. x 20'000 genes

0.2 million vertebrates x 25'000 genes

The calculation:

2x107x4000+5x106x6000+3x106x14000+106x6000+6x105x20000+2x

105x20000+2x105x25000

AMB, SP20

Page 17: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Protein sequence origin

About 4.5 millions of „known‟ protein sequences (in 2007)

More than 99 % of the protein sequences are derived from the translation of nucleotide sequences

Less than 1 %: direct protein sequencing (Edman, MS/MS…)

-> It is important that users know where the protein sequence comes from…

(sequencing & gene prediction quality) !

Page 18: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Menu

Introduction : bioinformatics and sequences

Nucleic acid sequence databases

Protein sequences databases (sources)

Protein sequences databases (other)

Page 19: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

cDNAs, ESTs(expressed sequence tags), genes, genomes, …

EMBL, GenBank, DDBJ

Data not submitted to public databases*, delayed or cancelled…

http://www.insdc.org/

The hectic life of a sequence …

EMBL: http://www.ebi.ac.uk/embl/GenBank: http://www.ncbi.nlm.nih.gov/Genbank/GenbankSearch.htmlDDBJ: http://www.ddbj.nig.ac.jp/

Page 20: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Contribution: EMBL 10 %; GenBank 75 %; DDBJ 15 %

Page 21: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Goal

-to accept, process and make freely available sequence data from individual researchers,

research group and patent office

- available via SRS/Entrez, ftp, web services and similarity search tools.

Page 22: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

The tremendous increase in nucleotide sequences

1980: 80 genes fully sequenced !

http://www3.ebi.ac.uk/Services/DBStats/

Page 23: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

•Serve as archives : „nothing goes out‟

• Contain all public sequences derived from:

– Genome projects (> 80 % of entries)

– Sequencing centers (cDNAs, ESTs…)

– Individual scientists ( 15 % of entries)

– Patent offices (i.e. European Patent Office, EPO)

• Currently: ~152x106 sequences, ~242 x109 bp;

• Sequences from > 260‟000 different species;

EMBL/GenBank/DDBJ

Page 24: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

human

mouse

rat

http://www3.ebi.ac.uk/Services/DBStats/

More than 260‟000 species, but…

Human/Mouse/Rat: organisms with the highest redundancy !

Page 25: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Where the sequenced specimen was collected?

Geographical Origin of Sequenced Samples (since 2005)(lat_lon: latitude_longitude qualifier)

http://www3.ebi.ac.uk/Services/EMBLWorld/EMBLWorld.pl

Page 26: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

A very important annotation for proteomic:the CoDing Sequence (CDS)

(in particular for eucaryotes)

EMBL/GenBank/DDBJ

Page 27: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

cDNAs, ESTs, genes, genomes, …

EMBL, GenBank, DDBJ

Data not submitted to public databases*, delayed or cancelled…

with or without annotated CDS

provided by authors

CDSCoDing Sequence

portion of DNA/RNA translated into protein(from Met to STOP)

Experimentally provedor derived from gene prediction

Page 29: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Problem 1Complete genome (submitted)

only ~ 2,015 CDS available !

Page 30: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

http://www3.ebi.ac.uk/Services/DBStats/

At the protein level (Example with UniProtKB/TrEMBL):The CDS of virus and bacteria areeasy to obtain !

human

mouse

rat

At the nucleic acid level

At the protein level

http://www.ebi.ac.uk/swissprot/sptr_stats/index.html

Page 31: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Problem 2: Variable level of sequence quality

- Sequencing quality- Gene prediction quality

Authors can specify the nature of the CDS by using the qualifier: "/evidence=experimental" or "/evidence=not_experimental".

Very rarely done…

Page 33: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Protein existence (PE): %

1: At protein level 15,3%

2: Evidence at transcript level 15,8%

3: Inferred from homology 65,2%

4: Predicted 3,4%

5: Uncertain 0,3%

http://www.expasy.org/sprot/relnotes/relstat.html

UniProtKB/Swiss-Prot protein knowledgebase

release 56.6 statistics (16-Dec-08)

Page 34: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Problem 3: highly redundant

Sort of sequence museum, where sequences are preserved for eternity as they were determined, interpreted and published originally by their

authors

(primary sequence repository)

-> Similarity searches are not obvious…

Page 35: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Problem no 4

Author authority

--> variable level of the annotation (CDS and other) quality- i.e. gene/protein name attribution…

Page 36: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

EMBL/GenBank/DDBJ

The authors have full authority over the content of the entries they submit !

(editorial control of the content belongs to the authors)

(exception: TPA (Third Party Annotation), since january 2003)

Page 37: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

‘Problem’ no 5

Environmental samples…

Page 38: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Environmental sequences (ENV)

Aim:To sequence all DNA present in a given sample, without knowing from which species the DNA is derived from

- Sargasso sea (Craig Venter)- human fluids- earth

Page 39: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence
Page 41: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

No idea of the species…(microbial population…)No idea of the gene prediction program to be used…No idea of the genetic code to be used for traduction !!!!!

Not always associated with CDS. If yes,the protein sequence are present in protein sequence databases

Page 42: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Menu

Introduction : bioinformatics and sequences

Nucleic acid sequence databases

Protein sequences databases (sources)

Protein sequences databases (other)

Page 43: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

cDNAs, ESTs, genomes, …

EMBL, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

…if the submitters provide an annotated Coding Sequence (CDS)

(1/10 EMBL entries)

Protein sequence databases

Nucleic acid databases

Gene prediction

no CDS

Page 44: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Major protein sequence database „sources‟

UniProtKB: Swiss-Prot + TrEMBL

NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq

PIR PDB PRF

UniProtKB/Swiss-Prot: manually annotated protein sequences (11’612 species)

UniProtKB/TrEMBL: submitted CDS (EMBL) + automated annotation; non redundant with Swiss-Prot(184’698 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (130’000 species)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: journal scan of „published‟ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction (4’000 species)

Integrated resources

„cross-references‟

Separated resources

Page 45: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

UniProt, the Universal protein resource

is maintained by

the UniProt consortium SIB + EBI + PIR

SIB = Swiss Institute Bioinformatics

EBI = European Bioinformatics Institute

PIR = Protein Information Resource

www.uniprot.org

Page 46: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

6’964’485 entries(184’698 species)

405’506 entries(11’612 species)

Page 47: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence
Page 48: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

The UniProt KnowledgeBase

(UniProtKB)

an encyclopedia on proteins

biweekly released

Page 49: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

TrEMBL

EMBL

Automated extraction of

protein sequence

(translated CDS), gene

name and references.+

Automated annotation

Page 50: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

!!!!

The quality of UniProtKB/TrEMBL data, including the protein sequence, is directly dependent on the

information provided by the submitter of the original nucleotide entry.

Automated annotation•using rules derived from Swiss-Prot manually annotated entries but with no manual oversight – RuleBase

•using automatically generated rules - Spearmint

Page 51: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

TrEMBL

EMBL

Automated extraction of

protein sequence

(translated CDS), gene

name and references.+

Automated annotation

Manual annotation of

the sequence and

associated biological

information

Swiss-Prot

Page 52: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

UniProtKB from TrEMBL to Swiss-Prot

Sequence check

Page 53: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

UniProtKB/Swiss-Prot

1 entry <-> 1 gene (1 species)

i) Merge of all known protein sequences (CDS) derived from the same gene

-> avoid redundancy and improve sequence reliability

(for human: ~ 6 different sequence report per entry)

ii) Annotation of the sequence differences

(including conflicts, polymorphisms, splice variants etc..)

-> annotation of protein diversity

Page 54: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Righting the wrongs

“Sequences are rarely deposited in a “mature” state; as with all scientific research, DNA and protein annotation is a continual process of learning, revision and corrections.”

“Sequencing error rates: ~1 base in 10‟000”

Page 55: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

evidence exists that prove the existence of a protein;

Different qualifiers:

1. Evidence at protein level (~15,3%)

2. Evidence at transcript level (~15,8%)

3. Inferred from homology (~65,2 %)

4. Predicted (~3,4%)

5. Unassigned (mainly in TrEMBL) (0,3%)

Page 56: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

• Focal point of our efforts to maintain and develop UniProtKB/Swiss-Prot;

• Enables individual researchers to obtain a summary of what is known about a protein

Annotation

Page 57: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

In a UniProtKB/Swiss-Prot entry, you can expect to find:

• A (often corrected) protein sequence and the description of various isoforms/variants.

• Its biological origin with links to the taxonomic databases;

• All the names of a given protein (and of its gene);• A summary of what is known about the protein:

function, alternative products, PTM, tissue expression, disease, 3D data etc.…;

• A description of important sequence features: domains, PTMs, variations, etc.;

• A selection of references;• Selected keywords;• Numerous cross-references (central hub);

Page 58: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

An easy way to access the history of a protein sequence entry…

http://www.ebi.ac.uk/uniprot/unisave/

UniSave homepage:

Page 62: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence
Page 64: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

UniRef useful for comprehensive BLAST searches by providing

sets of representative sequences«Collapsing BLAST results»

= Three collections of sequences clusters from the UniProt knowledgebase and EnsEMBL, IPI, EMBL_WGS:

One UniRef100 entry -> all identical sequences (Identical sequences and sub-fragments with 11 or more residues are placed into a single record) -> reduction of 12 %

One UniRef90 entry -> sequences that have at least 90 % or more identity -> reduction of 40 %

One UniRef50 entry -> sequences that are at least 50 % identical-> reduction of 65 %

Independently of the species !

Page 67: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

UniProt Archive (UniParc) is part of UniProt project.

It is a non-redundant archive of protein sequences extracted from public

databases UniProtKB/Swiss-Prot,UniProtKB/TrEMBL, PIR-PSD, EMBL, EMBL

WGS, Ensembl, IPI, PDB, PIR-PSD,RefSeq, FlyBase, WormBase, H-Invitational

Database, TROME database, European Patent Office proteins, United States Patent and

Trademark Office proteins (USPTO) and Japan Patent Office proteins.

UniParc contains only protein sequences. All other information about the protein must

be retrieved from the source databases using the database cross-references.

Each unique sequence is stored only once with a stable identifier. The format of the

identifier is UPI followed by ten hexadecimal numbers, e.g.UPI000000000A.

UniParc

Page 68: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Use with extreme caution: also contains pseudogene, incorrect CDS prediction etc…!

Also patent office database data (EPO, ESPO…).

UniParc

Page 71: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental data.

UniMES is available in FASTA format on the UniProt ftp servers, in the new subdirectory current_release/unimes: •ftp.uniprot.org/pub/databases/uniprot•ftp.ebi.ac.uk/pub/databases/uniprot•ftp.expasy.org/databases/uniprot

Page 73: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

NCBInr(Entrez protein)

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein

Page 74: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Protein sequences: « NR database »Entrez protein

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein

Page 75: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Major protein sequence database „sources‟

UniProtKB: Swiss-Prot + TrEMBL

NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq

PIR PDB PRF

UniProtKB/Swiss-Prot: manually annotated protein sequences (11’612 species)

UniProtKB/TrEMBL: submitted CDS (EMBL) + automated annotation; non redundant with Swiss-Prot(184698 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (130’000 species)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: journal scan of „published‟ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction (4’000 species)

Integrated resources

„cross-references‟

Separated resources

Page 76: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF

derived from GenBank/EMBL/DDBJ sequenceswhich have a CDS annotated on them

- equivalent to TrEMBL

All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt)

3D structure database:all the protein sequences which have been cristallized(Swiss-Prot/TrEMBL are crosslinked to PDB)

Scientific publications derived sequences« Journal scan »

(integrated into TrEMBL)

Page 78: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

http://www.ncbi.nlm.nih.gov/RefSeq/

Page 79: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Accession numbers- for RNA (NM_)- for genomic (NT_)- for protein (NP_)- for predicted protein (XP_)

RefSeq: The Reference Sequence (RefSeq) collection aims to providea comprehensive, integrated, non-redundant, well-annotated set ofsequences, including genomic DNA, transcripts, and proteins.

3,648,590 entries (22-May-2007); 4,300 species.

5,590,364 entries (11-July-2008); 5,395 species.

6,042,750 entries (20-November-2008); 5,726 species.

Page 82: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF

derived from GenBank/EMBL/DDBJ sequenceswhich have a CDS annotated on them

- equivalent to TrEMBL, except that it is

redundant with Swiss-Prot

All PIR data have been integrated into Swiss-Prot and TrEMBL (UniProt)

3D structure database:all the protein sequences which have been cristallized(Swiss-Prot/TrEMBL are crosslinked to PDB)

Scientific publications derived sequences« Journal scan »

(integrated into TrEMBL)

Page 84: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

PIR: the Protein Identification Resource

PIR-PSD is no more updated, but exists as an archive

Page 86: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

PDBPDB (Protein Data Bank), 3D structure

Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X-ray or NMR studies

Contains also the corresponding protein sequences*The PIR-NRL3D database makes the sequence information in PDB available for similarity

searches and other tools

Includes protein sequences which are mutated,

effect of a mutation on the 3D structure)

Page 87: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

PDB: Protein Data Bankwww.rcsb.org/pdb/

Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA).

Associated with specialized programs allow the visualization of the corresponding 3D structure (e.g.,

SwissPDB-viewer, Chime, Rasmol)).

Currently there are structural data for about different proteins, but far less protein family

(highly redundant) !

Page 92: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

http://www.genome.jp/dbget-bin/www_bfind?prf

Looks for the peptide sequence described in publication (and which are not submitted in databases !!!)

Page 94: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Query at Entrez protein (NCBInr)

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein

Page 95: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Typical result of a query at

« Entrez protein »

RefSeq

Swiss-Prot

Genpept(gb/embl/ddbj)

PIR

PDB

Page 97: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

GI number: ‘GenInfo identifier’ number

- In addition to an AC number specific from the original database, each protein sequence in the NCBInr database has a GI number.

- If the sequence changes in any way, a new GI number will be assigned -> not a stable identifier

- A separate GI number is also assigned to each protein translation within a nucleotide sequence record (alternative products)

- A Sequence Revision History tool is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record: http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi

Page 98: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Menu

Introduction : bioinformatics and sequences

Nucleic acid sequence databases

Protein sequences databases (sources)

Protein sequences databases (other)

Page 99: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

EnsEMBLhttp://www.ensembl.org/

not only for proteins….

Page 100: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

EnsEMBL

Automated genome annotation and subsequent visualisation of annotated genomes.

Ensembl concentrates on vertebrate genomes, but other groups have adapted the system for use with plant and fungal genomes.

http://www.ensembl.org/info/about/index.html

Page 101: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

http://www.ensembl.org/info/data/docs/genome_annotation.html

- EnsEMBL: align the genomic sequences with all the sequences found in EMBL, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (-> known genes)

- Also do gene prediction (-> novel genes)

-DNA, RNA and protein sequences available for ~30 species

- Browsing tool

Page 103: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

http://www.ensembl.org/index.html

Browsing tool available for 49 species…

Page 104: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence
Page 106: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

htt

p:/

/www.n

cbi.nlm.n

ih.g

ov/C

CDS/

Page 107: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

CCDS (human)

Combining different approaches – ab initio, by similarity - and taking

advantage of the expertise acquired by different institutes, including

manual annotation…

Consensus between 4 institutions…

Page 109: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

IPIInternational Protein Indexhttp://www.ebi.ac.uk/IPI/IPIhelp.html

Page 111: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

IPI (International Protein Index)

Provides a guide to the main databases that describe the human,mouse, rat, Zebrafish, Arabidopsis, Chicken, and Cowproteomes: Swiss-Prot , TrEMBL, RefSeq and Ensembl (and H-InvDB, TAIR and VEGA).

IPI is built in order to provide maximum coverage of the majorpublicly available protein (and gene) databases, for a sameprotein

For each protein in IPI, an entry from one of the constituentdatabases is selected as the master entry, and supplies the IPIentry with its sequence and annotation.

Stable identifiers (with incremental versioning) are maintainedto allow the tracking of sequences in IPI between IPI releases.

Page 113: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

IMGT(international ImMunoGeneTics information)

Is a collection of high-quality integrated databases specialising

in inmunoglobulins, T cell receptors and the Major

Histocompatibility Complex (MHC) of all vertebrate species.

Page 116: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Protein sequence databases for proteomics

Page 117: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Phenyx: UniProtKB

PROWL: NCBInr, Swiss-Prot, dbEST

Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept, Ludwignr, OWL*.

Peptident (Aldente): UniProtKB.

Mascot: NCBInr, Swiss-Prot, dbEST, OWL*, MSDB

* OWL is obsolete since 1999

Translation of ESTs sequences in the 6 frames(EST are not associated with annotated CDSs !)

Page 118: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

OWLNon redundant protein database, including: Swiss-Prot, PIR, NRL3-D* and GenPept.

*The PIR-NRL3D database makes the sequence information in PDB available for similarity searches

Page 119: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Phenyx: UniProtKB

PROWL: NCBInr, Swiss-Prot, dbEST

Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept, Ludwignr, OWL*.

Peptident (Aldente): UniProtKB.

Mascot: NCBInr, Swiss-Prot, dbEST, OWL*, MSDB

* OWL is obsolete since 1999

Translation of ESTs sequences in the 6 frames(EST are not associated with annotated CDSs !)

Page 122: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

-> Accession / version number jungle !According to the database, a AC number can be associated with an entry (gene product: stable even if the sequence

changes) or with a sequence (it change as soon as the sequence changes)

Page 123: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

In resume

For the same protein sequence

You can find: A UniProtKB/Swiss-Prot entry A RefSeq entry (or GenPept)A EnsEMBl entryA CCDS entryA UniParc entry (archive)A IPI…

Page 124: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Type of record Sample Accession Format

GenBank/EMBL/DDBJ One letter followed by five digits: e.g. U12345

Two letters followed by 6 digits: e.g. AF123456

Swiss-Prot/TrEMBL One letter and five digits/letters: e.g. P12345, A0B533

RefSeq nucleotide Two letters, underscore bar and six digit:

e.g. mRNA NM_000492

e.g. genomic NT_000907

RefSeq protein e.g. NP_00483

RefSeq prediction e.g. XM_000483

e.g. XP_000467

PDB (protein structure) One digit followed by three letters: e.g. 1TUP

The AC number jungle

Page 125: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

uniprot.org

Page 128: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

(Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154).

Proteome complexity

Not predictable at the genome

level !

Page 129: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Chemical aspects

• Post-translational modifications (PTMs) consist in the breaking and/or the making of covalent bonds catalyzed by enzyme

• PTMs modify both protein mass and isoelectric point (PI)

Page 130: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

The PTM variety

Gly Ala Val Leu Ile Lys Arg His Asp Glu Asn Gln Cys Ser Thr Met Pro Phe Tyr Trp

acetylation

methylation

acylation

phosphorylation

oxidation

crosslinks

hydroxylation cofactor binding sulfation C-linked sugar N-linked sugar O-linked sugar S-linked sugar

acetylation

methylation

acylation

crosslinks

GPI

amidation

crosslinks

methylation

C-terminal modifications

in black: cytoplasmic modifications

in dark grey: both cytoplasmic and extracellular modifications, depending on the exact type

in light grey: extracellular modifications

N-terminal modifications

side-chain modifications

Page 131: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

PTM distribution among kingdoms

bacteria

eukaryotes

archaea

acetylation

amidation

FMN binding

FAD binding

GPI-anchor

lanthionine crosslink

bacterial lipid anchor

myristoylation

archaean lipid anchor

palmitoylation

methylation

phosphorylation

sulfation

diphthamide

pyrrolysine

archaea-specific methylation

eukaryote-specific methylation

bacteria-specific methylation

Page 132: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

PTM annotation in UniProtKB entriesPTMs are annotated in the feature table (‘sequence annotation’) when they can be assigned a position on the protein sequence - in the comments when they cannot.

Page 133: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

PTM-dedicated FT keys

FT key usage

CARBOHYD

(Glycosylation )

sugars

DISULFID

(Disulfide bond)

disulfide bonds

CROSSLNK

(Cross-link)

other crosslinks

LIPID lipids

MOD_RES

(Modified residue)

other modifications

PTMs are grouped by type, are specifically and uniquely annotated by the use of a controlled vocabulary and a

set of specific FT keys

Page 134: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

PTM annotation in UniProtKB entriesPTMs are annotated in the feature table when they can be assigned a position on

the protein sequence - in the comments when they cannot.

Associated keywords

Page 136: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Find all mouse proteins which are

phosphorylated

Page 138: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

UniProtKB/Swiss-Prot

Number of PTMs in Swiss-Prot release 51 (241242 entries)

all organisms

Pot. By sim. Exp. & Prob. total

signal peptide 15235 2850 4996 23081

N-GlcNAc 66264 826 3161 70251

O-GalNAc 306 354 628 1288

O-GlcNAc 1 167 78 246

phosphorylation 1110 14288 7760 23158

sulfation 252 294 171 717

myristate 129 535 131 795

GPI-anchor 478 115 68 661

Page 140: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

RESID

RESID is a database of

473 natural modifications

(Rel. 56.00) with chemical

and structural annotations

such as recommended

name and synonyms,

delta mass, 3D structure,

UniProt annotations, etc.

FTP sites:

ftp://ftp.ebi.ac.uk/pub/databases/RESID/

ftp://ftp.ncifcrf.gov/pub/users/residues

Web sites:

http://www.ebi.ac.uk/RESID

http://www.ncifcrf.gov/RESID/

http://home.earthlink.net/~jsgaravelli/RESIDInfo.HTML

http://hpc.cs.tsinghua.edu.cn/bioinfo/database/index.html

Page 143: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Other PTM databases

UNIMOD: http://www.unimod.org/

PSI-MOD: http://psidev.sourceforge.net/mod/: ontology

Delta Mass: http://www.abrf.org/index.cfm/dm.home

Page 145: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

http://www.geneontology.org

Page 146: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

Three disjoint axes:

cellular component• Sub-cellular location e.g nucleus, ribosome,

origin recognition complex

molecular function• molecular role e.g. catalytic activity, binding

biological process• broad biological phenomena e.g. mitosis,

growth, digestion

GO scope

Page 147: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

terms are related within a hierarchy

GO structure

• Terms are linked by two relationships

– is-a

– part-of

Page 148: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

cell

membrane chloroplast

mitochondrial chloroplastmembrane membrane

is-apart-of

GO structure

Page 149: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

GOA: Gene Ontology Annotation

GOA aims to provide high-quality electronic and manual annotations to the

UniProt Knowledgebase and International Protein Index, using GO terms.

The GOA project is run by EBI and is a member of the GO consortium since 2001

http://www.ebi.ac.uk/GOA/

What is GOA ?

In 2001, the first phase of the GOA project involved the large-scale assignment

of GO terms to Swiss-Prot and TrEMBL entries using electronic methods,

namely the mappings spkw2go, ec2go and Interpro2go.

Page 151: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence

e-proxemis: http://e-proxemis.expasy.org

Page 152: Protein sequence databases - estrellapolar.cnb.csic.esestrellapolar.cnb.csic.es/proteored/docs/Bioinfo... · Protein sequence databases. Menu Introduction : bioinformatics and sequence