genome, protein and model organism databases

160
Genome, Protein Genome, Protein and and Model Organism Databases Model Organism Databases Anne Estreicher Swiss-Prot Group Swiss Institute of Bioinformatics Geneva – Switzerland [email protected] Bioinformatic and Comparative Genome Analysis Course HKU-Pasteur Research Centre - Hong Kong, China August 17 - August 29, 2009

Upload: kishi

Post on 13-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Genome, Protein and Model Organism Databases. Anne Estreicher Swiss-Prot Group Swiss Institute of Bioinformatics Geneva – Switzerland [email protected]. Bioinformatic and Comparative Genome Analysis Course HKU-Pasteur Research Centre - Hong Kong, China August 17 - August 29, 2009. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Genome, Protein and Model Organism Databases

Genome, ProteinGenome, Proteinandand

Model Organism Databases Model Organism Databases

Anne Estreicher Swiss-Prot Group

Swiss Institute of BioinformaticsGeneva – Switzerland

[email protected]

Bioinformatic and Comparative Genome Analysis Course

HKU-Pasteur Research Centre - Hong Kong, China

August 17 - August 29, 2009

Bioinformatic and Comparative Genome Analysis Course

HKU-Pasteur Research Centre - Hong Kong, China

August 17 - August 29, 2009

Page 2: Genome, Protein and Model Organism Databases

OutlineOutline

1. Introduction (definitions, history…)

2. From DNA sequence to genomic tools

3. The flow of information: from DNA to proteins

4. Protein sequence databases

5. MODs at a glance

Page 3: Genome, Protein and Model Organism Databases

• A collection of related data, which are– structured – searchable – updated periodically– cross-referenced

• Includes also associated tools necessary for access/query, download, etc.

What is a database ?What is a database ?

Page 4: Genome, Protein and Model Organism Databases

Why do we need databases ?Why do we need databases ?

Data need to be stored, curated and made available for analysis and knowledge discovery

Efficient way of sharing data, independently of regular publications

Essential resources for both experimental and computational biologists

Page 5: Genome, Protein and Model Organism Databases

Databases in biology : not a Databases in biology : not a new issue …new issue …

• 1954 First protein sequence (insulin by F. Sanger)• 1965 Atlas of Protein Sequence and Structure (65

proteins)

Page 6: Genome, Protein and Model Organism Databases

The first protein sequence "database" by Margaret Dayhoff (1965)

contained 65 proteins

Page 7: Genome, Protein and Model Organism Databases

Databases: not a new issue…Databases: not a new issue…

• 1954 First protein sequence (insulin by F. Sanger)• 1965 Atlas of Protein Sequence and Structure (65

proteins)• Mid 70s Improvements in DNA sequencing• 1979 Los Alamos Sequence Library (Walter Goad)• 1980 ~ 80 genes fully sequenced

-> Need to store the data and to make them available for analysis (in format acceptable for human eyes and machines)

-> ARCHIVE

-> RACE for the central position in life sciences…And the winner is…

Page 8: Genome, Protein and Model Organism Databases

Databases: not a new issue…Databases: not a new issue…

EMBL-Bank - Europe 1980GenBank - USA 1982

DDBJ - Asia 1986

leading to the establishment of the INSDC (International Nucleotide Sequence

Database Collaboration) -> daily exchanges of data

Page 9: Genome, Protein and Model Organism Databases

www.insdc.org

Page 10: Genome, Protein and Model Organism Databases

EMBL-BANKEMBL-BANK -- GenBankGenBank -- DDBJDDBJ

• Main resources for DNA and RNA sequences;

• Used to be retrieved from publications -> direct submissions from individual researchers, genome sequencing projects and patent applications:

“Journal publishers generally require sequence deposition prior to publication so that an accession number can be included in the paper.”

1. True for nucleic acid, not for protein sequences;2. Not always put into practice

=> Not submitted sequences are LOST!!!=> Not submitted sequences are LOST!!!

• Archives (primary databases)

• data belong to submitters

Page 11: Genome, Protein and Model Organism Databases

EMBL-BANKEMBL-BANK -- GenBankGenBank -- DDBJDDBJ

Archive (primary databases) => data belong to the submitter

Minimal checks, such as vector contamination Annotation by the submitters

Page 12: Genome, Protein and Model Organism Databases

Databases: not a new issue…Databases: not a new issue…

• 1954 First protein sequence (insulin by F. Sanger)• 1965 Atlas of Protein Sequence and Structure (65

proteins)• 1979 Los Alamos Sequence Library (Walter Goad) –

DNA • 1982 EMBL-Bank - DNA• 1984 GenBank – DNA• 1986 DDBJ - DNA

Page 13: Genome, Protein and Model Organism Databases

Databases: not a new issue…Databases: not a new issue…

• 1954 First protein sequence (insulin by F. Sanger)• 1965 Atlas of Protein Sequence and Structure (65

proteins)• 1979 Los Alamos Sequence Library (Walter Goad) –

DNA • 1982 EMBL-Bank - DNA• 1984 GenBank – DNA• 1986 DDBJ - DNA

-> ARCHIVES (primary databases) may not be sufficient-> need to annotate the data to produce KNOWLEDGE • 1986 Swiss-Prot – protein sequences – a paradigm for

annotated (secondary) databases

Page 14: Genome, Protein and Model Organism Databases

The Swiss-Prot conceptThe Swiss-Prot concept

non-redundant: Protein products of

1 gene / 1 species -> 1 entry1 gene / 1 species -> 1 entry,

Manually annotated (=> curator judgement on data !),

Highly cross-referenced (1st life-science database to provide cross-references) (links to > 130 databases from www.uniprot.org).

Page 15: Genome, Protein and Model Organism Databases

Databases: not a new issue…Databases: not a new issue…• 1954 First protein sequence (insulin by F. Sanger)• 1965 Atlas of Protein Sequence and Structure (65

proteins)• 1979 Los Alamos Sequence Library (Walter Goad) – DNA • 1982 EMBL-Bank - DNA• 1984 GenBank – DNA

Protein information resource (PIR) – Protein sequences

• 1986 DDBJ – DNASwiss-Prot – protein sequences

• 1996 TrEMBL (Translated EMBL) – Protein sequencesComplement of Swiss-Prot to cope with the

increasing amount of new sequences; AUTOMATIC ANNOTATION !

Page 16: Genome, Protein and Model Organism Databases

0

50'000

100'000

150'000

200'000

250'000

300'000

350'000

400'000

450'000

500'000

2 7 12 17 22 27 32 37 42 47 52 57

19863’939 entries

UniProtKB/Swiss-Prot growthUniProtKB/Swiss-Prot growthN

um

ber

of

en

trie

s

Releasenumber

1996: creation of TrEMBLTrEMBLSwiss-Prot: 52’205 entriesTrEMBL: 61’137 entries

Swiss-Prot rel. 57.5 (07-Jul-2009): 470’369Swiss-Prot rel. 57.5 (07-Jul-2009): 470’369 entriesentries

Page 17: Genome, Protein and Model Organism Databases

0

1'000'000

2'000'000

3'000'000

4'000'000

5'000'000

6'000'000

7'000'000

8'000'000

9'000'000

UniProtKB growthUniProtKB growth

Releasenumber

TrEMBL rel.40.5 (07-Jul-2009): 8TrEMBL rel.40.5 (07-Jul-2009): 8’’594594’’382382 entries entries Swiss-Prot rel.57.5 (07-Jul-2009): 470’369 Swiss-Prot rel.57.5 (07-Jul-2009): 470’369 entriesentries

1986 1996 2009

TrEMBL growthTrEMBL growth (sequences/day)

2004

1’5002006-2007 3’5002008

>5’0002009

~8’000

TrEMBLTrEMBLAutomated curation

Swiss-ProtSwiss-ProtManual curation

Nu

mb

er

of

en

trie

s

Page 18: Genome, Protein and Model Organism Databases

New challengeNew challenge

Flood of data -> need to be stored, curated and made available for analysis and knowledge discovery

Page 19: Genome, Protein and Model Organism Databases

Life sciences used to be rich in hypotheses, well-off in knowledge and poor in data;

Today they are very rich in data, not so well-off in knowledge and very poor in hypotheses.

Complex system

(R)evolution of these last 20 years(R)evolution of these last 20 years

List of parts

??

Page 20: Genome, Protein and Model Organism Databases

Science (1993) 262, 502

Page 21: Genome, Protein and Model Organism Databases

Danger !

EMBL Database GrowthEMBL Database Growthhttp://www.ebi.ac.uk/embl/Services/DBStats/

Page 22: Genome, Protein and Model Organism Databases

http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.htmlhttp://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat

In 4 months, 374 new In 4 months, 374 new genomesgenomes

and 77 were completedand 77 were completed~ 100 genomes/month~ 100 genomes/month

(in 2008 -> ~50 genomes/month)

+ ~2’360 viral (& viroid) genomes=> Total ~ 5’600 genomes 

Page 23: Genome, Protein and Model Organism Databases

http://genomesonline.org/index2.htm

Page 24: Genome, Protein and Model Organism Databases

http://www.genomesonline.org/gold.cgi

Page 25: Genome, Protein and Model Organism Databases
Page 26: Genome, Protein and Model Organism Databases

http://www.genomesonline.org/gold.cgi

Page 27: Genome, Protein and Model Organism Databases

Metagenomics:Metagenomics:study of genetic material recovered directly

from environmental samples

• Global Ocean Sampling (C. Venter)

• Whale fall

• Soil, sand beach, New-York air, …

• Human fluids, mouse gut

• …

Venter’s Sorcerer II

Page 28: Genome, Protein and Model Organism Databases

Flood in the world of Flood in the world of proteins…proteins…

1965: first protein sequence "database" by Margaret Dayhoff (65 proteins)

July 2009: ~ 20 millions unique protein sequence (source UniParc - http://www.uniprot.org/uniparc/)

UniParc:non-redundant database that contains most of the publicly available protein sequences in the world (includes sequences from EMBL-Bank/DDBJ/GenBank nucleotide sequence databases, Ensembl, FlyBase, H-Invitational Database (H-Inv), International Protein Index (IPI), Patent Offices (EPO, JPO and USPTO), PIR-PSD, Protein Data Bank (PDB), Protein Research Foundation (PRF), RefSeq, Saccharomyces Genome database (SGD), TAIR Arabidopsis thaliana Information Resource, TROME, UniProtKB/Swiss-Prot and TrEMBL, Vertebrate Genome Annotation database (VEGA) and WormBase).

Page 29: Genome, Protein and Model Organism Databases

New challengeNew challenge

Flood of data

Flood of databases…

Page 30: Genome, Protein and Model Organism Databases

NAR 1st issue of the year is always

dedicated to databases + "clean" list of

databases provided

(! not exhaustive !)

Page 31: Genome, Protein and Model Organism Databases

The NAR Online Molecular Biology Database collection in 2009

A total of 1’170 databases (19 obsolete removed)

http://www.oxfordjournals.org/nar/database/a/

Page 32: Genome, Protein and Model Organism Databases

NAR "clean" list of databaseshttp://www.oxfordjournals.org/nar/database/a/

Page 33: Genome, Protein and Model Organism Databases

Most recent NAR paper about the database

(not available for all db, some described in

other journals)

Page 34: Genome, Protein and Model Organism Databases

A "clean" list of can be found in the NAR online molecular biology database

collection

http://www.oxfordjournals.org/nar/database/a/

Page 35: Genome, Protein and Model Organism Databases
Page 36: Genome, Protein and Model Organism Databases

BIOLOGICAL DATABASE CATEGORIES BIOLOGICAL DATABASE CATEGORIES

• Databases of nucleic acid sequences (RNA, DNA)• Databases of protein sequences• Databases of protein motifs and protein domains• Databases of structures• Databases of genomes• Databases of genes• Databases of expression profiles• Databases of SNPs and mutations• Databases of metabolic pathways • Databases of protein interactions• Databases of taxonomy• …

Databases containing sequences or data directly derived from sequences.

Page 37: Genome, Protein and Model Organism Databases

DNA sequences :DNA sequences :

What ?What ?Where ?Where ?How ?How ?

& genomic tools& genomic tools

NCBINCBIUCSCUCSC

Page 38: Genome, Protein and Model Organism Databases

Accession numberMolecule typeDate of submissionDefinition

Nucleotide sequence

Stable accession number (should

always be cited in publications)

Possible molecule types:genomic DNA and RNA

mRNA other DNA and RNA rRNA transcribed RNAtRNA unassigned DNA and RNA viral cRNA

GenBank entry AF415175http://www.ncbi.nlm.nih.gov/nuccore/16589063

Page 39: Genome, Protein and Model Organism Databases

Accession numberMolecule typeDate of submissionDefinition

Nucleotide sequence

Taxonomy

Page 40: Genome, Protein and Model Organism Databases

Accession numberMolecule typeDate of submissionDefinition

Nucleotide sequence

Taxonomy

References

Page 41: Genome, Protein and Model Organism Databases

Nucleotide sequence

Taxonomy

References

Features:Information provided by the submitterMay include annotation of the sequence

Accession numberMolecule typeDate of submissionDefinition

OrganismMolecule typeChromosomal locationTissue typeGene nameCDS annotation=> protein sequence + Protein IDentifier (PID: stable identifier & version number)

Page 42: Genome, Protein and Model Organism Databases

Protein sequence

Gives access to the nucleic acid sequence of the CDS (not of the entire mRNA)

Page 43: Genome, Protein and Model Organism Databases

"Features"  may provide much more informationdepending upon the sequence and the submitter…

3’end of chromosome Y 

EMBL #AJ271736

Page 44: Genome, Protein and Model Organism Databases

Very similar view, links and Very similar view, links and options from the 3 sites:options from the 3 sites:

EMBL-Bank – GenBank - DDBJEMBL-Bank – GenBank - DDBJhttp://www.ddbj.nig.ac.jp/http://www.ebi.ac.uk/embl/ http://www.ncbi.nlm.nih.gov/

Page 45: Genome, Protein and Model Organism Databases

How to find a DNA sequence How to find a DNA sequence at the NCBI…at the NCBI…

Page 46: Genome, Protein and Model Organism Databases

http://www.ncbi.nlm.nih.gov/

Page 47: Genome, Protein and Model Organism Databases

Databases @ NCBIhttp://www.ncbi.nlm.nih.gov/Database/datamodel/index.html

The Entrez system:The Entrez system:integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others

=> Maximal=> Maximal interconnectivityinterconnectivity

Page 48: Genome, Protein and Model Organism Databases

Databases @ NCBIhttp://www.ncbi.nlm.nih.gov/Database/datamodel/index.html

Page 49: Genome, Protein and Model Organism Databases

Simple search with aSimple search with aEMBL-Bank/GenBank/DDBJ EMBL-Bank/GenBank/DDBJ

accession numberaccession number

Page 50: Genome, Protein and Model Organism Databases
Page 51: Genome, Protein and Model Organism Databases
Page 52: Genome, Protein and Model Organism Databases

Searching fromSearching froma bibliographic reference…a bibliographic reference…

Page 53: Genome, Protein and Model Organism Databases
Page 54: Genome, Protein and Model Organism Databases

Search results 2 and 3-> accession numbers provided by the authors in the article-> GenBank records

Search result 1-> corresponds to the RefSeq database…

Page 55: Genome, Protein and Model Organism Databases

RefSeq (Reference Sequence)RefSeq (Reference Sequence)

• Provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins;

• Most data extracted from GenBank -> choice of a reference sequence and annotation (no documented comparison between sequences)

• Some entries based on predictions (accession: XM_; XR_; XP_; ZP_);

• Currently, 8'665 species represented;

• Annotation: Manual annotation (only in entries tagged as "reviewed"); Collaboration; Propagation from other sources; Computation.

Page 56: Genome, Protein and Model Organism Databases

CURATION

GENOME ANNOTATION No

INFERRED No

MODEL No

PREDICTED No

PROVISIONAL No

REVIEWEDYes Yes (sequence +

functional information and features)

VALIDATED Yes Yes (initial sequence)

WGS No

RefSeq (Reference Sequence)RefSeq (Reference Sequence)

Page 57: Genome, Protein and Model Organism Databases

RefSeq entry NM_015595: SGEF mRNA RefSeq entry NM_015595: SGEF mRNA

Accession numberDefinitionTaxonomyList of references

Page 58: Genome, Protein and Model Organism Databases

RefSeq entry NM_015595: SGEF mRNA RefSeq entry NM_015595: SGEF mRNA

Gene nameExon annotationCDS annotation and sequence

Page 59: Genome, Protein and Model Organism Databases

RefSeq entry NM_015595: SGEF mRNA RefSeq entry NM_015595: SGEF mRNA

Sequence

Page 60: Genome, Protein and Model Organism Databases

Searching withSearching withthe gene name…the gene name…

Page 61: Genome, Protein and Model Organism Databases

Etc.

Page 62: Genome, Protein and Model Organism Databases

Etc.

GenBank

Refseq

Page 63: Genome, Protein and Model Organism Databases

NCBI Entrez systemNCBI Entrez system Looks for the request in all NCBI databases

Cannot be ignored -> no simple way to search only in your favourite NCBI database

Page 64: Genome, Protein and Model Organism Databases

Searching using BLAST…Searching using BLAST…

Page 65: Genome, Protein and Model Organism Databases
Page 66: Genome, Protein and Model Organism Databases
Page 67: Genome, Protein and Model Organism Databases
Page 68: Genome, Protein and Model Organism Databases
Page 69: Genome, Protein and Model Organism Databases

RefSeq

Page 70: Genome, Protein and Model Organism Databases
Page 71: Genome, Protein and Model Organism Databases
Page 72: Genome, Protein and Model Organism Databases

UniGene:Clusters of transcript sequences that appear to come from the same transcription locus

Page 73: Genome, Protein and Model Organism Databases

!?UniSTS:62643 maps to multiple loci in Homo sapiens

Information on tissue expression

Page 74: Genome, Protein and Model Organism Databases

UniGene Mapping of

known genes

Page 75: Genome, Protein and Model Organism Databases

UniGene Mapping of

known genes

Mapping of RNA (EMBL/GenBank/DD

BJ& RefSeq)

Page 76: Genome, Protein and Model Organism Databases

UniGene

Mapping of RNA (EMBL/GenBank/DDBJ

& RefSeq)

Mapping of RefSeq RNAMapping of

known genes

Page 77: Genome, Protein and Model Organism Databases

UniGene

Mapping of RNA (EMBL/GenBank/DDBJ

& RefSeq)

Mapping of RefSeq RNA

This view by default can be customized

Mapping of known genes

Page 78: Genome, Protein and Model Organism Databases

1. Choose desired option;2. Add it (and/remove undesired)3. Apply the new display

Page 79: Genome, Protein and Model Organism Databases

Zoom out -> a better view of the genomic

context of the sequence of interest

Page 80: Genome, Protein and Model Organism Databases

Original view

Page 81: Genome, Protein and Model Organism Databases

Map viewer~ 110 organisms

represented in Genome database.(www.ncbi.nlm.nih.gov/sites/entrez?

db=genome)

Page 82: Genome, Protein and Model Organism Databases

Genomic tools on the Genomic tools on the UCSC server:UCSC server:BLAT searchBLAT search

Page 83: Genome, Protein and Model Organism Databases

And:A.GambiaeA.MelliferaS.cerevisiae

a total of 47 organisms

Page 84: Genome, Protein and Model Organism Databases

http://genome.ucsc.edu/cgi-bin/hgBlat

Feb. 2009 assembly: not all data implemented !May be better to use former assembly for the time being.

Genome browser @ UCSC

cDNAsequen

ce

Page 85: Genome, Protein and Model Organism Databases
Page 86: Genome, Protein and Model Organism Databases
Page 87: Genome, Protein and Model Organism Databases

Chromosomal location

Consensus CDS& other sequences from reliable resources

gDNA sequence

Page 88: Genome, Protein and Model Organism Databases

http://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi

Annotation of genes is provided by multiple public resources, using different methods, and resulting in information that is similar but not always identical.

CCDS database goal: provide a standard set of gene annotations.

Collaborative project involving teams (manual and automated annotation): * European Bioinformatics Institute (EBI) * National Center for Biotechnology Information (NCBI) * Wellcome Trust Sanger Institute (WTSI) * University of California, Santa Cruz (UCSC)

Currently available only for human and mouse genomes (July 2009):20'159 human CCDS (including isoforms) -> 17'054 CCDS genes17'707 mouse CCDS (including isoforms) -> 16'889 CCDS genes

Page 89: Genome, Protein and Model Organism Databases

Chromosomal location

Consensus CDS& other sequences from reliable resources

gDNA sequence

(Human) ESTs(including unspliced)

(Human) spliced ESTs

(Human) mRNAs

All sequences can be retrieved

Page 90: Genome, Protein and Model Organism Databases

The view can be completely

customized…

Page 91: Genome, Protein and Model Organism Databases

…including with various tools

allowing comparative

genomics

Page 92: Genome, Protein and Model Organism Databases

…and including your own data !

http://genome.ucsc.edu/

Page 93: Genome, Protein and Model Organism Databases

Back to the Blat Back to the Blat viewerviewer

Page 94: Genome, Protein and Model Organism Databases

Arrows >>>> show the direction of transcription

Page 95: Genome, Protein and Model Organism Databases

2 transcripts from the same locus:BDNF (Brain-Derived Neurotrophic Factor) BDNFOS (BDNF Opposite Strand)

Page 96: Genome, Protein and Model Organism Databases

Exons

Page 97: Genome, Protein and Model Organism Databases

View of alternative exons

Alternative exons

Constitutive exons

Page 98: Genome, Protein and Model Organism Databases

Interested by this exon ?

Just zoom in…

Page 99: Genome, Protein and Model Organism Databases
Page 100: Genome, Protein and Model Organism Databases

Genome browser @ UCSC has many great options, give it a

try!

http://genome.ucsc.edu/

Page 101: Genome, Protein and Model Organism Databases

Typical problems

or

Why wonderful tools will never replace the brain of a life

scientist !

Page 102: Genome, Protein and Model Organism Databases
Page 103: Genome, Protein and Model Organism Databases

… Once upon a time, there was a gene on chromosome 11…

Page 104: Genome, Protein and Model Organism Databases

2 essential genome resources are missing from this lecture:

Ensembl (http://www.ensembl.org/index.html): automated annotation of many genomes;

Vega (http://vega.sanger.ac.uk/index.html):High quality manual annotation of genomes (currently Homo sapiens, Mus musculus, Danio rerio, Gorilla gorilla, Macropus eugenii, Sus scrofa, Canis familiaris).

Please go and visit them!

Page 105: Genome, Protein and Model Organism Databases

The flow of informationThe flow of information

From DNA sequencesFrom DNA sequencesto protein to protein

sequences:sequences:

A little biologyA little biologyandand

A few databasesA few databases

Page 106: Genome, Protein and Model Organism Databases

Increase in complexity 5-10 x

Alternative promoter usage Alternative splicing

Trans-splicingmRNA editing …

Increase in complexity2-5 x

~ 100’000human

transcripts

~ 20’500 human protein-encoding

genes

~ 1'000'000 human proteins

TranscriptoTranscriptomeme

From genome to proteome:From genome to proteome:the example of humanthe example of human

GenomeGenome ProteomeProteome

Post-translational modifications (PTMs)

Most PTMs cannot be predicted from DNA

sequences

Page 107: Genome, Protein and Model Organism Databases

The hectic life of a protein The hectic life of a protein sequence…sequence…

cDNAs, ESTs, genomes, …

DDBJDDBJ

Data not submitted to public databases, delayed or cancelled…

…if a Coding Sequence (CDS)is submitted

Protein sequence databases

Nucleic acid databases

Gene predictionRefSeq, Ensembl

+ some MODs

no CDS

EMBL GenBankwww.insdc.orgInternational Nucleotide Sequence Database Collaboration

Sequences from

publicationsJournal scan

Direct submissions

Page 108: Genome, Protein and Model Organism Databases

!!!!

99% of the protein sequences found in databases come from the translation

nucleotide sequences=> Experimental evidence may be

lacking!

Page 109: Genome, Protein and Model Organism Databases

EMBL (DNA)EMBL (DNA)

TrEMBL TrEMBL Translated EMBL

Translated CDS

Reference + tissue

Protein name

Translated CDS

Product name

Tissue

Reference

Automated extraction of protein

sequence (translated CDS),

gene name and references +Automated annotation.

A similar pipeline is used at the NCBI to go from GenBankGenBank

to GenPeptGenPept

Page 110: Genome, Protein and Model Organism Databases

!!!!

The quality of UniProtKB/TrEMBL (& GenPept) entries depends upon the

quality of the submissions in the original EMBL-Bank/GenBank/DDBJ

entry.

Page 111: Genome, Protein and Model Organism Databases

EMBLEMBL

TrEMBLTrEMBL

Page 112: Genome, Protein and Model Organism Databases
Page 113: Genome, Protein and Model Organism Databases

EMBL (DNA)EMBL (DNA)

TrEMBLTrEMBL

Translated CDS

Reference

Protein name

Translated CDS

Product name

Tissue

Reference

Automated extraction of protein

sequence (translated CDS),

gene name and references.Automated annotation.

Swiss-ProtSwiss-ProtManual annotation

of the sequence and review of

associated biological

information

Protein nameSS

Many more references

Translated CDS+ SAPs+ isoforms+ …

Full annotation

Page 114: Genome, Protein and Model Organism Databases

Sequence

Sequence

features

Ontologies

References

Nomenclature

Splice variants

Annotations

Page 115: Genome, Protein and Model Organism Databases

Evidence for protein existence:Annotation in UniProtKB

5 levels of evidence: 1. evidence at protein level, 2. evidence at transcript level, 3. inferred by homology, 4. predicted,5. uncertain.

Page 116: Genome, Protein and Model Organism Databases

http://www.uniprot.org/uniprot/P35613

Page 117: Genome, Protein and Model Organism Databases
Page 118: Genome, Protein and Model Organism Databases

http://www.uniprot.org/uniprot/Q9Y471

Page 119: Genome, Protein and Model Organism Databases

http://www.uniprot.org/uniprot/Q9Y471

Page 120: Genome, Protein and Model Organism Databases

2D-gel dbs 2DBase-EcoliANU-2DPAGEAarhus/Ghent-2DPAGE (no server)

COMPLUYEAST-2DPAGECornea-2DPAGE DOSAC-COBS-2DPAGEECO2DBASE (no server)

HSC-2DPAGEOGPPHCI-2DPAGEPMMA-2DPAGERat-heart-2DPAGEREPRODUCTION-2DPAGESiena-2DPAGESWISS-2DPAGEWorld-2DPAGE

Family and domain dbsGene3DHAMAPInterProPANTHERPfamPIRSFPRINTSProDomPROSITESMARTTIGRFAMs

Organism-specific dbsAGDBuruListCGDCTDCYGD DictyBaseEchoBASEEcoGeneeuHCVdbFlyBaseGenAtlasGeneCardsGeneDB_SpombeGeneFarmGrameneH-InvDB HGNCHPA LegioListLepromaListiListMaizeGDBMGIMIMMypuListOrphanet PharmGKBPhotoListPseudoCAPRGDSagaListSGDSubtiListTAIRTubercuListWormBaseWormPepXenbaseZFIN

Protein family/group dbsCAZyMEROPSPeroxiBasePptaseDBREBASETCDB

Genome annotation dbsEnsemblGeneIDGenomeReviewsKEGGNMPDRTIGRUCSCVectorBase

Enzyme and pathway dbsBioCycBRENDAPathway_Interaction_DBReactome

OthersBindingDBPMAP-CutDBDrugBank NextBio

Sequence dbsEMBLIPIPIRUniGeneRefSeq

3D structure dbsDisProtHSSPPDBPDBsumSMR

PTM dbsGlycoSuiteDBPhosphoSitePhosSite

UniProtKB/Swiss-Prot:115 explicit links

and 19 implicit links!

Proteomic dbsPeptideAtlasPRIDEProMEX

Protein-protein interaction dbsDIPIntAct

Phylogenomic dbsHOGENOMHOVERGENOMA

Polymorphism dbsdbSNP

Gene expression dbsArrayExpressBgeeCleanExGermOnline

Ontologies GO

Page 121: Genome, Protein and Model Organism Databases
Page 122: Genome, Protein and Model Organism Databases

Protein Information Resource

European Bioinformatics Institute European Molecular Biology Laboratory

Swiss Institute of

Bioinformatics

The UniProt The UniProt consortiumconsortium

Page 123: Genome, Protein and Model Organism Databases

UniProt mission:

Provide a comprehensive high-quality and freely accessible resource of protein sequence and functional annotation.

Page 124: Genome, Protein and Model Organism Databases

New release every 3 weeks

Page 125: Genome, Protein and Model Organism Databases

Update frequencyUpdate frequencyA crucial issue !! A crucial issue !!

• Sometimes very difficult, or even impossible, to find;

• Crucial not only for the database itself, but also for tools using databases.

Page 126: Genome, Protein and Model Organism Databases

Update frequencyUpdate frequency

Page 127: Genome, Protein and Model Organism Databases
Page 128: Genome, Protein and Model Organism Databases

http://www.matrixscience.com/search_intro.html

Page 129: Genome, Protein and Model Organism Databases

Mascot MS/MS identification tool is fine, but it cannot be used from this website !

Solution: Download the database of interest and make sure you work with an up-to-date version.

Page 130: Genome, Protein and Model Organism Databases

Never hesitate to ask for an Never hesitate to ask for an updateupdate

Page 131: Genome, Protein and Model Organism Databases
Page 132: Genome, Protein and Model Organism Databases

UniProtKBUniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (9’232’223 entries)

UniParcUniParc: protein sequence archive (equivalent to

EMBL-Bank/GenBank/DDBJ at the protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated. (query, no Blast on www.uniprot.org, Blast @ EBI, not downloadable) (20’070’606 entries)

Page 133: Genome, Protein and Model Organism Databases

UniParc entry contains all records for a unique sequence in major publicly available databases.

TrEMBL entry merged into Swiss-Prot => does not

exist anymore

Page 134: Genome, Protein and Model Organism Databases

UniProtKBUniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (9’232’223 entries)

UniParcUniParc: protein sequence archive (EMBL equivalent at the

protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated. (query, no Blast on www.uniprot.org, Blast @ EBI, not downloadable) (20’070’606 entries)

UniRefUniRef: 3 clusters of protein sequences with 100, 90 and 50 % similarity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 8’474’689 entries; UniRef90 5’668'669 entries; UniRef50 2'729'565 entries)

Page 135: Genome, Protein and Model Organism Databases

UniRef100, 90 and 50UniRef100, 90 and 50

One UniRef100 entry -> merge of identical sequences (including subfragments, splice variants). Based on UniProtKB sequences and selected UniParc records (such as Ensembl & RefSeq).

One UniRef90 entry -> sequences that have at least 90% or more identity. Built from UniRef100.

One UniRef50 entry -> sequences that are at least 50% identical. Built from UniRef100.

Page 136: Genome, Protein and Model Organism Databases
Page 137: Genome, Protein and Model Organism Databases

UniProtKBUniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (7’097’874 entries)

UniParcUniParc: protein sequence archive (EMBL equivalent at the

protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated. (query, no Blast on www.uniprot.org, Blast @ EBI, not downloadable) (17’646’564 entries)

UniRefUniRef: 3 clusters of protein sequences with 100, 90 and 50 % similarity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 6,652,983 entries; UniRef90 4’438’653 entries; UniRef50 2’104’702 entries)

UniMESUniMES: protein sequences derived from metagenomic projects (Global Ocean Sampling (GOS)) (Blast, download) (UniMes 6'028'191 entries)

Page 138: Genome, Protein and Model Organism Databases

What is "Non-Redundancy" ?What is "Non-Redundancy" ?

• UniParcUniParc– One UniParc entry for all entries corresponding to

100% identical sequences (100% identity over the entire length) (from many different databases).

• UniRefUniRef– One UniRef100 entry for all entries corresponding to

100% identical sequences (including fragments) from UniProtKB, Ensembl, Refseq, PDB.

• UniProtKB/Swiss-ProtUniProtKB/Swiss-Prot– One Swiss-Prot entry for all the protein products of

one gene, including fragments, variations/polymorphisms, splice variants, sequencing errors…

Page 139: Genome, Protein and Model Organism Databases

Comparing searches:Comparing searches:NCBI and UniProtNCBI and UniProt

Page 140: Genome, Protein and Model Organism Databases

GenPept

GenPept

Swiss-Prot

RefSeq

Identical sequences

AAC34135 CAH72619Identical

sequencesAAF05316 BAG55035 CAH72618 AAI17423 AAF89753

NP_612564 O00206

Search for the human Toll-like

receptor 4 Entrez Entrez Protein (NCBI)Protein (NCBI)

Page 141: Genome, Protein and Model Organism Databases

Swiss-Prot

Search for the human Toll-like

receptor 4 in

UniProtKBUniProtKB

Page 142: Genome, Protein and Model Organism Databases

Sequences retrieved in Entrez Protein:

O00206AAF05316CAH72618 CAH72619BAG55035AAI17423 AAF89753

NP_612564* AAC34135

*Based on A126770, BC117422,AL160272

and AA598398

Page 143: Genome, Protein and Model Organism Databases

Major protein sequence resourcesMajor protein sequence resources

UniProtKB: Swiss-Prot + TrEMBL

EntrezProtein: Swiss-Prot+GenPept+PIR+PDB+PRF+RefSeq

PIR PDB PRF

UniProtKB/Swiss-Prot: manually annotated protein sequences (~12’000 species)

UniProtKB/TrEMBL: submitted CDS (EMBL); automated annotation (~202’000 species)

GenPept: submitted CDS (GenBank)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: journal scan of ‘published’ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation

Resources integrated in the

entries

Resources integrated in the

search engine

Page 144: Genome, Protein and Model Organism Databases

Model Organism Databases Model Organism Databases (MODs) at a glance(MODs) at a glance

Page 145: Genome, Protein and Model Organism Databases

Model organismModel organism

Species extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the organism model will provide insight into the workings of other organisms.

Model organisms MODs

Mus musculus MGI http://www.informatics.jax.org/Rattus norvegicus RGD http://rgd.mcw.edu/Oryza sativa RAP-DB http://rapdb.dna.affrc.go.jp/Arabidopsis thaliana TAIR http://www.arabidopsis.org/Drosophila melanogaster FlyBase http://flybase.org/Schizosaccharomyces pombe S. pombe GeneDB http://www.genedb.org/genedb/pombe/Saccharomyces cerevisiae SGD http://www.yeastgenome.org/Caenorhabditis elegans WormBase http://www.wormbase.org/ Dictyostelium discoideum dictyBase http://dictybase.org/ Bacillus subtilis SubtiList http://genolist.pasteur.fr/SubtiList/ Escherichia coli ecogene http://ecogene.org/ Danio rerio (zebrafish) ZFIN http://zfin.org/

Just a few examples, not an exhaustive list!

Methanocaldococcus jannaschii -> no MOD

Page 146: Genome, Protein and Model Organism Databases

Model organism databases (MODs)Model organism databases (MODs)

Genome annotation;Gene models;Gene mapping;Official nomenclature;Gene expression;Functional annotation;Interactions;Information about mutants/knockout/transgenic animals;Phenotypes;(cross-)references;Species-specific reagents…

Key resources for information on a given organismService provided to/from a given community

MODs do not necessarily store sequences,but give access to them

Page 147: Genome, Protein and Model Organism Databases
Page 148: Genome, Protein and Model Organism Databases
Page 149: Genome, Protein and Model Organism Databases
Page 150: Genome, Protein and Model Organism Databases
Page 151: Genome, Protein and Model Organism Databases
Page 152: Genome, Protein and Model Organism Databases

Link to cDNA sequences

Page 153: Genome, Protein and Model Organism Databases

http://gmod.org/wiki/Main_Page

Page 154: Genome, Protein and Model Organism Databases

The world of databases is a

jungle

Page 155: Genome, Protein and Model Organism Databases

A few points to rememberA few points to rememberwhen using databaseswhen using databases

- Content ;

- Primary / secondary / meta-databases ;- Curated / non-curated ;- manual / automated curation ;- Redundant / non-redundant.

- Update frequency;

- Stable identifiers ;

- Strategy ;- Dataflow ;- Collaborations between databases.

Page 156: Genome, Protein and Model Organism Databases

Test a few genomic Test a few genomic databases and toolsdatabases and tools

Page 157: Genome, Protein and Model Organism Databases

NCBI:http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeEBI:http://www.ebi.ac.uk/genomes/TIGR: http://cmr.jcvi.org/tigr-scripts/CMR/shared/Genomes.cgi

Genome annotation and analysis tools:http://www.ensembl.org/index.htmlhttp://vega.sanger.ac.uk/index.htmlhttp://genome.ucsc.edu/ -> BLAT, Galaxy, Custom tracks, …http://www.jgi.doe.gov/software/ -> Genome portal, Integrated Microbial Genomes (IMG) and other tools

Generic Model Organism Database http://gmod.org/wiki/Main_Page

Genomes and genomic tools: a few sites

Page 158: Genome, Protein and Model Organism Databases

Find your favorite (completely sequenced) organism in a genome db;Follow the links to see the options on different sites;Find the sequences;Look at the annotation of your favorite gene;Compare the entries corresponding to this gene across sites;Test search engines (restrict searches, compare results, …)

Whenever possible use on-line tutorials, such as:http://www.ensembl.org/info/website/tutorials/index.html

Visit GMOD, see the tools (http://gmod.org/wiki/GMOD_Components)

Play around with the BLAT search, customize display, follow the links, …

Genomes and genomic tools:Hands-on

Page 159: Genome, Protein and Model Organism Databases

Go and visit databases cited in this lecture;

The databases/tools that should be "familiar" to all are:http://genome.ucsc.edu/cgi-bin/hgBlathttp://www.ensembl.org/index.htmlgene/genome databases/tools on http://www.ncbi .nlm.nih.gov/

If none of the databases are of interest for you, go to the NAR database (http://www.oxfordjournals.org/nar/database/a/) and find databases that are closest to your interests;

Play around…

Hands on protein sequence databases and UniProt:http://education.expasy.org/cours/HK09/Protein_database_TP.html(corrections: http://education.expasy.org/cours/HK09/Protein_database_TP_correction.html)

Genomes and genomic tools:Hands-on

Page 160: Genome, Protein and Model Organism Databases

Thank You !Thank You !