the annotation of plant proteins in uniprotkb

40
The annotation of Plant Proteins in UniProtKB Michel Schneider Plant protein annotation program, Swiss- Prot group Swiss Institute of Bioinformatics Geneva, Switzerland [email protected]

Upload: embl-ebi

Post on 27-May-2015

564 views

Category:

Technology


0 download

DESCRIPTION

Event: Plant and Animal Genomes conference 2012 Speaker: Michel Schneider The UniProt Knowledgebase consists of two sections, UniProtKB/Swiss-Prot, which contains manually-annotated protein sequence enriched with functional information added by expert human curators, and UniProtKB/TrEMBL, which contains unreviewed records that are enhanced by information provided by automated rule-based annotation systems. The majority of UniProtKB records are based on automatic translation of coding sequences (CDS) provided by submitters at the time of initial deposition to the nucleotide sequence databases. In order to provide the complete proteome of Arabidopsis thaliana, a complementary curation pipeline for import of protein sequences from TAIR has been developed. As the complete genome reannotation proposed in the TAIR10 release contains most of the sequences already in UniProtKB, these existing sequences have to be reconciled with those imported. Around 7% of them have a different gene model and should be checked manually. Based on these comparisons, we improved over 200 of our predicted proteins. In exchange, we provide TAIR with the gene model corrections that we introduce on the bases of our trans-species family annotation. This approach allows identification of data that can be seamlessly transferred from one site to the other and the development of common annotations. With the significant increase in the number of complete genomes sequenced (1001 Arabidopsis cultivars are currently under way!), organization of this data in a convenient way is critical. UniProt have selected a set of “reference proteomes”, including A. thaliana cv. Columbia, which provide broad coverage of the tree of life and constitute a representative cross-section of the taxonomic diversity to be found within UniProtKB.

TRANSCRIPT

Page 1: The annotation of plant proteins in UniProtKB

The annotation of Plant Proteins in UniProtKB

Michel Schneider

Plant protein annotation program, Swiss-Prot groupSwiss Institute of Bioinformatics

Geneva, [email protected]

Page 2: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

1. The UniProt consortium and its products

2. Content of an entry in UniProtKB and manual curation

3. Complete proteomes and reference proteomes

4. Synchronization between UniProtKB and TAIR

5. Some statistics

Page 3: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

The UniProt consortium

Page 4: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

The missions of the UniProt consortium

Provide the scientific community with a resource of protein sequence and functional annotation which has to be …

comprehensive

high quality

and freely accessible

Page 5: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

Four components to fulfill specific demands

UniParc – Sequence archive contains current and obsolete sequences(29.6 million sequences)

UniRef Sequence clusters

UniRef100UniRef90UniRef50

UniMes Metagenomic and

environmental sample sequences

UniProtKBProtein Knowledgebase

UniProtKB/Swiss-ProtReviewed

(533’657 entries)

UniProtKB/TremblUnreviewed

(19 million entries)

Manual curation

Automated annotation

Page 6: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

UniProtKB, the expertly curated component of UniProt

The high-quality curated protein knowledge database

where data becomes structured knowledge

Page 7: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

Shigeo Fukuda

UniProtKB, the expertly curated component of UniProt

Page 8: The annotation of plant proteins in UniProtKB

© 2009 SIB

Protein sequenceOne gene - One species

Page 9: The annotation of plant proteins in UniProtKB

© 2009 SIB

Protein sequenceOne gene - One species

Protein and gene namesTaxonomic information

Page 10: The annotation of plant proteins in UniProtKB

© 2009 SIB

Protein sequenceOne gene - One species

Protein and gene namesTaxonomic information

Sequence annotation:PTMs, alternative splicing products,

mutagenesis, transmembrane domains, signal peptide…

Page 11: The annotation of plant proteins in UniProtKB

© 2009 SIB

Protein sequenceOne gene - One species

Protein and gene namesTaxonomic information

Sequence annotation:PTMs, alternative splicing products,

mutagenesis, transmembrane domains, signal peptide…

General annotation:Function, Subcellular location,

Catalytic activity, Tissue specificity, Disruption phenotype…

Page 12: The annotation of plant proteins in UniProtKB

© 2009 SIB

Protein sequenceOne gene - One species

Protein and gene namesTaxonomic information

Sequence annotation:PTMs, alternative splicing products,

mutagenesis, transmembrane domains, signal peptide…

General annotation:Function, Subcellular location,

Catalytic activity, Tissue specificity, Disruption phenotype…

References

Page 13: The annotation of plant proteins in UniProtKB

© 2009 SIB

Protein sequenceOne gene - One species

Protein and gene namesTaxonomic information

Sequence annotation:PTMs, alternative splicing products,

mutagenesis, transmembrane domains, signal peptide…

General annotation:Function, Subcellular location,

Catalytic activity, Tissue specificity, Disruption phenotype…

References

Keywords -

Gene Ontology

Page 14: The annotation of plant proteins in UniProtKB

© 2009 SIB

Protein sequenceOne gene - One species

Protein and gene namesTaxonomic information

Sequence annotation:PTMs, alternative splicing products,

mutagenesis, transmembrane domains, signal peptide…

General annotation:Function, Subcellular location,

Catalytic activity, Tissue specificity, Disruption phenotype…

References

Keywords -

Gene OntologyCross-references

(~ 130 databases)

Page 15: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

International Nucleotide Sequence Database Collection (INSDC)

Ensembl or EnsemblGenomes RefSeq Direct submissions (protein sequences) Literature Protein Data Bank

Origin of the sequences in UniProtKB

Page 16: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

The process of manual sequence curation

1. Select entry/gene (priorities)

2. Identify entries from same gene and homologs using BLAST against UniProtKB

3. Merge entries from the same gene and same species into a single record

4. Select a canonical sequence

Page 17: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

Critical analysis and report of sequence discrepanciesQPCT_ARATH (Q84WV9) Glutaminyl-peptide cyclotransferase (At4g25720)

Page 18: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

Critical analysis and report of sequence discrepanciesQPCT_ARATH (Q84WV9) Glutaminyl-peptide cyclotransferase (At4g25720)

Page 19: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

Page 20: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

Literature-based curation Identify relevant papers through searching literature

databases

Read full text of papers and extract and summarize relevant information

Page 21: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

Literature-based curation

Page 22: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

Literature-based curation

Page 23: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

Literature-based curation

Page 24: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

Controlled vocabularies• Keywords provide a summary of the entry content

• We annotate using the Gene Ontology (GO)

Page 25: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

• Genome completely sequenced

• Proteins mapped to the genome

2’902 complete proteomes

Fully manually reviewed (e.g. S. cerevisiae)Partially manually reviewed (e.g. A. thaliana)Unreviewed (e.g. Chlorella variabilis)

UniProtKB, complete proteome sequence sets

Page 26: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

A reference proteome is the complete proteome of a representative, well-studied model organism or an organism of interest for biomedical research.

509 reference proteomes

UniProtKB, reference proteome sequence sets

Page 27: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

UniProtKB, complete proteome sequence sets

Page 28: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

Arabidopsis thaliana

The building of the complete proteome sequence set:

• Based on the re-annotation of complete genome by TAIR:

27’416 protein coding genes

Page 29: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

UniProtKB – TAIR synchronization

cDNAs, ESTs, genomic sequences

Nucleic acid databases

UniProtKB/Swiss-ProtReviewed

(10’340 entries)

UniProtKB/TrEMBLUnreviewed

(40’574 entries)

release 2011_03 - Mar 08, 2011

Page 30: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

UniProtKB – TAIR synchronization

cDNAs, ESTs, genomic sequences

Nucleic acid databases

Genome re-annotation

Temporary TrEMBL set33’341 entries

UniProtKB/Swiss-ProtReviewed

(10’340 entries)

UniProtKB/TrEMBLUnreviewed

(40’574 entries)

35’386 gene products

Page 31: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

UniProtKB – TAIR synchronization

cDNAs, ESTs, genomic sequences

Nucleic acid databases

Genome re-annotation

Temporary TrEMBL set33’341 entries

Compare translations from the same gene, merge if 100 % identical, report sequence discrepancies, align with

orthologs and paralogs

UniProtKB/Swiss-ProtReviewed

(10’340 entries)

UniProtKB/TrEMBLUnreviewed

(40’574 entries)

11’508 sequences

35’386 gene products

Page 32: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

283 corrections

UniProtKB – TAIR synchronization

cDNAs, ESTs, genomic sequences

Nucleic acid databases

Genome re-annotation

Compare translations from the same gene, merge if 100 % identical, report sequence discrepancies, align with

orthologs and paralogs

correct gene models or add new isoforms

Feedback to TAIR

UniProtKB/Swiss-ProtReviewed

UniProtKB/TrEMBLUnreviewed

90 gene models

Temporary TrEMBL set

Page 33: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

UniProtKB – TAIR synchronization

cDNAs, ESTs, genomic sequences

Nucleic acid databases

Genome re-annotation

Temporary TrEMBL set

UniProtKB/Swiss-ProtReviewed

UniProtKB/TrEMBLUnreviewed

Cleaned set of new TrEMBL entries(21’656 entries)

Page 34: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

+

Arabidopsis thaliana, cv. ColumbiaComplete proteome: 32’521 entries

UniProtKB – TAIR synchronization

cDNAs, ESTs, genomic sequences

Nucleic acid databases

UniProtKB/Swiss-ProtReviewed

(10’875 entries)

UniProtKB/TrEMBLUnreviewed

(44’628 entries)

Genome re-annotation

Temporary TrEMBL set

Cleaned set of new TrEMBL entries(21’656 entries)

UniProtKB/Swiss-ProtReviewed (10’865 entries)

release 2011_12 - Dec 14, 2011

Page 35: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

1001 Arabidopsis genomes

• Deposited to INSDC ?

• Fully Annotated ? With CDS ?

• Should we still merge all the identical sequences together?

• If they are not merged but kept separate, how to get relevant Blast results?

Page 36: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

Some UniProtKB/Swiss-Prot Statistics concerning plant entries (UniProt release 2011_12 - Dec 14, 2011)

• 31,959 entries of Viridiplantae

• from 1,924 species

• 10’875 entries from Arabidopsis thaliana (with 1,219 isoforms)

• 2,823 entries from Oryza sativa sp. Japonica

• 11,897 plant entries with an EC number

• 966 different complete EC numbers

• 5,744 putative transporters or proteins involved in transport

Page 37: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

UniProtKB/Swiss-Prot, the manually curated knowledgebase:

• Protein sequence database covering all kingdoms of life (533’657 sequence entries; 12’664 species)

• Manually annotated

• Non-redundant: all products of one gene in one species in a single entry

• Highly cross-referenced (links to ~130 databases).

Plant protein annotation:

• Complete proteome for Arabidopsis thaliana

• Synchronization with TAIR

Summary

Page 38: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

We need your feedback and your collaboration !

[email protected]

Page 39: The annotation of plant proteins in UniProtKB

AcknowledgementsSIBIoannis Xenarios, Lydie Bougueleret, Andrea Auchincloss, Kristian Axelsen, Delphine Baratin, Marie-Claude Blatter, Brigitte Boeckmann, Jerven Bolleman, Laurent Bollondi, Emmanuel Boutet, Lionel Breuza, Alan Bridge, Edouard de Castro, Lorenzo Cerutti, Elisabeth Coudert, Béatrice Cuche, Mikael Doche, Dolnide Dornevil, Severine Duvaud, Anne Estreicher, Livia Famiglietti, Marc Feuermann, Sebastien Gehant, Elisabeth Gasteiger, Vivienne Gerritsen, Arnaud Gos, Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo, Nicolas Hulo, Janet James, Florence Jungo, Guillaume Keller, Vicente Lara, Philippe Lemercier, Damien Lieberherr, Xavier Martin, Patrick Masson, Anne Morgat, Salvo Paesano, Ivo Pedruzzi, Sandrine Pilbout, Sylvain Poux, Monica Pozzato, Manuela Pruess, Nicole Redaschi, Catherine Rivoire, Bernd Roechert, Michel Schneider, Christian Sigrist, Karin Sonesson, Sylvie Staehli, Eleanor Stanley, André Stutz, Shyamala Sundaram, Michael Tognolli, Laure Verbregue and Anne-Lise Veuthey

EBIRolf Apweiler, Maria Jesus Martin, Claire O'Donovan, Michele Magrane, Yasmin Alam-Faruque, Ricardo Antunes, Benoit Bely, Mark Bingley, David Binns, Lawrence Bower, Wei Mun Chan, Emily Dimmer, Francesco Fazzini, Alexander Fedotov, John Garavelli, Leyla Garcia Castro, Rachael Huntley, Julius Jacobsen, Michael Kleen, Duncan Legge, Wudong Liu, Jie Luo, Sandra Orchard, Samuel Patient, Klemens Pichler, Diego Poggioli, Nikolas Pontikos, Steven Rosanoff, Tony Sawford, Harminder Sehra, Edward Turner, Matt Corbett, Mike Donnelly and Pieter van Rensburg

PIRCathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Winona C. Barker, Chuming Chen, Yongxing Chen, Pratibha Dubey, Hongzhan Huang, Kati Laiho, Raja Mazumder, Peter McGarvey, Darren A. Natale, Thanemozhi G. Natarajan, Jules Nchoutmboube, Natalia V. Roberts, Baris E. Suzek, Uzoamaka Ugochukwu, C. R. Vinayaka, Qinghua Wang, Yuqi Wang, Lai-Su Yeh and Jian Zhang

www.uniprot.org

Page 40: The annotation of plant proteins in UniProtKB

“Pioneers at the Heart of Science” 1998 – 2008

PAG XX, San Diego, January 15, 2012

UniProt is mainly supported by the National Institutes of Health (NIH) grant 1 U41 HG006104-01. Additional support for the EBI's involvement in UniProt comes from the NIH grant 2P41 HG02273-07. Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science and the European Commission contracts SLING (226073), Gen2Phen (200754) and MICROME (222886). PIR activities are also supported by the NIH grants 5R01GM080646-04, 3R01GM080646-04S2, 1G08LM010720-01, and 3P20RR016472-09S2, and NSF grant DBI-0850319.