the annotation of plant proteins in uniprotkb
DESCRIPTION
Event: Plant and Animal Genomes conference 2012 Speaker: Michel Schneider The UniProt Knowledgebase consists of two sections, UniProtKB/Swiss-Prot, which contains manually-annotated protein sequence enriched with functional information added by expert human curators, and UniProtKB/TrEMBL, which contains unreviewed records that are enhanced by information provided by automated rule-based annotation systems. The majority of UniProtKB records are based on automatic translation of coding sequences (CDS) provided by submitters at the time of initial deposition to the nucleotide sequence databases. In order to provide the complete proteome of Arabidopsis thaliana, a complementary curation pipeline for import of protein sequences from TAIR has been developed. As the complete genome reannotation proposed in the TAIR10 release contains most of the sequences already in UniProtKB, these existing sequences have to be reconciled with those imported. Around 7% of them have a different gene model and should be checked manually. Based on these comparisons, we improved over 200 of our predicted proteins. In exchange, we provide TAIR with the gene model corrections that we introduce on the bases of our trans-species family annotation. This approach allows identification of data that can be seamlessly transferred from one site to the other and the development of common annotations. With the significant increase in the number of complete genomes sequenced (1001 Arabidopsis cultivars are currently under way!), organization of this data in a convenient way is critical. UniProt have selected a set of “reference proteomes”, including A. thaliana cv. Columbia, which provide broad coverage of the tree of life and constitute a representative cross-section of the taxonomic diversity to be found within UniProtKB.TRANSCRIPT
The annotation of Plant Proteins in UniProtKB
Michel Schneider
Plant protein annotation program, Swiss-Prot groupSwiss Institute of Bioinformatics
Geneva, [email protected]
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
1. The UniProt consortium and its products
2. Content of an entry in UniProtKB and manual curation
3. Complete proteomes and reference proteomes
4. Synchronization between UniProtKB and TAIR
5. Some statistics
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
The UniProt consortium
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
The missions of the UniProt consortium
Provide the scientific community with a resource of protein sequence and functional annotation which has to be …
comprehensive
high quality
and freely accessible
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
Four components to fulfill specific demands
UniParc – Sequence archive contains current and obsolete sequences(29.6 million sequences)
UniRef Sequence clusters
UniRef100UniRef90UniRef50
UniMes Metagenomic and
environmental sample sequences
UniProtKBProtein Knowledgebase
UniProtKB/Swiss-ProtReviewed
(533’657 entries)
UniProtKB/TremblUnreviewed
(19 million entries)
Manual curation
Automated annotation
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
UniProtKB, the expertly curated component of UniProt
The high-quality curated protein knowledge database
where data becomes structured knowledge
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
Shigeo Fukuda
UniProtKB, the expertly curated component of UniProt
© 2009 SIB
Protein sequenceOne gene - One species
© 2009 SIB
Protein sequenceOne gene - One species
Protein and gene namesTaxonomic information
© 2009 SIB
Protein sequenceOne gene - One species
Protein and gene namesTaxonomic information
Sequence annotation:PTMs, alternative splicing products,
mutagenesis, transmembrane domains, signal peptide…
© 2009 SIB
Protein sequenceOne gene - One species
Protein and gene namesTaxonomic information
Sequence annotation:PTMs, alternative splicing products,
mutagenesis, transmembrane domains, signal peptide…
General annotation:Function, Subcellular location,
Catalytic activity, Tissue specificity, Disruption phenotype…
© 2009 SIB
Protein sequenceOne gene - One species
Protein and gene namesTaxonomic information
Sequence annotation:PTMs, alternative splicing products,
mutagenesis, transmembrane domains, signal peptide…
General annotation:Function, Subcellular location,
Catalytic activity, Tissue specificity, Disruption phenotype…
References
© 2009 SIB
Protein sequenceOne gene - One species
Protein and gene namesTaxonomic information
Sequence annotation:PTMs, alternative splicing products,
mutagenesis, transmembrane domains, signal peptide…
General annotation:Function, Subcellular location,
Catalytic activity, Tissue specificity, Disruption phenotype…
References
Keywords -
Gene Ontology
© 2009 SIB
Protein sequenceOne gene - One species
Protein and gene namesTaxonomic information
Sequence annotation:PTMs, alternative splicing products,
mutagenesis, transmembrane domains, signal peptide…
General annotation:Function, Subcellular location,
Catalytic activity, Tissue specificity, Disruption phenotype…
References
Keywords -
Gene OntologyCross-references
(~ 130 databases)
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
International Nucleotide Sequence Database Collection (INSDC)
Ensembl or EnsemblGenomes RefSeq Direct submissions (protein sequences) Literature Protein Data Bank
Origin of the sequences in UniProtKB
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
The process of manual sequence curation
1. Select entry/gene (priorities)
2. Identify entries from same gene and homologs using BLAST against UniProtKB
3. Merge entries from the same gene and same species into a single record
4. Select a canonical sequence
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
Critical analysis and report of sequence discrepanciesQPCT_ARATH (Q84WV9) Glutaminyl-peptide cyclotransferase (At4g25720)
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
Critical analysis and report of sequence discrepanciesQPCT_ARATH (Q84WV9) Glutaminyl-peptide cyclotransferase (At4g25720)
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
Literature-based curation Identify relevant papers through searching literature
databases
Read full text of papers and extract and summarize relevant information
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
Literature-based curation
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
Literature-based curation
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
Literature-based curation
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
Controlled vocabularies• Keywords provide a summary of the entry content
• We annotate using the Gene Ontology (GO)
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
• Genome completely sequenced
• Proteins mapped to the genome
2’902 complete proteomes
Fully manually reviewed (e.g. S. cerevisiae)Partially manually reviewed (e.g. A. thaliana)Unreviewed (e.g. Chlorella variabilis)
UniProtKB, complete proteome sequence sets
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
A reference proteome is the complete proteome of a representative, well-studied model organism or an organism of interest for biomedical research.
509 reference proteomes
UniProtKB, reference proteome sequence sets
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
UniProtKB, complete proteome sequence sets
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
Arabidopsis thaliana
The building of the complete proteome sequence set:
• Based on the re-annotation of complete genome by TAIR:
27’416 protein coding genes
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
UniProtKB – TAIR synchronization
cDNAs, ESTs, genomic sequences
Nucleic acid databases
UniProtKB/Swiss-ProtReviewed
(10’340 entries)
UniProtKB/TrEMBLUnreviewed
(40’574 entries)
release 2011_03 - Mar 08, 2011
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
UniProtKB – TAIR synchronization
cDNAs, ESTs, genomic sequences
Nucleic acid databases
Genome re-annotation
Temporary TrEMBL set33’341 entries
UniProtKB/Swiss-ProtReviewed
(10’340 entries)
UniProtKB/TrEMBLUnreviewed
(40’574 entries)
35’386 gene products
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
UniProtKB – TAIR synchronization
cDNAs, ESTs, genomic sequences
Nucleic acid databases
Genome re-annotation
Temporary TrEMBL set33’341 entries
Compare translations from the same gene, merge if 100 % identical, report sequence discrepancies, align with
orthologs and paralogs
UniProtKB/Swiss-ProtReviewed
(10’340 entries)
UniProtKB/TrEMBLUnreviewed
(40’574 entries)
11’508 sequences
35’386 gene products
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
283 corrections
UniProtKB – TAIR synchronization
cDNAs, ESTs, genomic sequences
Nucleic acid databases
Genome re-annotation
Compare translations from the same gene, merge if 100 % identical, report sequence discrepancies, align with
orthologs and paralogs
correct gene models or add new isoforms
Feedback to TAIR
UniProtKB/Swiss-ProtReviewed
UniProtKB/TrEMBLUnreviewed
90 gene models
Temporary TrEMBL set
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
UniProtKB – TAIR synchronization
cDNAs, ESTs, genomic sequences
Nucleic acid databases
Genome re-annotation
Temporary TrEMBL set
UniProtKB/Swiss-ProtReviewed
UniProtKB/TrEMBLUnreviewed
Cleaned set of new TrEMBL entries(21’656 entries)
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
+
Arabidopsis thaliana, cv. ColumbiaComplete proteome: 32’521 entries
UniProtKB – TAIR synchronization
cDNAs, ESTs, genomic sequences
Nucleic acid databases
UniProtKB/Swiss-ProtReviewed
(10’875 entries)
UniProtKB/TrEMBLUnreviewed
(44’628 entries)
Genome re-annotation
Temporary TrEMBL set
Cleaned set of new TrEMBL entries(21’656 entries)
UniProtKB/Swiss-ProtReviewed (10’865 entries)
release 2011_12 - Dec 14, 2011
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
1001 Arabidopsis genomes
• Deposited to INSDC ?
• Fully Annotated ? With CDS ?
• Should we still merge all the identical sequences together?
• If they are not merged but kept separate, how to get relevant Blast results?
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
Some UniProtKB/Swiss-Prot Statistics concerning plant entries (UniProt release 2011_12 - Dec 14, 2011)
• 31,959 entries of Viridiplantae
• from 1,924 species
• 10’875 entries from Arabidopsis thaliana (with 1,219 isoforms)
• 2,823 entries from Oryza sativa sp. Japonica
• 11,897 plant entries with an EC number
• 966 different complete EC numbers
• 5,744 putative transporters or proteins involved in transport
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
UniProtKB/Swiss-Prot, the manually curated knowledgebase:
• Protein sequence database covering all kingdoms of life (533’657 sequence entries; 12’664 species)
• Manually annotated
• Non-redundant: all products of one gene in one species in a single entry
• Highly cross-referenced (links to ~130 databases).
Plant protein annotation:
• Complete proteome for Arabidopsis thaliana
• Synchronization with TAIR
Summary
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
We need your feedback and your collaboration !
AcknowledgementsSIBIoannis Xenarios, Lydie Bougueleret, Andrea Auchincloss, Kristian Axelsen, Delphine Baratin, Marie-Claude Blatter, Brigitte Boeckmann, Jerven Bolleman, Laurent Bollondi, Emmanuel Boutet, Lionel Breuza, Alan Bridge, Edouard de Castro, Lorenzo Cerutti, Elisabeth Coudert, Béatrice Cuche, Mikael Doche, Dolnide Dornevil, Severine Duvaud, Anne Estreicher, Livia Famiglietti, Marc Feuermann, Sebastien Gehant, Elisabeth Gasteiger, Vivienne Gerritsen, Arnaud Gos, Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo, Nicolas Hulo, Janet James, Florence Jungo, Guillaume Keller, Vicente Lara, Philippe Lemercier, Damien Lieberherr, Xavier Martin, Patrick Masson, Anne Morgat, Salvo Paesano, Ivo Pedruzzi, Sandrine Pilbout, Sylvain Poux, Monica Pozzato, Manuela Pruess, Nicole Redaschi, Catherine Rivoire, Bernd Roechert, Michel Schneider, Christian Sigrist, Karin Sonesson, Sylvie Staehli, Eleanor Stanley, André Stutz, Shyamala Sundaram, Michael Tognolli, Laure Verbregue and Anne-Lise Veuthey
EBIRolf Apweiler, Maria Jesus Martin, Claire O'Donovan, Michele Magrane, Yasmin Alam-Faruque, Ricardo Antunes, Benoit Bely, Mark Bingley, David Binns, Lawrence Bower, Wei Mun Chan, Emily Dimmer, Francesco Fazzini, Alexander Fedotov, John Garavelli, Leyla Garcia Castro, Rachael Huntley, Julius Jacobsen, Michael Kleen, Duncan Legge, Wudong Liu, Jie Luo, Sandra Orchard, Samuel Patient, Klemens Pichler, Diego Poggioli, Nikolas Pontikos, Steven Rosanoff, Tony Sawford, Harminder Sehra, Edward Turner, Matt Corbett, Mike Donnelly and Pieter van Rensburg
PIRCathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Winona C. Barker, Chuming Chen, Yongxing Chen, Pratibha Dubey, Hongzhan Huang, Kati Laiho, Raja Mazumder, Peter McGarvey, Darren A. Natale, Thanemozhi G. Natarajan, Jules Nchoutmboube, Natalia V. Roberts, Baris E. Suzek, Uzoamaka Ugochukwu, C. R. Vinayaka, Qinghua Wang, Yuqi Wang, Lai-Su Yeh and Jian Zhang
www.uniprot.org
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
UniProt is mainly supported by the National Institutes of Health (NIH) grant 1 U41 HG006104-01. Additional support for the EBI's involvement in UniProt comes from the NIH grant 2P41 HG02273-07. Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science and the European Commission contracts SLING (226073), Gen2Phen (200754) and MICROME (222886). PIR activities are also supported by the NIH grants 5R01GM080646-04, 3R01GM080646-04S2, 1G08LM010720-01, and 3P20RR016472-09S2, and NSF grant DBI-0850319.