advancing science with dna sequence data curation in img-er natalia ivanova mgm workshop may 16,...
TRANSCRIPT
Advancing Science with DNA Sequence
Data Curation in IMG-ER
Natalia IvanovaMGM WorkshopMay 16, 2012
Advancing Science with DNA Sequence
Tricky question
• What do you need to do data curation in IMG?a) I-phoneb) PhD in Computer Sciencec) supernatural powers
• Correct answer: you need an IMG accounthttp://img.jgi.doe.gov/er
Advancing Science with DNA Sequence
1. Gene modelsa) Add a geneb) Make a gene pseudogene or “obsolete” (=delete it)2. Functional annotations:c) Product namesd) EC numberse) Gene symbolsIf you believe something else needs to be changed (genome
name, taxonomy, etc.) – please use IMG Questions/Comments link
What can’t be changed: automated assignments to protein families (Pfam, COGs, TIGRfam, InterPro, SEED assignments, KO assignments)
What can be curated in IMG-ER?
Advancing Science with DNA Sequence
Center point for curation – Gene Cart
Advancing Science with DNA Sequence
• Product Name is free text (but see GenBank requirements http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit_annotation.html)
• Prot Description is free text (goes to “note” in GenBank submission)
• EC number and PUBMED ID – see explanation
• Notes are free text (goes to “note” in GenBank submission)
• Gene symbol is “gene name” – 4 letter abbreviation; goes to “gene” in GenBank submission
Advancing Science with DNA Sequence
How to find the genes that need curation?
Two possible scenarios:• You have submitted a genome to IMG-ER
and want to have the best annotations possible for it (e. g. for GenBank submission)
• You’re an expert and know everything about a certain pathway or protein family (families) = “community service”
Advancing Science with DNA Sequence
Curation of genome annotations
Compare Gene Annotations
find genome
Genome Statistics
review Gene Pages
add to Gene Cart
refine gene setFind Genomes:
• Genome Browser• Genome Search
• “Hypothetical protein”, but with some evidence
• Non-hypothetical protein, but no evidence
w/o enzymes but with candidate KO
based enzymes • Protein families• Homologs/orthologs• Gene Neighborhoods
Advancing Science with DNA Sequence
Why do you want to review annotations?
• Most IMG pipelines are optimized for specificity, so they are more likely to have false negatives, but generate few false positives
• Compare Annotations– Product name is a consensus of multiple assignments:
BLASTp, TIGRfam, COG, Pfam– Sources of false negatives - cutoffs: TIGRfam trusted cutoffs
are quite stringent; COG doesn’t have trusted cutoffs; BLASTp cutoff of 50% identity
• Candidate genes with KO annotations – sources of false negatives– Cutoffs for % identity and alignment length
Advancing Science with DNA Sequence
Curation of annotation in one genome (or a set of genomes)
a) Your favorite genes (experimental verification, etc.) -> use Find Genes, Gene Search or BLAST
b) “Compare Annotations” on Organism Details page
c) “Candidate genes with KO annotations” on Organism Details page
d) KEGG Pathways (either from Organism Details page or from Find Functions menu)
e) PhyloProfiler
Advancing Science with DNA Sequence
A shortcut for product name/EC number assignments based on KO
Advancing Science with DNA Sequence
Example of a missed gene
• Run PhyloProfiler of Deinococcus geothermalis as a query, Deinococcus hopiensis as target (with no homologs in)
• Select Dgeo_0119 as a sequence to check whether a homolog of this gene was missed in Deinococcus hopiensis
Advancing Science with DNA Sequence
Adding missed genes - contd
• Use graphical viewer to check the translation
• Adjust the start if other start codons with better RBS exist upstream
Advancing Science with DNA Sequence
Reviewing your annotations
• Organism Details page -> Genome Statistics
• MyIMG
Advancing Science with DNA Sequence
IMG curation exercises
Go to the link in the usual place:http://genomebiology.jgi-psf.org/Content/MGM-12.May2012/agenda.htmlThe first 2 pages – questions without answers; the
rest is cheat sheet