automated prokaryotic annotation at jcvi
DESCRIPTION
Conference: Annual BRC Meeting (BRC6), Oct 28-29, 2008 in Ft. Lauderedale, Florida. Presenter: Dan HaftTRANSCRIPT
AutomatedAutomatedProkaryotic Prokaryotic AnnotationAnnotation
at the JCVIat the JCVI
Danie l HaftDanie l Haft20082008
A Dual-UsePipeline
Multiple types of stored evidence Persistent & Flexibly Interleaved Supports selective re-annotation Features annotation-driving databases
- CHAR- TIGRFAMs- Genome Properties- BrainGrab Rules
Evidence used by Machine and by Experts MANATEE interface for annotators Capture new rules with BrainGrab
Computable objects:Output from one programbecomes input to another.
HMM results drive GenomeProperties
Genome Properties guide GOprocess assignments
GO process terms
Identification of Genome FeaturesIdentification of Genome Features
IMMbuilt
Glimmer builds a statistical model from the training set
Genome Sequence
•• rRNA rRNA, tRNA, , tRNA, RfamRfam
•• IS elements ·Phage regions ·RepeatsIS elements ·Phage regions ·Repeats
ORFsORFs
: Other Genome Features: Other Genome Features
Gene FindingGlimmer & friends, homology methods
Homology Searches (gathering evidence)
BLAST-Extend-ReprazeHidden Markov Modelsmisc.
Structural Curation ( ORF Management)
Auto_Gene_Curate (start sites, overlaps)InterEvidence
Functional AssignmentsAuto_AnnotateManualMapped
Data Availability
Homology SearchesHomology Searches•• HMM searches: TIGRFAMs & PfamHMM searches: TIGRFAMs & Pfam•• BLAST searches: against internal NIAABLAST searches: against internal NIAA•• PROSITE motifsPROSITE motifs•• InterProInterPro•• TmHMMTmHMM•• SignalPSignalP•• LipoproteinLipoprotein•• PsortPsort•• Generate Paralogous FamiliesGenerate Paralogous Families•• Custom databases searchesCustom databases searches ( (TransportDBTransportDB, Rules), Rules)
Gene Model CurationGene Model Curation
•• Overlaps resolved by evidence competitionOverlaps resolved by evidence competition
•• Start site Start site curationcuration
•• Missed genes / unsupported gene callsMissed genes / unsupported gene calls
Evidence can Overhang the GeneEvidence can Overhang the GeneBlast-Extend-Blast-Extend-ReprazeRepraze (BER) (BER)
The extensions help in the detection of frameshifts (FS) and point mutations resulting in in-framestop codons (PM). This is indicated when similarity extends outside the coordinates of the proteincoding sequence. Blue line indicates predicted protein coding seqeunce, green line indicates up-and downstream extensions. Red line is the match protein.
ORFxxxxx300 bp 300 bpend5 end3
search protein
match protein
similarity extends in the same frame through a stop codon
normal full length match
*similarity extends upstream through a start, or downstream past a frameshift
!
Pfam vs. TIGRFAMs Functional assignments to
proteins Granularity tuned for
single-hit equivalogs(mono-functional !)
Generates computableobjects --> pathwayreconstructions
TIGRFAMs: RULES
Names for homologydomains in proteins
Granularity tuned fortwilight-level sequencesimilarity detection
Explains things toannotator
Pfam : Explanations
TIGRFAMs equivalogsvs. Pfam domains
}X
X
X
Y
Z}
TIGRxxxxx
PFxxxxx
TIGRFAMs as annotation rules
EC number computable ! GO term computable ! protein name computable ? HMM hit computable !!
Isology (homology) types:ranking our rules
EXCEPTION additional info, e.g. “vegetative”
EQUIVALOG the SAME (in enough ways) toreceive the same name across multiple genomes,reflecting one specific function.
SUBFAMILY can name a whole class
DOMAIN class name for a protein region (and apply these classifications also to Pfam)
CHAR : Experimentally CharacterizedCHAR : Experimentally Characterized Protein Database Protein Database
• Highly curated database of experimentally characterizedproteins; connects protein accessions, known function, and thescientific literature.
•What does it include:–Controlled vocabulary describes the type of experimentationperformed in each publication–Key annotation fields (protein name, gene symbol, EnzymeCommission (EC) number, taxonomic data, Gene Ontology (GO)terms) are extracted–Synonymous protein accessions obtained from publicdatabases (Genbank and UniProt) are stored
Annotation Proceeds from …
Inside --> out (e.g. AutoAnnotate): for every protein Collect evidence Best-guess annotation
Outside --> in (e.g. TIGRFAMs): for every model
Search tool + cutoff + standards = annotation rule Achieves partial coverage
Hybrid (BrainGrab) for every unfinished protein Look for means to annotate: blastp, synteny, hole-filling, etc. Capture annotator logic as a new rule Add to library of rules/models for all future genomes
Subject Genome
Trusted Complete Automatic
Proper Realm ofAnnotator Attention
RULES
BrainGrab
NEW
genome genome
share
validate IMPORT
BLASTP_MATCH [SP|P07363, 1600, 95, 92, 60, 1]
SP|P07363|CHEA_ECOLI Chemotaxis protein cheA EC 2.7.13.3
EcHS_A1984 is manually annotated confidentlybecause it is similar enough to :
(method: defines “similar enough”)
Must be the only protein in genome that scores >= 1600 by blastp,covering >= 95 % of the length of the characterized protein and>= 92 % of the target protein, with >= 60 % sequence identity.
A Teachable Moment
a sample of expert opinion:“For This Particular Protein Family”
I (D.H.H.) assert that any > 75 %-identical, full-length match is the same protein.
Ditto any > 65 % match, as long as the region isclearly syntenic.
Ditto any single-copy > 50 % match, as long as it fillsthis hole in this otherwise mostly complete pathway.
B “Bag of Genes”
G Genome Properties
E Evidence to drive other programs
Image from Gödel, Escher, Bach:an Eternal Golden Braidby Douglas Hofstadter, 1979
Genome Properties:annotation at the level of systems
pathway (glyoxylate shunt) system (type III secretion) structure (outer membrane) genometrics (GC content) phenotype (motility, pathogenesis)
YES someevidence
notsupported NO
Some Novel Genome Properties
12 subtypes of CRISPR/Cas system
PEP-CTERM / exo-sortase: Biofilm-associatedprotein sorting
Type VI secretion (53 loci in B. mallei 23344)
Post-translational selenium-modified enzymes
Heterocycle-containing bacterial toxinproduction: BA_2677 = “heterocyclo-anthracin”
A family of variable putative toxinswith patterns of CGG insertions.
Future Annotation PipelineEnhancements
•• Populate the Characterized Protein DatabasePopulate the Characterized Protein Database
•• Develop META-RULES from CHARDevelop META-RULES from CHAR
•• BrainGrab BrainGrab for novel contentfor novel content
•• Import additional Import additional computable evidencecomputable evidence
•• ImproveImprove exchanges of validationexchanges of validation setssets
•• Build a protein names ontologyBuild a protein names ontology
AcknowlegementsRamana Madupu
Jeremy Selengut
Alex Richter
JCVI microbial annotation team