swiss-prot fortelaza sybil: comparative analysis system gemina: epidemiological resource owen white...
TRANSCRIPT
Swiss-Prot Fortelaza
• Sybil: Comparative Analysis System
• Gemina: Epidemiological Resource
Owen WhiteOwen WhiteJuly 31July 31stst, 2006, 2006
ISMB POSTERS
• Sam Angiuoli Ergatis/Sybil Poster H-82
• Aaron Gussman Gemina Poster B-46
Swiss-Prot Fortelaza
• Sybil: Comparative Analysis System
• Gemina: Epidemiological Resource
Sybil searches and computes
• Searches:– All v all blastp searches– Mummer: Nucleotide/Protein SNPs
• Clustering: – Evaluation of proteins-match networks– Scoring system set by user – COGs – bidirectional best hits– Jaccard COG-clustering for transitive closure– Also paralogs
• Syntenic block:– Collections of J-COG between species– Runs of J genes without K non-homologous intervening genes
How Sybil Computes are Performed
• Blast• Position effect
(conserved gene order)
• MUMmer– SNPs
• PROmer• Gene families
– COGs– Paralogs
BSML Dumper
COG
BSMLLoader
blastp
PE
SNPsMUMmer
PROmer
Primary output: BSML-XML
Data Prep For Comparative Analysis
BSML Dumper
COG
BSMLLoader
blastp
PE
SNPsMUMmer
PROmer
ChadoGenBank FilesEMBL FilesCustom FilesBSMLGFF3
GMOD Consortium
Jaccard Clustering
Jaccard-filtered Orthologs
Match Reduction
Fig. 6. Using a minimal spanning tree (MST) algorithm to remove redundant matches. Protein cluster image before (left) and after (right) applying the MST filter.
Sybil: Chromosomal summaries
Preferences for pop-up displays are user configurable.Preferences for pop-up displays are user configurable.
Syntenic blocksSyntenic blocks
Jaccard-filtered COGsJaccard-filtered COGs
Copyright ©2005 by the National Academy of Sciences
Tettelin, Hervé et al. (2005) Proc. Natl. Acad. Sci. USA 102, 13950-13955
Fig. 1. Whole genome alignment of GBS strains
Copyright ©2005 by the National Academy of Sciences
Tettelin, Hervé et al. (2005) Proc. Natl. Acad. Sci. USA 102, 13950-13955
Fig. 2. GBS core genome
Copyright ©2005 by the National Academy of Sciences
Tettelin, Hervé et al. (2005) Proc. Natl. Acad. Sci. USA 102, 13950-13955
Fig. 3. GBS pan-genome
Other Sybil Features
• Open source.• sybil.sourceforge.net• Complete demo database• Other packages:
– Chado relation database– BSML XML (Bioinformatic Sequence Markup Language)
– Bioperl (Lincoln Stein's Bio::Graphics package) – Apache Batik SVG toolkit – MUMmer suffix-tree alignment tools
Important: To run Sybil
• You must load data into Chado.
• We have Flat file BSML parsers
• To be released as open source.
Ergatis: latest discussion.
• me: then when're we releasing ergatis?• Sam: so, the plan was that all these scripts would just
come bundled with Ergatis• me: right.• Sam: we need a deadline• me: oh. is this on the record? I think I'll just put this chat
in my power point for tomorrow.• Sam: i reallly don't think there is that much we need to
do in order to release it. most of the concerns will be about how a user can install and configure it to point to their installs of all the 3rd party search tools they'd want to use.
Sam Angiuoli Ergatis/Sybil Poster H-82
Swiss-Prot Fortelaza
• Sybil: Comparative Analysis System
• Gemina: Epidemiological Resource
Pathogen Host DiseaseTransmission Method Anatomy
Defining Infection Systems
Symptom Reservoir Geographic Location
Bacterial and Viral pathogens
NIAID Category A, B & C Priority Pathogens
human and animal
direct
indirect
mechanical
vector-borne
animal structure
body region
cardiovascular system
cell
digestive system
endocrine system
mechanical
vector-borne
Blood and blood-forming organs diseases
Circulatory system diseases
Complications of pregnancy
Digestive system diseases
Genitourinary system diseases
Clostridium botulinum C Bos taurus indirect: vehicle-borne ingestion
gastrointestincal (GI) tract
Foodborne botulism
Clostridium botulinum F Homo sapiens indirect: vehicle-borne ingestion
gastrointestincal (GI) tract
Infant botulism
Clostridium botulinum B Homo sapiens direct: contact skin Wound botulism
Clostridium botulinum Homo sapiens Indirect: airborne respiratory tract Botulism
Infection Systems distinguish modes of transmission, hosts, disease
Mycobacterium tuberculosis Homo sapiens direct: droplet spread respiratory tract Tuberculosis
Mycobacterium tuberculosis Homo sapiens Indirect: airborne respiratory tract Tuberculosis
Mycobacterium tuberculosis Homo sapiens brain Meningitis
Mycobacterium tuberculosis Pan troglodytes Indirect: airborne lymph nodes Tuberculosis
Pathogen Host Transmission Method
Anatomy Disease
• infectious disease and body system oriented
• hierarchical query and retrieval
• Mapping of terms from newly defined threat_systems and MRS terms
Ontologies & Controlled Vocabularies in Gemina
disease – anatomy – symptom – transmission method – reservoir – geographic location
+diseases of the respiratory system
+infectious and parasitic diseases
+arthropod-borne viral disease
+intestinal infectious diseases
+other bacterial diseases
+bacterial infection
+gas gangrene
+staphylococcus infection
+tetanus
+animal reservoir
+arthropod
+mollusc
+environmental reservoir
+soil
+food
+human reservoir
+blood
+respiratory tract
disease reservoir
(1667)(1667) (424)(424)(1322)(1322) (964)(964)(243)(243)(16)(16)
+Animal_structure
+Body_region
+Cardiovascular_system
+Cell
+Digestive_system
+Embryonic_structure
+Endocrine_system
+Fluids_and_secretions
+Hemic_and_immune_system
+Integumentary_system
+Musculoskeletal_system
+Nervous_system
+Respiratory_system
+Sense_organ
+Stomatognathic_system
+Tissue
+Urogenital_system
Respiratory_system
+ larynx
+ lung
+ pharynx
+ nasopharynx
+ oropharynx
Anatomy Ontology
+Africa
+Americas
+Caribbean
+Central America
+North America
+South America
+Argentina
+Bolivia
+Brazil
+North Region
+Northeast Region
+Rio Grando do Norte
+Sergipe
+Central West Region
+Antarctic Regions
+Arctic Regions
+Asia
+Atlantic Islands
+Europe
+Indian Ocean Islands
+Oceania
+Oceans and Seas
+World Wide
Geographic location
+ Fortaleza
Gemina query pageGemina query page: select topic tabs to add terms to Selection Summary : select topic tabs to add terms to Selection Summary
Scroll down the list of choices or click on Tree view to navigate the hierarchy of termsScroll down the list of choices or click on Tree view to navigate the hierarchy of terms
Query Anatomy Ontology for terms including ‘tissue’Query Anatomy Ontology for terms including ‘tissue’
Identify Infection Systems involving nerve tissue, select, add to Selection Identify Infection Systems involving nerve tissue, select, add to Selection BoxBox
Gemina Search ResultsGemina Search Results
View and sort Infection Systems by topic. Unique ID.View and sort Infection Systems by topic. Unique ID.
Navigate back to the Gemina Query PageNavigate back to the Gemina Query Page
Curated GEMINA Infection Systems (as of July 28th, 2006)
NIAID Category Pathogen Number of Infection Systems
Number of Geographic Locations
Total 22 1616 3852
A Bacillus anthracis 18 -
A Clostridium botulinum 61 257
A Francisella tularensis 44 18
A Yersinia pestis 33 48
B Brucella abortus 3 -
B Brucella canis 7 -
B Brucella melitensis 18 -
B Brucella spp. 11 -
B Brucella suis 15 -
B Burkholderia mallei 55 47
B Burkholderia pseudomallei 210 108
B Campylobacter jejuni 42 148
B Clostridium perfringens 120 30
B Coxiella burnetti 67 69
B Escherichia coli 328 545
B Listeria monocytogenes 96 191
B Rickettsia prowazekii 10 74
B Salmonella typhimurium 105 89
B Staphylococcus aureus 100 86
B Vibrio cholerae 31 178
C Influenza 168 97
C Mycobacterium tuberculosis 74 1867
Microbial Rosetta Stone (MRS): is a database that relates microorganism names, Microbial Rosetta Stone (MRS): is a database that relates microorganism names, taxonomic classifications, diseases, and scientific literature for the the most important taxonomic classifications, diseases, and scientific literature for the the most important human, animal and plant microbial pathogens, with linkage to public genomic sequence human, animal and plant microbial pathogens, with linkage to public genomic sequence databasesdatabases
Applications of Gemina
• Pathogen Identification Applications: Pathogen Identification Applications: – biodefense, animal health care, food safety, biodefense, animal health care, food safety,
diagnostics, pathology, clinical research, diagnostics, pathology, clinical research, forensics, drug discovery forensics, drug discovery
• Under Open Access.Under Open Access.
Applications of Gemina
• Pathogen Identification Applications: Pathogen Identification Applications: – biodefense, animal health care, food safety, biodefense, animal health care, food safety,
diagnostics, pathology, clinical research, diagnostics, pathology, clinical research, forensics, drug discovery forensics, drug discovery
• Under Open Access.Under Open Access.• Disease/Anatomy/SymptomsDisease/Anatomy/Symptoms
– DNA sequence, genomesDNA sequence, genomes– Physical resourcesPhysical resources– Proteomic dataProteomic data
Case Study: Submit queries of multiple terms to view related Infection Systems
Microbial Identification of Clinically Significant Microbes
NIH Clinical Center CollaborationCollaboration: Dr. Patrick Murray
• Creation of Identification Clinical Reference Set
• Identify unique signature tags to distinguish organisms
• Goal: identify the minimum number of tests (50 bp unique signatures) to identify a gram-negative rod bacteria using Pyrosequencing
• Genus-level identification
• Species, Strain-level identification
• Test Set: Clinical Isolates of Gram Negative Rods not reliably identified by biochemical testing: 140 Proteobacteria
Case Study2: Insignia Homeland Security
Web InterfaceGemina
Annotation extractor
PANDADNA Sequence
Chadogenome
annotation
MRS Database schema:
Pathogens and Disease
Data Input:NCBI: Genomic SequenceTIGR: Infection Systems
TAXON_ID
Diagnosics:DNA Signatures:
Univ. MD
Epidemiology Data Flow
Sequence Data Flow
Organism Name Sequencing Center
Acidobacteria bacterium Ellin345 DOE Joint Genome Institute
Acinetobacter baumannii Genoscope
Actinobacillus actinomycetemcomitans HK1651 University of Oklahoma
Bacteriovorax marinus Wellcome Trust Sanger Institute
Bordetella avium Wellcome Trust Sanger Institute
Burkholderia cenocepacia J2315 Wellcome Trust Sanger Institute
Chromohalobacter salexigens DOE Joint Genome Institute
Citrobacter rodentium Wellcome Trust Sanger Institute
Clavibacter michiganensis subsp. sepedonicus Wellcome Trust Sanger Institute
Clostridium botulinum A Wellcome Trust Sanger Institute
Clostridium difficile 630 Wellcome Trust Sanger Institute
Erwinia amylovora Wellcome Trust Sanger Institute
Escherichia coli 042 Wellcome Trust Sanger Institute
Escherichia coli E2348/69 Wellcome Trust Sanger Institute
Francisella tularensis subsp. holarctica FSC200 Baylor College of Medicine
Frankia sp. EAN1pec DOE Joint Genome Institute
Geobacillus stearothermophilus 10 University of Oklahoma
Helicobacter mustelae Wellcome Trust Sanger Institute
Lactobacillus brevis DOE Joint Genome Institute
Mannheimia haemolytica PHL213 Baylor College of Medicine
Methylobacterium extorquens AM1 Integrated Genomics
Mycobacterium marinum M Wellcome Trust Sanger Institute
Mycobacterium microti Wellcome Trust Sanger Institute
Neisseria meningitidis FAM18 Wellcome Trust Sanger Institute
DNA datasets outside of GenBank that we have identified and included in PANDA.
Organism Name Sequencing Center
Paenibacillus larvae subsp. larvae Baylor College of Medicine
Proteus mirabilis Wellcome Trust Sanger Institute
Pseudomonas fluorescens SBW25 Wellcome Trust Sanger Institute
Rhizobium leguminosarum bv. viciae 3841 Wellcome Trust Sanger Institute
Rhodobacter capsulatus SB1003 Integrated Genomics
Salmonella bongori 12149 Wellcome Trust Sanger Institute
Salmonella typhimurium DT104 Wellcome Trust Sanger Institute
Salmonella typhimurium SL1344 Wellcome Trust Sanger Institute
Salmonella typhimurium TR7095 Wellcome Trust Sanger Institute
Serratia marcescens subsp. marcescens Db11 Wellcome Trust Sanger Institute
Shewanella baltica DOE Joint Genome Institute
Shigella dysenteriae M131649 Wellcome Trust Sanger Institute
Shigella sonnei 53G Wellcome Trust Sanger Institute
Spiroplasma kunkelii CR2-3x University of Oklahoma
Streptococcus equi Wellcome Trust Sanger Institute
Streptococcus equi subsp. zooepidemicus Wellcome Trust Sanger Institute
Streptococcus pneumoniae 23F Wellcome Trust Sanger Institute
Streptococcus pyogenes Manfredo Wellcome Trust Sanger Institute
Streptococcus suis P1/7 Wellcome Trust Sanger Institute
Streptococcus uberis 0140J Wellcome Trust Sanger Institute
Thermoanaerobacter ethanolicus DOE Joint Genome Institute
Vibrio salmonicida LFI1238 Wellcome Trust Sanger Institute
Wolbachia endosymbiont of Onchocerca volvulus Wellcome Trust Sanger Institute
Wolbachia pipientis Wellcome Trust Sanger Institute
Yersinia enterocolitica (type 0:8) Wellcome Trust Sanger Institute
Ongoing DevelopmentOngoing Development
• Creating links to Insignia from Results pageCreating links to Insignia from Results page• Enable choice of target and background Enable choice of target and background
genomes from Gemina Search Results genomes from Gemina Search Results • links to Web resources for each Pathogenlinks to Web resources for each Pathogen• Community involvement in development of Community involvement in development of
ontologiesontologies• Workshop on Ontology of Diseases: Nov. 6-7, 2006Workshop on Ontology of Diseases: Nov. 6-7, 2006
• Inclusion of additional datasets (Scotland – Inclusion of additional datasets (Scotland – Disease data)Disease data)
Sam AngiuoliSam AngiuoliSybil/ErgatisSybil/ErgatisPoster H-82Poster H-82
Aaron GussmanAaron GussmanGeminaGemina
Poster B-46Poster B-46
Jonathan Jonathan CrabtreeCrabtree
Sybil InterfaceSybil Interface