swiss-prot fortelaza sybil: comparative analysis system gemina: epidemiological resource owen white...

43
Swiss-Prot Fortelaza • Sybil: Comparative Analysis System • Gemina: Epidemiological Resource Owen White Owen White July 31 July 31 st st , 2006 , 2006

Upload: shon-benson

Post on 29-Dec-2015

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Swiss-Prot Fortelaza

• Sybil: Comparative Analysis System

• Gemina: Epidemiological Resource

Owen WhiteOwen WhiteJuly 31July 31stst, 2006, 2006

Page 2: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

ISMB POSTERS

• Sam Angiuoli Ergatis/Sybil Poster H-82

• Aaron Gussman Gemina Poster B-46

Page 3: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Swiss-Prot Fortelaza

• Sybil: Comparative Analysis System

• Gemina: Epidemiological Resource

Page 4: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Sybil searches and computes

• Searches:– All v all blastp searches– Mummer: Nucleotide/Protein SNPs

• Clustering: – Evaluation of proteins-match networks– Scoring system set by user – COGs – bidirectional best hits– Jaccard COG-clustering for transitive closure– Also paralogs

• Syntenic block:– Collections of J-COG between species– Runs of J genes without K non-homologous intervening genes

Page 5: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

How Sybil Computes are Performed

• Blast• Position effect

(conserved gene order)

• MUMmer– SNPs

• PROmer• Gene families

– COGs– Paralogs

BSML Dumper

COG

BSMLLoader

blastp

PE

SNPsMUMmer

PROmer

Primary output: BSML-XML

Page 6: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Data Prep For Comparative Analysis

BSML Dumper

COG

BSMLLoader

blastp

PE

SNPsMUMmer

PROmer

ChadoGenBank FilesEMBL FilesCustom FilesBSMLGFF3

GMOD Consortium

Page 7: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Jaccard Clustering

Page 8: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Jaccard-filtered Orthologs

Page 9: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Match Reduction

Fig. 6. Using a minimal spanning tree (MST) algorithm to remove redundant matches. Protein cluster image before (left) and after (right) applying the MST filter.

Page 10: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Sybil: Chromosomal summaries

Page 11: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006
Page 12: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006
Page 13: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006
Page 14: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Preferences for pop-up displays are user configurable.Preferences for pop-up displays are user configurable.

Page 15: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Syntenic blocksSyntenic blocks

Jaccard-filtered COGsJaccard-filtered COGs

Page 16: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006
Page 17: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006
Page 18: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Copyright ©2005 by the National Academy of Sciences

Tettelin, Hervé et al. (2005) Proc. Natl. Acad. Sci. USA 102, 13950-13955

Fig. 1. Whole genome alignment of GBS strains

Page 19: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Copyright ©2005 by the National Academy of Sciences

Tettelin, Hervé et al. (2005) Proc. Natl. Acad. Sci. USA 102, 13950-13955

Fig. 2. GBS core genome

Page 20: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Copyright ©2005 by the National Academy of Sciences

Tettelin, Hervé et al. (2005) Proc. Natl. Acad. Sci. USA 102, 13950-13955

Fig. 3. GBS pan-genome

Page 21: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006
Page 22: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Other Sybil Features

• Open source.• sybil.sourceforge.net• Complete demo database• Other packages:

– Chado relation database– BSML XML (Bioinformatic Sequence Markup Language)

– Bioperl (Lincoln Stein's Bio::Graphics package) – Apache Batik SVG toolkit – MUMmer suffix-tree alignment tools

Page 23: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Important: To run Sybil

• You must load data into Chado.

• We have Flat file BSML parsers

• To be released as open source.

Page 24: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Ergatis: latest discussion.

• me: then when're we releasing ergatis?• Sam: so, the plan was that all these scripts would just

come bundled with Ergatis• me: right.• Sam: we need a deadline• me: oh. is this on the record? I think I'll just put this chat

in my power point for tomorrow.• Sam: i reallly don't think there is that much we need to

do in order to release it. most of the concerns will be about how a user can install and configure it to point to their installs of all the 3rd party search tools they'd want to use.

Sam Angiuoli Ergatis/Sybil Poster H-82

Page 25: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Swiss-Prot Fortelaza

• Sybil: Comparative Analysis System

• Gemina: Epidemiological Resource

Page 26: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Pathogen Host DiseaseTransmission Method Anatomy

Defining Infection Systems

Symptom Reservoir Geographic Location

Bacterial and Viral pathogens

NIAID Category A, B & C Priority Pathogens

human and animal

direct

indirect

mechanical

vector-borne

animal structure

body region

cardiovascular system

cell

digestive system

endocrine system

mechanical

vector-borne

Blood and blood-forming organs diseases

Circulatory system diseases

Complications of pregnancy

Digestive system diseases

Genitourinary system diseases

Page 27: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Clostridium botulinum C Bos taurus indirect: vehicle-borne ingestion

gastrointestincal (GI) tract

Foodborne botulism

Clostridium botulinum F Homo sapiens indirect: vehicle-borne ingestion

gastrointestincal (GI) tract

Infant botulism

Clostridium botulinum B Homo sapiens direct: contact skin Wound botulism

Clostridium botulinum Homo sapiens Indirect: airborne respiratory tract Botulism

Infection Systems distinguish modes of transmission, hosts, disease

Mycobacterium tuberculosis Homo sapiens direct: droplet spread respiratory tract Tuberculosis

Mycobacterium tuberculosis Homo sapiens Indirect: airborne respiratory tract Tuberculosis

Mycobacterium tuberculosis Homo sapiens brain Meningitis

Mycobacterium tuberculosis Pan troglodytes Indirect: airborne lymph nodes Tuberculosis

Pathogen Host Transmission Method

Anatomy Disease

Page 28: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

• infectious disease and body system oriented

• hierarchical query and retrieval

• Mapping of terms from newly defined threat_systems and MRS terms

Ontologies & Controlled Vocabularies in Gemina

disease – anatomy – symptom – transmission method – reservoir – geographic location

+diseases of the respiratory system

+infectious and parasitic diseases

+arthropod-borne viral disease

+intestinal infectious diseases

+other bacterial diseases

+bacterial infection

+gas gangrene

+staphylococcus infection

+tetanus

+animal reservoir

+arthropod

+mollusc

+environmental reservoir

+soil

+food

+human reservoir

+blood

+respiratory tract

disease reservoir

(1667)(1667) (424)(424)(1322)(1322) (964)(964)(243)(243)(16)(16)

Page 29: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

+Animal_structure

+Body_region

+Cardiovascular_system

+Cell

+Digestive_system

+Embryonic_structure

+Endocrine_system

+Fluids_and_secretions

+Hemic_and_immune_system

+Integumentary_system

+Musculoskeletal_system

+Nervous_system

+Respiratory_system

+Sense_organ

+Stomatognathic_system

+Tissue

+Urogenital_system

Respiratory_system

+ larynx

+ lung

+ pharynx

+ nasopharynx

+ oropharynx

Anatomy Ontology

Page 30: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

+Africa

+Americas

+Caribbean

+Central America

+North America

+South America

+Argentina

+Bolivia

+Brazil

+North Region

+Northeast Region

+Rio Grando do Norte

+Sergipe

+Central West Region

+Antarctic Regions

+Arctic Regions

+Asia

+Atlantic Islands

+Europe

+Indian Ocean Islands

+Oceania

+Oceans and Seas

+World Wide

Geographic location

+ Fortaleza

Page 31: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Gemina query pageGemina query page: select topic tabs to add terms to Selection Summary : select topic tabs to add terms to Selection Summary

Scroll down the list of choices or click on Tree view to navigate the hierarchy of termsScroll down the list of choices or click on Tree view to navigate the hierarchy of terms

Page 32: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Query Anatomy Ontology for terms including ‘tissue’Query Anatomy Ontology for terms including ‘tissue’

Identify Infection Systems involving nerve tissue, select, add to Selection Identify Infection Systems involving nerve tissue, select, add to Selection BoxBox

Page 33: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Gemina Search ResultsGemina Search Results

View and sort Infection Systems by topic. Unique ID.View and sort Infection Systems by topic. Unique ID.

Navigate back to the Gemina Query PageNavigate back to the Gemina Query Page

Page 34: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Curated GEMINA Infection Systems (as of July 28th, 2006)

NIAID Category Pathogen Number of Infection Systems

Number of Geographic Locations

Total 22 1616 3852

A Bacillus anthracis 18 -

A Clostridium botulinum 61 257

A Francisella tularensis 44 18

A Yersinia pestis 33 48

B Brucella abortus 3 -

B Brucella canis 7 -

B Brucella melitensis 18 -

B Brucella spp. 11 -

B Brucella suis 15 -

B Burkholderia mallei 55 47

B Burkholderia pseudomallei 210 108

B Campylobacter jejuni 42 148

B Clostridium perfringens 120 30

B Coxiella burnetti 67 69

B Escherichia coli 328 545

B Listeria monocytogenes 96 191

B Rickettsia prowazekii 10 74

B Salmonella typhimurium 105 89

B Staphylococcus aureus 100 86

B Vibrio cholerae 31 178

C Influenza 168 97

C Mycobacterium tuberculosis  74 1867

Page 35: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Microbial Rosetta Stone (MRS): is a database that relates microorganism names, Microbial Rosetta Stone (MRS): is a database that relates microorganism names, taxonomic classifications, diseases, and scientific literature for the the most important taxonomic classifications, diseases, and scientific literature for the the most important human, animal and plant microbial pathogens, with linkage to public genomic sequence human, animal and plant microbial pathogens, with linkage to public genomic sequence databasesdatabases

Page 36: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Applications of Gemina

• Pathogen Identification Applications: Pathogen Identification Applications: – biodefense, animal health care, food safety, biodefense, animal health care, food safety,

diagnostics, pathology, clinical research, diagnostics, pathology, clinical research, forensics, drug discovery forensics, drug discovery

• Under Open Access.Under Open Access.

Page 37: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Applications of Gemina

• Pathogen Identification Applications: Pathogen Identification Applications: – biodefense, animal health care, food safety, biodefense, animal health care, food safety,

diagnostics, pathology, clinical research, diagnostics, pathology, clinical research, forensics, drug discovery forensics, drug discovery

• Under Open Access.Under Open Access.• Disease/Anatomy/SymptomsDisease/Anatomy/Symptoms

– DNA sequence, genomesDNA sequence, genomes– Physical resourcesPhysical resources– Proteomic dataProteomic data

Page 38: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Case Study: Submit queries of multiple terms to view related Infection Systems

Microbial Identification of Clinically Significant Microbes

NIH Clinical Center CollaborationCollaboration: Dr. Patrick Murray

• Creation of Identification Clinical Reference Set

• Identify unique signature tags to distinguish organisms

• Goal: identify the minimum number of tests (50 bp unique signatures) to identify a gram-negative rod bacteria using Pyrosequencing

• Genus-level identification

• Species, Strain-level identification

• Test Set: Clinical Isolates of Gram Negative Rods not reliably identified by biochemical testing: 140 Proteobacteria

Page 39: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Case Study2: Insignia Homeland Security

Web InterfaceGemina

Annotation extractor

PANDADNA Sequence

Chadogenome

annotation

MRS Database schema:

Pathogens and Disease

Data Input:NCBI: Genomic SequenceTIGR: Infection Systems

TAXON_ID

Diagnosics:DNA Signatures:

Univ. MD

Epidemiology Data Flow

Sequence Data Flow

Page 40: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Organism Name Sequencing Center

Acidobacteria bacterium Ellin345 DOE Joint Genome Institute

Acinetobacter baumannii Genoscope

Actinobacillus actinomycetemcomitans HK1651 University of Oklahoma

Bacteriovorax marinus Wellcome Trust Sanger Institute

Bordetella avium Wellcome Trust Sanger Institute

Burkholderia cenocepacia J2315 Wellcome Trust Sanger Institute

Chromohalobacter salexigens DOE Joint Genome Institute

Citrobacter rodentium Wellcome Trust Sanger Institute

Clavibacter michiganensis subsp. sepedonicus Wellcome Trust Sanger Institute

Clostridium botulinum A Wellcome Trust Sanger Institute

Clostridium difficile 630 Wellcome Trust Sanger Institute

Erwinia amylovora Wellcome Trust Sanger Institute

Escherichia coli 042 Wellcome Trust Sanger Institute

Escherichia coli E2348/69 Wellcome Trust Sanger Institute

Francisella tularensis subsp. holarctica FSC200 Baylor College of Medicine

Frankia sp. EAN1pec DOE Joint Genome Institute

Geobacillus stearothermophilus 10 University of Oklahoma

Helicobacter mustelae Wellcome Trust Sanger Institute

Lactobacillus brevis DOE Joint Genome Institute

Mannheimia haemolytica PHL213 Baylor College of Medicine

Methylobacterium extorquens AM1 Integrated Genomics

Mycobacterium marinum M Wellcome Trust Sanger Institute

Mycobacterium microti Wellcome Trust Sanger Institute

Neisseria meningitidis FAM18 Wellcome Trust Sanger Institute

DNA datasets outside of GenBank that we have identified and included in PANDA.

Page 41: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Organism Name Sequencing Center

Paenibacillus larvae subsp. larvae Baylor College of Medicine

Proteus mirabilis Wellcome Trust Sanger Institute

Pseudomonas fluorescens SBW25 Wellcome Trust Sanger Institute

Rhizobium leguminosarum bv. viciae 3841 Wellcome Trust Sanger Institute

Rhodobacter capsulatus SB1003 Integrated Genomics

Salmonella bongori 12149 Wellcome Trust Sanger Institute

Salmonella typhimurium DT104 Wellcome Trust Sanger Institute

Salmonella typhimurium SL1344 Wellcome Trust Sanger Institute

Salmonella typhimurium TR7095 Wellcome Trust Sanger Institute

Serratia marcescens subsp. marcescens Db11 Wellcome Trust Sanger Institute

Shewanella baltica DOE Joint Genome Institute

Shigella dysenteriae M131649 Wellcome Trust Sanger Institute

Shigella sonnei 53G Wellcome Trust Sanger Institute

Spiroplasma kunkelii CR2-3x University of Oklahoma

Streptococcus equi Wellcome Trust Sanger Institute

Streptococcus equi subsp. zooepidemicus Wellcome Trust Sanger Institute

Streptococcus pneumoniae 23F Wellcome Trust Sanger Institute

Streptococcus pyogenes Manfredo Wellcome Trust Sanger Institute

Streptococcus suis P1/7 Wellcome Trust Sanger Institute

Streptococcus uberis 0140J Wellcome Trust Sanger Institute

Thermoanaerobacter ethanolicus DOE Joint Genome Institute

Vibrio salmonicida LFI1238 Wellcome Trust Sanger Institute

Wolbachia endosymbiont of Onchocerca volvulus Wellcome Trust Sanger Institute

Wolbachia pipientis Wellcome Trust Sanger Institute

Yersinia enterocolitica (type 0:8) Wellcome Trust Sanger Institute

Page 42: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Ongoing DevelopmentOngoing Development

• Creating links to Insignia from Results pageCreating links to Insignia from Results page• Enable choice of target and background Enable choice of target and background

genomes from Gemina Search Results genomes from Gemina Search Results • links to Web resources for each Pathogenlinks to Web resources for each Pathogen• Community involvement in development of Community involvement in development of

ontologiesontologies• Workshop on Ontology of Diseases: Nov. 6-7, 2006Workshop on Ontology of Diseases: Nov. 6-7, 2006

• Inclusion of additional datasets (Scotland – Inclusion of additional datasets (Scotland – Disease data)Disease data)

Page 43: Swiss-Prot Fortelaza Sybil: Comparative Analysis System Gemina: Epidemiological Resource Owen White July 31 st, 2006

Sam AngiuoliSam AngiuoliSybil/ErgatisSybil/ErgatisPoster H-82Poster H-82

Aaron GussmanAaron GussmanGeminaGemina

Poster B-46Poster B-46

Jonathan Jonathan CrabtreeCrabtree

Sybil InterfaceSybil Interface