the integrated microbial genome (img) systems
DESCRIPTION
The Integrated Microbial Genome (IMG) systems. Nikos Kyrpides. Reddy. Bahador. Iain. Denis. Amrita. Billis. Peter. Marcel. OMICS GROUP. STANDARDS GROUP. ANNOTATION GROUP. Natalia. Dino. Kostas. Ioanna. Biological Data Management. Victor Markowitz. Yuri Grechkin. Ken Chu. - PowerPoint PPT PresentationTRANSCRIPT
The Integrated Microbial Genome (IMG) systems
Nikos Kyrpides
OMICS GROUP
Ken Chu
KrishnaPalaniappan
ErnestSzeto
YuriGrechkin
Amy Chen
VictorMarkowitz
Biju Jacob
ANNOTATION GROUP
STANDARDS GROUP
Kostas
Marcel Peter Billis
Natalia Dino
Amrita Denis Iain Bahador IoannaReddy
Science driven data generation and analysis
Science Goals
ANALYSIS
UserFacility
Science driven data generation and analysis
Science Goals
ANALYSIS
UserFacility
Data Integration
Comparative Analysis
Data analysis
Data management system for comparative analysis of biological data
I
MG
IMG
GenesGenomes
Functions
Metadata Clusters
SNPsProteomics
RegulonsTranscriptomes
What is the Matrix?
Become the HOME of Microbial Genomes and Metagenomes
IMG’s Mission
• support comparative genome analysis• support community functional
annotationprovide a user friendly interface
What is IMG:IMG is a data management system for comparative analysis and annotation of all publicly available genomes from three domains of life in a uniquely integrated context.
Mission:To become the Home of Microbial Genome and Metagenome Analysis
Background: Launched on March 2005 3 Releases/Year, 20 releases so far >5,000 unique visitors per month >350 citations
Current Status: 6891 Genomes 11.6 Million Genes
Bacteria: 2780 Archaea: 107 Eukarya: 121 Plasmids: 1186 Viruses: 2697
• http://img.jgi.doe.gov/
• http://img.jgi.doe.gov/
USERS CAN Search data Browse data Compare data Export data
Integrated Microbial Genomes (IMG)[It’s easier to analyze 1000 genomes than a single one]
http://img.jgi.doe.gov/
Why more data are neededfaster and more accurate function prediction
Ribokinase family
Fructokinase family
2-dehydro-3-deoxy
glucokinase family
Binning
Metagenomic Analysis
Acid Mine Drainage Sargasso Sea Soil
1 10 100 1000 1000s 10000
Species complexity
Human GutTermite Hindgut
?The road to success in Metagenomics is through Microbial Genomics
Source: Susannah Tringe, JGI
Reference Genomes
Availability of Reference Genomes
Acid Mine Drainage Human gut Soil
100% 60% 50% 40% 20% 1%
Reference Genomes
Termite GutMarine
?
Data Model Abstraction Example:
IMG Operations
Ge n
e s
Functions/
Pathways
Genomes
Gene occurrence
profile across genomes
Gene occurrence profiles across
pathways
Pathways shared by genomes
Genes present in G1 and absent from G2, G3, G4 and G5
G1 G2 G3 G4 G5
g3
g2
g1 + + + + + + + - + + + - - - -
IMG Data Integration
Genomes Functions
Genes
• COG• GO• Pfam• TIGRfam• InterPro• KEGG• BioCyc• SEED
• Protein product
• MyIMG• IMG Terms• IMG
Pathways• IMG
Networks
Groupings•
Phylogenetic
• Phenotypic
• Ecotypic• Disease•
Geographical
• Isolation
• RNAs, Proteins• Sequence Clusters• Positional clusters• Regulatory clusters• Fusions• Operons• Expression
6891
11.6M
1.1M
IMG ToolkitChromosome
MapFunction
ProfileGene
SyntenyAbundance
ProfilesFunctional Categories
ProjectsMap
IMG Pathway Profile
MetadataSearch
PhylogeneticProfile
GenomeClustering
CompareAnnotations
KEGGMaps
PhylogeneticDistribution
ChromosomalMap Artemis
VISTA
RecruitmentPlot
FragmentRecruitment
WRITE PAPER
USERS CAN Search data Browse data Compare
data Export data USERS CAN
Submit data Annotate
data
APRIL 2011
Users 1370Submissions 2626Private Genes 188 M
UNIQUE VISITS~ 5,000 / month
NEW PROJECT
SEQUENCING
Informatics Steps & Servicessupport of a new user community
ASSEMBLY ANNOTATION DATA RELEASE
INTEGRATION & COMPARATIVE
ANALYSIS
METADATA
EXTERNAL
PROJECTS
2012ASSEMBLY
EXTERNAL
PROJECTS
2008IMG-ER
EXTERNAL
PROJECTS
2005IMG
18
• Metadata• Gene calling• Annotation
• Quantity• Quality
• Number of Genes• All vs all Blast
• Number of Datasets• How do we navigate
through a sea of data
Data Analysis
Data Challenges & Opportunities
Integration
Challenges we face
DATA SIZE DATA QUALITY DATA STANDARDS
Challenges we face
1. DATA SIZE• Number of Genes• Number of Datasets
a. How do we compare datab. How do we find datac. How do we navigate through data
MetagenomeReference genomes
Use clusters
Metagenome Metagenome
Clusters• Common/unique genes• Rapid identification of
best hit(s)• ….
2. Computation of similarities
ii. Method dev for data reduction & comparison
- Computation of Similarities
21
SCALINGComputation of Similarities
IMG
OLD: BLAST~ 30 days for 8 Million
Genes
NEW: CLUSTERS~ 3 days for 8 Million
Genes
IMG/M
OLD: BLASTNot Possible
NEW: CLUSTERS~ 10 days for 80 Million
Genes
Strain / species diversity
Prochlorococcus marinus Pangenome10
Listeria monocytogenes Pangenome
17
15
Staphylococcus aureus Pangenome
PangenomesWe need better ways to
• represent and browse through thousands of genomes• represent an organism
Reference Genome
Bes
t Bla
st H
it
Pangenome
Metagenome Analysiswith Pangenomes
Challenges we face
2. DATA QUALITYa. Did we generate enough data to support biological
conclusions?b. Did we introduce any biases during sequencing?c. Is the quality of assembly comparable between
different datasets?d. Is the quality of predicted genes comparable between
different datasets?e. Is the quality of functional annotation comparable
between different datasets
Microbial GenomesGene Prediction Quality Assurance
Gene Prediction Improvement PipelineGenePRIMP is a pipeline that consists of a series of computational units that identify erroneous gene calls and missed genes and correct a subset of the identified defective features.
APPLICATIONS• Identify gene prediction anomalies• Benchmark the quality of gene
prediction algorithms• Benchmark the quality of combination /
coverage of sequencing platforms• Improve the sequence quality
Pati A. et al, (2010) Nature Methods
GenePRIMPhttp://geneprimp.jgi-psf.org
Natalia
Amrita
Challenges we face
3. DATA STANDARDSa. Assemblyb. Gene Findingc. Functional Annotationd. Metadata
Project Catalog & MetadataGenomes OnLine Database
D. LioliosI. Pagani
COMPUTATIONSM5: Pilot Project with ANL
Building a roadmap for a scaleable and sustainable computing MetaInfrastructure for the metagenomics
community
innovation through collaboration
GSC
CAMERA
JGI ANL• develop standards to share and process data more effectively
• run data-intensive workflows once (reduce wasted cycles)
Develop a single QC data processing pipeline Develop a single data submission entry Develop a single data processing pipeline Develop a common project catalog
ANL JGI
Standards in Genomic Scienceshttp://standardsingenomics.org
New Data & Tools for Visualization & Analysis of• Integration of Expression data• Integration of Regulatory Data• Resequencing data (strain variation)• Pangenomes
Data Processing• Short Read annotation• Bypass the all vs all Blast bottleneck
Ongoing Developments