camera presentation at knaw icomm colloquium may 2008

52
C A M E R A A Metagenomics Resource for Microbial Ecology Saul A. Kravitz J. Craig Venter Institute Rockville, Maryland USA KNAW Colloquium May 29, 2008

Upload: saul-kravitz

Post on 11-May-2015

2.463 views

Category:

Business


4 download

DESCRIPTION

CAMERA Presentation by Saul Kravitz at KNAW ICoMM Colloquium May 2008 in Amsterdam, Netherlands. See http://camera.calit2.net

TRANSCRIPT

Page 1: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

C A M E R AA Metagenomics Resource

for Microbial Ecology

Saul A. KravitzJ. Craig Venter InstituteRockville, Maryland USA

KNAW ColloquiumMay 29, 2008

Page 2: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Goals

• Introduce you to CAMERA

• Encourage you to use CAMERA

• What can CAMERA do for you?

Page 3: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Presentation Outline

• Introduction to Metagenomics

• Global Ocean Sampling (GOS) Expedition

• CAMERA Capabilities and Features- Compute Resources

- Data Resources

- Tools Resources

• Looking Forward

Page 4: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

• Within an environment- What biological functions are present (absent)?

- What organisms are present (absent)

• Compare data from (dis)similar environments- What are the fundamental rules of microbial ecology

• Adapting to environmental conditions?- How?

- Evidence and mechanisms for lateral transfer

• Search for novel proteins and protein families

- And diversity within known families

Metagenomic Questions

Page 5: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

• Genomics – ‘Old School’- Study of a single organism's genome - Genome sequence determined using shotgun

sequencing and assembly- >1300 microbes sequenced, first in 1995

- DNA usually obtained from pure cultures (<1%) • Metagenomics

- Application of genome sequencing methods to environmental samples (no culturing)

- Environmental shotgun sequencing is the most widely used approach

- Environmental Metadata provides key context

Genomics vs Metagenomics

Page 6: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Complexity of Microbial Communities

• Simple (e.g., AMD, gutless worm)- Few species present (<10)

- Diverse

Variations on standard genomics techniques

• Complex (e.g., Soil or Marine)- Many species present (>10, often >1000)

- Many closely related

New techniques

Page 7: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Global Ocean Sampling Expedition

Page 8: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Global Ocean Sampling (GOS)• 178 Total Sampling Locations

- Phase 1: 7.7M reads, >6M proteins 3/07- Phase 2-IO: 2.2M reads 3/08- Phase 2: ~10M reads future

• Diverse Environments- Open ocean, estuary, embayment, upwelling, fringing reef, atoll…

3/08

3/07

4/04

Page 9: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

• Most sequence reads are unique- Very limited assembly- Most sequences not taxonomically anchored- Relating shotgun data to reference genomes- Annotation challenging

• New Techniques Needed- Fragment Recruitment- Extreme Assembly to find pan genomes- Sample to Sample Comparisons

GOS: Sequence Diversity in the Ocean

Rusch et al (PLoS 2007)

Page 10: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Comparing of Dominant Ribotypes

Page 11: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Comparison of Total Genomic Content

Page 12: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

• Novel clustering process• Sequence similarity based

• Predict proteins and group into related clusters

• Include GOS and all known proteins

• Findings• GOS proteins

• cover ~all existing prokaryotic families

• expands diversity of known protein families

• ~10% of large clusters are novel

• Many are of viral origin

• No saturation in the rate of novel protein family discovery

GOS Protein Analysis Yooseph et al (PLoS 2007)

Page 13: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Rubisco homologs

Added Protein Family Diversity

Yooseph et al (PLoS 2007)

New Groups

GOS prokaryotes

Known eukaryotes

Known prokaryotes

Page 14: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

• Study of dsDNA viruses from shotgun data- 155k viral proteins identified from 37 GOS I sites (~2.5%)

- 59% of viral sequences were bacteriophage

• Viral acquisition and retention of host metabolic genes is common and widespread- Viruses have made these genes “their own”- Clade tightly with viral genes

• Codistribution of P-SSM4-like cyanophage and the dominant ecotype of Prochlorococcus in GOS samples.

GOS Viral Analysis(Williamson et al PLoSOne 2008)

Page 15: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Viral acquisition of host genestalC Gene

GOS Viral

Public Viral

GOS Bacterial

Public Bacterial

Public Euk

Page 16: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Reference Genomes

• Overview- 150+ reference marine microbes (101 released)

- Scaffold for GOS

- Sequenced, assembled, autoannotated

• Isolation Metadata- Incomplete

• Bottlenecks- Availability of DNA

- Purity of DNA

• Status and Data- https://research.venterinstitute.org/moore/

Page 17: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

• Significant investment in sequencing- Only accessible to bioinformatics elite

- Diversity of user sophistication and needs

• Bioinformatics and Computation Challenges- Assembly, annotation, comparative analysis, visualization

- Dedicated compute resources

• Importance of Metadata- Metadata required for environmental analysis

- Need to drive standards

• Compliance with Convention on Biodiversity

Motivations for CAMERA

Page 18: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Convention on Biological Diversity

• Sample in territorial waters?- Country granted certain rights by CBD

- Sampling agreements may contain restrictions

• CAMERA users must acknowledge potential restrictions on commercial data use

• CAMERA maintains mapping of country-of-origin for all data objects

Page 19: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

CAMERA – http://camera.calit2.net

• “Convenient acronym for cumbersome name…”- Henry Nichols, PLoS Biology

• Mission- Enable Research in Marine Microbiology

• Debuted March 2007

[email protected]

Page 20: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

CAMERA Capabilities

• Compute Resources- 512 node compute grid + 200 Tb storage

• Data and Metadata Resources- Annotated Metagenomic and genomic data

• Tools Resources- Scalable BLAST

- Fragment Recruitment

- Metagenomic Annotation

- Text Search

Page 21: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

512 Processors ~5 Teraflops

~ 200 Terabytes Storage

CAMRA Compute and Storage Complexat UCSD/Calit2

Source: Larry Smarr, Calit2

Page 22: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

CAMERA Metagenomic Data Volume

by Project

Page 23: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

CAMERA Metagenomic Samples

Page 24: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

CAMERA Users>2000 Registered Since March

2007

Page 25: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

• Metagenomic Sequence Collection- Reads and assemblies w/associated metadata

- CAMERA-computed annotation

• Protein Clusters- Maintaining clusters from Yooseph et al (Yooseph and Li, ’08)

• Genomic Data- Viral, Fungal, pico-Eukaryotes, Microbial- Moore Marine Genomes with Metadata

• Non-redundant sequence Collection- Genbank, Refseq, Uniprot/Swissprot, PDB etc

CAMERA Data Collections

Page 26: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

• Genome Standards Consortium- Led by Dawn Field, NIEeS

- Members from EU, UK, US

• Goals are to promote- Standardization of genomic descriptions

- Exchange & Integration of genomic data

• Metadata standardization key enabler- MIMS: Min Info for Metagenomic Sample

- GCDML: Standard format

Standardizing Contextual Metadata

Page 27: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Contextual Metadata Challenges

• Researchers Need to Collect and Submit

• Relevant metadata depends on study – MIMS- Specification of minimum metadata

• Standardize Exchange Format - GCDML

- Comprehensive and extensible

- Leverages Existing Ontologies, Validatable

And…

- Easy for a scientist to use...

• Need ongoing software support for tools

Page 28: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

CAMERA Core Metadata by Project

• Defacto Core•Lattitude and Longitude•Collection date•Habitat and Geographic Location

• Missing metadata =

Page 29: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

CAMERA Contextual Metadata

Page 30: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

CAMERA 1.3

http://camera.calit2.net

Page 31: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Scalable BLAST with Metadata

• Large searches permitted and encouraged

• 454 FLX run vs “All Metagenomic”

• Some larger tblastx jobs have run >20 hrs

• 10kbp BLASTN vs All Metagenomic – 1 min

• BLAST XML or Tabular Export

• Searches against NRAA

• BLAST XML output feeds MEGAN

• Searches against ‘All Metagenomic’

• GUI with metdata

• Tabular with metadata

Page 32: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Scalable BLAST with Metadata

Page 33: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Integration of Metadata and Data

Page 34: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Browsing Large Data Collections: Fragment Recruitment Viewer

• Microbial Communities vs Reference Genomes- Millions of sequence reads vs Thousands of genomes

• Definition: A read is recruited to a sequence if:- End-to-end blastN alignment exists

• Rapid Hypothesis Generation and Exploration- How do cultured and wildtype genomes differ?

- Insertions, deletion, translocations

- Correlation with environmental factors

• Export sequence and annotation• Credits: Doug Rusch and Michael Press

Page 35: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Fragment Recruitment ViewerS

eque

nce

Sim

ilarit

y

Genomic Position

Doug Rusch, JCVI

Page 36: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Seq

uenc

e S

imila

rity

Genomic Position

Annotation

Geographic Legend

Page 37: CAMERA Presentation at KNAW ICoMM Colloquium May 2008
Page 38: CAMERA Presentation at KNAW ICoMM Colloquium May 2008
Page 39: CAMERA Presentation at KNAW ICoMM Colloquium May 2008
Page 40: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Prochlorococcus marinus str. MIT 9312

• Coloring by geography • 80-95% identity cloud • = GOS Indian Ocean• Regions with no coverage

• Where?• Real?

Page 41: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Mate Status Highlights Differences

• Paired end (mate) sequencing • Coloring by mate status• Highlights cultured vs metagenomic differences • Selective display of

- Mates by status- Reads by sample

Page 42: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Mate Pairs Highlight Variation

Page 43: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

What Genes are Involved

Page 44: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

View

by

Sample

Page 45: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

View by Sample

Filter by mate status

Page 46: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Annotation ofEnvironmental Shotgun Data

• Gene Finding- Using Yooseph’s Protein Clusters, and/or- Metagene

• Functional Assignment- Variation of JCVI prok annotation pipeline*- Leverages protein cluster annotation -- soon

• Quality Nearly Comparable to Prokaryotic Genomic Annotation

Page 47: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Protein Clusters as Gene Finder

• Identification and soft mask of ncRNAs• Naïve identification of ORFs (60aa min)• Add peptides to clusters incrementally

- Yooseph and Li, 2008

• Predicted Genes based on ORFS in- Clusters of sufficient size- Clusters that satisfy additional filters

Page 48: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Protein ClustersAdvantages and Disadvantages

• Weaknesses- Homology-based- Stateful (also a strength)- Less sensitive (for now)

• Strengths- More specific- Transitive Annotation- Learns over time- Easy to maintain

Page 49: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Search for Dehalogenase

Page 50: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Browse Clusters

Page 51: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Near Future

• More extensive data collection

• Summary views of data sets by- Annotation

- Samples

- Mate Status

- Taxonomy

- Habitat and other contextual metadata

• 16S datasets?

Page 52: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Credits• JCVI CAMERA Team

- Leonid Kagan, Michael Press, Todd Safford, Cristian Goina, Qi Yang, Sean Murphy, Jeff Hoover, Tanja Davidsen, Ramana Madupu, Sree Nampally, Nikhat Zhafar, Prateek Kumar

- Doug Rusch, Shibu Yooseph, Aaron Halpern*, Granger Sutton, Shannon Williamson

- Marv Frazier and Bob Friedman

• Calit2 CAMERA Team- Adam Brust, Michael Chiu, Brian Fox, Adam Dunne, Kayo

Arima

- Larry Smarr and Paul Gilna

http://camera.calit2.net