compostbin : a dna composition based metagenomic binning algorithm sourav chatterji *, ichitaro...

33
CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji * , Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis [email protected]

Upload: harvey-milo-shepherd

Post on 05-Jan-2016

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

CompostBin : A DNA composition based metagenomic binning algorithm

CompostBin : A DNA composition based metagenomic binning algorithm

Sourav Chatterji*, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen

UC Davis [email protected]

Sourav Chatterji*, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen

UC Davis [email protected]

Page 2: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Overview of TalkOverview of Talk

Metagenomics and the binning problem. CompostBin

Metagenomics and the binning problem. CompostBin

Page 3: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

The Microbial WorldThe Microbial World

Page 4: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Exploring the Microbial WorldExploring the Microbial World

Culturing Majority of microbes currently unculturable. No ecological context.

Molecular Surveys (e.g. 16S rRNA) “who is out there?” “what are they doing?”

Culturing Majority of microbes currently unculturable. No ecological context.

Molecular Surveys (e.g. 16S rRNA) “who is out there?” “what are they doing?”

Page 5: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Metagenomics

Page 6: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Interpreting Metagenomic DataInterpreting Metagenomic Data

Nature of Metagenomic Data Mosaic Intraspecies polymorphism Fragmentary

New Sequencing Technologies Enormous amount of data Short Reads

Nature of Metagenomic Data Mosaic Intraspecies polymorphism Fragmentary

New Sequencing Technologies Enormous amount of data Short Reads

Page 7: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Metagenomic BinningMetagenomic Binning

Classification of sequences by taxa

Page 8: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Binning in ActionBinning in Action

Glassy Winged Sharpshooter (Homalodisca coagulata).

Feeds on plant xylem (poor in organic nutrients).

Microbial Endosymbionts

Page 9: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu
Page 10: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Current Binning Methods Current Binning Methods

Assembly Align with Reference Genome Database Search [MEGAN, BLAST] Phylogenetic Analysis DNA Composition [TETRA,Phylopythia]

Assembly Align with Reference Genome Database Search [MEGAN, BLAST] Phylogenetic Analysis DNA Composition [TETRA,Phylopythia]

Page 11: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Current Binning Methods Current Binning Methods

Need closely related reference genomes. Poor performance on short fragments.

Sanger sequence reads 500-1000 bp long. Current assembly methods unreliable

Complex Communities Hard to Bin.

Need closely related reference genomes. Poor performance on short fragments.

Sanger sequence reads 500-1000 bp long. Current assembly methods unreliable

Complex Communities Hard to Bin.

Page 12: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Overview of TalkOverview of Talk

Metagenomics and the binning problem. CompostBin

Metagenomics and the binning problem. CompostBin

Page 13: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Genome SignaturesGenome Signatures

Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? Yes [Karlin et al. 1990s]

What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?

Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? Yes [Karlin et al. 1990s]

What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?

Page 14: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Imperfect WorldImperfect World

Horizontal Gene Transfer Recent Estimates [Ge et al. 2005]

Varies between 0-6% of genes.Typically ~2%.

But… Amelioration

Horizontal Gene Transfer Recent Estimates [Ge et al. 2005]

Varies between 0-6% of genes.Typically ~2%.

But… Amelioration

Page 15: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

DNA-composition metricsDNA-composition metrics

The K-mer Frequency MetricCompostBin uses hexamers

Page 16: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Working with K-mers for Binning. Curse of Dimensionality : O(4K) independent

dimensions. Statistical noise increases with decreasing

fragment lengths. Project data into a lower dimensional space to

decrease noise. Principal Component Analysis.

Working with K-mers for Binning. Curse of Dimensionality : O(4K) independent

dimensions. Statistical noise increases with decreasing

fragment lengths. Project data into a lower dimensional space to

decrease noise. Principal Component Analysis.

DNA-composition metricsDNA-composition metrics

Page 17: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

PCA separates speciesPCA separates species

Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]

Page 18: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Effect of Skewed Relative AbundanceEffect of Skewed Relative Abundance

B. anthracis and L. monogocytes

Abundance 1:1 Abundance 20:1

Page 19: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

A Weighting SchemeA Weighting Scheme

For each read, find overlap with other sequences

Page 20: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

A Weighting SchemeA Weighting Scheme

Calculate the redundancy of each position.

4 5 5 3

Weight is inverse of average redundancy.

Page 21: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Weighted PCAWeighted PCA

Calculate weighted mean µw :

Calculates weighted co-variance matrix Mw

PCs are eigenvectors of Mw. Use first three PCs for further analysis.

Calculate weighted mean µw :

Calculates weighted co-variance matrix Mw

PCs are eigenvectors of Mw. Use first three PCs for further analysis.

TTwwii

NN

11iiwwiiiiww ))μμ(X(X))μμ(X(XwwMM

N

Xwμ

N

1iii

w

Page 22: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Weighted PCA separates species

Weighted PCA separates species

B. anthracis and L. monogocytes : 20:1

PCA Weighted PCA

Page 23: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Un-supervised Classification ?Un-supervised Classification ?

Page 24: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Semi-Supervised ClassificationSemi-Supervised Classification

31 Marker Genes [courtesy Martin Wu] Omni-present Relatively Immune to Lateral Gene Transfer

Reads containing these marker genes can be classified with high reliability.

31 Marker Genes [courtesy Martin Wu] Omni-present Relatively Immune to Lateral Gene Transfer

Reads containing these marker genes can be classified with high reliability.

Page 25: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Semi-supervised ClassificationSemi-supervised Classification

Use a semi-supervised version of the normalized cut algorithm

Page 26: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

The Semi-supervised Normalized Cut Algorithm

The Semi-supervised Normalized Cut Algorithm

1. Calculate the K-nearest neighbor graph from the point set.

2. Update graph with marker information.o If two nodes are from the same species, add an

edge between them.o If two nodes are from different species, remove

any edge between them.

3. Bisect the graph using the normalized-cut algorithm.

1. Calculate the K-nearest neighbor graph from the point set.

2. Update graph with marker information.o If two nodes are from the same species, add an

edge between them.o If two nodes are from different species, remove

any edge between them.

3. Bisect the graph using the normalized-cut algorithm.

Page 27: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Generalization to multiple binsGeneralization to multiple bins

Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis

[0.62]

Apply algorithm

recursively

Page 28: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Generalization to multiple binsGeneralization to multiple bins

Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis

[0.62]

Page 29: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

TestingTesting

Simulate Metagenomic Sequencing Sanger Reads Variables

Number of speciesRelative abundanceGC contentPhylogenetic Diversity

Test on a “real” dataset where answer is well-established.

Simulate Metagenomic Sequencing Sanger Reads Variables

Number of speciesRelative abundanceGC contentPhylogenetic Diversity

Test on a “real” dataset where answer is well-established.

Page 30: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

ResultsResults

Page 31: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Conclusions/Future DirectionsConclusions/Future Directions

Satisfactory performance No Training on Existing Genomes Sanger Reads Low number of Species

Future Work Holy Grail : Complex Communities

Semi-supervised projection? Hybrid Assembly/Binning

Satisfactory performance No Training on Existing Genomes Sanger Reads Low number of Species

Future Work Holy Grail : Complex Communities

Semi-supervised projection? Hybrid Assembly/Binning

Page 32: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

AcknowledgementsAcknowledgements

UC DavisUC Davis Jonathan Eisen Martin Wu Dongying Wu Ichitaro Yamazaki Amber Hartman Marcel Huntemann

Jonathan Eisen Martin Wu Dongying Wu Ichitaro Yamazaki Amber Hartman Marcel Huntemann

UC BerkeleyUC Berkeley Lior Pachter Richard Karp Ambuj Tewari Narayanan Manikandan

Lior Pachter Richard Karp Ambuj Tewari Narayanan Manikandan

Princeton University Simon Levin Josh Weitz Jonathan Dushoff

Page 33: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu