compostbin : a dna composition based metagenomic binning algorithm sourav chatterji *, ichitaro...

Post on 05-Jan-2016

220 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CompostBin : A DNA composition based metagenomic binning algorithm

CompostBin : A DNA composition based metagenomic binning algorithm

Sourav Chatterji*, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen

UC Davis schatterji@ucdavis.edu

Sourav Chatterji*, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen

UC Davis schatterji@ucdavis.edu

Overview of TalkOverview of Talk

Metagenomics and the binning problem. CompostBin

Metagenomics and the binning problem. CompostBin

The Microbial WorldThe Microbial World

Exploring the Microbial WorldExploring the Microbial World

Culturing Majority of microbes currently unculturable. No ecological context.

Molecular Surveys (e.g. 16S rRNA) “who is out there?” “what are they doing?”

Culturing Majority of microbes currently unculturable. No ecological context.

Molecular Surveys (e.g. 16S rRNA) “who is out there?” “what are they doing?”

Metagenomics

Interpreting Metagenomic DataInterpreting Metagenomic Data

Nature of Metagenomic Data Mosaic Intraspecies polymorphism Fragmentary

New Sequencing Technologies Enormous amount of data Short Reads

Nature of Metagenomic Data Mosaic Intraspecies polymorphism Fragmentary

New Sequencing Technologies Enormous amount of data Short Reads

Metagenomic BinningMetagenomic Binning

Classification of sequences by taxa

Binning in ActionBinning in Action

Glassy Winged Sharpshooter (Homalodisca coagulata).

Feeds on plant xylem (poor in organic nutrients).

Microbial Endosymbionts

Current Binning Methods Current Binning Methods

Assembly Align with Reference Genome Database Search [MEGAN, BLAST] Phylogenetic Analysis DNA Composition [TETRA,Phylopythia]

Assembly Align with Reference Genome Database Search [MEGAN, BLAST] Phylogenetic Analysis DNA Composition [TETRA,Phylopythia]

Current Binning Methods Current Binning Methods

Need closely related reference genomes. Poor performance on short fragments.

Sanger sequence reads 500-1000 bp long. Current assembly methods unreliable

Complex Communities Hard to Bin.

Need closely related reference genomes. Poor performance on short fragments.

Sanger sequence reads 500-1000 bp long. Current assembly methods unreliable

Complex Communities Hard to Bin.

Overview of TalkOverview of Talk

Metagenomics and the binning problem. CompostBin

Metagenomics and the binning problem. CompostBin

Genome SignaturesGenome Signatures

Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? Yes [Karlin et al. 1990s]

What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?

Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? Yes [Karlin et al. 1990s]

What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?

Imperfect WorldImperfect World

Horizontal Gene Transfer Recent Estimates [Ge et al. 2005]

Varies between 0-6% of genes.Typically ~2%.

But… Amelioration

Horizontal Gene Transfer Recent Estimates [Ge et al. 2005]

Varies between 0-6% of genes.Typically ~2%.

But… Amelioration

DNA-composition metricsDNA-composition metrics

The K-mer Frequency MetricCompostBin uses hexamers

Working with K-mers for Binning. Curse of Dimensionality : O(4K) independent

dimensions. Statistical noise increases with decreasing

fragment lengths. Project data into a lower dimensional space to

decrease noise. Principal Component Analysis.

Working with K-mers for Binning. Curse of Dimensionality : O(4K) independent

dimensions. Statistical noise increases with decreasing

fragment lengths. Project data into a lower dimensional space to

decrease noise. Principal Component Analysis.

DNA-composition metricsDNA-composition metrics

PCA separates speciesPCA separates species

Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]

Effect of Skewed Relative AbundanceEffect of Skewed Relative Abundance

B. anthracis and L. monogocytes

Abundance 1:1 Abundance 20:1

A Weighting SchemeA Weighting Scheme

For each read, find overlap with other sequences

A Weighting SchemeA Weighting Scheme

Calculate the redundancy of each position.

4 5 5 3

Weight is inverse of average redundancy.

Weighted PCAWeighted PCA

Calculate weighted mean µw :

Calculates weighted co-variance matrix Mw

PCs are eigenvectors of Mw. Use first three PCs for further analysis.

Calculate weighted mean µw :

Calculates weighted co-variance matrix Mw

PCs are eigenvectors of Mw. Use first three PCs for further analysis.

TTwwii

NN

11iiwwiiiiww ))μμ(X(X))μμ(X(XwwMM

N

Xwμ

N

1iii

w

Weighted PCA separates species

Weighted PCA separates species

B. anthracis and L. monogocytes : 20:1

PCA Weighted PCA

Un-supervised Classification ?Un-supervised Classification ?

Semi-Supervised ClassificationSemi-Supervised Classification

31 Marker Genes [courtesy Martin Wu] Omni-present Relatively Immune to Lateral Gene Transfer

Reads containing these marker genes can be classified with high reliability.

31 Marker Genes [courtesy Martin Wu] Omni-present Relatively Immune to Lateral Gene Transfer

Reads containing these marker genes can be classified with high reliability.

Semi-supervised ClassificationSemi-supervised Classification

Use a semi-supervised version of the normalized cut algorithm

The Semi-supervised Normalized Cut Algorithm

The Semi-supervised Normalized Cut Algorithm

1. Calculate the K-nearest neighbor graph from the point set.

2. Update graph with marker information.o If two nodes are from the same species, add an

edge between them.o If two nodes are from different species, remove

any edge between them.

3. Bisect the graph using the normalized-cut algorithm.

1. Calculate the K-nearest neighbor graph from the point set.

2. Update graph with marker information.o If two nodes are from the same species, add an

edge between them.o If two nodes are from different species, remove

any edge between them.

3. Bisect the graph using the normalized-cut algorithm.

Generalization to multiple binsGeneralization to multiple bins

Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis

[0.62]

Apply algorithm

recursively

Generalization to multiple binsGeneralization to multiple bins

Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis

[0.62]

TestingTesting

Simulate Metagenomic Sequencing Sanger Reads Variables

Number of speciesRelative abundanceGC contentPhylogenetic Diversity

Test on a “real” dataset where answer is well-established.

Simulate Metagenomic Sequencing Sanger Reads Variables

Number of speciesRelative abundanceGC contentPhylogenetic Diversity

Test on a “real” dataset where answer is well-established.

ResultsResults

Conclusions/Future DirectionsConclusions/Future Directions

Satisfactory performance No Training on Existing Genomes Sanger Reads Low number of Species

Future Work Holy Grail : Complex Communities

Semi-supervised projection? Hybrid Assembly/Binning

Satisfactory performance No Training on Existing Genomes Sanger Reads Low number of Species

Future Work Holy Grail : Complex Communities

Semi-supervised projection? Hybrid Assembly/Binning

AcknowledgementsAcknowledgements

UC DavisUC Davis Jonathan Eisen Martin Wu Dongying Wu Ichitaro Yamazaki Amber Hartman Marcel Huntemann

Jonathan Eisen Martin Wu Dongying Wu Ichitaro Yamazaki Amber Hartman Marcel Huntemann

UC BerkeleyUC Berkeley Lior Pachter Richard Karp Ambuj Tewari Narayanan Manikandan

Lior Pachter Richard Karp Ambuj Tewari Narayanan Manikandan

Princeton University Simon Levin Josh Weitz Jonathan Dushoff

top related