compostbin : a dna composition based metagenomic binning algorithm sourav chatterji *, ichitaro...

CompostBin : A DNA composition based metagenomic binning algorithm

Sourav Chatterji*, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen

UC Davis schatterji@ucdavis.edu

Sourav Chatterji*, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen

UC Davis schatterji@ucdavis.edu

Overview of TalkOverview of Talk

Metagenomics and the binning problem. CompostBin

The Microbial WorldThe Microbial World

Exploring the Microbial WorldExploring the Microbial World

Culturing Majority of microbes currently unculturable. No ecological context.

Molecular Surveys (e.g. 16S rRNA) “who is out there?” “what are they doing?”

Culturing Majority of microbes currently unculturable. No ecological context.

Molecular Surveys (e.g. 16S rRNA) “who is out there?” “what are they doing?”

Metagenomics

Interpreting Metagenomic DataInterpreting Metagenomic Data

Nature of Metagenomic Data Mosaic Intraspecies polymorphism Fragmentary

New Sequencing Technologies Enormous amount of data Short Reads

Nature of Metagenomic Data Mosaic Intraspecies polymorphism Fragmentary

New Sequencing Technologies Enormous amount of data Short Reads

Metagenomic BinningMetagenomic Binning

Classification of sequences by taxa

Binning in ActionBinning in Action

Glassy Winged Sharpshooter (Homalodisca coagulata).

Feeds on plant xylem (poor in organic nutrients).

Microbial Endosymbionts

Current Binning Methods Current Binning Methods

Assembly Align with Reference Genome Database Search [MEGAN, BLAST] Phylogenetic Analysis DNA Composition [TETRA,Phylopythia]

Current Binning Methods Current Binning Methods

Need closely related reference genomes. Poor performance on short fragments.

Sanger sequence reads 500-1000 bp long. Current assembly methods unreliable

Complex Communities Hard to Bin.

Need closely related reference genomes. Poor performance on short fragments.

Sanger sequence reads 500-1000 bp long. Current assembly methods unreliable

Complex Communities Hard to Bin.

Overview of TalkOverview of Talk

Metagenomics and the binning problem. CompostBin

Genome SignaturesGenome Signatures

Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? Yes [Karlin et al. 1990s]

What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?

Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? Yes [Karlin et al. 1990s]

What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?

Imperfect WorldImperfect World

Horizontal Gene Transfer Recent Estimates [Ge et al. 2005]

Varies between 0-6% of genes.Typically ~2%.

But… Amelioration

Horizontal Gene Transfer Recent Estimates [Ge et al. 2005]

Varies between 0-6% of genes.Typically ~2%.

But… Amelioration

DNA-composition metricsDNA-composition metrics

The K-mer Frequency MetricCompostBin uses hexamers

Working with K-mers for Binning. Curse of Dimensionality : O(4K) independent

dimensions. Statistical noise increases with decreasing

fragment lengths. Project data into a lower dimensional space to

decrease noise. Principal Component Analysis.

Working with K-mers for Binning. Curse of Dimensionality : O(4K) independent

dimensions. Statistical noise increases with decreasing

fragment lengths. Project data into a lower dimensional space to

decrease noise. Principal Component Analysis.

DNA-composition metricsDNA-composition metrics

PCA separates speciesPCA separates species

Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]

Effect of Skewed Relative AbundanceEffect of Skewed Relative Abundance

B. anthracis and L. monogocytes

Abundance 1:1 Abundance 20:1

A Weighting SchemeA Weighting Scheme

For each read, find overlap with other sequences

A Weighting SchemeA Weighting Scheme

Calculate the redundancy of each position.

4 5 5 3

Weight is inverse of average redundancy.

Weighted PCAWeighted PCA

Calculate weighted mean µw :

Calculates weighted co-variance matrix Mw

PCs are eigenvectors of Mw. Use first three PCs for further analysis.

Calculate weighted mean µw :

Calculates weighted co-variance matrix Mw

PCs are eigenvectors of Mw. Use first three PCs for further analysis.

TTwwii

11iiwwiiiiww ))μμ(X(X))μμ(X(XwwMM

Weighted PCA separates species

B. anthracis and L. monogocytes : 20:1

PCA Weighted PCA

Un-supervised Classification ?Un-supervised Classification ?

Semi-Supervised ClassificationSemi-Supervised Classification

31 Marker Genes [courtesy Martin Wu] Omni-present Relatively Immune to Lateral Gene Transfer

Reads containing these marker genes can be classified with high reliability.

31 Marker Genes [courtesy Martin Wu] Omni-present Relatively Immune to Lateral Gene Transfer

Reads containing these marker genes can be classified with high reliability.

Semi-supervised ClassificationSemi-supervised Classification

Use a semi-supervised version of the normalized cut algorithm

The Semi-supervised Normalized Cut Algorithm

1. Calculate the K-nearest neighbor graph from the point set.

2. Update graph with marker information.o If two nodes are from the same species, add an

edge between them.o If two nodes are from different species, remove

any edge between them.

3. Bisect the graph using the normalized-cut algorithm.

1. Calculate the K-nearest neighbor graph from the point set.

2. Update graph with marker information.o If two nodes are from the same species, add an

edge between them.o If two nodes are from different species, remove

any edge between them.

3. Bisect the graph using the normalized-cut algorithm.

Generalization to multiple binsGeneralization to multiple bins

Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis

[0.62]

Apply algorithm

recursively

Generalization to multiple binsGeneralization to multiple bins

Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis

[0.62]

TestingTesting

Simulate Metagenomic Sequencing Sanger Reads Variables

Number of speciesRelative abundanceGC contentPhylogenetic Diversity

Test on a “real” dataset where answer is well-established.

Simulate Metagenomic Sequencing Sanger Reads Variables

Number of speciesRelative abundanceGC contentPhylogenetic Diversity

Test on a “real” dataset where answer is well-established.

ResultsResults

Conclusions/Future DirectionsConclusions/Future Directions

Satisfactory performance No Training on Existing Genomes Sanger Reads Low number of Species

Future Work Holy Grail : Complex Communities

Semi-supervised projection? Hybrid Assembly/Binning

Satisfactory performance No Training on Existing Genomes Sanger Reads Low number of Species

Future Work Holy Grail : Complex Communities

Semi-supervised projection? Hybrid Assembly/Binning

AcknowledgementsAcknowledgements

UC DavisUC Davis Jonathan Eisen Martin Wu Dongying Wu Ichitaro Yamazaki Amber Hartman Marcel Huntemann

Jonathan Eisen Martin Wu Dongying Wu Ichitaro Yamazaki Amber Hartman Marcel Huntemann

UC BerkeleyUC Berkeley Lior Pachter Richard Karp Ambuj Tewari Narayanan Manikandan

Lior Pachter Richard Karp Ambuj Tewari Narayanan Manikandan

Princeton University Simon Levin Josh Weitz Jonathan Dushoff

compostbin : a dna composition based metagenomic binning algorithm sourav chatterji *, ichitaro...

weighted pcacalculate

binning problem

sanger sequence

mean w

eigenvectors of mw

semisupervised version

minimum length sequence

poor performance

Documents

tridib chatterji faculty of business higher colleges of...

zorro : a masking program for incorporating alignment...

(a joint session with espn) - · pdf fileeditor : sandip...

adaptive projection subspace dimension for the thick...

€¦ · ssi 5 - conflict, migration, and diaspora manas...

tridib chatterji faculty of business dubai women’s college...

csr - a panoramic view -dr bhasker chatterji

able: an adaptive block lanczos method for...

viveka chudamani translated by mohini m. chatterji

a two-stage approach for multi- objective decision making...

an making overseas acquisitions work by dipankar chatterji...

[bankim chandra chatterji] anandamath(bookos.org)

the hindu realism 1912 - jagadisha chandra chatterji

high pressure pdf analysis of reo3 tapan chatterji institut...

multiple species gene finding sourav chatterji...

qin shi huang: the first emperor of china presented by:...

tridib chatterji entrepreneurship center faculty of business...

virtual systolic array for qr decomposition - the · pdf...

prof. biswa nath chatterji, - bppimt

searching for gravitational-wave bursts with the q pipeline...