gao song 2010/07/14. outline overview of metagenomices current assemblers genovo assembly

22
Genovo: De Novo Assembly for Metagenomes Gao Song 2010/07/14

Upload: abigayle-quinn

Post on 31-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Genovo: De Novo Assembly for Metagenomes

Gao Song2010/07/14

OutlineOverview of MetagenomicesCurrent AssemblersGenovo Assembly

Overview of Metagemices

Metagenomics is:

Why Do We Need Metagenomics?Snapshot of bacterial communityCannot be cultivated

Motivation

<1%

Monitoring the impact of pollutants on ecosystems

Discovery of new genes, enzymes…- Global Ocean Sampling Expedition

Human Microbiome Project

JGI sequenced Acid Mine Drainage sample

Applications

Marker Gene Sequencing16s rRNA:

Two ways

Other marker genes: RuBisCo, NifHOnly composition

Whole Genome Sequencing (WGS)Detailed picture of community

Two Paradigms

Complex Communities>1000X5000200L

1million

Current Assembler

Why not assemble reads?

ORFome assembler*Three steps:

The putative ORFs are annotated for each read ORFs are assembled using EULER ORF homologs are searched for in Integrated Microbial

Genomics (IMG) database

Existing WGS assemblersSanger reads: Phrap, Celera, Arachne, JAZZ…Short reads: Velvet, Newbler…

Current Status

* Y. Ye and H. Tang, "An orfome assembly approach to metagenomics sequences analysis." Journal of bioinformatics and computational biology, vol. 7, no. 3, pp. 455-471, June 2009

Genovo: De Novo Assembly for Metagenomes

Jonathan Laserson, Vladimir Jojic and Daphne Koller. RECOMB 2010, LNBI 6044, pp. 341-356, 2010

Main IdeaPropose a generative model for Metagenome

dataUsing iterated conditional modes (ICM)Using hill-climbing steps iterativelyDesign a score for evaluation

ModelInitialize contigs:

Infinite contigs with infinite length

Partition the readsUsing Chinese Restaurant Process

ModelGenerate the starting point oi

Generate the length of read

Quality of assembly of each read

AlgorithmUsing ICMStarting from initial condition, hill-climbing

moves are performed iterativelyMove 1: Consensus Sequence:

Select the most frequent base

AlgorithmMove 2: Read Mapping

For read i, first remove it, then recalculate its contig and alignment

First, for each potential location, compute alignment

Then, select the location according to possibility

Filtering: using common 10-mer

AlgorithmMove 3: update geometric variable

->Globle moves:

Propose indelsCenterMerge contigs

Chimeric readsDisassemble the dangling contigs

EvaluationBLASTPFAMDesigned score

1st term: quality of assembly2nd term: penalty for total length3rd term: prefer to merge when V>V0

ResultsUsing 454 readsCompare with Newbler, Velvet and EULER-

SRSingle Genome

ResultMetagenome data

Score

PFAM

DiscussionNew ideaApply a mature algorithm to assembly

domainSystematically describe and analyze the

problem and algorithmResults are better

DiscussionSlowly: minute vs. hours for 300k 454 readsMain idea: try to extend as long as possible,

so they will have more hits for BLASTWhy choose 20 for V0?How to deal with branching? Repeats?Model:

Why it can capture the property of metagenomic data?

How to argue the correctness of that model?The distribution of starting points

Thank you