seminar: haider et al. 2014, bioinformatics:btu395

37
Omega: an overlap-graph de novo assembler for metagenomics Presentation by Rosemary McCloskey Bahlul Haider 1 Tae-Hyuk Ahn 1 Brian Bushnell 2 Juanjuan Chai 1 Alex Copeland 2 Chongle Pan 1 1 Oak Ridge National Laboratory 2 US Department of Energy Joint Genome Institute October 2, 2014 Haider et al. Omega October 2, 2014 1 / 11

Upload: rosemary-mccloskey

Post on 28-Jul-2015

82 views

Category:

Science


3 download

TRANSCRIPT

Omega: an overlap-graph de novo assembler formetagenomics

Presentation by Rosemary McCloskey

Bahlul Haider1 Tae-Hyuk Ahn1 Brian Bushnell2 JuanjuanChai1 Alex Copeland2 Chongle Pan1

1Oak Ridge National Laboratory

2US Department of Energy Joint Genome Institute

October 2, 2014

Haider et al. Omega October 2, 2014 1 / 11

Metagenomics

Analysis of genetic material (typically bacteria) directly fromenvironmental samples.

The front bumper of a 2006 Dodge Caravan (“The Wanderer”) was . . . tapedwith a double-sided carpet tape. The tapes were applied . . . in State College,Pennsylvania, and removed . . . in Manchester, Connecticut. New tapes wereagain applied in Portland, Maine . . . and removed in Moncton, NewBrunswick . . . the following day.

The list included unexpected entries such as the genus Homo even though the

two trips were uneventful.

Haider et al. Omega October 2, 2014 2 / 11

Metagenomics

Analysis of genetic material (typically bacteria) directly fromenvironmental samples.

The front bumper of a 2006 Dodge Caravan (“The Wanderer”) was . . . tapedwith a double-sided carpet tape. The tapes were applied . . . in State College,Pennsylvania, and removed . . . in Manchester, Connecticut. New tapes wereagain applied in Portland, Maine . . . and removed in Moncton, NewBrunswick . . . the following day.

The list included unexpected entries such as the genus Homo even though the

two trips were uneventful.

Haider et al. Omega October 2, 2014 2 / 11

Metagenomics

Analysis of genetic material (typically bacteria) directly fromenvironmental samples.

The front bumper of a 2006 Dodge Caravan (“The Wanderer”) was . . . tapedwith a double-sided carpet tape. The tapes were applied . . . in State College,Pennsylvania, and removed . . . in Manchester, Connecticut. New tapes wereagain applied in Portland, Maine . . . and removed in Moncton, NewBrunswick . . . the following day.

The list included unexpected entries such as the genus Homo even though the

two trips were uneventful.

Haider et al. Omega October 2, 2014 2 / 11

Metagenomics

Analysis of genetic material (typically bacteria) directly fromenvironmental samples.

The front bumper of a 2006 Dodge Caravan (“The Wanderer”) was . . . tapedwith a double-sided carpet tape. The tapes were applied . . . in State College,Pennsylvania, and removed . . . in Manchester, Connecticut. New tapes wereagain applied in Portland, Maine . . . and removed in Moncton, NewBrunswick . . . the following day.

The list included unexpected entries such as the genus Homo even though the

two trips were uneventful.

Haider et al. Omega October 2, 2014 2 / 11

Challenges

Metagenomic samples cancontain hundreds of distinctgenomes.

Most do not have a referencegenome.

Representation of genomes isnot equal.

Goal of Omega: assembleindividual genomes frommetagenomic data.

Trindade-Silva, Amaro E., et al. “Polyketidesynthase gene diversity within the microbiome of thesponge Arenosclera brasiliensis, endemic to theSouthern Atlantic Ocean.” Applied andenvironmental microbiology 79.5 (2013): 1598-1605.

Haider et al. Omega October 2, 2014 3 / 11

Challenges

Metagenomic samples cancontain hundreds of distinctgenomes.

Most do not have a referencegenome.

Representation of genomes isnot equal.

Goal of Omega: assembleindividual genomes frommetagenomic data.

Trindade-Silva, Amaro E., et al. “Polyketidesynthase gene diversity within the microbiome of thesponge Arenosclera brasiliensis, endemic to theSouthern Atlantic Ocean.” Applied andenvironmental microbiology 79.5 (2013): 1598-1605.

Haider et al. Omega October 2, 2014 3 / 11

Challenges

Metagenomic samples cancontain hundreds of distinctgenomes.

Most do not have a referencegenome.

Representation of genomes isnot equal.

Goal of Omega: assembleindividual genomes frommetagenomic data.

Trindade-Silva, Amaro E., et al. “Polyketidesynthase gene diversity within the microbiome of thesponge Arenosclera brasiliensis, endemic to theSouthern Atlantic Ocean.” Applied andenvironmental microbiology 79.5 (2013): 1598-1605.

Haider et al. Omega October 2, 2014 3 / 11

Challenges

Metagenomic samples cancontain hundreds of distinctgenomes.

Most do not have a referencegenome.

Representation of genomes isnot equal.

Goal of Omega: assembleindividual genomes frommetagenomic data. Trindade-Silva, Amaro E., et al. “Polyketide

synthase gene diversity within the microbiome of thesponge Arenosclera brasiliensis, endemic to theSouthern Atlantic Ocean.” Applied andenvironmental microbiology 79.5 (2013): 1598-1605.

Haider et al. Omega October 2, 2014 3 / 11

Graph theory

Definition

A graph G is a pair (V,E), where V is the vertex set (any set) and E isa (multi)set of edges connecting elements of V .

a b

cd

Definition

A bi-directed graph associates two directions to each edge.

Definition

A path is a sequence of contiguous edges with no loops.

Haider et al. Omega October 2, 2014 4 / 11

Graph theory

Definition

A graph G is a pair (V,E), where V is the vertex set (any set) and E isa (multi)set of edges connecting elements of V .

a b

cd

Definition

A bi-directed graph associates two directions to each edge.

Definition

A path is a sequence of contiguous edges with no loops.

Haider et al. Omega October 2, 2014 4 / 11

Graph theory

Definition

A graph G is a pair (V,E), where V is the vertex set (any set) and E isa (multi)set of edges connecting elements of V .

a b

cd

Definition

A bi-directed graph associates two directions to each edge.

Definition

A path is a sequence of contiguous edges with no loops.

Haider et al. Omega October 2, 2014 4 / 11

Graph theory

Definition

A graph G is a pair (V,E), where V is the vertex set (any set) and E isa (multi)set of edges connecting elements of V .

a b

cd

Definition

A bi-directed graph associates two directions to each edge.

Definition

A path is a sequence of contiguous edges with no loops.

Haider et al. Omega October 2, 2014 4 / 11

Overlap graph

node = read

edge = overlap

direction = prefix/suffix,forward/reverse

path = contig

disallow paths entering andexiting a node by the sametype of arrow

Haider et al. Omega October 2, 2014 5 / 11

Overlap graph

node = read

edge = overlap

direction = prefix/suffix,forward/reverse

path = contig

disallow paths entering andexiting a node by the sametype of arrow

Haider et al. Omega October 2, 2014 5 / 11

Overlap graph

node = read

edge = overlap

direction = prefix/suffix,forward/reverse

path = contig

disallow paths entering andexiting a node by the sametype of arrow

Haider et al. Omega October 2, 2014 5 / 11

Overlap graph

node = read

edge = overlap

direction = prefix/suffix,forward/reverse

path = contig

disallow paths entering andexiting a node by the sametype of arrow

Haider et al. Omega October 2, 2014 5 / 11

Overlap graph

node = read

edge = overlap

direction = prefix/suffix,forward/reverse

path = contig

disallow paths entering andexiting a node by the sametype of arrow

Haider et al. Omega October 2, 2014 5 / 11

Simplifications

remove triangles

=⇒

contract simple vertices

=⇒

trim small branches

=⇒

remove bubbles

=⇒

Haider et al. Omega October 2, 2014 6 / 11

Simplifications

remove triangles

=⇒

contract simple vertices

=⇒

trim small branches

=⇒

remove bubbles

=⇒

Haider et al. Omega October 2, 2014 6 / 11

Simplifications

remove triangles

=⇒

contract simple vertices

=⇒

trim small branches

=⇒

remove bubbles

=⇒

Haider et al. Omega October 2, 2014 6 / 11

Simplifications

remove triangles

=⇒

contract simple vertices

=⇒

trim small branches

=⇒

remove bubbles

=⇒

Haider et al. Omega October 2, 2014 6 / 11

Finding contigs

push flow along long edges (>1000 bp)

minimize total flow in network

merge edges with mate-pair support

scaffold edges with mate-pair support

resolve ambiguity by coverage depth

Haider et al. Omega October 2, 2014 7 / 11

Finding contigs

push flow along long edges (>1000 bp)

minimize total flow in network

merge edges with mate-pair support

scaffold edges with mate-pair support

resolve ambiguity by coverage depth

Haider et al. Omega October 2, 2014 7 / 11

Finding contigs

push flow along long edges (>1000 bp)

minimize total flow in network

merge edges with mate-pair support

scaffold edges with mate-pair support

resolve ambiguity by coverage depth

Haider et al. Omega October 2, 2014 7 / 11

Finding contigs

push flow along long edges (>1000 bp)

minimize total flow in network

merge edges with mate-pair support

scaffold edges with mate-pair support

resolve ambiguity by coverage depth

Haider et al. Omega October 2, 2014 7 / 11

Finding contigs

push flow along long edges (>1000 bp)

minimize total flow in network

merge edges with mate-pair support

scaffold edges with mate-pair support

resolve ambiguity by coverage depth

Haider et al. Omega October 2, 2014 7 / 11

Benchmarking: real data

HiSeq 100-bp dataset, 64 micro-organisms.

N80 = k ⇔ 80% of assembled contigs have length ≥ k (larger is better).

Haider et al. Omega October 2, 2014 8 / 11

Benchmarking: real data

HiSeq 100-bp dataset, 64 micro-organisms.

N80 = k ⇔ 80% of assembled contigs have length ≥ k (larger is better).

Haider et al. Omega October 2, 2014 8 / 11

Benchmarking: real data

HiSeq 100-bp dataset, 64 micro-organisms.

N80 = k ⇔ 80% of assembled contigs have length ≥ k (larger is better).

Haider et al. Omega October 2, 2014 8 / 11

Benchmarking: simulated data

Simulated MiSeq 300-bp dataset, 9 genomes.

Haider et al. Omega October 2, 2014 9 / 11

Benchmarking: simulated data

Simulated MiSeq 300-bp dataset, 9 genomes.

Haider et al. Omega October 2, 2014 9 / 11

Good and bad

What was good:

Clear and detailed description of algorithm.

Upfront about limitations of their software and difficulties ofbenchmarking.

Room for improvement:

Many unjustified parameter choices (long edges := > 1000 bp,scaffolding requires support of 3 mate pairs, . . . ).

Why no Celera on real data?

Haider et al. Omega October 2, 2014 10 / 11

Good and bad

What was good:

Clear and detailed description of algorithm.

Upfront about limitations of their software and difficulties ofbenchmarking.

Room for improvement:

Many unjustified parameter choices (long edges := > 1000 bp,scaffolding requires support of 3 mate pairs, . . . ).

Why no Celera on real data?

Haider et al. Omega October 2, 2014 10 / 11

Good and bad

What was good:

Clear and detailed description of algorithm.

Upfront about limitations of their software and difficulties ofbenchmarking.

Room for improvement:

Many unjustified parameter choices (long edges := > 1000 bp,scaffolding requires support of 3 mate pairs, . . . ).

Why no Celera on real data?

Haider et al. Omega October 2, 2014 10 / 11

Good and bad

What was good:

Clear and detailed description of algorithm.

Upfront about limitations of their software and difficulties ofbenchmarking.

Room for improvement:

Many unjustified parameter choices (long edges := > 1000 bp,scaffolding requires support of 3 mate pairs, . . . ).

Why no Celera on real data?

Haider et al. Omega October 2, 2014 10 / 11

Thank you!

Haider et al. Omega October 2, 2014 11 / 11