de-novo genome assembly and its current state · 2016-11-24 · de-novo genome assembly and its...

32
De-Novo Genome Assembly and its Current State Anne-Katrin Emde April 17, 2013 Freie Universität Berlin, Algorithmische Bioinformatik Max Planck Institut für Molekulare Genetik, Computational Molecular Biology

Upload: others

Post on 11-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

De-Novo Genome Assembly and its Current State

Anne-Katrin Emde April 17, 2013

Freie Universität Berlin, Algorithmische Bioinformatik Max Planck Institut für Molekulare Genetik, Computational Molecular Biology

Page 2: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

3

What is genome assembly and why is it hard?

Original genome sequence (in multiple copies) Sequence short fragments Reconstruct original sequence from reads

Assembler

Difficulties: - Genomes are very long and repeat-rich - Reads are very short and may contain errors and biases

Page 3: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

4

Assembly algorithms - introduction Assemblers use overlaps between reads, to first produce contigs, that are then used to build scaffolds.

Read pairs contribute long-range linking information, especially in the scaffolding phase.

reads

contig NNNN...NNNN

scaffold

Page 4: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

6

2 Paradigmen 1) Overlap – Layout – Consensus: Focus is on sequence reads

2) De Bruijn graph: Focus is on k-tuples occuring in sequence reads

NGS algorithms, March 20 2012

Page 5: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

7

Algorithms – Overlap Graph

ACGTAATT GTAATTCA ATTCAGTC GTCCATGT CATGTTGA TGTTGACT

ACGTAATTCAGTCCATGTTGACT Kececioglu and Myers, 1995

1) Overlap phase: pairwise overlap alignments

2) Layout phase: overlap graph construction and finding relative placement of reads

3) Consensus phase: Produce multiple read alignment and compute contig consensus sequences

Page 6: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

8

Algorithms – Overlap Graph 1) Overlap phase:

for all pairs of reads

Computationally expensive!

read1

re

ad2

pairwise overlap alignments

Page 7: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

9

Algorithms – Overlap Graph 1) Overlap phase:

2) Layout phase:

nodes = reads edges = overlaps

r1

r2

r3

r4 r5

r6

r1 r2

r3

r4

r5 r6 -6

-5 -3

-3 -3

-5 -6

Page 8: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

10

Algorithms – Overlap Graph 1) Overlap phase:

2) Layout phase:

nodes = reads edges = overlaps

r1

r2

r3

r4 r5

r6

r1 r2

r3

r4

r5 r6 -6

-5 -3

-3 -3

-5 -6

path through the graph read layout

In theory Hamiltonian path (NP-complete), in practice heuristics

Page 9: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

11

Algorithms – Overlap Graph

2) Layout phase:

nodes = reads edges = overlaps

r1 r2

r3

r4

r5 r6 -6

-5 -3

-3 -3

-5 -6

3) Consensus phase: multiple read alignment

contig

ACGTAATT GTAATTCA ATTCAGTC GTCCATGT CATGTTGA TGTTGACT

ACGTAATTCAGTCCATGTTGACT

r1 r2 r3 r4 r5 r6

Page 10: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

12

Algorithms – Overlap Graph

2) Layout phase:

nodes = reads edges = overlaps

3) Consensus phase: multiple read alignment

contig

ACGTAATT GTAATTCA AT - CAGTC GTCCATGT CATATTGA TGTTGACT

ACGTAATTCAGTCCATGTTGACT

r1 r2 r3 r4 r5 r6

robust with alignment errors

r1 r2

r3

r4

r5 r6 -6

-5 -3

-3 -3

-5 -6

Page 11: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

13

Page 12: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

14

Page 13: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

15

Page 14: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

16

Overlap Graph There are not only simple paths…

R R A B C

- Coverage is high - Path branches is a repeat! R

Approximate ordering in the overlap graph:

R A

B

C

Page 15: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

17

Overlap Graph There are not only simple paths…

Approximate ordering in the overlap graph:

R R A B C

A R

B

C

Page 16: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

18

Overlap Graph Now: R1 and R2 are nearly identical

Approximate ordering in the overlap graph:

A A A A T

T T T

A A A A

T T T T

R1 R2 A B C

A Rx

B

C

Overlap strictness: Tradeoff between error tolerance and “natural” repeat separation

Page 17: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

20

Overlap Graph Now: R1 and R2 are nearly identical

Approximate ordering in the overlap graph:

A A A

Overlap strictness: Tradeoff between error tolerance and “natural” repeat separation

R1 R2 A B C

A R1

B

C R2

A T T T T

Page 18: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

21

Algorithms - de Bruijn graph (Euler assembler) - No overlap phase, no consensus phase, basically just a layout phase

ACGT CGTA GTAA TAAT TACG TTCA ATTC AATT

contig: TACGTAATTCA

Nodes = k-mers Edges = (k+1)-mers

de Bruijn, 1946; Pevzner, 2001; Medvedev 2007

Given k = 4 and three read sequences: CGTAATTC GTAATTCA TACGTAAT

CGTA GTAA TAAT AATT ATTC GTAA

TACGTAAT CGTAATTC GTAATTCA

r3

r1

r2

r1

r2

r3

In theory „de Bruijn Superwalk Problem“ (NP-hard), in practice heuristics

r1

r2 r3

Page 19: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

22

de Bruijn graph Again, there are not only simple paths…

ACGT

CGTA

GTAC

TACG

GACG

GTCA

CGTC

...G ACGT ACGTCA... GACGTACG

CGTACGTC GTACGTCA

GACGTCA is a read-incoherent path

Repetitive k-mers introduce cycles

Page 20: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

23

de Bruijn graph What if we increase k to 5?

GACGT

...GACGTACGTCA... GACGTACG

CGTACGTC GTACGTCA

Back to a linear graph structure

ACGTA CGTAC GTACG TACGT ACGTC CGTCA

Increasing k leads to better repeat resolution

Page 21: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

24

de Bruijn graph What if we have sequencing errors?

...GACGTACGTCA... GACGTACG

CGTACGTC GTACGGCA

Additional nodes

GACGT ACGTA CGTAC GTACG TACGT ACGTC CGTCA

TACGG ACGGC CGGCA

Page 22: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

25

A quick example

GTCG (1x)

TCGA (1x)

CGAG (1x)

GAGG (1x)

One read: GTCGAGG

Page 23: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

26

A quick example

AGAT (8x)

ATCC (7x)

TCCG (7x)

CCGA (7x)

CGAT (6x)

GATG (5x)

ATGA (8x)

TGAG (9x)

GATC (8x)

GATT (1x)

TAGT (3x)

AGTC (7x)

GTCG (9x)

TCGA (10x)

GGCT (11x)

TAGA (16x)

AGAG (9x)

GAGA (12x)

GACA (8x)

ACAG (5x)

GCTT (8x)

GCTC (2x)

CTTT (8x)

CTCT (1x)

TTTA (8x)

TCTA (2x)

TTAG (12x)

CTAG (2x)

AGAC (9x)

AGAA (1x)

CGAG (8x)

CGAC (1x)

GAGG (16x)

GACG (1x)

AGGC (16x)

ACGC (1x)

All the others…

Page 24: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

27

A quick example

AGAT (8x)

ATCC (7x)

TCCG (7x)

CCGA (7x)

CGAT (6x)

GATG (5x)

ATGA (8x)

TGAG (9x)

GATC (8x)

GATT (1x)

TAGT (3x)

AGTC (7x)

GTCG (9x)

TCGA (10x)

GGCT (11x)

TAGA (16x)

AGAG (9x)

GAGA (12x)

GACA (8x)

ACAG (5x)

GCTT (8x)

GCTC (2x)

CTTT (8x)

CTCT (1x)

TTTA (8x)

TCTA (2x)

TTAG (12x)

CTAG (2x)

AGAC (9x)

AGAA (1x)

CGAG (8x)

CGAC (1x)

GAGG (16x)

GACG (1x)

AGGC (16x)

ACGC (1x)

All the others…

Page 25: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

28

A quick example

TAGTCGA

AGAGA TAGA

AGAT

GCTTTAG

GCTCTAG

AGACAG

AGAA

CGAG

CGACGC

GAGGCT

GATCCGATGAG

GATT

After simplification…

Page 26: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

29

Error removal

TAGTCGA

AGAGA TAGA

AGAT

GCTTTAG

GCTCTAG

AGACAG

CGAG

GAGGCT

GATCCGATGAG

Tips removed…

Page 27: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

30

Error removal

TAGTCGA

AGAGA TAGA

AGAT

GCTTTAG AGACAG

CGAG

GAGGCT

GATCCGATGAG

Bubbles removed …

Page 28: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

31

Error removal

TAGTCGAG AGAGACAG

AGATCCGATGAG

GAGGCTTTAGA

Final simplification…

Page 29: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

32

de Bruijn graph What if we have sequencing errors?

...GACGTACGTCA... GACGTACG

CGTACGTC GTACCTCA

Additional nodes

GACGT ACGTA CGTAC GTACG TACGT ACGTC CGTCA

TACCT ACCTC CCTCA GTACC

& graph disconnected

Choice of k: Tradeoff between error tolerance and repeat resolution

Page 30: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

36

What is a good assembly? Some assembly evaluation metrics:

- High N50 contig size (most commonly used metric)

contigs sorted by size

N50 contig size

assembled as

Only if a reference sequence is known!

original sequence

- Low number of assembly errors (sequence errors, structural misjoins)

Page 31: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

39

Conclusions - Most assemblers use either the OLC or de Bruijn graph paradigm,

both lead to NP-hard assembly models.

- However, assembler performance is independent of the underlying paradigm and mainly depends on heuristics for repeat resolution and handling noise.

- How to measure assembly accuracy is another important aspect, in general tradeoff between assembly contiguity and correctness.

- Evaluations show that assembly is far from solved, assembler performance still quite inconsistent.

“For large genomes, the choice of assemblers is often limited to those that will run without crashing” (GAGE paper, 2012)

Page 32: De-Novo Genome Assembly and its Current State · 2016-11-24 · De-Novo Genome Assembly and its Current State Anne-Katrin Emde . April 17, 2013 . Freie Universität Berlin, Algorithmische

40

References - GAGE: a critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg et

al., Genome Res. 2012 - Assemblathon 2: evaluating de-novo of genome assembly in three vertebrate species, Bradnam et

al., not yet published - Assembly algorithms for next-generation sequencing data. Jason R. Miller, Sergey Koren, Granger

Sutton, Genomics, 2010. - Fragment assembly string graph, Myers, 2005

Figures: - geneed.nlm.nih.gov

- http://paper-shredding-services-review.toptenreviews.com/ - www.illumina.com - A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation

Sequencing Technologies, Zhang et al., PLOS one, 2011

Titel, Datum