learning to love de bruijn graphs

22
Learning to love de Bruijn graphs Ben Woodcroft, Australian Centre for Ecogenomics (ACE) Winter School in Bioinformatics, 2015

Upload: benjwoodcroft

Post on 17-Jan-2017

310 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Learning to Love De Bruijn Graphs

Learning to love de Bruijn graphsBen Woodcroft,

Australian Centre for Ecogenomics (ACE)

Winter School in Bioinformatics, 2015

Page 2: Learning to Love De Bruijn Graphs

A slide from Torsten Seemann

Page 3: Learning to Love De Bruijn Graphs

K-mers and assembly

• For next-generation sequencing, comparison of each read with each other read is impossible.– E.g. 10 million reads -> 107 x 107 read-read

comparisons. Slowww..

• K-mers and de Bruijn graphs help make things tractable

Page 4: Learning to Love De Bruijn Graphs

K-mers and assembly

Page 5: Learning to Love De Bruijn Graphs

Forks

Page 6: Learning to Love De Bruijn Graphs

K-mer too small

Page 7: Learning to Love De Bruijn Graphs

K-mer too large

Page 8: Learning to Love De Bruijn Graphs

My favourite k-mer size

Page 9: Learning to Love De Bruijn Graphs

My favourite k-mer size

With a 100bp read, this can never happen with a k-mer size of 51

Page 10: Learning to Love De Bruijn Graphs

Less tips, more bubbles

As read lengths get longer, assemblers must move from handling dead ends in the graph to handling bubbles.

Page 11: Learning to Love De Bruijn Graphs

Tips and bubbles

Page 12: Learning to Love De Bruijn Graphs

Metagenome assembly

Me: “I know, why don’t I just assemble all my data together?”

Run assemblyWait 4 daysOut of memory allocating 18.4 million terabytes of RAM.

Page 13: Learning to Love De Bruijn Graphs

Solutions to RAM issues

• Quality trimming• Hard trimming• Throwing away a proportion of reads

randomly• Sequencing something else

Page 14: Learning to Love De Bruijn Graphs

Lossy de Bruijn graphs

The number of k-mers observed is vanishingly small relative to the total number of possible k-mers

The human genome: ~3Gbp = ~3×109 k-mersTotal possible 51-mers: 451 = ~1030

0.00000000000000000002%

When making a list of k-mers, counting extra ones probably has little effect on assembly.

Page 15: Learning to Love De Bruijn Graphs

Bloom filters

A low memory k-mer “store”

Page 16: Learning to Love De Bruijn Graphs

Is my k-mer in these reads?

From a bloom filter, the answer is either “no” or “probably”

Page 17: Learning to Love De Bruijn Graphs

A finishing approach to assembly

A central assumption of this method is that the genome is “mostly” complete

Page 18: Learning to Love De Bruijn Graphs

Scaffolding without mate pair data

Page 19: Learning to Love De Bruijn Graphs

Gap filling vs. assembly

• Regular assembly ain’t easy• Re-assembly is more straightforward because

you are trying to get to somewhere

Page 20: Learning to Love De Bruijn Graphs

Gap filling can correct assembly errors

• Contigs often contain errors right at the ends of contigs

• By starting to search a bit back (e.g. 200bp) away from the end of the contig, these errors can be overcome

Page 21: Learning to Love De Bruijn Graphs

Gap-filling can account for strain variation

github.com/wwood/finishm

Page 22: Learning to Love De Bruijn Graphs

Thanks!

• Slideshare.com/benjwoodcroft

• Github.com/wwood

• Ecogenomic.org