learning to love de bruijn graphs

Post on 17-Jan-2017

310 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Learning to love de Bruijn graphsBen Woodcroft,

Australian Centre for Ecogenomics (ACE)

Winter School in Bioinformatics, 2015

A slide from Torsten Seemann

K-mers and assembly

• For next-generation sequencing, comparison of each read with each other read is impossible.– E.g. 10 million reads -> 107 x 107 read-read

comparisons. Slowww..

• K-mers and de Bruijn graphs help make things tractable

K-mers and assembly

Forks

K-mer too small

K-mer too large

My favourite k-mer size

My favourite k-mer size

With a 100bp read, this can never happen with a k-mer size of 51

Less tips, more bubbles

As read lengths get longer, assemblers must move from handling dead ends in the graph to handling bubbles.

Tips and bubbles

Metagenome assembly

Me: “I know, why don’t I just assemble all my data together?”

Run assemblyWait 4 daysOut of memory allocating 18.4 million terabytes of RAM.

Solutions to RAM issues

• Quality trimming• Hard trimming• Throwing away a proportion of reads

randomly• Sequencing something else

Lossy de Bruijn graphs

The number of k-mers observed is vanishingly small relative to the total number of possible k-mers

The human genome: ~3Gbp = ~3×109 k-mersTotal possible 51-mers: 451 = ~1030

0.00000000000000000002%

When making a list of k-mers, counting extra ones probably has little effect on assembly.

Bloom filters

A low memory k-mer “store”

Is my k-mer in these reads?

From a bloom filter, the answer is either “no” or “probably”

A finishing approach to assembly

A central assumption of this method is that the genome is “mostly” complete

Scaffolding without mate pair data

Gap filling vs. assembly

• Regular assembly ain’t easy• Re-assembly is more straightforward because

you are trying to get to somewhere

Gap filling can correct assembly errors

• Contigs often contain errors right at the ends of contigs

• By starting to search a bit back (e.g. 200bp) away from the end of the contig, these errors can be overcome

Gap-filling can account for strain variation

github.com/wwood/finishm

Thanks!

• Slideshare.com/benjwoodcroft

• Github.com/wwood

• Ecogenomic.org

top related