de novo assembly course - dtu health tech · 2018. 6. 5. · in theory, the size of the de bruijn...

27626 - Next Generation Sequencing Analysis

Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation Group

de novo assemblySimon Rasmussen

36626: Next Generation Sequencing analysisDTU Bioinformatics


Generalized NGS analysis

Raw reads

Pre-processing

Assembly:Alignment /

de novo

Application specific:

Variant calling,count matrix, ...

Comparesamples / methods

Answer?Question

Dat

a si

ze


Merge small DNA fragments together so they form a previously unknown sequence

What is de novo assembly?


Merge millions reads together so they form previously unknown sequences

What is de novo assembly?

de novo assembly• Assemble reads into longer fragments• Find overlap between reads• Many approaches

reads&

con*gs&

scaffolds&

Lets try to assemble some reads!• Rules:• a minimum of 7-bp overlap• overlap must not include any N bases• same orientation so that the sequence can be read left to right• there may be 1-bp differences• simplified - no double stranded DNA

..NNNNGGACTATGATTCG ||||||| TGATTCGAGGCTAANN....NNNNNNNNCGATTCTGATCCGA ||||||| GTCCTCGATTCTNNNNNNNN..

..NNNNCGGACTATGATT |||||| ATGATTCGAGGCTAANN....NNNNNNNNCGCTACTGATCCGA || | ||| GTCCTCGATTCTGNNNNNNN..

Which are valid?

..NNNNGGACTATGATTCG ||||||| TGATTCGAGGCTAANN..

..NNNNNNNNCGATTCTGATCCGA ||||||| GTCCTCGATTCTNNNNNNNN..

..NNNNCGGACTATGATT |||||| ATGATTCGAGGCTAANN..

..NNNNNNNNCGCTACTGATCCGA || | ||| GTCCTCGATTCTGNNNNNNN..


Which approaches?

• Greedy (“Simple” approach)

• Overlap-Layout-Consensus (OLC)

• de Bruijn graphs


Simple approach - Greedy• Pseudo code:

1. Pairwise alignment of all reads

2. Identify fragments that have largest overlap

3. Merge these

4. Repeat until all overlaps are used

• Can only resolve repeats smaller than read length

• High computational cost with increasing no. reads


Reads > Contigs > Scaffolds

• Overlap Layout Consensus and de Bruijn use a similar general approach.

1. Try to correct sequence errors in reads with high coverage

2. Assemble reads to contiguous sequence fragments “contigs”

3. Identify repeat contigs

4. Combine and order contigs to “scaffolds”, with gaps representing regions of uncertainty


Overlap-Layout-Consensus

• Create overlap graph by all-vs-all alignment (Overlap)• Build graph where each node is a read, edges are

overlaps between reads (Layout)

• Example

separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.

In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.

Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.

As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was

widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that

Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

Schatz et al.

1168 Genome Researchwww.genome.org

Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from










Schatz et al.



Schatz et al., Genome Res, 2010


Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

theory to solve the graph

• Walk the Hamiltonian path • Eg. visit each node exactly

once


Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

theory to solve the graph

• Walk the Hamiltonian path • Eg. visit each node exactly

once

Imagine trying to solve this for a graph of hundred of thousands of nodes (=reads) - this is an NP-complete problem


Overlap-Layout-Consensus• Not good with many short reads -> lots of alignment!• With short read lengths, hard to resolve repeats

• Good for large read lengths: • PacBio, Oxford Nanopore, 10X Genomics, 454, Ion

Torrent, Sanger

• Example assemblers: Canu, Celera, Newbler


de Bruijn graph• Directed graph of overlapping items (here DNA

sequences)

• Instead of comparing reads, decompose reads into k-mers

• Graph is created by mapping the k-mers to the graph

• Each k-mer only exists once in the graph

• Problem is reduced to walking Eulerian path (visiting each edge once) - this is a solve-able problem


Drawbacks ...

• Lots of RAM required (1-1000 GB !)

• Optimal k can not be identified a priori, must be experimentally tested for each dataset

• small k: very complex graph, large k: limited overlap in low coverage areas

• Iterative approach to find best assembly


How is the graph constructed?

• Same 10 reads, extract k-mers from reads and map onto graph, k = 3:











Schatz et al.















Schatz et al.





GAC











Schatz et al.





GAC

1











Schatz et al.





GAC ACC

1 1











Schatz et al.





GAC ACC CCT

1 1 1











Schatz et al.





GAC ACC CCT CTA TAC ACA

1 1 1 1 1 1











Schatz et al.





GAC ACC CCT CTA TAC ACA

1 2 2 2 2 2











Schatz et al.





GAC ACC CCT CTA TAC ACA CAA

1 2 2 2 2 2 1











Schatz et al.





GAC ACC CCT CTA TAC ACA CAA AAG AGT

1 2 3 4 4 4 3 2 1











Schatz et al.





GAC ACC CCT CTA TAC ACA CAA AAG AGT

GTT

TTA

TAG

1 2 3 4 5 6 6 5 5

3

2

1




To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et a

de novo assembly course - dtu health tech · 2018. 6. 5. · in theory, the size of the de bruijn...

Documents