de novo assembly course - dtu health tech · 2018. 6. 5. · in theory, the size of the de bruijn...

83
27626 - Next Generation Sequencing Analysis de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics

Upload: others

Post on 05-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • 27626 - Next Generation Sequencing Analysis

    Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation Group

    de novo assemblySimon Rasmussen

    36626: Next Generation Sequencing analysisDTU Bioinformatics

  • 27626 - Next Generation Sequencing Analysis

    Generalized NGS analysis

    Raw reads

    Pre-processing

    Assembly:Alignment /

    de novo

    Application specific:

    Variant calling,count matrix, ...

    Comparesamples / methods

    Answer?Question

    Dat

    a si

    ze

  • 27626 - Next Generation Sequencing Analysis

    Generalized NGS analysis

    Raw reads

    Pre-processing

    Assembly:Alignment /

    de novo

    Application specific:

    Variant calling,count matrix, ...

    Comparesamples / methods

    Answer?Question

    Dat

    a si

    ze

  • 36626 - Next Generation Sequencing Analysis

    Merge small DNA fragments together so they form a previously unknown sequence

    What is de novo assembly?

  • 36626 - Next Generation Sequencing Analysis

    Merge millions reads together so they form previously unknown sequences

    What is de novo assembly?

  • de novo assembly• Assemble reads into longer fragments• Find overlap between reads• Many approaches

    reads&

    con*gs&

    scaffolds&

  • de novo assembly• Assemble reads into longer fragments• Find overlap between reads• Many approaches

    reads&

    con*gs&

    scaffolds&

  • Lets try to assemble some reads!• Rules:• a minimum of 7-bp overlap• overlap must not include any N bases• same orientation so that the sequence can be read left to right• there may be 1-bp differences• simplified - no double stranded DNA

    ..NNNNGGACTATGATTCG ||||||| TGATTCGAGGCTAANN....NNNNNNNNCGATTCTGATCCGA ||||||| GTCCTCGATTCTNNNNNNNN..

    ..NNNNCGGACTATGATT |||||| ATGATTCGAGGCTAANN....NNNNNNNNCGCTACTGATCCGA || | ||| GTCCTCGATTCTGNNNNNNN..

  • Which are valid?

    ..NNNNGGACTATGATTCG ||||||| TGATTCGAGGCTAANN..

    ..NNNNNNNNCGATTCTGATCCGA ||||||| GTCCTCGATTCTNNNNNNNN..

    ..NNNNCGGACTATGATT |||||| ATGATTCGAGGCTAANN..

    ..NNNNNNNNCGCTACTGATCCGA || | ||| GTCCTCGATTCTGNNNNNNN..

  • Which are valid?

    ..NNNNGGACTATGATTCG ||||||| TGATTCGAGGCTAANN..

    ..NNNNNNNNCGATTCTGATCCGA ||||||| GTCCTCGATTCTNNNNNNNN..

    ..NNNNCGGACTATGATT |||||| ATGATTCGAGGCTAANN..

    ..NNNNNNNNCGCTACTGATCCGA || | ||| GTCCTCGATTCTGNNNNNNN..

  • 36626 - Next Generation Sequencing Analysis

    Which approaches?

    • Greedy (“Simple” approach)

    • Overlap-Layout-Consensus (OLC)

    • de Bruijn graphs

  • 36626 - Next Generation Sequencing Analysis

    Simple approach - Greedy• Pseudo code:

    1. Pairwise alignment of all reads

    2. Identify fragments that have largest overlap

    3. Merge these

    4. Repeat until all overlaps are used

    • Can only resolve repeats smaller than read length

    • High computational cost with increasing no. reads

  • 36626 - Next Generation Sequencing Analysis

    Reads > Contigs > Scaffolds

    • Overlap Layout Consensus and de Bruijn use a similar general approach.

    1. Try to correct sequence errors in reads with high coverage

    2. Assemble reads to contiguous sequence fragments “contigs”

    3. Identify repeat contigs

    4. Combine and order contigs to “scaffolds”, with gaps representing regions of uncertainty

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus

    • Create overlap graph by all-vs-all alignment (Overlap)• Build graph where each node is a read, edges are

    overlaps between reads (Layout)

    • Example

    separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

    Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

    To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.

    In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.

    Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.

    As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was

    widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that

    Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

    Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

    Schatz et al.

    1168 Genome Researchwww.genome.org

    Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from

    separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

    Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

    To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.

    In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.

    Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.

    As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was

    widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that

    Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

    Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

    Schatz et al.

    1168 Genome Researchwww.genome.org

    Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from

    Schatz et al., Genome Res, 2010

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Create consensus sequence • We need to use graph

    theory to solve the graph

    • Walk the Hamiltonian path • Eg. visit each node exactly

    once

    Imagine trying to solve this for a graph of hundred of thousands of nodes (=reads) - this is an NP-complete problem

  • 36626 - Next Generation Sequencing Analysis

    Overlap-Layout-Consensus• Not good with many short reads -> lots of alignment!• With short read lengths, hard to resolve repeats

    • Good for large read lengths: • PacBio, Oxford Nanopore, 10X Genomics, 454, Ion

    Torrent, Sanger

    • Example assemblers: Canu, Celera, Newbler

  • 36626 - Next Generation Sequencing Analysis

    de Bruijn graph• Directed graph of overlapping items (here DNA

    sequences)

    • Instead of comparing reads, decompose reads into k-mers

    • Graph is created by mapping the k-mers to the graph

    • Each k-mer only exists once in the graph

    • Problem is reduced to walking Eulerian path (visiting each edge once) - this is a solve-able problem

  • 36626 - Next Generation Sequencing Analysis

    Drawbacks ...

    • Lots of RAM required (1-1000 GB !)

    • Optimal k can not be identified a priori, must be experimentally tested for each dataset

    • small k: very complex graph, large k: limited overlap in low coverage areas

    • Iterative approach to find best assembly

  • 36626 - Next Generation Sequencing Analysis

    How is the graph constructed?

    • Same 10 reads, extract k-mers from reads and map onto graph, k = 3:

  • 36626 - Next Generation Sequencing Analysis

    separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

    Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

    To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.

    In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.

    Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.

    As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was

    widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that

    Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

    Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

    Schatz et al.

    1168 Genome Researchwww.genome.org

    Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from

    How is the graph constructed?

    • Same 10 reads, extract k-mers from reads and map onto graph, k = 3:

  • 36626 - Next Generation Sequencing Analysis

    separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

    Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

    To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.

    In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.

    Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.

    As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was

    widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that

    Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

    Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

    Schatz et al.

    1168 Genome Researchwww.genome.org

    Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from

    How is the graph constructed?

    • Same 10 reads, extract k-mers from reads and map onto graph, k = 3:

    GAC

  • 36626 - Next Generation Sequencing Analysis

    separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

    Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

    To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.

    In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.

    Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.

    As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was

    widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that

    Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

    Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

    Schatz et al.

    1168 Genome Researchwww.genome.org

    Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from

    How is the graph constructed?

    • Same 10 reads, extract k-mers from reads and map onto graph, k = 3:

    GAC

    1

  • 36626 - Next Generation Sequencing Analysis

    separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

    Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

    To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.

    In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.

    Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.

    As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was

    widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that

    Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

    Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

    Schatz et al.

    1168 Genome Researchwww.genome.org

    Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from

    How is the graph constructed?

    • Same 10 reads, extract k-mers from reads and map onto graph, k = 3:

    GAC ACC

    1 1

  • 36626 - Next Generation Sequencing Analysis

    separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

    Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

    To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.

    In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.

    Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.

    As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was

    widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that

    Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

    Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

    Schatz et al.

    1168 Genome Researchwww.genome.org

    Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from

    How is the graph constructed?

    • Same 10 reads, extract k-mers from reads and map onto graph, k = 3:

    GAC ACC CCT

    1 1 1

  • 36626 - Next Generation Sequencing Analysis

    separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

    Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

    To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.

    In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.

    Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.

    As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was

    widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that

    Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

    Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

    Schatz et al.

    1168 Genome Researchwww.genome.org

    Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from

    How is the graph constructed?

    • Same 10 reads, extract k-mers from reads and map onto graph, k = 3:

    GAC ACC CCT CTA TAC ACA

    1 1 1 1 1 1

  • 36626 - Next Generation Sequencing Analysis

    separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

    Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

    To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.

    In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.

    Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.

    As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was

    widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that

    Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

    Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

    Schatz et al.

    1168 Genome Researchwww.genome.org

    Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from

    How is the graph constructed?

    • Same 10 reads, extract k-mers from reads and map onto graph, k = 3:

    GAC ACC CCT CTA TAC ACA

    1 1 1 1 1 1

  • 36626 - Next Generation Sequencing Analysis

    separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

    Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

    To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.

    In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.

    Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.

    As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was

    widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that

    Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

    Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

    Schatz et al.

    1168 Genome Researchwww.genome.org

    Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from

    How is the graph constructed?

    • Same 10 reads, extract k-mers from reads and map onto graph, k = 3:

    GAC ACC CCT CTA TAC ACA

    1 2 2 2 2 2

  • 36626 - Next Generation Sequencing Analysis

    separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

    Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

    To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.

    In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.

    Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.

    As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was

    widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that

    Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

    Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

    Schatz et al.

    1168 Genome Researchwww.genome.org

    Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from

    How is the graph constructed?

    • Same 10 reads, extract k-mers from reads and map onto graph, k = 3:

    GAC ACC CCT CTA TAC ACA CAA

    1 2 2 2 2 2 1

  • 36626 - Next Generation Sequencing Analysis

    separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

    Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

    To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.

    In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.

    Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.

    As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was

    widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that

    Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

    Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

    Schatz et al.

    1168 Genome Researchwww.genome.org

    Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from

    How is the graph constructed?

    • Same 10 reads, extract k-mers from reads and map onto graph, k = 3:

    GAC ACC CCT CTA TAC ACA CAA

    1 2 2 2 2 2 1

  • 36626 - Next Generation Sequencing Analysis

    separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

    Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

    To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.

    In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.

    Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.

    As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was

    widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that

    Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

    Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

    Schatz et al.

    1168 Genome Researchwww.genome.org

    Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from

    How is the graph constructed?

    • Same 10 reads, extract k-mers from reads and map onto graph, k = 3:

    GAC ACC CCT CTA TAC ACA CAA AAG AGT

    1 2 3 4 4 4 3 2 1

  • 36626 - Next Generation Sequencing Analysis

    separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

    Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

    To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.

    In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.

    Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.

    As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was

    widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that

    Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

    Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

    Schatz et al.

    1168 Genome Researchwww.genome.org

    Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from

    How is the graph constructed?

    • Same 10 reads, extract k-mers from reads and map onto graph, k = 3:

    GAC ACC CCT CTA TAC ACA CAA AAG AGT

    1 2 3 4 4 4 3 2 1

  • 36626 - Next Generation Sequencing Analysis

    separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

    Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

    To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et al. 2010). Although these types of computing resourcesare not widely available, they are within reach for large-scale sci-entific centers.

    In theory, the size of the de Bruijn graph depends only on thesize of the genome, including polymorphic alleles, and should beindependent of the number of reads. However, because sequencingerrors create their own graph nodes, increasing the number of readsinevitably increases the size of the de Bruijn graph. In the de novoassembly of human from short reads, SOAPdenovo reduced thenumber of 25-mers from 14.6 billion to 5.0 billion by correctingerrors before constructing the de Bruijn graph (Li et al. 2010). Itserror correction method first counts the number of occurrences ofall k-mers in the reads and replaces any k-mers occurring less thanthree times with the highest frequency alternative k-mer.

    Choice of assembler andsequencing strategyOnly de Bruijn graph assemblers havedemonstrated the ability to successfullyassemble very short reads (100 bp), overlap graph as-semblers have been quite successful andhave a much better track record overall. Ade Bruijn graph assembler should func-tion with longer reads as well, but a largedifference between the read length andthe k-mer length will result in many morebranching nodes than in the simplifiedoverlap graph. The precise conditions un-der which one assembly method is supe-rior to the other remain an open question,and the answer may ultimately dependon the specific assembler and genomecharacteristics.

    As Figure 3 illustrates, there is a di-rect and dramatic tradeoff among readlength, coverage, and expected contiglength in a genome assembly. The figureshows the theoretical expected contigslength, based on the Lander-Watermanmodel (Lander and Waterman 1988), inan assembly where all overlaps have beendetected perfectly. This model, which was

    widely applied for predicting assembly quality in the Sanger se-quencing era, predicts that under ideal conditions, 710-bp readsshould require 33 coverage to produce 4-kbp average contig sizes,while 30-bp reads would require 283 coverage. In practice, themodel is inadequate for modeling very short reads: The figure alsoshows the actual contig sizes for the dog genome, assembled with710-bp reads, and the panda genome, assembled with 52-bp reads.The dog assembly tracked closely to the theoretical prediction,while the panda assembly has contig sizes that are many timeslower than predicted by the model. The large discrepancy betweenpredicted and observed assembly quality results from the fact that

    Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the setof 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bpare indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, areshown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; herethe k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mersoverlap by k ! 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here wehave only considered the forward orientation of each sequence to simplify the figure.

    Figure 3. Expected average contig length for a range of different readlengths and coverage values. Also shown are the average contig lengthsand N50 lengths for the dog genome, assembled with 710-bp reads, andthe panda genome, assembled with reads averaging 52 bp in length.

    Schatz et al.

    1168 Genome Researchwww.genome.org

    Cold Spring Harbor Laboratory Press on December 2, 2010 - Published by genome.cshlp.orgDownloaded from

    How is the graph constructed?

    • Same 10 reads, extract k-mers from reads and map onto graph, k = 3:

    GAC ACC CCT CTA TAC ACA CAA AAG AGT

    GTT

    TTA

    TAG

    1 2 3 4 5 6 6 5 5

    3

    2

    1

  • 36626 - Next Generation Sequencing Analysis

    separate paths. Short repeats of this type can be resolved, but theyrequire additional processing and therefore additional time.

    Another potential drawback of the de Bruijn approach is thatthe de Bruijn graph can require an enormous amount of computerspace (random access memory, or RAM). Unlike conventionaloverlap computations, which can be easily partitioned into mul-tiple jobs with distinct batches of reads, the construction andanalysis of a de Bruijn graph is not easily parallelized. As a result, deBruijn assemblers such as Velvet and ALLPATHS, which have beenused successfully on bacterial genomes, do not scale to large ge-nomes. For a human-sized genome, these programs would requireseveral terabytes of RAM to store their de Bruijn graphs, which is farmore memory than is available on most computers.

    To date, only two de Bruijn graph assemblers have been shownto have the ability to assemble a mammalian-sized genome. ABySS(Simpson et al. 2009) assembled a human genome in 87 h ona cluster of 21 eight-core machines each with 16 GB of RAM (168cores, 336 GB of RAM total). SOAPdenovo assembled a human ge-nome in 40 h using a single computer with 32 cores and 512 GB ofRAM (Li et a