methods, models & techniquesdosequis.colorado.edu/courses/methodslogic/papers/nexgenseq.pdf ·...

13
Methods, Models & Techniques High-throughput DNA sequencing – concepts and limitations Martin Kircher and Janet Kelso Introduction In 1977 the first genome, that of the 5,386 nucleotide (nt), single-stranded bacteriophage wX174, was completely sequenced [1] using a technology invented just a few years earlier [2–5]. Since then the sequencing of whole genomes as well as of individual regions and genes has become a major focus of modern biology and completely trans- formed the field of genetics. At the time of the sequencing of wX174, and for almost another decade, DNA sequencing was a barely auto- mated and very tedious process which involved determining only a few hun- dred nucleotides at a time. In the late 1980s, semi-automated sequencers with higher throughput became avail- able [6, 7], still only able to determine a few sequences at a time. A break- through in the early 1990s was the development of capillary array electro- phoresis and appropriate detection sys- tems [8–12]. As recently as 1996, these developments converged in the pro- duction of a commercial single capillary sequencer (ABI Prism 310). In 1998, the GE Healthcare MegaBACE 1000 and the ABI Prism 3700 DNA Analyzer became the first commercial 96 capillary sequencers, a development which was termed high-throughput sequencing. Over the last decade, alternative sequencing strategies have become available [13–18] which force us to com- pletely redefine ‘‘high-throughput sequencing.’’ These technologies out- perform the older Sanger-sequencing technologies by a factor of 100–1,000 in daily throughput, and at the same time reduce the cost of sequencing one million nucleotides (1 Mb) to 4–0.1% of that associated with Sanger sequencing. To reflect these huge changes, several companies, research- ers, and recent reviews [19–24] use the term ‘‘next-generation sequencing’’ instead of high-throughput sequencing, yet this term itself may soon be outdated considering the speed of ongoing developments. DOI 10.1002/bies.200900181 Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany *Corresponding author: Janet Kelso E-mail: [email protected] Abbreviations: A/C/G/T, Deoxyadenosine, Deoxycytosine, Deoxy- guanosine, Deoxythymidine; ATP, Adenosine tri- phosphate; dATPaS, Deoxy-adenosine-5’-(alpha- thio)-triphosphate; CCD, Charge-coupled Device, i.e. semi-conductor device used in digital cameras; ChIP-Seq, Chromatin Immuno-Precipitation sequen- cing; CNV, Copy Number Variation; dNTPs/NTPs, deoxy-nucleotides; ddNTPs, dideoxy-nucleotides (modified nucleotides missing a hydroxyl group at the third carbon atom of the sugar); GA, Short for Illumina Genome Analyzer; InDel, Insertion/ Deletion; kb/Mb/Gb, kilo base (l0 3 nt)/mega base (10 6 nt)/giga base (l0 9 nt); MeDIP-Seq, Methyl- ation-Dependent Immuno-Precipitation sequen- cing; nt nucleotide(s); PCR, Polymerase Chain Reaction; RNA-Seq, Sequencing of mRNAs/transcripts; SAGE, Serial Analysis of Gene Expression; SNP, Single Nucleotide Poly- morphism; mRNA, messenger RNA/transcripts. Recent advances in DNA sequencing have revolutionized the field of genomics, making it possible for even single research groups to generate large amounts of sequence data very rapidly and at a substantially lower cost. These high- throughput sequencing technologies make deep transcriptome sequencing and transcript quantification, whole genome sequencing and resequencing available to many more researchers and projects. However, while the cost and time have been greatly reduced, the error profiles and limitations of the new platforms differ significantly from those of previous sequencing technol- ogies. The selection of an appropriate sequencing platform for particular types of experiments is an important consideration, and requires a detailed under- standing of the technologies available; including sources of error, error rate, as well as the speed and cost of sequencing. We review the relevant concepts and compare the issues raised by the current high-throughput DNA sequencing technologies. We analyze how future developments may overcome these limita- tions and what challenges remain. Keywords: .ABI/Life Technologies SOLiD; Helicos HeliScope; Illumina Genome Analyzer; Roche/454 GS FLX Titanium; Sanger capillary sequencing 524 www.bioessays-journal.com Bioessays 32: 524–536,ß 2010 WILEY Periodicals, Inc. Methods, Models & Techniques

Upload: others

Post on 08-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Methods, Models & Techniquesdosequis.colorado.edu/Courses/MethodsLogic/papers/NexGenSeq.pdf · Methods, Models & Techniques High-throughput DNA sequencing – concepts and limitations

Methods, Models & Techniques

High-throughput DNA sequencing –concepts and limitations

Martin Kircher and Janet Kelso�

Introduction

In 1977 the first genome, that of the5,386 nucleotide (nt), single-strandedbacteriophage wX174, was completely

sequenced [1] using a technologyinvented just a few years earlier [2–5].Since then the sequencing of wholegenomes as well as of individual regionsand genes has become a major focus of

modern biology and completely trans-formed the field of genetics.

At the time of the sequencing ofwX174, and for almost another decade,DNA sequencing was a barely auto-mated and very tedious process whichinvolved determining only a few hun-dred nucleotides at a time. In the late1980s, semi-automated sequencerswith higher throughput became avail-able [6, 7], still only able to determinea few sequences at a time. A break-through in the early 1990s was thedevelopment of capillary array electro-phoresis and appropriate detection sys-tems [8–12]. As recently as 1996, thesedevelopments converged in the pro-duction of a commercial single capillarysequencer (ABI Prism 310). In 1998, theGE Healthcare MegaBACE 1000 and theABI Prism 3700 DNA Analyzer becamethe first commercial 96 capillarysequencers, a development which wastermed high-throughput sequencing.

Over the last decade, alternativesequencing strategies have becomeavailable [13–18] which force us to com-pletely redefine ‘‘high-throughputsequencing.’’ These technologies out-perform the older Sanger-sequencingtechnologies by a factor of 100–1,000in daily throughput, and at the sametime reduce the cost of sequencingone million nucleotides (1 Mb) to4–0.1% of that associated with Sangersequencing. To reflect these hugechanges, several companies, research-ers, and recent reviews [19–24] use theterm ‘‘next-generation sequencing’’instead of high-throughput sequencing,yet this term itself may soon be outdatedconsidering the speed of ongoingdevelopments.

DOI 10.1002/bies.200900181

Department of Evolutionary Genetics, MaxPlanck Institute for Evolutionary Anthropology,Leipzig, Germany

*Corresponding author:Janet KelsoE-mail: [email protected]

Abbreviations:A/C/G/T, Deoxyadenosine, Deoxycytosine, Deoxy-guanosine, Deoxythymidine; ATP, Adenosine tri-phosphate; dATPaS, Deoxy-adenosine-5’-(alpha-thio)-triphosphate; CCD, Charge-coupled Device,i.e. semi-conductor device used in digital cameras;

ChIP-Seq, Chromatin Immuno-Precipitation sequen-cing; CNV, Copy Number Variation; dNTPs/NTPs,deoxy-nucleotides; ddNTPs, dideoxy-nucleotides(modified nucleotides missing a hydroxyl group atthe third carbon atom of the sugar); GA, Shortfor Illumina Genome Analyzer; InDel, Insertion/Deletion; kb/Mb/Gb, kilo base (l03 nt)/megabase (106 nt)/giga base (l09 nt); MeDIP-Seq,Methyl- ation-Dependent Immuno-Precipitationsequen- cing; nt nucleotide(s); PCR, PolymeraseChain Reaction; RNA-Seq, Sequencing ofmRNAs/transcripts; SAGE, Serial Analysis ofGene Expression; SNP, Single Nucleotide Poly-morphism; mRNA, messenger RNA/transcripts.

Recent advances in DNA sequencing have revolutionized the field of genomics,

making it possible for even single research groups to generate large amounts of

sequence data very rapidly and at a substantially lower cost. These high-

throughput sequencing technologies make deep transcriptome sequencing

and transcript quantification, whole genome sequencing and resequencing

available to many more researchers and projects. However, while the cost

and time have been greatly reduced, the error profiles and limitations of the

new platforms differ significantly from those of previous sequencing technol-

ogies. The selection of an appropriate sequencing platform for particular types

of experiments is an important consideration, and requires a detailed under-

standing of the technologies available; including sources of error, error rate, as

well as the speed and cost of sequencing. We review the relevant concepts and

compare the issues raised by the current high-throughput DNA sequencing

technologies. We analyze how future developmentsmay overcome these limita-

tions and what challenges remain.

Keywords:.ABI/Life Technologies SOLiD; Helicos HeliScope; Illumina Genome Analyzer;

Roche/454 GS FLX Titanium; Sanger capillary sequencing

524 www.bioessays-journal.com Bioessays 32: 524–536,� 2010 WILEY Periodicals, Inc.

Methods,Models

&Techniques

Page 2: Methods, Models & Techniquesdosequis.colorado.edu/Courses/MethodsLogic/papers/NexGenSeq.pdf · Methods, Models & Techniques High-throughput DNA sequencing – concepts and limitations

Here we review the five sequencingtechnologies currently available on themarket (capillary sequencing, pyrose-quencing, reversible terminator chem-istry, sequencing-by-ligation, and virtualterminator chemistry), discuss the intrin-sic limitations of each, and provide anoutlook on new technologies on thehorizon. We explain how the vastincreases in throughput are associatedwith both new and old types of problemsin the resulting sequence data, and howthese limit the potential applicationsand pose challenges for data analysis.

Sanger capillary sequencing

Current Sanger capillary sequencingsystems, like the widely used AppliedBiosystems 3xxx series or the GEHealthcare MegaBACE instrument, arestill based on the same general schemeapplied in 1977 for the wX174 genome[1, 3]. First, millions of copies of thesequence to be determined are purifiedor amplified, depending on the source ofthe sequence. Reverse strand synthesisis performed on these copies using aknown priming sequence upstream ofthe sequence to be determined and amixture of deoxy-nucleotides (dNTPs,the standard building blocks ofDNA) and dideoxy-nucleotides (ddNTP,

modified nucleotides missing a hydro-xyl group at the third carbon atom of thesugar). The dNTP/ddNTP mixturecauses random, non-reversible termin-ation of the extension reaction,creating from the different copies mol-ecules extended to different lengths.Following denaturation and clean upof free nucleotides, primers, and theenzyme, the resulting molecules aresorted by their molecular weight (corre-sponding to the point of termination)and the label attached to the terminat-ing ddNTPs is read out sequentially inthe order created by the sorting step. Aschematic representation of this processis available in Fig. 1.

Sorting by molecular weight wasoriginally performed using gel electro-phoresis but is nowadays carried out bycapillary electrophoresis [7, 25].Originally, radioactive or optical labelswere applied in four different terminatorreactions (each sorted and read outseparately), but today four different flu-orophores, one per nucleotide (A, C, G,and T) are used in a single reaction [6].Additionally, the advent of more sensi-tive detection systems and severalrounds of primer extensions (equivalentto a linear amplification) permitsmaller amounts of starting DNA to beused for modern sequencing reactions.

Unfortunately, there is still little auto-mation for creation of the high copyinput DNA with known priming sites.Typically this is done by cloning, i.e.,introducing the target sequence into aknown vector sequence using restrictionand ligation procedures and using abacterial strain to amplify the targetsequence in vivo – thereby exploitingthe low amplification error due toinherent proof-reading and repair mech-anisms. However, this process is verytedious and is sometimes hamperedby difficulties such as cloning specificsequences due to their base compo-sition, length, and interactions withthe bacterial host system. Althoughnot yet widely used, integrated micro-fluidic devices have been developedwhich aim to automate the DNA extrac-tion, in vitro amplification, and sequenc-ing on the same chip [26–29].

Using current Sanger sequencingtechnology, it is technically possiblefor up to 384 sequences [29, 30] ofbetween 600 and 1,000 nt in length[23, 31] to be sequenced in parallel.However, these 384-capillary systemsare rare. The more standard 96-capillaryinstruments yield a maximum ofapproximately 6 Mb of DNA sequenceper day, with costs for consumablesamounting to about $500 per 1 Mb.

Figure 1. Schematic representation of the Sanger sequencing proc-ess. Input DNA is fragmented and cloned into bacterial vectors forin vivo amplification. Reverse strand synthesis is performed on theobtained copies starting from a known priming sequence and using amixture of deoxy-nucleotides (dNTPs) and dideoxy-nucleotides(ddNTPs). The dNTP/ddNTP mixture randomly causes the extension

to be non-reversibly terminated, creating differently extended mol-ecules. Subsequently, after denaturation, clean up of free nucleo-tides, primers, and the enzyme, the resulting molecules are sortedusing capillary electrophoresis by their molecular weight (corre-sponding to the point of termination) and the fluorescent labelattached to the terminating ddNTPs is read out sequentially.

......Methods, Models & Techniques M. Kircher and J. Kelso

Bioessays 32: 524–536,� 2010 WILEY Periodicals, Inc. 525

Methods,Models

&Techniques

Page 3: Methods, Models & Techniquesdosequis.colorado.edu/Courses/MethodsLogic/papers/NexGenSeq.pdf · Methods, Models & Techniques High-throughput DNA sequencing – concepts and limitations

The sequencing error observed forSanger sequencing is mainly due toerrors in the amplification step (a lowrate when done in vivo), natural var-iance, and contamination in the sampleused, as well as polymerase slippage atlow complexity sequences like simplerepeats (short variable number tandemrepeats) and homopolymers (stretchesof the same nucleotide). Further, lowerintensities and missing terminationvariants tend to lead to sequencingerrors accumulating toward the end oflong sequences. In combination withreduced separation by the electrophore-sis, base miscalls [32] and deletionsincrease with read length. However,the average error rate (the average overall bases of a sequence) after sequenceend trimming is typically very low, withan error every 10,000–100,000 nt [33].

Roche/454 GS FLX Titaniumsequencer

The 454 sequencing platform was thefirst of the new high-throughput

sequencing platforms on the market(released in October 2005). It is basedon the pyrosequencing approach devel-oped by Pal Nyren and Mostafa Ronaghiat the Royal Institute of Technology,Stockholm in 1996 [34]. In contrast tothe Sanger technology, pyrosequencingis based on iteratively complementingsingle strands and simultaneouslyreading out the signal emitted fromthe nucleotide being incorporated(also called sequencing by synthesis,sequencing during extension). Electro-phoresis is therefore no longer requiredto generate an ordered read out of thenucleotides, as the read out is nowdone simultaneously with the sequenceextension.

In the pyrosequencing process(Fig. 2), one nucleotide at a time iswashed over several copies of thesequence to be determined, causingpolymerases to incorporate the nucleo-tide if it is complementary to the tem-plate strand. The incorporation stops ifthe longest possible stretch of comp-lementary nucleotides has been

synthesized by the polymerase. In theprocess of incorporation, one pyrophos-phate per nucleotide is released andconverted to ATP by an ATP sulfurylase.The ATP drives the light reaction of luci-ferases present and the emitted lightsignal is measured. To prevent thedATP provided for sequencing reactionfrom being used directly in the lightreaction, deoxy-adenosine-50-(a-thio)-triphosphate (dATPaS), which is not asubstrate of the luciferase, is usedfor the base incorporation reaction.Standard deoxyribose nucleotides areused for all other nucleotides. After cap-turing the light intensity, the remainingunincorporated nucleotides are washedaway and the next nucleotide isprovided.

In 2005, pyrosequencing technologywas parallelized on a picotiter plate by454 Life Sciences (later bought by RocheDiagnostics) to allow high-throughputsequencing [16]. The sequencing platehas about two million wells – each ofthem able to accommodate exactlyone 28-mm diameter bead covered with

Figure 2. The pyrosequencing process. One of four nucleotides iswashed sequentially over copies of the sequence to be determined,causing polymerases to incorporate complementary nucleotides.The incorporation stops if the longest possible stretch of the avail-able nucleotide has been synthesized. In the process of

incorporation, one pyrophosphate per nucleotide is released andconverted to ATP by an ATP sulfurylase. The ATP drives the lightreaction of luciferases present and a light signal proportional(within limits) to the number of nucleotide incorporations can bemeasured.

M. Kircher and J. Kelso Methods, Models & Techniques.....

526 Bioessays 32: 524–536,� 2010 WILEY Periodicals, Inc.

Methods,Models

&Techniques

Page 4: Methods, Models & Techniquesdosequis.colorado.edu/Courses/MethodsLogic/papers/NexGenSeq.pdf · Methods, Models & Techniques High-throughput DNA sequencing – concepts and limitations

single-stranded copies of the sequenceto be determined. The beads are incubatedwith a polymerase and single-strandbinding proteins and, together withsmaller beads carrying the ATP sulfur-ylases and luciferases, gravitationallydeposited in the wells. Free nucleotidesare then washed over the flow cell andthe light emitted during the incorpora-tion is captured for all wells in parallelusing a high-resolution charge-coupleddevice (CCD) camera, exploiting thelight-transporting features of the plateused.

One of the main prerequisites forapplying this array-based pyrosequenc-ing approach is covering individualbeads with multiple copies of the samemolecule. This is done by first creatingsequencing libraries in which everyindividual molecule gets two differentadapter sequences, one at the 50 endand one at the 30 end of the molecule.In the case of the 454/Roche sequencinglibrary preparation [16], this is done bysequential ligation of two pre-synthes-ized oligos. One of the adapters added iscomplementary to oligonucleotides onthe sequencing beads and thus allowsmolecules to be bound to the beads byhybridization. Low molecule-to-beadratios and amplification from the hybri-dized double-stranded sequence on thebeads (kept separate using emulsionPCR) makes it possible to grow beadswith thousands of copies of a singlestarting molecule. Using the secondadapter, beads covered with moleculescan be separated from empty beads(using special capture beads with oligo-nucleotides complementary to the sec-ond adapter) and are then used inthe sequencing reaction as describedabove.

The average substitution (excludinginsertion/deletion, InDel) error rate is inthe range of 10�3–10�4 [16, 35], which ishigher than the rates observed forSanger sequencing, but is the lowestaverage substitution error rate of thenew sequencing technologies discussedhere. As mentioned earlier for Sangersequencing, in vitro amplifications per-formed for the sequencing preparationcause a higher background error rate,i.e., the error introduced into the samplebefore it enters the sequencer. Inaddition, in bead preparation (i.e.,emulsion PCR) a fraction of the beadsend up carrying copies of multiple

different sequences. These ‘‘mixedbeads’’ will participate in a high numberof incorporations per flow cycle, result-ing in sequencing reads that do notreflect real molecules. Most of thesereads are automatically filtered duringthe software post-processing of the data.The filtering of mixed beads may, how-ever, cause a depletion of real sequenceswith a high fraction of incorporationsper flow cycle.

A large fraction of the errorsobserved for this instrument are smallInDels, mostly arising from inaccuratecalling of homopolymer length, andsingle base-pair deletions or insertionscaused by signal-to-noise thresholdingissues [35]. Most of these problems canbe resolved by higher coverage. For long(>10 nt) homopolymers, however, thereis often a consistent length miscall thatis not resolvable by coverage [35–37].Strong light signals in one well ofthe picotiter plate may also result ininsertions in sequences in neighboringwells. If the neighboring well is empty,this can generate so-called ghostwells, i.e., wells for which a signal isrecorded even though they contain nosequence template; hence, the inten-sities measured are completely causedby bleed-over signal from the neighbor-ing wells. Computational post-process-ing may correct for these artifacts [38].As for Sanger sequencing, the error rateincreases with the position in thesequence. In the case of 454 sequencing,this is caused by a reduction in enzymeefficiency or loss of enzymes (resultingin a reduction of the signal intensities),some molecules no longer being elong-ated and by an increasing phasingeffect. Phasing is observed when apopulation of DNA molecules amplifiedfrom the same starting molecule(ensemble) is sequenced, and describesthe process whereby not all molecules inthe ensemble are extended in everycycle. This causes the molecules in theensemble to lose synchrony/phase, andresults in an echo of the precedingcycles to be added to the signal as noise.

The current 454/Roche GS FLXTitanium platform makes it possible tosequence about 1.5 million such beadsin a single experiment and to determinesequences of length between 300 and500 nt. The length of the reads is deter-mined by the number of flow cycles (thenumber of times all four nucleotides

are washed over the plate) as well asby the base composition and the order ofthe bases in the sequence to be deter-mined. Currently, 454/Roche limits thisnumber to 200 flow cycles, resulting inan expected average read length ofabout 400 nt. This is largely due tolimitations imposed by the efficiencyof polymerases and luciferases, whichdrops over the sequencing run, resultingin decreased base qualities. Currentlythe platforms allows about 750 Mb ofDNA sequence to be created per daywith costs of about 20$/Mb.

Illumina Genome Analyzer II/IIx

The reversible terminator technologyused by the Illumina Genome Analyzer(GA) employs a sequencing-by-syn-thesis concept that is similar to thatused in Sanger sequencing, i.e. theincorporation reaction is stopped aftereach base, the label of the base incorp-orated is read out with fluorescent dyes,and the sequencing reaction is then con-tinued with the incorporation of thenext base [13, 39] (Fig. 3).

Like 454/Roche, the Illuminasequencing protocol requires that thesequences to be determined are con-verted into a special sequencing library,which allows them to be amplified andimmobilized for sequencing [13, 40]. Forthis purpose two different adapters areadded to the 50 and 30 ends of all mol-ecules using ligation of so-called forkedadapters.1 The library is then amplifiedusing longer primer sequences, whichextend and further diversify theadapters to create the final sequenceneeded in subsequent steps.

This double-stranded library ismelted using sodium hydroxide toobtain single-stranded DNAs, whichare then pumped at a very low concen-tration through the channels of a flowcell. This flow cell has on its surface twopopulations of immobilized oligonu-cleotides complementary to the twodifferent single-stranded adapter endsof the sequencing library. These oligo-nucleotides hybridize to the single-

1 Hybrids of partially complementary oligonucleo-tides creating one double-stranded end with a Toverhang, with a single-stranded and a differentsequence at the other end.

......Methods, Models & Techniques M. Kircher and J. Kelso

Bioessays 32: 524–536,� 2010 WILEY Periodicals, Inc. 527

Methods,Models

&Techniques

Page 5: Methods, Models & Techniquesdosequis.colorado.edu/Courses/MethodsLogic/papers/NexGenSeq.pdf · Methods, Models & Techniques High-throughput DNA sequencing – concepts and limitations

stranded library molecules. By reversestrand synthesis starting from the hybri-dized (double-stranded) part, the newstrand being created is covalentlybound to the flow cell. If this new strandbends over and attaches to another oli-gonucleotide complementary to the sec-ond adapter sequence on the free end ofthe strand, it can be used to synthesize asecond covalently bound reverse strand.This process of bending and reversestrand synthesis, called bridge amplifi-cation, is repeated several times andcreates clusters of several 1,000 copiesof the original sequence in very closeproximity to each other on the flow cell[13, 40].

These randomly distributed clusterscontain molecules that represent theforward as well as reverse strandsof the original sequences. Before deter-mining the sequence, one of the strandshas to be removed to prevent itfrom hindering the extension reactionsterically or by complementary basepairing. Strands are selectively cleavedat base modifications of oligonucleoti-des on the flow cell. Following strandremoval, each cluster on the flow cellconsists of single stranded, identicallyoriented copies of the same sequence;which can be sequenced by hybridizing

the sequencing primer onto the adaptersequences and starting the reversibleterminator chemistry.

‘‘Solexa sequencing’’, as it wasintroduced in early 2007, initiallyallowed for the simultaneous sequenc-ing of several million very short sequen-ces (at most 26 nt) in a singleexperiment. In recent years there havebeen several technical, chemical, andsoftware updates. The product, whichis now called the Illumina GenomeAnalyzer, has increased flow cell clusterdensities (more than 200 million clus-ters per run), a wider range of theflow cell is imaged, and sequencereads of up to 100 nt can be generated.A technical update also enabled thesequencing of the reverse strand ofeach molecule. This is achieved bychemical melting and washing awaythe synthesized sequence, repeating afew bridge amplification cycles forreverse strand synthesis, and then selec-tively removing the starting strand(again using base modifications of theflow cell oligonucleotide populations),before annealing another sequencingprimer for the second read. Using this‘‘paired-end sequencing’’ approach,approximately twice the amount ofdata can be generated. The Illumina

library and flow cell preparationincludes several in vitro amplificationsteps, which cause a high backgrounderror rate and contribute to the averageerror rate of about 10�2–10�3 [41, 42].Further, the flow cell preparationcreates a fraction of ordinary-lookingclusters that are initiated from morethan one individual sequence. Theseresults in mixed signals and mostlylow quality sequences for these clusters.Similar to the 454 ghost wells, theIllumina image analysis may identifychemistry crystals, dust, and lintparticles as clusters and call sequencesfrom these. In such cases the resultingsequences typically appear to be of lowsequence complexity.

As is the case for the other platforms,the error rate increases with increasingposition in the determined sequence.This is mainly due to phasing, whichincreases the background noise assequencing progresses. While theensemble sequencing process for pyro-sequencing creates uni-directionalphasing, reversible terminator sequenc-ing creates bi-directional phasing [41,43] as some incorporated nucleotidesmay also fail to be correctly terminated –allowing the extension of the sequenceby another nucleotide in the same cycle.

Figure 3.Reversible terminator chemistry applied by the Illumina GA.Sequencing primers are annealed to the adapters of the sequencesto be determined. Polymerases are used to extend the sequencingprimers by incorporation of fluorescently labeled and terminatednucleotides. The incorporation stops immediately after the firstnucleotide due to the terminators. The polymerases and free

nucleotides are washed away and the label of the bases incorporatedfor each sequence is read with four images taken through differentfilters (T nucleotide filter is indicated in the figure) and using twodifferent lasers (red: A, C and green: G, T) to illuminate fluorophores.Subsequently, the fluorophores and terminators are removed andthe sequencing continued with the incorporation of the next base.

M. Kircher and J. Kelso Methods, Models & Techniques.....

528 Bioessays 32: 524–536,� 2010 WILEY Periodicals, Inc.

Methods,Models

&Techniques

Page 6: Methods, Models & Techniquesdosequis.colorado.edu/Courses/MethodsLogic/papers/NexGenSeq.pdf · Methods, Models & Techniques High-throughput DNA sequencing – concepts and limitations

With increasing cycle numbers, theintensities extracted from the clustersdecline [41, 43, 44]. This is due to fewermolecules participating in the extensionreaction as a result of non-reversibletermination, or due to dimming effectsof the sequencing fluorophores. In earlyversions of the chemistry, one of thefluorophores could become stuck tothe clusters creating another source ofincreased background noise [41]. Thesimultaneous identification of fourdifferent nucleotides is also an issue.The GA uses four fluorescent dyes todistinguish the four nucleotides A, C,G, and T. Of these, two pairs (A/C andG/T) excited using the same laser, aresimilar in their emission spectra andshow only limited separation using opti-cal filters. Therefore, the highest substi-tution errors observed are between A/Cand G/T [41, 42].

Even though the Illumina GA readsshow a higher average error rate,a wider average error range, and areconsiderably shorter than 454/Rochereads, the GA instrument determinesmore than 5,000 Mb/day with a priceof about 0.50$/Mb. This is more thansix times higher daily throughput andfor a considerably lower price permegabase.

Applied Biosystems SOLiD

The prototype of what was furtherdeveloped and later sold by LifeTechnologies/Applied Biosystems (ABI)as the SOLiD sequencing platform, wasdeveloped by Harvard Medical Schooland the Howard Hughes MedicalInstitute and published in 2005 [17].With its commercial release in late2007, SOLiD was only the third newhigh-throughput system entering ahighly competitive market with all threevendors selling their instruments foraround half a million dollars. TheChurch lab at Harvard Medical Schoolcontinued the development of thesystem and now offers a cheaper(<$200,000) open source version ofthe system (called Polonator) in collab-oration with Dover System. In the thirdquarter of 2008, a biotechnology com-pany from Mountain View, California,named Complete Genomics startedoffering a human genome sequencingservice. Their technology is also based

on the Church lab sequencing-by-ligation concept, but combines it witha new strategy of sequencing libraryconstruction and sequence immobiliz-ation using rolling circle amplification[45]. Here, we focus on the commercialSOLiD system as this is the most wide-spread application of this concept.

The principle behind sequencing-by-ligation is very different from theapproaches discussed thus far. Thesequence extension reaction is not car-ried out by polymerases but rather byligases [17] (see Fig. 4 for a schematicrepresentation of the SOLiD 2/3 plat-form). In the sequencing-by-ligationprocess, a sequencing primer is hybri-dized to single-stranded copies of thelibrary molecules to be sequenced. Amixture of 8-mer probes carrying fourdistinct fluorescent labels compete forligation to the sequencing primer. Thefluorophore encoding, which is basedon the two 30-most nucleotides of theprobe, is read. Three bases includingthe dye are cleaved from the 50 end ofthe probe, leaving a free 50 phosphate onthe extended (by five nucleotides) pri-mer, which is then available for furtherligation. After multiple ligations (typi-cally up to 10 cycles), the synthesizedstrands are melted and the ligationproduct is washed away before a newsequencing primer (shifted by onenucleotide) is annealed. Starting fromthe new sequencing primer the ligationreaction is repeated. The same process isfollowed for three other primers, facili-tating the read out of the dinucleotideencoding for each start position in thesequence. Using specific fluorescentlabel encoding, the dye read outs (i.e.colors) can be converted to a sequence[46]. This conversion from color space tosequence requires a known first base,which is the last base of the used libraryadapter sequence. Given a referencesequence, this encoding system allowsdetection of machine errors and theapplication of an error correction toreduce the average error rate. In theabsence of a reference sequence, how-ever, color conversion fails with an errorin the dye read out and causes thesequence downstream of the error tobe incorrect.

For parallelization, the sequencingprocess uses beads covered withmultiple copies of the sequence to bedetermined. These beads are created in

a similar fashion to that described ear-lier for the 454/Roche platform. In con-trast to the 454/Roche technology, theSOLiD system does not use a picotiterplate for fixation of the beads in thesequencing process; instead the 30 endsof the sequences on the beads are modi-fied in a way that allows them to becovalently bound onto a glass slide.As for the Illumina GA system, this cre-ates a random dispersion of the beads inthe sequencing chamber and allows forhigher loading densities. However, ran-dom dispersion complicates the identi-fication of bead positions from images,and results in the possibility that chemi-cal crystals, dust, and lint particles canbe misidentified as clusters. Further,dispersal of the beads results in a widerange of inter-bead distances, whichthen have different susceptibility to beinfluenced by signals from neighboringbeads.

Types and causes of sequence errorsare diverse: first, the in vitro amplifica-tion steps cause a higher backgrounderror rate. Secondly, beads carrying amixture of sequences and beads in closeproximity to one another create falsereads and low quality bases. Further,signal decline, a small regular phasingeffect, and incomplete dye removalresult in increasing error as the ligationcycles progress [47]. Phasing, asdescribed earlier, is a minor issue onthis platform as sequences not extendedin the last cycle are non-reversibly ter-minated using phosphatases. Sincehybridization is a stochastic process,this causes a considerable reduction inthe number of molecules participatingin subsequent ligation reactions, andtherefore substantial signal decline.On the other hand, given the efficiencyof phosphatases the remaining phasingeffect can be considered very low.However, incomplete cleavage of thedyes may allow cleavage in the nextligation reaction, which then allowsfor the extension in the next but onecycle. This causes a different phasingeffect and additional noise from theprevious cycle’s dyes in the dye identi-fication process.

The SOLiD system currently allowssequencing of more than 300 millionbeads in parallel, with a typical readlength of between 25 and 75 nt. At thetime of writing, the ABI SOLiD system istherefore comparable to the Illumina GA

......Methods, Models & Techniques M. Kircher and J. Kelso

Bioessays 32: 524–536,� 2010 WILEY Periodicals, Inc. 529

Methods,Models

&Techniques

Page 7: Methods, Models & Techniquesdosequis.colorado.edu/Courses/MethodsLogic/papers/NexGenSeq.pdf · Methods, Models & Techniques High-throughput DNA sequencing – concepts and limitations

system in terms of throughput and priceper million nucleotides (�5,000 Mb/day, �0.50$/Mb). Average error ratesare, however, dependent on the avail-ability of a reference genome for errorcorrection (10�3–10�4 vs. 10�2–10�3). Inthe absence of a reference genome,assembly and consensus calling maybe performed based on dye read outs(so-called color space sequences) toreduce the errors before conversion tothe nucleotide sequence. If no referencegenome is available for error correction,and no assembly and consensus callingis performed, then the average error rateis higher than for the Illumina GA.

Helicos HeliScope

Helicos is the first company to sell asequencer able to sequence individualmolecules instead of molecule ensem-bles created by an amplification proc-ess. Single molecule sequencing has theadvantage that it is not affected bybiases or errors introduced in a librarypreparation or amplification step, andmay facilitate sequencing of minimalamounts of input DNA. Using methodsable to detect non-standard nucleotides,it could also allow for the identificationof DNA modifications, commonly lost inthe in vitro amplification process.

The HeliScope, as the Helicos sequenceris called, was first sold in March 2008,and by the end of the first quarter of2009 only four machines have beeninstalled worldwide. This might be sur-prising given the advantages of singlemolecule sequencing, but probablyreflects both the specific limitations ofthis platform, the price (about onemillion dollars), and a relatively smallmarket that has already invested exten-sively in new sequencing technologies.

The technology applied (Fig. 5)could be termed asynchronous virtualterminator chemistry [15]. Input DNAis fragmented and melted before a

Figure 4. Applied Biosystem’s SOLiD sequencing by ligation. Asequencing primer is annealed to single-stranded copies of sequen-ces to be determined. Octamer probes are hybridized, ligated to thesequencing primer, and a fluorescent dye at the 50 end of the ligated8-mer probes, encoding the two 30-most nucleotides of the probe, isread out. Non-extended primers are dephosphorylated. Threenucleotides of the probe including the dye are cleaved, creating a

free 50 phosphate for further ligations. After multiple ligations, thesynthesized strands are melted and the ligation product is washedaway before a new, by-one-nucleotide-shifted sequencing primer isannealed. Starting from the new sequencing primer the ligationreaction is repeated. The same is done for three other primers,allowing the read out of the dinucleotide label for every position inthe sequence.

M. Kircher and J. Kelso Methods, Models & Techniques.....

530 Bioessays 32: 524–536,� 2010 WILEY Periodicals, Inc.

Methods,Models

&Techniques

Page 8: Methods, Models & Techniquesdosequis.colorado.edu/Courses/MethodsLogic/papers/NexGenSeq.pdf · Methods, Models & Techniques High-throughput DNA sequencing – concepts and limitations

poly-A-tail is synthesized onto eachsingle-stranded molecule using a poly-adenylate polymerase. In the last step ofpolyadenylation, a fluorescently labeledadenine is added. The library is washedover a flow cell where the poly-A tailsbind to poly-T oligonucleotides. Thebound coordinates on the flow cell aredetermined using a fluorescence-basedread out of the flow cell. Having thesecoordinates identified, the fluorescentlabel of the 30 adenine is removed andthe sequencing reaction started.Polymerases are washed through theflow cell with one type of fluorescentlylabeled nucleotide (A, C, G, or T) at atime and the polymerases extend thereverse strand of the sequences startingfrom the poly-T oligonucleotides. Thenucleotide incorporation of the poly-merases is slowed down by the fluor-escent labeling and allows for at most

one incorporation before the polymer-ase is washed away together with thenon-incorporated nucleotides (termedvirtual termination [48, 49]). The flowcell is then imaged again, the fluor-escent dyes are removed, and the reac-tion continued with another nucleotide.By this process not every molecule isextended in every cycle, which is whyit is an asynchronous sequencing proc-ess resulting in sequences of differentlength (as is the case for the 454/Rocheplatform).

Since single molecules aresequenced, the signals being measuredare weak, and there is no possibility thatmisincorporation errors can be cor-rected by an ensemble effect. Due tothe fact that molecules are attached tothe flow cell by hybridization only, thereis a chance that template molecules canbe lost in the wash steps. In addition,

molecules may be irreversibly termi-nated by the incorporation of incorrectlysynthesized nucleotides. Overall, readsare between 24 and 70 nt long (average32 nt) [50] and thus shorter than for theother platforms. Due to the higher num-ber of sequences determined in parallel,the total throughput per day (4150 Mb/daywith a cost of �0.33$/Mb [50]) is in thesame range as for the GA and SOLiDsystems. The average error rate, whichis in the range of a few percent, isslightly higher than for all other instru-ments and biased toward InDels ratherthan substitutions.

Applications and generalconsiderations

All current high-throughput technol-ogies have an average error rate that

Figure 5. Asynchronous virtual terminator chemistry performed bythe HeliScope. Input DNA is fragmented, melted, and polyadeny-lated. A fluorescently labeled adenine is added in the last step. Thissingle-stranded DNA is washed over a flow cell with poly-T oligo-nucleotides allowing hybridization. The bound coordinates on theflow cell are determined using the fluorescently labeled adenines.Having the coordinates identified, the fluorescent label of the 30

adenines is removed. Polymerases are washed through with one

type of fluorescently labeled nucleotides (A, C, G, and T) at a time,and the polymerases extend the reverse strand of the sequencesstarting from the poly-T oligonucleotides. The nucleotide incorpora-tion of the polymerases is slowed down by the fluorescent labelingand allows for at most one incorporation before the polymerase iswashed away. The flow cell is then imaged, the fluorescent dyesremoved, and the reaction continued with another nucleotide.

......Methods, Models & Techniques M. Kircher and J. Kelso

Bioessays 32: 524–536,� 2010 WILEY Periodicals, Inc. 531

Methods,Models

&Techniques

Page 9: Methods, Models & Techniquesdosequis.colorado.edu/Courses/MethodsLogic/papers/NexGenSeq.pdf · Methods, Models & Techniques High-throughput DNA sequencing – concepts and limitations

is considerably higher than the typical1/10,000 to 1/100,000 observed forhigh-quality Sanger sequences.Further, the GS FLX Titanium, GA,SOLiD, and HeliScope platforms eachhave very specific biases and limita-tions, making it necessary to choose aplatform appropriate for a specific proj-ect or application (for a summary seeTable 1). A combination of technologies[51–54] and experimental protocols [55–57] may also be appropriate, and evencomplementary, for specific projects.

High-quality Sanger sequencingis now commonly used to generatelow-coverage sequencing of individualpositions and regions (e.g., diagnosticgenotyping) or the sequencing ofvirus- and phage-sized whole genomes.As the Sanger sequence length islonger than most abundant shortrepeat classes, it allows the unambigu-ous assembly of most genomicregions – something that is generallynot possible using the shorter readplatforms. However, the technology isexpensive and too slow for sequencinglarge samples, extended genomic re-gions, or the many molecules requiredfor quantitative applications [e.g.,gene expression quantification; chro-matin immunoprecipitation sequenc-ing (ChIP-Seq); and methylation-dependent immunoprecipitation sequen-cing (MeDip-Seq)]. For quantitativeapplications the HeliScope providesthe highest throughput in terms ofsequence number and has theadvantage of not requiring a multisteplibrary preparation protocol. On the ot-her hand, the HeliScope provides thelowest resolution in mapping accuracyfor complex genomes due its short readlength and error profile. The GA orSOLiD platforms may thus provideequivalent results for quantitative appli-cations, while providing fewer but lon-ger reads and requiring a moreelaborate library preparation.

While it has not yet been fully ana-lyzed, it is possible (and even likely) thatlibrary preparation protocols could biasthe sequence representation in a sample[42, 58, 59], making the replacement ofthis step an important goal. Further,multistep library preparation protocolsrequire higher amounts of inputmaterial, limiting their general appli-cation. However, protocols for libraryconstruction from limited sample

amounts are available or being devel-oped for each of the platforms, and pub-lications demonstrate that, while vendorprotocols indicate the need for highersample quantities (microgram range),many users are proceeding successfullywith low input DNA amounts (nanogramto picogram range), as, for example, fromancient DNA specimens [60–62].

Like Sanger sequencing, the GS FLXTitanium provides a read length span-ning many of the short repeat sequences– an important feature for accuratesequence mapping and assembly ofgenomes [63]. Despite the InDel errors,this technology has very low rates ofmisidentifying individual bases, makingit perfectly suited for the identificationof single nucleotide polymorphisms(SNPs). Also geared to the identificationof SNPs, at least for samples with anexisting reference genome, is theSOLiD instrument with its dinucleotideencoding scheme [46]. Considerablyhigher coverage is needed to performSNP calling with similar accuracy usingthe Illumina GA [64]. Neither theIllumina GA nor the ABI SOLiD sequenc-ing systems are prone to generate highrates of small InDels, making them wellsuited for studying InDel variation.

As mentioned earlier, the drawbackof short reads (below about 75 nt)obtained from Helicos, SOLiD, or GAinstruments is in genome assemblyand mapping applications, where theplacement of repeated or very similarsequences cannot be resolved unambig-uously. The correct placement is furthercomplicated by high error rates intro-ducing a requirement for a minimumsequence distance of an unambiguousplacement. Paired-end or mate-pair pro-tocols help to overcome some of theselimitations of short reads [65] by provid-ing information about relative locationand orientation of a pair of reads.Currently a paired-end protocol is onlycommonly applied on the GA, whilemate-pair protocols are available forSOLiD, GS FLX Titanium, and GA. Inpaired-end sequencing the actual endsof rather short DNA molecules (<1 kb)are determined, while mate-pairsequencing requires the preparation ofspecial libraries. In these protocols, theends of longer, size-selected molecules(e.g., 8, 12, or 20 kb) are connected withan internal adapter sequence in a circu-larization reaction. The circular

molecule is then processed using restric-tion enzymes or fragmentation beforeouter library adapters are added aroundthe two combined molecule ends. Theinternal adapter can then be used as asecond priming site for an additionalsequencing reaction on the sameimmobilized molecules. Thus, mate-pairsequencing provides distance infor-mation useful for assembly, but doesnot allow the merging of the two over-lapping end reads, since by design themolecules will not overlap in sequenc-ing. However, merging of two overlap-ping forward and reverse paired endreads from short insert libraries allowsthe reconstruction of a complete con-secutive molecule sequence, longerthan the individual read length, andwith reduced average error rates in theoverlapping sequence part [60, 66].

Due to the large amounts of sequen-ces created, there is interest in sequenc-ing targeted regions (e.g. a genomiclocus, from sequence capture exper-iments [67–69]) in multiple individuals/samples instead of sequencing onesample in excessive depth. All tech-nologies therefore provide a separationof their sequencing plate into definedregions or channels. However, at most,16 such regions/channels are available(GS FLX Titanium and HeliScope plates),which may not be sufficient for someapplications. Using different library con-struction protocols, some platformsallow addition of sample specific barcode(sometimes called ‘‘index’’) sequences tothe library molecules. These moleculescan then be sequenced in the sameregion/channel, and later separated(computationally) based on their bar-code sequence [70–73]. This facilitateshighly parallel sequencing of a largenumber of samples beyond that possibleusing the physical lane/channel separ-ation. Currently such protocols (mostlynon-vendor protocols) are available forthe GS FLX Titanium, GA, and SOLiDinstrument.

Although sequencing prices per giga-base have fallen considerably in recentyears, making projects like the 1000Human Genome Variation Project, 1001Arabidopsis thaliana Genomes Project,the Mammalian Genome Project, orthe International Cancer GenomeConsortium possible, high-throughputsequencing still has high acquisition,running and maintenance costs, which

M. Kircher and J. Kelso Methods, Models & Techniques.....

532 Bioessays 32: 524–536,� 2010 WILEY Periodicals, Inc.

Methods,Models

&Techniques

Page 10: Methods, Models & Techniquesdosequis.colorado.edu/Courses/MethodsLogic/papers/NexGenSeq.pdf · Methods, Models & Techniques High-throughput DNA sequencing – concepts and limitations

are not included in Table 1. Further, eachof these platforms requires a substantialinvestment in data management andanalysis, time, and personnel [74–77].Smaller research groups may still findprohibitive the costs of the infrastructureneeded for storing, handling, and ana-lyzing several tens of gigabytes of puresequence data and terabytes of severalthousand intermediate files generated bythese instruments each week. Even forlarger, experienced genome centers thisaspect remains an ever-increasing chal-lenge for the ongoing use of theseplatforms.

Upcoming developments

Motivated by the goal of a $1,000-genome set by NIH/NHGRI to enablepersonalized medicine, the throughputof all systems described is constantly

increasing and the numbers given hereare rapidly outdated. However, inaddition to the improvements of currenttechnologies, including the January2010 announcement of the IlluminaHiSeq 2000 system, which determinessequences of clusters on bottom and topof the flow cell and processes two flowcells in parallel, a new generation ofsequencers is already on the horizon.

What started with the Helicossystem – the sequencing of single mol-ecules without prior library preparationor amplification – will likely become apopular paradigm. Specifically, three ot-her systems have captured media andscientific attention well in advance oftheir actual availability: PacificBioscience’s Single Molecule RealTime (SMRT) sequencing technology[18], Oxford Nanopore’s BASE technol-ogy [14] and, recently, IBM’s proposal ofsilicon-based nanopores [78].

Pacific Biosciences’ SMRT technologyperforms the sequencing reaction onsilicon dioxide chips with a 100 nmmetal film containing thousands oftens-of-nanometer diameter holes, so-called zero-mode waveguides (ZMWs)[79]. Each ZMW is used as a nano-visual-ization chamber, providing a detectionvolume of 20 zeptoliters (10�21 l). At thisvolume, a single molecule can be illu-minated while excluding other labelednucleotides in the background – savingtime and sequencing chemistry by omit-ting wash steps. A single DNA polymer-ase is fixed to the bottom of the surfacewithin the detection volume, andnucleotides, with different dyesattached to the phosphate chain, areused in concentrations allowingnormal enzyme processivity. As the pol-ymerase incorporates complementarynucleotides, the nucleotide is heldwithin the detection volume for tens

Table 1. Comparison of high-throughput sequencing technologies available

Throughput Length Quality Costs Applications Main sources of errors

The table summarizes throughput, length, quality, and costs for the current versions of the mentioned technologies. These approximatenumbers are constantly improving and based on figures available in January 2010. Costs do not include instrument acquisition andmaintenance; further they may be affected by discounts and scale effects for multiple instruments. Where numbers are very similar, colorsranging from red (low performance) to green (good performance) indicate a general trend. In the last column, example applications fitting thethroughput and error profiles of each of the platforms are given. Typically, this does not mean that the technology is limited to theseapplications, but that it is currently best suited to such applications.þ High sequencing depth/number of runs required.

......Methods, Models & Techniques M. Kircher and J. Kelso

Bioessays 32: 524–536,� 2010 WILEY Periodicals, Inc. 533

Methods,Models

&Techniques

Page 11: Methods, Models & Techniquesdosequis.colorado.edu/Courses/MethodsLogic/papers/NexGenSeq.pdf · Methods, Models & Techniques High-throughput DNA sequencing – concepts and limitations

of milliseconds, orders of magnitudelonger than for unspecific diffusionevents. This way the fluorescent dye ofthe incorporated nucleotide can beidentified during normal speed reversestrand synthesis [79]. In pilot exper-iments, Pacific Biosciences has shownthat its technology allows for directsequencing of a few thousand basesbefore the polymerase is denatureddue to the laser read out of the dyes.The SMRT technology is intended forrelease in 2010. Even though furtherdevelopment is needed to create a morerobust system, the omission of librarypreparation and amplification as wellas the long sequences generated willundoubtedly provide an advantageover the current systems for manyapplications.

Oxford Nanopore’s BASE technologyis unlikely to be released as soon asthe SMRT technology. BASE offersthe potential to identify individualnucleotide modifications (e.g. 5-methyl-cytosine vs. cytosine) during thesequencing process [14]. The idea behindthis technology is the identification ofindividual nucleotides using a changein the membrane potential as they passthrough a modified a-hemolysin mem-brane pore with a cyclodextrin sensor[14, 80]. However, to apply this technol-ogy for sequencing, the pore has to befused to an exonuclease, which degradessingle-stranded DNA sequences andreleases individual nucleotides into thepore. In addition, the technology needsto be parallelized in array format, beforeits release as a high-throughput sequenc-ing platform. While the sensitivity forindividual nucleotide modificationsseems to be a major advantage, thedestructive fashion of the outlinedsequencing process might be considereda hindrance for applications with pre-cious samples, and it does not allow asecond read cycle for error reduction.

In early October 2009, IBM issued apress release [78] describing amethod toslow down the speed of an individualDNA strand passing through a nano-pore. For this purpose they developeda multilayer metal/dielectric nanoporedevice that utilizes the interaction of theDNA backbone charges with a modu-lated electric field to trap and slowlyreleases an individual DNA molecule.The technology described could theor-etically be combined with, for example,

the Nanopore technologies developed atHarvard University [81] or the previouslydescribed BASE technology where itmay overcome the destructive approachfollowed so far.

Conclusion

Current high-throughput sequencingtechnologies provide a huge variety ofsequencing applications to manyresearchers and projects. Given theimmense diversity, we have not dis-cussed these applications in depth here;other reviews with a stronger focus onspecific applications and data analysisare available [24, 82–88]. The discussedtechnologies make it possible for evensingle research groups to generate largeamounts of sequence data very rapidlyand at substantially lower costs thantraditional Sanger sequencing. Whilecosts have been reduced to less than4–0.1% and time has been shortenedby a factor of 100–1,000 based on dailythroughput, the error profiles andlimitations observed for the new plat-forms differ significantly from Sangersequencing and between approaches.Further, each of these new sequencingplatforms requires substantial additionalinvestments – factors that have oftennot be sufficiently stressed in researchpublications describing a specific appli-cation. Some vendors have recentlystarted to offer budget versions of theirinstruments (e.g. Illumina GA IIe or 454/Roche GS Junior) with lower sequencingcapacity. However, while the instru-ment price is lower, the financial invest-ment remains high. Costs per base aregenerally higher than for the standardinstrument, and very similar overallinfrastructure is still required. Oftenthe choice of an appropriate sequencingplatform is project specific and some-times combinations can be advan-tageous. This may open the marketfurther to companies providingsequencing-on-demand services, butwill not replace the need for laboratoriesto invest considerable time and exper-tise in both the production of librariesand analysis of the vast quantities ofdata that will be generated.

New technologies on the horizon,SMRT by Pacific Biosciences, BASE byOxford Nanopore, and other technol-ogies such as that suggested by IBM,

demonstrate the major future directionsin the field of DNA sequencing: the abil-ity to use individual molecules withoutany library preparation or amplification,the identification of specific nucleotidemodifications, and the ability to gener-ate longer sequence reads. These devel-opments will facilitate future research inmany fields, make data analysis easier,and further reduce sequencing costs,hopefully achieving the aim of a$1,000 human genome suggested byNIH/NHGRI to be required for personal-ized medicine.

AcknowledgmentsWe thank the members of the Depart-ment of Evolutionary Genetics, andparticularly members of the sequencinggroup, for providing sequencing datafrom multiple platforms, as wellas interesting discussions and usefulinsights. We are also indebted to A.Wilkins and the three anonymousreviewers for critical reading of themanuscript and thoughtful comments.This work was supported by the MaxPlanck Society.

References

1. Sanger F, Air GM, Barrell BG, et al. 1977.Nucleotide sequence of bacteriophage phiX174 DNA. Nature 265: 687–95.

2. Gilbert W, Maxam A. 1973. The nucleotidesequence of the lac operator. Proc Natl AcadSci USA 70: 3581–4.

3. Sanger F, Nicklen S, Coulson AR. 1977.DNA sequencing with chain-terminatinginhibitors. Proc Natl Acad Sci USA 74:5463–7.

4. Sanger F, Coulson AR. 1975. A rapid methodfor determining sequences in DNA by primedsynthesis with DNA polymerase. JMol Biol 94:441–8.

5. Wu R, Kaiser AD. 1968. Structure and basesequence in the cohesive ends of bacterio-phage lambda DNA. J Mol Biol 35: 523–37.

6. Smith LM, Sanders JZ, Kaiser RJ, et al.1986. Fluorescence detection in automatedDNA sequence analysis. Nature 321: 674–9.

7. Swerdlow H, Gesteland R. 1990. Capillarygel electrophoresis for rapid, high resolutionDNA sequencing. Nucleic Acids Res 18:1415–9.

8. Zagursky RJ, McCormick RM. 1990. DNAsequencing separations in capillary gels on amodified commercial DNA sequencing instru-ment. Biotechniques 9: 74–9.

9. Huang XC, QuesadaMA, Mathies RA. 1992.DNA sequencing using capillary array electro-phoresis. Anal Chem 64: 2149–54.

10. Kambara H, Takahashi S. 1993. Multiple-sheathflow capillary array DNA analyser.Nature 361: 565–6.

M. Kircher and J. Kelso Methods, Models & Techniques.....

534 Bioessays 32: 524–536,� 2010 WILEY Periodicals, Inc.

Methods,Models

&Techniques

Page 12: Methods, Models & Techniquesdosequis.colorado.edu/Courses/MethodsLogic/papers/NexGenSeq.pdf · Methods, Models & Techniques High-throughput DNA sequencing – concepts and limitations

11. Ueno K, Yeung ES. 1994. Simultaneousmonitoring of DNA fragments separated byelectrophoresis in a multiplexed array of 100capillaries. Anal Chem 66: 1424–31.

12. Kim S, Yoo HJ, Hahn JH. 1996.Postelectrophoresis capillary scanningmethod for DNA sequencing. Anal Chem68: 936–9.

13. Bentley DR, Balasubramanian S, SwerdlowHP, et al. 2008. Accurate whole humangenome sequencing using reversible termin-ator chemistry. Nature 456: 53–9.

14. Clarke J, Wu HC, Jayasinghe L, et al. 2009.Continuous base identification for single-mol-ecule nanopore DNA sequencing. NatNanotechnol 4: 265–70.

15. Harris TD, Buzby PR, Babcock H, et al.2008. Single-molecule DNA sequencing of aviral genome. Science 320: 106–9.

16. Margulies M, Egholm M, Altman WE, et al.2005. Genome sequencing in microfabricatedhigh-density picolitre reactors. Nature 437:376–80.

17. Shendure J, Porreca GJ, Reppas NB, et al.2005. Accurate multiplex polony sequencingof an evolved bacterial genome. Science 309:1728–32.

18. Korlach J, Marks PJ, Cicero RL, et al. 2008.Selective aluminum passivation for targetedimmobilization of single DNA polymerase mol-ecules in zero-mode waveguide nanostruc-tures. Proc Natl Acad Sci USA 105: 1176–81.

19. Ansorge WJ. 2009. Next-generation, DNAsequencing techniques. Nat Biotechnol 25:195–203.

20. Mardis ER. 2008. Next-generation, DNAsequencing methods. Annu Rev GenomicsHum Genet 9: 387–402.

21. Schuster SC. 2008. Next-generationsequencing transforms today’s biology. NatMethods 5: 16–8.

22. Shendure J, Ji H. 2008. Next-generation,DNA sequencing. Nat Biotechnol 26: 1135–45.

23. Shendure JA, Porreca GJ, Church GM.2008. Overview of DNA sequencing strat-egies. Curr Protoc Mol Biol Chapter 7: Unit7.1.

24. Metzker ML. 2010. Sequencing technologies– the next generation. Nat Rev Genet 11: 31–46.

25. George KS, Zhao X, Gallahan D, et al. 1997.Capillary electrophoresis methodology foridentification of cancer related gene expres-sion patterns of fluorescent differential displaypolymerase chain reaction. J Chromatogr BBiomed Sci Appl 695: 93–102.

26. Blazej RG, Kumaresan P, Mathies RA.2006. Microfabricated bioprocessor for inte-grated nanoliter-scale Sanger DNA sequenc-ing. Proc Natl Acad Sci USA 103: 7240–5.

27. Mariella R Jr. 2008. Sample preparation: theweak link in microfluidics-based biodetection.Biomed Microdevices 10: 777–84.

28. Roper MG, Easley CJ, Legendre LA, et al.2007. Infrared temperature control system fora completely noncontact polymerase chainreaction in microfluidic chips. Anal Chem79: 1294–1300.

29. Emrich CA, Tian H, Medintz IL, et al. 2002.Microfabricated 384-lane capillary array elec-trophoresis bioanalyzer for ultrahigh-through-put genetic analysis. Anal Chem 74: 5076–83.

30. Shibata K, Itoh M, Aizawa K, et al. 2000.RIKEN integrated sequence analysis (RISA)system – 384-format sequencing pipeline with

384 multicapillary sequencer. Genome Res10: 1757–71.

31. Hert DG, Fredlake CP, Barron AE. 2008.Advantages and limitations of next-gener-ation sequencing technologies: a comparisonof electrophoresis and non-electrophoresismethods. Electrophoresis 29: 4618–26.

32. Ewing B, Hillier L, Wendl MC, et al. 1998.Base-calling of automated sequencer tracesusing phred. I. Accuracy assessment.Genome Res 8: 175–85.

33. Ewing B, Green P. 1998. Base-calling ofautomated sequencer traces using phred.II. Error probabilities. Genome Res 8:186–94.

34. Ronaghi M, Karamohamed S, PetterssonB, et al. 1996. Real-time DNA sequencingusing detection of pyrophosphate release.Anal Biochem 242: 84–9.

35. Quinlan AR, Stewart DA, Stromberg MP,et al. 2008. Pyrobayes: an improved basecaller for SNP discovery in pyrosequences.Nat Methods 5: 179–81.

36. Wicker T, Schlagenhauf E, Graner A, et al.2006. 454 sequencing put to the test using thecomplex genome of barley.BMCGenomics 7:275.

37. Green RE, Malaspinas AS, Krause J, et al.2008. A complete Neandertal mitochondrialgenome sequence determined by high-throughput sequencing. Cell 134: 416–26.

38. Green RE, Krause J, Ptak SE, et al. 2006.Analysis of one million base pairs ofNeanderthal DNA. Nature 444: 330–6.

39. Turcatti G, Romieu A, Fedurco M, et al.2008. A new class of cleavable fluorescentnucleotides: synthesis and optimization asreversible terminators for DNA sequencingby synthesis. Nucleic Acids Res 36: e25.

40. Fedurco M, Romieu A, Williams S, et al.2006. BTA, a novel reagent for DNA attach-ment on glass and efficient generation ofsolid-phase amplified DNA colonies. NucleicAcids Res 34: e22.

41. Kircher M, Stenzel U, Kelso J. 2009.Improved base calling for the IlluminaGenome Analyzer using machine learningstrategies. Genome Biol 10: R83.

42. Dohm JC, Lottaz C, Borodina T, et al. 2008.Substantial biases in ultra-short read datasets from high-throughput DNA sequencing.Nucleic Acids Res 36: e105.

43. Erlich Y, Mitra PP, delaBastide M, et al.2008. Alta-Cyclic: a self-optimizing basecaller for next-generation sequencing. NatMethods 5: 679–82.

44. Rougemont J, Amzallag A, Iseli C, et al.2008. Probabilistic base calling of Solexasequencing data. BMC Bioinf. 9: 431.

45. Drmanac R, Sparks AB, Callow MJ, et al.2010. Human genome sequencing usingunchained base reads on self-assemblingDNA nanoarrays. Science 327: 78–81.

46. Applied Biosystems. A TheoreticalUnderstanding of 2 Base Color Codes andIts Application to Annotation, ErrorDetection, and Error Correction. WhitePaper SOLiDTM System; 2008.

47. Dimalanta ET, Zhang L, Hendrickson CL,et al. 2009. Increased Read Length on theSOLiDTM Sequencing Platform. PosterSOLiDTM System.

48. Zhu Z, Waggoner AS. 1997. Molecularmechanism controlling the incorporation offluorescent nucleotides into DNA by PCR.Cytometry 28: 206–11.

49. Bowers J, Mitchell J, Beer E, et al. 2009.Virtual terminator nucleotides for next-generation DNA sequencing. Nat Methods6: 593–5.

50. Pushkarev D, Neff NF, Quake SR. 2009.Single-molecule sequencing of an individualhuman genome. Nat Biotechnol 27: 847–52.

51. Reinhardt JA, Baltrus DA, Nishimura MT,et al. 2009. De novo assembly using low-cov-erage short read sequence data from the ricepathogen Pseudomonas syringae pv. oryzae.Genome Res 19: 294–305.

52. Diguistini S, Liao NY, Platt D, et al. 2009.De novo genome sequence assembly of afilamentous fungus using Sanger, 454 andIllumina sequence data. Genome Biol 10:R94.

53. Miller JR, Delcher AL, Koren S, et al. 2008.Aggressive assembly of pyrosequencingreads with mates. Bioinformatics 24: 2818–24.

54. ChenW, Ullmann R, Langnick C, et al. 2009.Breakpoint analysis of balanced chromosomerearrangements by next-generation paired-end sequencing. Eur J Hum Genet DOl: 10.1038/ejhg .2009.21118 [Epub ahead of print].

55. Zimin AV, Delcher AL, Florea L, et al. 2009.A whole-genome assembly of the domesticcow, Bos taurus. Genome Biol 10: R42.

56. Zhou X, Su Z, Sammons RD, et al. 2009.Novel software package for cross-platformtranscriptome analysis (CPTRA). BMCBioinf. 11: S16.

57. Kim JI, Ju YS, Park H, et al. 2009. A highlyannotated whole-genome sequence of aKorean individual. Nature 460: 1011–5.

58. Linsen SE, deWit E, Janssens G, et al. 2009.Limitations and possibilities of small RNAdigital gene expression profiling. NatMethods 6: 474–6.

59. Quail MA, Swerdlow H, Turner DJ. 2009.Improved protocols for the Illumina GenomeAnalyzer sequencing system. Curr ProtocHum Genet Chapter 18: Unit 18.2.

60. Briggs AW, Stenzel U, Meyer M, et al. 2009.Removal of deaminated cytosines and detec-tion of in vivo methylation in ancient DNA.Nucleic Acids Res 38(6): e87 [Epub aheadof print].

61. Maricic T, Paabo S. 2009. Optimization of454 sequencing library preparation from smallamounts of DNA permits sequence determi-nation of both DNA strands. Biotechniques46: 51–2, 54–7.

62. Rohland N, Hofreiter M. 2007. Comparisonand optimization of ancient DNA extraction.Biotechniques 42: 343–52.

63. Wheeler DA, Srinivasan M, Egholm M, et al.2008. The complete genome of an individualby massively parallel DNA sequencing.Nature452: 872–6.

64. Harismendy O, Ng PC, Strausberg RL, et al.2009. Evaluation of next generation sequenc-ing platforms for population targetedsequencing studies. Genome Biol 10: R32.

65. Chaisson MJ, Brinza D, Pevzner PA. 2009.De novo fragment assembly with short mate-paired reads: Does the read length matter?Genome Res 19: 336–46.

66. Krause J, Briggs AW, Kircher M, et al. 2009.A complete mtDNA genome of an early mod-ern human from Kostenki, Russia. Curr Biol20: 231–6.

67. Gnirke A, Melnikov A, Maguire J, et al. 2009.Solution hybrid selection with ultra-long oli-gonucleotides for massively parallel targetedsequencing. Nat Biotechnol 27: 182–9.

......Methods, Models & Techniques M. Kircher and J. Kelso

Bioessays 32: 524–536,� 2010 WILEY Periodicals, Inc. 535

Methods,Models

&Techniques

Page 13: Methods, Models & Techniquesdosequis.colorado.edu/Courses/MethodsLogic/papers/NexGenSeq.pdf · Methods, Models & Techniques High-throughput DNA sequencing – concepts and limitations

68. Hodges E, Rooks M, Xuan Z, et al. 2009.Hybrid selection of discrete genomic intervalson custom-designed microarrays for mas-sively parallel sequencing. Nat Protoc 4:960–74.

69. Briggs AW, Good JM, Green RE, et al. 2009.Targeted retrieval and analysis of fiveNeandertal mtDNA genomes. Science 325:318–21.

70. Meyer M, Stenzel U, Hofreiter M. 2008.Parallel tagged sequencing on the 454 plat-form. Nat Protoc 3: 267–78.

71. Meyer M, Stenzel U, Myles S, et al. 2007.Targeted high-throughput sequencing oftagged nucleic acid samples. Nucleic AcidsRes 35: e97.

72. Erlich Y, Chang K, Gordon A, et al. 2009.DNA Sudoku – harnessing high-throughputsequencing for multiplexed specimen analy-sis. Genome Res 19: 1243–53.

73. MeyerM, KircherM. 2010. Illumina sequenc-ing library preparation for highly multiplexedtarget capture and sequencing. Cold SpringHarb Protoc DOI: 10.1101/pdb.prot5448.

74. Pop M, Salzberg SL. 2008. Bioinformaticschallenges of new sequencing technology.Trends Genet 24: 142–9.

75. Richter BG, Sexton DP. 2009. Managing andanalyzing next-generation sequence data.PLoS Comput Biol 5: e1000369.

76. Quail MA, Kozarewa I, Smith F, et al. 2008.A large genome center’s improvements to theIllumina sequencing system. Nat Methods 5:1005–10.

77. Batley J, Edwards D. 2009. Genomesequence data: management, storage, andvisualization. Biotechniques 46: 333–4, 336.

78. IBM Research. 2009. IBM research aims tobuild nanoscale DNA sequencer to help drivedown cost of personalized genetic analysis. InLoughran M, ed.; Press Releases, Vol. 2009.New York: IBM.

79. Eid J, Fehr A, Gray J, et al. 2009. Real-timeDNA sequencing from single polymerase mol-ecules. Science 323: 133–8.

80. Astier Y, Braha O, Bayley H. 2006. Towardsingle molecule DNA sequencing: directidentification of ribonucleoside and deoxyri-bonucleoside 50-monophosphates by usingan engineered protein nanopore equippedwith a molecular adapter. J Am Chem Soc128: 1705–10.

81. Albertorio F, Hughes ME, Golovchenko JA,et al. 2009. Base dependent DNA-carbon

nanotube interactions: activation enthalpiesand assembly-disassembly control. Nano-technology 20: 395101.

82. Medvedev P, Stanciu M, Brudno M. 2009.Computational methods for discoveringstructural variation with next-generationsequencing. Nat Methods 6: S13–20.

83. Pepke S, Wold B, Mortazavi A. 2009.Computation for ChIP-seq and RNA-seqstudies. Nat Methods 6: S22–32.

84. Flicek P, Birney E. 2009. Sense fromsequence reads: methods for alignment andassembly. Nat Methods 6: S6–12.

85. Park PJ. 2009. ChIP-seq: advantages andchallenges of a maturing technology. NatRev Genet 10: 669–80.

86. Wall PK, Leebens-Mack J, ChanderbaliAS, et al. 2009. Comparison of next generationsequencing technologies for transcriptomecharacterization. BMC Genomics 10: 347.

87. Holt RA, Jones SJ. 2008. The new paradigmof flow cell sequencing. Genome Res 18: 839–46.

88. Dalca AV, Brudno M. 2010. Genomevariation discovery with high-throughputsequencing data. Brief Bioinf. 11: 3–14.

M. Kircher and J. Kelso Methods, Models & Techniques.....

536 Bioessays 32: 524–536,� 2010 WILEY Periodicals, Inc.

Methods,Models

&Techniques