proovread: large-scale high accuracy pacbio correction ... · proovread: large-scale high accuracy...

proovread: large-scale high accuracyPacBio correction through iterative short

read consensus

Thomas Hackl 1,2, Rainer Hedrich 1, Jorg Schultz2, and Frank Forster 2,∗

1Department for Molecular Plant Physiology and Biophysics, University of Wurzburg,Julius-von-Sachs-Institut, Julius-von-Sachs-Platz 2, 97082 Wurzburg

2Department of Bioinformatics, University of Wurzburg, Biocenter, Am Hubland, 97074Wurzburg

Motivation:

Today, the base code of DNA is mostly determined through sequencing bysynthesis as provided by the Illumina sequencers. Although highly accurate,resulting reads are short, making their analyses challenging. Recently, a newtechnology, Single Molecule Real-Time (SMRT) sequencing, was developedwhich could address these challenges as it generates reads of several thousandbases. But, their broad application has been hampered by a high error rate.Therefore, hybrid approaches which use high quality short reads to correcterroneous SMRT long reads have been developed. Still, current implementa-tions have great demands on hardware, work only in well-defined computinginfrastructures and reject a substantial amount of reads. This limits theirusability considerably, especially in the case of large sequencing projects.

Results:

Here we present proovread , a hybrid correction pipeline for SMRT reads,which can be flexibly adapted on existing hardware and infrastructure froma laptop to a high performance computing cluster. On genomic and transcrip-tomic test cases covering Escherichia coli, Arabidopsis thaliana and human,proovread achieved accuracies up to 99.9 % and outperformed the existing hy-brid correction programs. Furthermore, proovread corrected sequences were

∗to whom correspondence should be addressed

1

longer and the throughput was higher. Thus, proovread combines the mostaccurate correction results with an excellent adaptability to the availablehardware. It will therefore increase the applicability and value of SMRTsequencing.

Availability:

proovread is available at the following URL: http://www.bioinfo.biozentrum.uni-wuerzburg.de/index.php?id=159865

Contact:

[email protected]

Supplementary information:

Supplementary data are available at Bioinformatics online.

1 Introduction

Looking back just a decade, sequencing a genome was a time consuming and expensiveendeavour. The emergence of second generation sequencers and their sequencing bysynthesis has changed this drastically thereby revolutionizing molecular biology. Today,a single run of a HiSeq2500 can generate as much as 600 Gbp high quality output data,which covers a human genome 200×. Unfortunately, this new technology came witha drawback. Compared to the traditional Sanger sequencing, resulting reads are short(150 bp). This became a major challenge for the assembly especially in the case of large,repetitive genomes. Accordingly, a plenitude of short read assemblers has been devel-oped, e.g. Allpath-LG (Gnerre et al., 2011), the Celera Assembler (Myers et al., 2000;Miller et al., 2008), and SOAPdenovo (R. Li et al., 2010). Still, repeats longer than theshort reads (SRs) can not be resolved and therefore the genome can not be reconstructedin these regions (Gnerre et al., 2011). For the assembly of repetetive genomes, a com-bination of short and long insert libraries and additional fosmid sequencing is thereforerecommended (Gnerre et al., 2011).

In 2009, however, a new long read (LR) sequencing technology emerged: SingleMolecule Real-Time (SMRT) sequencing. Here, the incorporation of nucleotides in aDNA molecule is recorded during synthesis for several thousand single template strandssimultaneously (Eid et al., 2009). With the latest chemistry, this approach delivers readslonger than 4 kbp enabling the assembly of larger repeat structures (Roberts, Carneiro,and Michael C Schatz, 2013). Additionally, amplification can be omitted. Since 2011,SMRT based sequencing is commercially available from Pacific Biosciences of California.

2

http://www.bioinfo.biozentrum.uni-wuerzburg.de/index.php?id=159865

http://www.bioinfo.biozentrum.uni-wuerzburg.de/index.php?id=159865

[email protected]

Their third generation sequencer, PacBio RS II, generates to date up to 400 Mbp persequencing run.

Still, the advantages of third generation sequencing come at a prize. The accuracyof its LRs falls way behind those of short second generation reads. While current Illu-mina instruments offer a sequencing accuracy of 99 % (Dohm et al., 2008), PacBio RS IIachieves only 80 % to 85 % (Ono, Asai, and Hamada, 2013; Ross et al., 2013). Fur-thermore, the error model of both technologies differs. Whereas Illumina reads mainlycontain miscalled bases with increasing frequency towards read ends, SMRT generatesprimarily insertions (10 %) and deletions (5 %) in a random pattern (Ross et al., 2013).Since SMRT uses circular templates, accuracy can be increased for shorter sequences(< 1 kbp). By sequencing each position multiple times, a circular consensus sequencing(CCS) with an accuracy of 99 % can be generated. However, this approach substan-tially decreases read length (Travers et al., 2010), erasing one of the major advantagesof SMRT sequencing. In addition to this technical approach, two different methods forin-silico correction of SMRT reads have been developed. (i) The hierarchical genome-assembly process (HGAP) uses shorter SMRT reads contained within longer reads togenerate pre-assemblies and to calculate consensus sequences (Chin et al., 2013). (ii)PacBioToCA (Koren et al., 2012) and LSC (Au et al., 2012) utilize Illumina short readsin a hybrid approach to correct SMRT reads. Indeed, these approaches result in higherquality LRs. Nevertheless both approaches also have limitations. In the case of HGAP,a coverage of 80× to 100× has been recommended (Chin et al., 2013). This mightnot be an issue when targeting smaller, e.g. bacterial genomes, but for larger, especiallyeukaryotic genomes this would imply sequencing several hundred or thousands SMRTcells. Obviously, this increases the costs of the genome project substantially.

For a hybrid correction, as implemented by PacBioToCA and LSC, millions of SR toLR alignments have to be computed and processed. As these alignments have to tolerateerror rates up to 20 %, this can be a formidable computational challenge. These com-putational demands are usually met using massive parallelization on high performancecomputers (HPCs) or computer grids, providing dozens or hundreds of computer nodes.Accordingly, LSC and PacBioToCA are designed to run on HPCs. PacBioToCA alsoworks on computer clusters providing Sun grid engine (SGE) as a queueing system. Still,both require a large amount of memory during the correction process of large genomes.This can become a considerable limitation, as computing nodes in a grid are typicallyequipped only with limited memory. A second point which can determine the success ofa genome project is the throughput of the correction method, i.e. the percentage of thebases in corrected reads which can be used in the assembly. The lower the throughput,the more material needs to be sequenced in the first place in order to achieve suffi-cient coverage for assembly after correction. As an example, PacBioToCA lost morethan 40 % when correcting sequences from Escherichia coli (Koren et al., 2012), whichis a considerable loss of data. Ultimately, we expect that with the increasing use ofSMRT sequencing more genomes and transcriptomes with unusual features will be se-quenced. Thus, a correction pipeline developed today should be flexible enough to beeasily adopted to these new use cases. Whereas LSC was developed mainly for the cor-rection of (human) transcriptomic data, PacBioToCA can handle different data sets, but

3

is part of the Celera WGS pipeline and requires the installation of the complete package.Distributed computing is restricted to the now commercial SGE.

These limitations motivated us to implement a new SMRT sequencing correctionpipeline. The goal was high flexibility such that the pipeline can (i) run on standardcomputers as well as computer grids and (ii) can be easily adapted to different use cases.Obviously, these objectives should not be at the costs of accuracy, length of correctedreads or throughput.

2 Implementation

2.1 Mapping—Sensitive and trusted hybrid alignments

Due to the high rate and distinctive nature of SMRT sequencing errors, calculatinghybrid alignments is challenging. Mapping of SRs with common alignment models andmapping programs either results in no or at best spurious alignments. For proovread , wethus devised an adapted alignment scoring scheme reflecting SMRT sequencing features.The underlying model comprises three basic rules: (1) The penalty for mismatches isseveral times higher than for gaps. (2) The cost for extending a gap are higher than foropening a gap. (3) The costs for gaps in the LR, which correspond to deletions, are abouttwice as high as for gaps in the SR, which represent insertion. As default, proovread usesSHRiMP2 (David et al., 2011) as mapping software. Its versatile interface allowed usto completely implement the hybrid scoring model through different parameter settings.Optimized sets for different applications have been determined empirically on simulatedtest data (Suppl. Table S1). All results presented here have been generated using thesesettings with SHRiMP2 version 2.2.3.

Additionally, proovread is able to directly digest mapping data provided in SAM for-mat (H. Li et al., 2009), leaving the mapping procedure entirely up to the user. Asan alternative to SHRiMP2, Bowtie2 (Langmead and Steven L Salzberg, 2012) is sup-ported as an experimental feature. However, corrections using Bowtie2 lagged behinddue to a limited set of parameters regarding scoring and sensitivity. LR and SR datacan be provided in either FASTA or FASTQ format. Usage of trimmed (e.g. sicklehttps://github.com/najoshi/sickle) or corrected SRs (e.g. Quake (Kelley, MichaelC. Schatz, and Steven L. Salzberg, 2010)) is recommended as it increases correctionaccuracy.

To distinguish valid from spurious alignments, the obtained mapping results requirefurther evaluation. In general, the number of reported alignments differs strongly be-tween different regions of the reference sequences. Repetitive regions tend to collecta vast amount of SRs contrary to non repetitive regions with a high local error ratewhich might not trigger alignment computation at all. In addition, the alignment scoredistribution reported by the mapping program fluctuates along the references with thevarying amount of errors. We therefore assess length normalized scores in a localizedcontext. For this purpose, LRs are internally represented by a consecutive series of 20 bplong bins. Each SR mapping is allocated to a bin by the position of its centre. Only thehighest scoring alignments of each bin, not the overall highest scoring alignments, up to

4

https://github.com/najoshi/sickle

Figure 1: Generic work flow of the proovread correction pipeline. Box 1: Simplifiedrepresentation of the correction-by-short-read-consensus approach; short reads(dark green bars) are mapped onto an erroneous and chimeric (yellow X) longread (blue bar); black strokes indicate sequencing errors and non-matchingalignment positions. Regions with an unusually high amount of errors pre-vent short read mappings. Base call qualities are represented below the longread (light green curve). During consensus generation the majority of errorsare removed and possible chimeric break points are identified. New qualitiesscores are inferred from coverage and the composition of the consensus at eachposition. Processed reads and chimera annotations are written to files. Box 2:The resulting reads are trimmed using a quality cutoff and the chimera anno-tations generating proovread’s primary output: high accuracy long reads.

5

the coverage cut-off are considered for the next step—the calculation of the consensussequence.

2.2 Consensus call with quality computation and chimera detection

To compute the consensus, a matrix with one column per nucleotide is composed foreach LR. This matrix is filled with SR bases according to the alignment informationobtained from the mapper (Figure 1). Empty cells represent insertions on the LR;multiple nucleotides within one cell indicate deletions. The gap favouring scoring scheme,which is needed to accurately model SMRT sequencing errors, is prone to incorrectlyintroduce gaps close to the end of an alignment. Thus, it is crucial that alignmentswith gaps close to their ends are trimmed during matrix initialization. The consensus iscomputed by a majority vote in each column. If no SR bases are available, the originalLR sequence is kept. The support of the called base is used as confidence criterion andconverted to a phred mimicking quality score (Suppl. Equation S1, Suppl. Table S2).

More than 1 % of raw SMRT single pass reads are chimeric (Fichot and Norman,2013). The majority of these reads emerges through “misligation” of LRs during SMRTcell library construction. The location of the fusion within the chimeric read is definedas chimeric break point. Here, adjacent sequences derive from different locations ofthe sequenced reference. Thus SRs only align to one side of the break point with highidentity. The second part of the alignment is enriched for gaps and mismatches. Incombination with a score threshold alignments of reads largely overlapping break pointsare highly unlikely. However, the sensitive mapping parameters facilitate break pointoverlapping read ends. Therefore, break points are hard to localize considering only perbase coverage. For detection of possible break points, proovread therefore considers theper bin coverage generated during the consensus call in the finishing correction cycle.The bin coverage is obtained by computing the total number of bases of all reads assignedto one bin. Only reads centred close to a break point contribute to the coverage of theaccording bin, while partially break point overlapping reads do not. Hence, in contrast toper base coverage, rapid decreases in bin coverage are strong indicators for break points.We further verify possible break points by analysing the error pattern in the alignmentsoverlapping with the break point using an entropy based approach (Figure 2). At a truebreak point an increase in entropy is observed. Such confirmed coordinates are storedto a chimera annotation file and subsequently used during trimming.

2.3 Quality and chimera trimming

The reads obtained from the consensus step are considered untrimmed corrected LRs.These reads are returned in FASTQ format, with consensus phred scores ASCII encodedin the quality line. This data comprises high accuracy regions as well as uncorrectedregions and unprocessed chimeras. Using a window based quality filter we identifyand trim low quality regions from these reads. Chimeric reads are cut according toannotations. The trimming procedure can easily be rerun on the untrimmed correctedreads to adjust strictness according to the users needs.

6

Figure 2: Detection of chimera break points. Box 1: Mapping of short reads (shortlight blue and green bars) onto pre corrected long read (blue bar) containing achimeric break point (yellow X); black strokes indicate non-matching alignmentpositions. While not detectable by per-base coverage (grey curve), break pointcandidate sites (c1, c2) can be inferred from decreases in bin coverage (greencurve, red arrows). Reads overlapping candidate sites from the right hand side(dark green bars) and reads overlapping from the left hand side (light greenbars) are considered for further classification of each individual site. Box 2:Entropy profiles derived from the short read alignment matrix at the twocandidate sites. Each position comprises four entropy values: HT calculatedfrom the entire read set (grey bar), HU for the reads placed upstream (darkgreen bar), HD for the reads placed downstream (light green bar), and ∆H(red bar). ∆H is the difference of HT and the greater value of HU and HD. Anaccumulation of positive ∆H values (yellow arrows) is observed at true chimeralocations.

7

2.4 Iterative correction

In order to entirely cover a set of erroneous LR with SR alignments, the level of sensitivityhas to match the regions with the locally highest error rates, even though most regionsexhibit much lower rates. Mapping at the required level of sensitivity, especially on largescale data, is computationally demanding and time consuming. Therefore, proovreadprovides an iterative procedure for mapping and correcting with successively increasingsensitivity (Figure 3). It consists of three pre-correction and one finishing cycle. Duringpre-corrections, only subsets of the SRs are utilized, by default 20 %, 30 % and 50 %,respectively. These subsets are generated through systematic sampling such that samplesfrom successive cycles complement each other.

In the first cycle, the SR subset is mapped at a moderate level of sensitivity andthus at high speed. Subsequently, regions with sufficient SR support are identified, pre-corrected and masked. As a result, the effective search space is substantially reducedprior to the next cycles. As masking is limited, unmasked edges of pre-corrected regionscan act as seeds for mappings during the next cycle, facilitating extensions of maskedregions. The process is repeated in cycle two and three, with larger and complementarySR sub-samples at increased sensitivity. We tuned parameters such that on average morethan 80 % of the reference is masked after the first two cycles. Consequently, mappingat highest sensitivity during cycle three is limited to remaining, error enriched islands.Finally, the entire SR set is mapped at high specificity to the unmasked, pre-correctedLRs to merge and refine all previously performed corrections. The time benefit of thisiterative approach far out-weights the computational overhead generated by repeatedmapping and correction. In summation, proovread combines high sensitivity with lowrun time.

2.5 Correction of transcriptomic and other non-standard data

The default set-up of proovread is optimized for genomic data. During the initial pre-correction cycles, for example, only sub-samples of the provided SRs are used. Fordata with uniform coverage distribution, this optimization significantly reduces run timewithout any loss in overall correction efficiency. However, on data with uneven coveragedistribution, e.g. derived from RNA-Seq experiments, the set-up can be improved. Inorder to sufficiently cover low abundance transcripts, increasing the sub-sample ratios isrequired. Such data specific adjustments can be achieved using the optional, comprehen-sive configuration of the pipeline. Moreover, customizable parameters are not limited toSR sampling, but include core settings for scoring schemes, iteration procedure and post-processing. With appropriate parameter settings, proovread can thus be easily tuned tocorrect data with entirely different characteristics, e.g. obtained from newly emergingLR sequencing technologies.

2.6 Scalability and Parallelization

The advantages of SMRT LR sequencing render the technology useful for a variety ofsequencing projects, ranging from specific studies on microbial genomes, viruses and

8

Figure 3: Iterative correction by short read consensus. Short reads (dark green bars) aremapped onto an erroneous long read (blue bar) in four iterations. During pre-correction cycles (c1-c3) subreads are mapped with increasing sensitivity andwith complementary subsamples of SR (Piecharts: 20 %, 30 % and 50 %). Aftereach cycle a consensus is generated. Sufficiently covered regions are maskedprior to the next iteration, reducing effective reference size for computationallymore expensive mappings at increased sensitivity. In the final iteration (cf,100 %), all available short read data is mapped onto unmasked, pre-correctedlong reads at high specificity, resulting in a high accuracy consensus withtrusted qualities and chimera annotations.

9

plastids to large scale sequencing of eukaryotic genomes. The varying amount of datademands high flexibility in scalability and parallelization of the processing software.proovread provides several optimizations for this purpose. By design, LRs are correctedindependently from each other and as a consequence, instances of proovread can operateon packages of LR data at adaptable size without affecting correction results (Suppl.Figure S1). The memory footprint of the pipeline increases linear with the amount ofsupplied LR bases (Suppl. Figure S2). The amount of SR data, however, does not affectmemory consumption, as our approach does not require the generation of SR indexesor kmer spectra. Therefore, by tuning the package size, the memory requirements ofproovread can easily be adjusted to meet available hardware, regardless of the scale ofthe project. Thus, proovread can run efficiently on high memory multi-core machinesas well as distributed cluster architectures with nodes of limited capabilities or evendesktop PCs. Queueing engines like SGE (commercially available from http://www.

univa.com/products/grid-engine.php) or Slurm (Jette, Yoo, and Grondona, 2002)can be used to distribute individual jobs.

Additionally, the number of temporary files is strongly reduced. The SR sub-samplesare generated on-the-fly and directly fed to the mapper. The primary output of themapper is again processed directly and only a pre-filtered, reduced version, alreadyindexed for sorting, is written to disk. These optimizations decrease the required diskspace as well as the read and write operations several fold.

3 Material and Methods

3.1 Data sets and preparation

All genomic LRs sequences, except the Homo sapiens transcriptome data, were be ob-tained from the PacBio DevNet (http://www.pacb.com/devnet/). Here, we give anoverview of the test data sets, more detailed information is provided in the Supplemen-tary Material (Suppl. Table S3).

As bacterial data set, E. coli genomic sequences (available from PacBio DevNet as“E. coli K12 MG1655 Resequencing)” of about 100 Mbp and an N50 of 4,082 bp wereanalyzed. These data were also used as benchmark by Koren et al. (2012). For thecorrection a subsample of a 100 bp Illumina library (SRA ERX002508) was used. Thereference was the K12 strain of E. coli (Assembly GCA 000005845.1).

As first eukaryotic genomic data set about 126 Mbp of Arabidopsis thaliana (availablefrom PacBio DevNet as “Arabidopsis P5C3”; N50 8,109 bp) were corrected with 100 bpIllumina sequences (SRA SRX158552). The TAIR10 assembly was used as reference(Assembly GCA 000001735.1).

The second genomic data set was 393 Mbp from a human sequencing project (availablefrom PacBio DevNet as “H. sapiens 10× Sequence Coverage with PacBio data”; N509,938 bp). These sequences were corrected with SRs sampled for a 50× coverage fromthe 1,000 genomes project (SRA SRX246904, SRX246905, SRX246906, SRX246907,SRX247361, SRX247362) leading to a 50× coverage. The reference sequence was thehuman hg19 assembly (Assembly GCA 000001405.12).

10

http://www.univa.com/products/grid-engine.php

http://www.univa.com/products/grid-engine.php

http://www.pacb.com/devnet/

The applicability of proovread for the correction of transcriptomic data was tested withthe human brain transcriptome set used by Au et al. (2012) (avaiable from http://www.

stanford.edu/~kinfai/human_cerebellum_PacBioLR.zip). It contained 138 Mbp RNA-seq LRs with a N50 of 972 bp. SRs from the human Map 2.0 project (SRA ERX011200,ERX011186) were used for correction. The human hg19 assembly was used as reference(Assembly GCA 000001405.12).

Before correction all SRs were quality trimmed using sickle (https://github.com/najoshi/sickle). The normalization of the SRs for the human genome was achievedusing the normalize-by-median.py script of the khmer package (Brown et al., 2012).

3.2 Correction programs and parameter

We evaluated the correction efficiency of proovread (v1.01) in comparison to the existingpipelines PacBioToCA (v7.0) and LSC (v.0.3.1). The majority of the corrections wereperformed on a HPC offering 32 CPU cores and 192 GB memory. The correction of thegenomic human data set was performed in a computer grid offering single nodes with4 CPU cores and 8 GB memory. The queueing system Slurm was used to distribute jobs.The results were assessed in terms of correction accuracy, total throughput, N50, runtime and required memory. The required memory and CPU time were monitored foreach correction run by a custom script. LSC and PacBioToCA were run with 32 threads.proovread was run in eight independent processes using four mapping threads. The upperbounds for memory requirements were estimated by summing up the eight most memoryintensive subtasks.

3.3 Quality assessment procedure

The quality of the correction results was assessed by an evaluation script (available onrequest from the authors). During the assessment the corrected LRs were remapped ontothe reference sequence by GMAP (Wu and Watanabe, 2005). All LR sequences withouta global alignment were realigned using exonerate v2.2.0 (Slater and Birney, 2005). Ge-nomic LRs were aligned using the affine:bestfit model to assure global alignments tothe reference. Exonerate does not offer global alignment with intron modeling, thereforetranscriptomic LRs were aligned with the model est2genome and reduced gap costs tomaximize the length of the local hits. All LR nucleotides without an alignment wereconsidered as erroneous positions. Only reads with a unique mapping were used for theaccuracy estimation. Reads were classified as chimeric by GMAP. Reads without aninitial GMAP alignment were not considered for the accuracy assessment. The propor-tions of these ambiguous read placements for all corrections were low, with the exceptionof the A. thaliana correction with up to 17 % for LSC (Suppl. Table S4). The overallthroughput and the N50 values for each correction method were calculated for the com-plete sequence set returned by the programs, respectively, regardless of their mappingresult.

11

http://www.stanford.edu/~kinfai/human_cerebellum_PacBioLR.zip

http://www.stanford.edu/~kinfai/human_cerebellum_PacBioLR.zip



3.4 Estimation of the required SR coverage

The SR coverage has a major effect on mapping run time and correction efficiency. Weempirically determined the optimal SR coverage for proovread using the E. coli data set.We selected random subsets of the SRs yielding a expected coverage of 10×, 20×, 35×,50×, 75×, 100× and 200×, with larger sets including the preceding smaller set. Thecorrection result of the 50× set was only slightly improved by increasing the expectedcoverage (Suppl. Figure S3 and S4). Indeed, the memory consumption was saturatedby a higher expected coverage (Suppl. Figure S5). Therefore we used a 50× as a goodtrade-off between run time and accuracy which is in agreement with Koren et al. (2012).

4 Results

Today, the major application of the PacBio RS II is the sequencing of bacterial genomes.Therefore, we used an E. coli data set as first test case (Table 1). Mapping of thecorrected reads onto the E. coli reference genome revealed accuracies of 99.98 % forproovread , followed by 99.93 % (PacBioToCA) and 88.79 % (LSC). LSC correction re-sulted in a N50 of 4,158 bp which was higher than in the uncorrected reads as LSCreturns only LRs with SR mapping and omitted other LRs. Therefore, the startingN50 for LSC was in fact 4,450 bp. proovread corrected reads had an N50 of 2,147 bp,PacBioToCA of 1,639 bp. In matters of throughput, proovread recovered 81 % of theinput, while LSC and PacBioToCA returned 76 % and 73 %, respectively. The requiredrun time was shortest for PacBioToCA (2.6 h) while proovread took 7× longer, and LSCwas most time consuming (25× of PacBioToCA). The memory consumption was in thesame range for LSC and proovread (18.4 GB and 19.6 GB) while PacBioToCA requiredmore than twice as much (44 GB). This differs from the originally published 2.1 GB(Koren et al., 2012) as here PacBioToCA ran multi threaded.

With their outstanding length, SMRT sequencing reads will be of increasing use forthe sequencing of eukaryotic genomes. To evaluate the performance of proovread onthis type of data, we used the comparably small genome of A. thaliana as a second testcase (Table 1). For this data set, correction with LSC exceeded the maximum availablememory (192 GB) and therefore did not finish. proovread and PacBioToCA achieveda high correction accuracy of 98.48 % and 97.44 %, respectively. Starting with an N50of 8,109 bp in the raw reads, proovread returned a N50 of 2,714 bp and PacBioToCAof 1,528 bp. The throughput was 79 % for proovread and 46 % for PacBioToCA. Thelatter required a run time of 481 h with proovread 8.7× longer. Concerning the memoryconsumption proovread required 43.6 GB and PacBioToCA 27.5 GB.

To analyse the influence of the size and the complexity of the sequenced genome on thecorrection result, we used the human genome as third test case (Table 1). For run timereasons we corrected this set on a cluster infrastructure. Unfortunately, the availablegrid provided only Slurm as queuing system, whereas PacBioToCA is implemented onlyfor SGE. Furthermore, and typically for larger grids, each node offers only four CPUcores and 8 GB memory. Both do not fulfill the hardware requirements of LSC and astandalone PacBioToCA correction. Therefore, only the performance of proovread was

12

Tab

le1:

Ben

chm

ark

corr

ecti

on

resu

lts

forproovread

,P

acB

ioT

oCA

,an

dL

SC

:Ec–E

scherichia

coli

,At–Arabidopsisthaliana,

Hsg

–Homosapiens

gen

ome,

Hst

–H.sapiens

tran

scri

pto

me

Pro

gram

accu

racy

[%]

N50

[bp

]th

rou

ghp

ut

[%]

run

tim

e[h

ord

]m

emor

y[G

B]

Ec

At

Hsg

Hst

Ec

At

Hsg

Hst

Ec

At

Hsg

Hst

Ec

At

Hsg

Hst

Ec

At

Hsg

Hst

uncorrecte

d4,082

8,109

9,938

1,234

proovread

def.

99.98

98.48

98.90

99.87

2,147

2,714

2,219

821

81

79

88

48

19.3

h173d

2,288d

84h

19.6

43.6

40.1

8.1

proovread

norm

.99.10

1,327

80

1,009d

33.6

proovread

adapt.

99.88

730

70

282h

9.9

PacBioToCA

99.93

97.44

98.67

1,639

1,528

193

73

46

29

2.6

h20d

49h

44.0

27.5

45.4

LSC

88.79

95.43

4,158

908

76

60

66.2

h137h

18.4

18.4

13

analyzed for this LR data set. Here, we examined two different SR data sets: First a50× coverage SR sample and second a digitally normalized version of this data. Thenormalization process required 37 GB memory and took 49 h. The correction accuracyfor the non-normalized and the normalized SR set was 98.9 % and 99.1 %, with an N50of 2,219 bp and 1,327 bp, respectively. The throughput was 88.4 % for the complete SRset and 79.5 % for the normalized data set. The total run time was shorter for thenormalized set (1,009 d vs. 2,288 d) and the required memory was similar for both cases(33.6 GB and 29.3 GB).

In addition to genomics studies, SMRT sequencing is frequently used for transcrip-tome analyses. With its LRs, chances are high that a whole transcript is sequenced ina single read thereby avoiding the assembly process. To evaluate the applicability ofproovread for this type of data, we used human brain transcriptomic data as a final testcase (Table 1). This data set was previously used to assess the performance of LSC(Au et al., 2012), although here we used a more recent version of the program (0.3.1).proovread was run with two different parameter settings: (i) the default settings and(ii) a modified set tailored-made for transcriptomic data (see Implementation). Indeed,the correction accuracy was highest with the default and modified proovread settings(99.87 % and 99.88 %). PacBioToCA and LSC performed worse (98.67 % and 95.43 %).Starting with an N50 of 972 bp, LSC returned the highest N50 (908 bp) followed by de-fault proovread (821 bp), modified proovread (730 bp) and PacBioToCA (193 bp). Mod-ified proovread resulted in the highest throughput (69.5 % of input data), followed byLSC (59.6 %), proovread with default settings (47.5 %), and PacBioToCA (29 %). Theshortest run time was required by PacBioToCA (49.1 h). LSC took 1.35× longer, whilestandard proovread ran 1.71× longer. Modified proovread required the longest run time(5.75×). In contrast, proovread needed less than 10 GB memory (8.1 GB and 9.9 GB fordefault and modified settings) while LSC needed almost twice as much (18.4 GB) andPacBioToCA needed most memory (45.4 GB).

5 Discussion

proovread is designed to correct erroneous LRs sequenced by SMRT with high qualitySR data as generated by Illumina sequencers. Our benchmarks revealed that proovreadis well suited for the correction of microbial and eukaryotic as well as genomic andtranscriptomic data.

Arguably the most prominent characteristic of a correction pipeline is the accuracy ofthe corrected reads. With accuracies of > 99 % in almost all test cases, proovread andPacBioToCA clearly outperformed LSC. the latter achieved only < 90 % for genomic and95 % for transcriptomic reads as LSC omits trimming of corrected reads which resultsin reads which are only partially corrected. Obviously, the overall accuracy of thesereads will be low. When comparing accuracies, it has to be taken into account thatall corrected reads which could not be mapped onto the reference or were identified aschimeric were classified as ambiguous and not considered. In general, their amount wassmall. Only for the A. thaliana set it exceed 6.5 % for all programs as here the reference

14

and the sequenced strain differ (Suppl. Table S4). Still, if these reads were included, theaccuracy of all programs would decrease. This effect would be smallest on proovread asit generated fewer of these ambiguous reads than PacBioToCA and LSC.

Still, accuracy is only one criterion for the evaluation of a LR correction pipeline. Akey advantage of SMRT sequencing is the length of the resulting reads. Therefore, thecorrection process shorten the reads as few as possible. When comparing the N50 of thecorrected reads, LSC resulted in the longest reads. This does not come as a surprise tak-ing into account that LSC does not trim corrected reads (see above). Still we think thattrimming is an important feature as it not only avoids the inclusion of poorly correctedregions in the following analyses but also enables the correction of chimeric reads. There-fore, both, proovread and PacBioToCA, trim corrected reads. Still the N50 of proovreadcorrected reads was in all test cases considerably higher than for PacBioToCA. To givethe user maximum flexibility, proovread also reports the untrimmed, corrected reads.Furthermore, the trimming step is independent of the correction thereby enabling theuser to easily optimise the trimming parameters for the given data set.

Finally, during the correction process, regions and reads can be rejected, decreasing thethroughput of the pipeline. This factor might be neglecteable if initially a high coverageof the sequenced material was provided. In the case of larger, e.g. eukaryotic genomes,this is usually too expensive resulting in lower coverage. Here, a decrease in throughputcould have a strong impact on the further steps of the projects, especially the assembly.In the worst case, costly re-sequencing might be needed. Thus, minimizing the rejectionof reads is an important objective of a correction pipeline. proovread corrected in almostall cases ≥ 80 % of the input data, while LSC and PacBioToCA generated dramaticallysmaller throughput. In the case of the H. sapiens transcriptome and the A. thalianagenome PacBioToCA omitted � 50 % of the LRs. Thus, 1.6× more raw reads wouldbe needed for the same amount of corrected reads. Taken together, proovread is ableto correct larger percentages with higher accuracy leading to longer reads than previoustools.

When designing proovread it was one goal to achieve a maximum of independence fromthe existing computer infrastructure. Therefore, the correction can be performed in asingle process, which requires a maximum of memory, or can be split into smaller chunkseach needing less memory. In addition, the amount of SRs is unlimited as proovread doesnot require an indexing of the SR data. This allows the parallel execution of independentproovread processes on computers with limited memory and CPU configuration. Indeed,the memory footprint of a single proovread process is smaller in the other programs. Ifthe available memory is not sufficient, the package size can be lowered to fit the availablememory. Admittedly, this increases the run time of proovread . Still, we think that thiscan be overcome by clusters of comparably cheap machines. Here, each machine correctsonly a fraction of the reads. Contrary, LSC requires large memory and does not benefitfrom running in a grid infrastructure. PacBioToCA allows the parallel execution in agrid system, but is restricted to the SGE queueing system. Moreover, it requires up to48 GB memory on a local computer, which requires large server systems. Thus, proovreaddoes not dictate the architecture of the computer system.

This idea of flexibility is also encoded in the correction process itself. This starts

15

with the mapping of the reads. Currently, we support SHRiMP2 as default mapperand Bowtie2 as experimental option. As we can not foresee the development of newmappers, proovread can also work directly on user provided mapping files. Next, theproovread iterative correction is highly configurable. A user can modify the numberof iterations and thereby decrease the overall run time by performing fewer iterations.Obviously, this might affect the correction accuracy and therefore has to be consideredcarefully. Finally, the correction and the filtering steps are independent from each other.Without filtering, the maximum of the input read length can be preserved. In contrastwith a strict filtering, only highly accurate positions will be returned. Furthermore, thisseparation enables repeating the filtering step without re-running the time consumingcorrection.

The idea to correct errors in LRs from SMRT sequencing with Illumina SRs has beenimplemented before by LSC and PacBioToCA. But, both have strong demands on hard-ware as well as software infrastructure. If either can not be met, correction can not beperformed. This will make following analyses like an assembly challenging if possible atall. Contrarily, proovread can be easily adapted to the available resources. It handlesthe correction of an E. coli genome on a laptop as well as of a human genome on a HPCcluster. This in-build flexibility also enables the adaptation to and optimization for dif-ferent data sets as generated in genomic and transcriptomic projects. Finally, correctionwith proovread delivered more, more accurate and longer reads. Thus, proovread is wellsuited for the correction of LRs irrespective of the target of sequencing and regardlessof the computational resources.

Acknowledgment

We would like to thank the ‘Leibniz Rechenzentrum’ and the ‘Data Intensive AcademicGrid’ for providing us with the infrastructure for evaluation and benchmarking of the pro-grams, Henri van de Geest, Wageningen UR, Plant Research International, Netherlandsfor testing of proovread and Jessika D’Lorge-Harnisch for proofreading the manuscript.

Funding This work was supported by the European Research Council [ERC acronym:Carnivorum].

Conflicts of Interest FF is share holder of Pacific Biosciences.

References

Au, Kin Fai et al. (2012). “Improving PacBio long read accuracy by short read align-ment.” eng. In: PLoS One 7.10, e46679. doi: 10.1371/journal.pone.0046679.url: http://dx.doi.org/10.1371/journal.pone.0046679.

Brown, C. Titus et al. (2012). “A Reference-Free Algorithm for Computational Normal-ization of Shotgun Sequencing Data”. In: eprint: 1203.4802.

16

http://dx.doi.org/10.1371/journal.pone.0046679

http://dx.doi.org/10.1371/journal.pone.0046679

1203.4802

Chin, Chen-Shan et al. (2013). “Nonhybrid, finished microbial genome assemblies fromlong-read SMRT sequencing data.” eng. In: Nat Methods 10.6, pp. 563–569. doi:10.1038/nmeth.2474. url: http://dx.doi.org/10.1038/nmeth.2474.

David, Matei et al. (2011). “SHRiMP2: sensitive yet practical SHort Read Mapping.”eng. In: Bioinformatics 27.7, pp. 1011–1012. doi: 10 . 1093 / bioinformatics /

btr046. url: http://dx.doi.org/10.1093/bioinformatics/btr046.Dohm, Juliane C. et al. (2008). “Substantial biases in ultra-short read data sets from

high-throughput DNA sequencing.” eng. In: Nucleic Acids Res 36.16, e105. doi:10.1093/nar/gkn425. url: http://dx.doi.org/10.1093/nar/gkn425.

Eid, John et al. (2009). “Real-time DNA sequencing from single polymerase molecules.”eng. In: Science 323.5910, pp. 133–138. doi: 10.1126/science.1162986. url:http://dx.doi.org/10.1126/science.1162986.

Fichot, Erin B. and R. Sean Norman (2013). “Microbial phylogenetic profiling with thePacific Biosciences sequencing platform”. In: Microbiome 1.27.

Gnerre, Sante et al. (2011). “High-quality draft assemblies of mammalian genomes frommassively parallel sequence data.” eng. In: Proc Natl Acad Sci U S A 108.4, pp. 1513–1518. doi: 10.1073/pnas.1017351108. url: http://dx.doi.org/10.1073/pnas.1017351108.

Jette, Morris A., Andy B. Yoo, and Mark Grondona (2002). “SLURM: Simple LinuxUtility for Resource Management”. In: In Lecture Notes in Computer Science: Pro-ceedings of Job Scheduling Strategies for Parallel Processing (JSSPP) 2003. Springer-Verlag, pp. 44–60.

Kelley, David R., Michael C. Schatz, and Steven L. Salzberg (2010). “Quake: quality-aware detection and correction of sequencing errors.” eng. In: Genome Biol 11.11,R116. doi: 10.1186/gb-2010-11-11-r116. url: http://dx.doi.org/10.1186/gb-2010-11-11-r116.

Koren, Sergey et al. (2012). “Hybrid error correction and de novo assembly of single-molecule sequencing reads.” eng. In: Nat Biotechnol 30.7, pp. 693–700. doi: 10.1038/nbt.2280. url: http://dx.doi.org/10.1038/nbt.2280.

Langmead, Ben and Steven L Salzberg (2012). “Fast gapped-read alignment with Bowtie2.” eng. In: Nat Methods 9.4, pp. 357–359. doi: 10.1038/nmeth.1923. url: http://dx.doi.org/10.1038/nmeth.1923.

Li, Heng et al. (2009). “The Sequence Alignment/Map format and SAMtools.” eng. In:Bioinformatics 25.16, pp. 2078–2079.

Li, Ruiqiang et al. (2010). “De novo assembly of human genomes with massively parallelshort read sequencing.” eng. In: Genome Res 20.2, pp. 265–272. doi: 10.1101/gr.097261.109. url: http://dx.doi.org/10.1101/gr.097261.109.

Miller, Jason R et al. (2008). “Aggressive assembly of pyrosequencing reads with mates.”eng. In: Bioinformatics 24.24, pp. 2818–2824. doi: 10.1093/bioinformatics/

btn548. url: http://dx.doi.org/10.1093/bioinformatics/btn548.Myers, E. W. et al. (2000). “A whole-genome assembly of Drosophila.” eng. In: Science

287.5461, pp. 2196–2204.Ono, Yukiteru, Kiyoshi Asai, and Michiaki Hamada (2013). “PBSIM: PacBio reads

simulator–toward accurate genome assembly.” eng. In: Bioinformatics 29.1, pp. 119–

17

http://dx.doi.org/10.1038/nmeth.2474


http://dx.doi.org/10.1093/bioinformatics/btr046



http://dx.doi.org/10.1093/nar/gkn425

http://dx.doi.org/10.1093/nar/gkn425

http://dx.doi.org/10.1126/science.1162986

http://dx.doi.org/10.1126/science.1162986

http://dx.doi.org/10.1073/pnas.1017351108



http://dx.doi.org/10.1186/gb-2010-11-11-r116



http://dx.doi.org/10.1038/nbt.2280






http://dx.doi.org/10.1101/gr.097261.109

http://dx.doi.org/10.1101/gr.097261.109

http://dx.doi.org/10.1101/gr.097261.109

http://dx.doi.org/10.1093/bioinformatics/btn548



121. doi: 10.1093/bioinformatics/bts649. url: http://dx.doi.org/10.1093/bioinformatics/bts649.

Roberts, Richard J, Mauricio O Carneiro, and Michael C Schatz (2013). “The advantagesof SMRT sequencing.” eng. In: Genome Biol 14.6, p. 405. doi: 10.1186/gb-2013-14-6-405. url: http://dx.doi.org/10.1186/gb-2013-14-6-405.

Ross, Michael G et al. (2013). “Characterizing and measuring bias in sequence data.”eng. In: Genome Biol 14.5, R51. doi: 10.1186/gb-2013-14-5-r51. url: http://dx.doi.org/10.1186/gb-2013-14-5-r51.

Slater, Guy St C. and Ewan Birney (2005). “Automated generation of heuristics forbiological sequence comparison.” eng. In: BMC Bioinformatics 6, p. 31. doi: 10.1186/1471-2105-6-31. url: http://dx.doi.org/10.1186/1471-2105-6-31.

Travers, Kevin J. et al. (2010). “A flexible and efficient template format for circularconsensus sequencing and SNP detection.” eng. In: Nucleic Acids Res 38.15, e159.doi: 10.1093/nar/gkq543. url: http://dx.doi.org/10.1093/nar/gkq543.

Wu, Thomas D. and Colin K. Watanabe (2005). “GMAP: a genomic mapping andalignment program for mRNA and EST sequences.” eng. In: Bioinformatics 21.9,pp. 1859–1875. doi: 10.1093/bioinformatics/bti310. url: http://dx.doi.org/10.1093/bioinformatics/bti310.

18

http://dx.doi.org/10.1093/bioinformatics/bts649



http://dx.doi.org/10.1186/gb-2013-14-6-405

http://dx.doi.org/10.1186/gb-2013-14-6-405

http://dx.doi.org/10.1186/gb-2013-14-6-405




http://dx.doi.org/10.1186/1471-2105-6-31

http://dx.doi.org/10.1186/1471-2105-6-31

http://dx.doi.org/10.1186/1471-2105-6-31

http://dx.doi.org/10.1093/nar/gkq543

http://dx.doi.org/10.1093/nar/gkq543

http://dx.doi.org/10.1093/bioinformatics/bti310



Supplementary data

Table S1: Smith-Waterman scoring parameter applied for Illumina-PacBio hybrid align-ments

default LR raw LR refine

match 10 5 5mismatch -15 -11 -10gap-open on LR -33 -2 -5gap-open on SR -33 -1 -5gap-extend on LR -7 -4 -2gap-extend on SR -3 -3 -2

Phred conversion

p = min(√f · 50, 40); (S1)

● ● ● ● ● ● ● ● ● ●

99.9

099

.94

99.9

8

Package size vs. accuracy

accu

racy

in %

0.10 0.50 2.50 10.00 50.00

Package size (Mbp)

Figure S1

● ●●

●

●●

●

●

●

●

02

46

810

12

Package size vs. memory requirement

Package size (Mbp)

Mem

ory

requ

irem

ent (

Gby

te)

0.10 0.50 2.50 10.00 50.00

Figure S2

● ● ● ● ● ●

●

99.7

599

.85

99.9

5

Expected coverage vs. correction accuracy

Acc

urac

y in

%

10 35 75 100 200

Expected coverage

Figure S3

●

●

● ● ● ● ●

020

4060

8010

0

Expected coverage vs. correction throughput

Thr

ough

put i

n %

10 35 75 100 200

Expected coverage

Figure S4

●

●

●

●

●●

●

1.5

2.0

2.5

Expected coverage vs. memory consumption

Mem

ory

in G

B

10 35 75 100 200

Expected coverage

Figure S5

Table S2: Frequency to Phred conversion table

Freq. Phred ASCII Freq. Phred ASCIIPhred+64 Phred+64

0 0 @ 20 32 ‘

1 7 G 21 32 ‘

2 10 J 22 33 a

3 12 L 23 34 b

4 14 N 24 35 c

5 16 P 25 35 c

6 17 Q 26 36 d

7 19 S 27 37 e

8 20 T 28 37 e

9 21 U 29 38 f

10 22 V 30 39 g

11 23 W 31 39 g

12 24 X 32 40 h

13 25 Y 33 40 h

14 26 Z 34 40 h

15 27 [ 35 40 h

16 28 \ 36 40 h

17 29 ] 37 40 h

18 30 ^ 38 40 h

19 31 _ 39 40 h

20 32 ‘ 40 40 h

Tab

leS

3:T

his

tab

leco

mp

rise

sth

eu

sed

dat

ase

tsu

sed

for

per

form

ace

esti

mat

ion

ofproovread

.

organism

type

gen

omereference

gen

omesize

pacb

iocellid

/accession

SR

accession

readlength

Escherichia

coli

gen

omic

GCA

000005845.1

4.64Mbp

c100247042550000001523002504251220

ERX002508

100bp

Arabidopsisthaliana

gen

omic

GCA

000001735.1

119.6Mbp

c100539462550000001823089611241346

SRX158552

110bp

Homosapiens

gen

omic

GCA

000001405.12

3.23Gbp

c100518541910000001823079209281310

SRX246904

101bp

SRX246905

101bp

SRX246906

101bp

SRX246907

101bp

SRX247361

101bp

SRX247362

101bp

Homosapiens

transcriptomic

GCA

000001405.12

3.23Gbp

c100202382555500000315044810141103

ERX011200

50bp

c100202382555500000315044810141104

ERX011186

75bp

c100202382555500000315044810141105

Table S4: Percentage of ambigous bases obtained from the corrections by proovread ,PacBioToCA, and LSC

Program Escherichia coli Arabidopsis thaliana Homo sapiensgenome transcriptome

proovread def. 0.54 9.65 3.40 0.27proovread norm. 1.49proovread adapt. 0.31PacBioToCA 1.37 6.49 0.30LSC 8.51 16.88 1.65

proovread: large-scale high accuracy pacbio correction ... · proovread: large-scale high accuracy...

Documents