1-s2.0-s153204640900149x-main

Upload: victor-rj-m

Post on 01-Mar-2016

212 views

Category:

Documents


0 download

DESCRIPTION

k

TRANSCRIPT

  • r N

    Sze

    ix, A

    Software

    enoexpripl nndme

    perform pixel-based analysis of NGS data. When compared to the Illumina informatics tool, BINGs

    roduceize thelater, wchers a

    facilitates the sequencingofmillions of fragments of DNA in parallel,shifting the burden from intensive wet laboratory processes andproduction, aswas the casewith Sanger sequencing, to bioinformat-ics. Informatics processes in support of DNA sequencing includelarge-scale computation, storage and analysis of increasingly largeimage and sequence data sets.With the rapid deployment of the lat-est generation of sequencing technologies, such as Roche (formerly

    1.0) as a baseline, we explored computational approaches for improv-ing the 1. accuracy of base calls (measured using bar codes); 2. compu-tational efciency (speed of processing image data) and 3. density ofidentied sequences (using pixel-based and cluster-based approaches).

    BING is one of the rst software tools to perform pixel-to-baseanalysis of NGS data. In a pixel-based approach a sequence basecall is made at each pixel, as opposed to the conventional clus-ter-based approach where a centroid is derived from multipleneighboring pixels. When compared to the Illumina informaticstool, BINGs pixel-based approach produces a signicant increasein the number of bar-code validated sequence data.

    * Corresponding author. Address: Department of Biomedical Informatics, ArizonaState University, ABC1 425 N 5th Street, Phoenix, AZ 85004-2157, USA.

    Journal of Biomedical Informatics 43 (2010) 428434

    Contents lists availab

    d

    .eE-mail address: [email protected] (J. Kriseman).entire genomes for genomic variations that can aide in the discoveryof genetic elements that contribute to both traits and diseases.

    Beginning with the Human Genome Project, researchers gaineda general understanding of the layout of the human genome [2].This project highlighted the importance of sequence informationon a large scale and the challenges involved in the process ofacquiring it. The duration, cost and insight provided by this projectprompted the development of new high-throughput, parallelsequencing methods.

    Parallel sequencing, a.k.a. Next Generation Sequencing (NGS)

    cesses, and minimizing of the storage footprint. At the same time,incorporation of new algorithms, such as pixel-based image analysisalgorithms (as opposed to the current cluster-based algorithms),into this pipeline will allow a potential increase in the amount ofsequence data generated per experiment.

    In thismanuscriptwe introduce such an informatics analysis pipe-line, named BING (Biomedical Informatics for Next GenerationSequencing). BING was developed to improve the analysis workowof NGS image data. Using results generated with the Illumina GeneAnalyzer (versions GA and GA-II) Pipeline (versions 0.2.2.6, 0.3, andImage processingImage alignmentBase callingImage analysisSignal processing

    1. Introduction

    In 1977, Sanger and Coulson intscientic tool that would revolutionwould be studied [1]. Three decadesHigh-Throughput Sequencing, resear1532-0464/$ - see front matter 2009 Elsevier Inc. Adoi:10.1016/j.jbi.2009.11.003pixel-based approach produces a signicant increase in the number of sequence reads, while reducingthe computational time per experiment and error rate (

  • greatest between the reference plane and the plane in question.

    and y, n is the sample size.

    edicaThe matrix is calculated by shifting the sample window, and mul-tiplying the intensity values (for a particular sample S is the refer-ence tile plane, T is the tile plane of a specic cycle, and the y and xcoordinates on the image).

    Two-dimensional cross-correlation coefcient scoring algo-rithm used for image alignment.

    Cy; x XM

    y10

    XN

    x10Sy1; x1 conj Ty1 y; x1 x 1This approach has the potential of increasing the density andthroughput of NGS technologies. In comparison to the IlluminaGenome Analyzer Pipeline, which utilizes a cluster-based approach[3], BING provides greater accuracy, delivers a signicant boost incomputational performance and sequence information, whilereducing the storage footprint. These ndings open up the possibil-ity for further exploration into higher density sequencing images,potentially increasing the number of sequence data points persequencing experiment.

    2. Methods

    2.1. BING approach

    The BING approach comprises a set of distinct modules thatencompass the characterization, alignment, andanalysis of sequenc-ing image data. The rst module performs image alignment, thentransitions data to the second module, cluster registration, whereclusters (parametrically dened as one or more pixels) are detectedfromauniversalmapof aligned images.Next, the thirdmodule signalmeasurementandbase calling,measuresandnormalizes intensityval-ues, andbase-calling isperformed. The fourthmoduleperformsqual-ity assurance, where signals are sampled for signicance andvalidation of each sequence against a barcode library. The validatedsequences are then aligned against a reference sequence library orgenome and the genomic coverage is measured and compared tothe expected coverage. Finally optional modules generate a varietyof outputs including FASTA formats, which can be used in third partyreporting of genetic variation. Below we detail these steps.

    2.1.1. Image alignmentDue to variation of the mechanical movement of the camera

    across the ow cell surface, the acquired images need to be aligned.There is no camera movement between the acquisition of eachnucleotide image, only the introductionof lasers andlters to isolatethe responses. Images are captured separately for each of the fourpossible nucleotides, A, C, G and T. Thus, a super-positioning of allfour images of a cycle provides the location of every cluster regard-less of which nucleotide it represents. The super-positioned imagesfor each cycle (e.g., there are 3040 cycles, corresponding to readlengths of 3040 bases) must then be aligned to one another.

    For image alignment, BING implements a variation of the Luckyalgorithm from the eld of Observational Astronomy [4]. Theshift, score and add alignment begins by iterating through eachof the cycles and preparing a sample tile plane, a superpositionof all channels. The resulting planar image demonstrates intensitypeaks at all cluster positions regardless of its specic designation(A, C, T, G). Next, a centered sample window of size M N (e.g.,200 200 pixels) is constructed to identify the offset of each im-age. The cross-correlation coefcient matrix is then computed be-tween each sample tile plane and the initial tile plane, the offset isdetermined by selecting the coordinate at which the correlation is

    J. Kriseman et al. / Journal of Biomwhere C represents the scoring coefcient, [x, y] represent offsetcoordinates, [x1, y1] are the scanning coordinates for which theIn order to compare signals in a standardized format, the meanintensity value of the imagemust be subtracted from every pixel va-lue. Thecorrelation is calculated for every channelpair and is appliedback to the images, such that the coefcients of the other channelsare subtracted in proportion to the values of each image, respec-tively. The values of each channel pair are corrected as in Eq. (3) asthere is insignicant correlation between the other combinationsof channels once themain correlation effects (A/C,G/T) are removed.

    Functions used to standardize signals.

    Acorrected A A rA;CCCcorrected C C rC;AAGcorrected G G rG;TTTcorrected T T rT;GC

    3

    The application of this correlation compensation principlethroughout the progression of cycles achieves maximum signalsample iterates. conj is the complex conjugate. S and T are 2-Dmatrices representing nucleotide signals.

    The offset at which theminimumdistance exists is then recordedfor later use in coordinated base-calling. The maximum spatial do-main is determined by themaximumoffset to prevent out of boundsrequests from being called. The selection of specic images in creat-inga composite reference image is crucial to thequalityof alignment.When processing images generated with the Illumina Genome Ana-lyzer, a compilation of the rst ve cycles results in a blurred repre-sentation of cluster locations; this phenomenon is due to the initialmechanical variation of the sequencing machine, while compilationof the last ve cycles results in a near accurate representationof clus-ter locations with minimal adjustments required.

    To better understand which images to utilize as a reference im-age in the creation of the stack of aligned images, sample offsetsare calculated to determine the cycles to be used in the generationof a compiled reference image.

    2.1.2. Signal correlation, compensation and separationSignal correlation, compensation, and separation are applied to

    the images to remove cross-talk between the uorescent signals.The majority of the cross-talk phenomenon is illustrated by thedistinct correlation between the A and C channels and G and Tchannels, respectively.

    Another issue is the correlation between different cycles. As cy-cles progress, the correlation between cycles increases for allbases/channels as seen in Fig. 1A. This phenomenon likely occursdue to the accumulation of uorophores at a low level of excitationduring progressive sequencing cycling. This may occur due toincomplete cleavage of uorophores in previous cycles.

    Throughout cycle progression baseline signals increase whilehigh intensity peaks diminish proportionally. In an effort to recoverhigh intensity peaks and restore the baseline signal, Pearsons cor-relation coefcients are calculated and utilized in phase correction.To remove the effect that one signal has upon the other (crosstalk),the correlation is determined between two images, and the imagemeans subtracted. Next, the proportional contributions (correla-tion mean-subtracted image) are subtracted resulting in theelimination of spectral overlap as illustrated in Fig. 1B.

    rx;y P

    xiyi nxyn 1SxSy 2

    where x and y represent different bases while i are correspondingpixel locations, Sx and Sy represent the standard deviation of x

    l Informatics 43 (2010) 428434 429separation. When applied to the images, as illustrated in the clus-ter evidence image in Fig. 2, the cross-talk and phasing have beensimultaneously minimized.

  • -1-0.5

    00.5

    1

    1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

    Cor

    rela

    tion

    Cycle

    Correlation of Raw Signal IntensitiesA,C

    A,G

    A,T

    C,G

    C,T

    G,T

    -0.5

    0

    0.5

    1

    9 2

    Cor

    rela

    tion

    le

    Correlation of Corrected Signal IntensitiesA,C

    A,G

    A,T

    C,G

    C,T

    B

    A

    430 J. Kriseman et al. / Journal of Biomedical Informatics 43 (2010) 428434-11 3 5 7 9 11 13 15 17 1

    CycAn example of the application of the correction algorithm for asingle cluster across all cycles and bases displays the distinct sep-aration of signals as seen in Fig. 3.

    Fig. 1. (A, top) Correlation between raw signal intensities increase as cycles

    Fig. 2. Evidence image of a single cluster. A single column represents a signalcollected from each base (four images). The collection of columns from left to rightrepresents the position in the sequence from 1 to 36. The top evidence imagerepresents the raw image data before correction and the bottom image representsthe data after the correction of cross-talk and phasing.

    Fig. 3. Illustration of signal correlation between the bases before and after the correctiocycle, and the colors corresponds to each of the four bases.2.1.3. Cluster registrationThe content of each image contains many point sources of illu-

    mination. Cluster registration is the process of identifying thesesignicant point sources across the ow-cell. Due to the choice ofuorescent nucleotides and their responses, only one of the fourbases will always have a signicant intensity. BING utilizes twoapproaches to cluster detection and registration (Fig. 4). The rst,a pixel-based approach, registers every pixel in the image. The sec-ond, a cluster based technique, leverages the characteristics of theuniform signal response in dening regions in registration of clus-ters. The pixel-based approach ensures no loss of information,whereas the cluster based method identies probable clusters.Tradeoffs of coverage and performance are decided by the require-ment of the experiment.

    The pixel-based approach to cluster registration performs anal-

    1 23 25 27 29 31 33 35 G,T

    progress. (B, bottom) Correlation ltering corrects for spectral overlap.ysis of each pixel independently, eliminating the need to character-ize a cluster region. The coordinates of signicant intensity valuesbetween the different bases are recorded, identifying regionswhere uorophores exist. This reduces the amount of data to beprocessed (no base calling at that location).

    n algorithm. The unit of the Y axis is the standard score, the X axis represents each

  • at each pixel location based on the neighboring sequences. Basedupon the physical proximity of sequences, as determined by simi-

    edicaCluster-based algorithms implemented in BING require globalbased processing in the identication of signicant point sources.The intensity of a point source becomes inversely proportional tothe square of its distance from its center, resulting in the appear-ance of a point spread function or a Gaussian surface. For the pur-pose of this analysis, the Laplacian operator is utilized, and denedas the second order differential of a two-dimensional space. It canbe applied to differentiate the peaks from the troughs across an im-age. Computing the Laplacian matrix and applying the watershedalgorithm [5] identies all regional peaks across the image.

    Laplacian operator:

    I / 1r2

    : Df @2f

    @x2 @

    2f@y2

    4

    To dene a global reference coordinate, an image set is chosenthat has all nucleotides present with strong luminance. As thereare four specic nucleotide probes, a superposition of all fourimages will result in an image where every oligonucleotide hasan active illuminating probe.

    To achieve the highest possible quality in cluster registration, theaverage of super-positions for each cycle is taken before the Lapla-cian matrix is calculated and the watershed algorithm is applied.

    2.1.4. Signal measurement, and base callingThe signal intensities across all four channels determine the

    Fig. 4. The accurate determination and broad coverage of clusters identied usingthe cluster (left) and pixel (right) based algorithms, in conjunction with the barcodeindexes substantially reduces errors. Green dots represent bar-code validatedsequences. (For interpretation of color mentioned in this gure the reader isreferred to the web version of the article.)

    J. Kriseman et al. / Journal of Biomnucleotide that is present at each cluster location. Thus, the inten-sities are recorded for each channel and each registered cluster (orone pixel, in the pixel-based approach). Two different approachesare implemented and maintained in BING, to allow for further eval-uation based upon the experimental conditions such as clusterdensity and uorescent probes.

    One approach normalizes cluster intensities by converting themto fractions across all four channels. Due to the different and over-lapping spectral responses of each uorescently labeled nucleotide,the fractions are adjusted such that their distributions are compa-rable. For example, signals from channels 1 and 3 (A and G) arenever greater than 50%, while channels 2 and 4 (C and T) approach100%. The adjustments scale and shift the distributions of eachchannel between 0% and 100%. The result is a probability distribu-tion for each channel. Using this approach, the base-call is chosenas the maximum score between all four channels.

    The second approach, the pipeline default algorithm, relies onthe standardizing effect of cross-talk compensation, explained pre-viously, which allows the base to be directly determined by themaximum signal for a location and cycle. This approach is a moreefcient and simplied technique, removing the need for complexbase calling algorithms.larity in neighboring pixels (conguration parameter), regions aredened. The consensus sequence of the pixels dened in each clus-ter (a BING cluster is parametrically dened as one or more pixels)is then calculated and low scoring values (conguration parame-ter) are removed from the sequence set to eliminate redundancyand reduce base calling error rates.

    Next, utilization of indexes or bar-codes [6] as a further qual-ity control measure ensures a specic sequence of nucleotides ex-ists within the target sequence. Bar codes represent indexed DNAsequences that are short (for this study, ve or six nucleotides,with Thymidine as the last nucleotide necessary for ligation ofthe Adenosine overhang), known oligonucleotides designed to in-clude multiple redundancy checks using a checksum mechanism[6]. Barcode DNA sequences are ligated to the target sequencesfor use in the identication of a sample.

    Sequence Position Consistency is calculated by the sum of QClocation provided by the bar code indexes; the QC location is basedupon the known nal Thymine position in the barcode DNA. Foreach cluster the algorithm tests if the base call at the QC locationequals the expected base from the known barcode sequence. Ifthe test is positive, then increment the QC count. This count givesthe Feature Accuracy, which is calculated by summing the total QCcount and dividing by the number of valid sequences times 100 toyield a percentage.

    Barcodes Matches are calculated by comparing the intersectionbetween the barcode library and the BING sequences data set forall clusters. If the sequence passes, it is agged as a valid cluster;if it fails the cluster is removed from further processing, althoughthe index is stored for further QC analysis. The Barcode Accuracyis determined by the sum of Barcode Hits divided by the total num-ber of clusters multiplied by 100 resulting in a percentage. ValidClusters represent the total number of clusters which pass the Bar-code Matching processing.

    The pipeline parameters may be adjusted manually to discardsequences that do not match the barcode library, or automatedto utilize the bar code indexes in the initial stages of the run to as-sess quality of the read and make adjustments to applicationparameters for each module (e.g., Image Alignment-window size,cluster size, etc.) based upon the barcode matches; ensuring qual-ity reads prior to completion of an experiment, reducing resourcesallocated including personnel, machine utilization, and cost.

    Optionally, sequence quality is evaluated by aligning the gen-erated sequences to a reference library, or genome. As there area variety of efcient short read alignment algorithms, the align-ment algorithm may be dened by the end user in the pipelineconguration parameters. For the purpose of this experiment,the BLAST algorithm [7] was utilized to demonstrate alignmentto the provided reference library. The resulting Bit-Scores andE-values of the sequence alignments provide indication to theread quality. These scores are provided to the ends user for fur-ther analysis.

    3. Results

    This section provides a detailed comparative analysis betweenthe theoretical results described above, and the actual results2.1.5. Quality control and accuracy measurementSeveral levels of quality measurement are utilized for error con-

    trol. The rst level applies rules for probabilistic sequence cluster-ing which calculates the probabilities of the validity of a sequence

    l Informatics 43 (2010) 428434 431delivered by the BING and Illumina analytical pipelines. The dataillustrated below corresponds to a single experiment, lane and tilefor 36 cycles.

  • encing experiment. Base Distribution (A and B). The Y axis represents base distributionbases. The Base Call Distribution (C) compares the distribution of base calls for a single

    ris

    edical Informatics 43 (2010) 428434Fig. 5. Illumina, BING base distribution, and Base Call Comparison for a single sequ(percentage), X axis corresponds to each cycle, and colors represent each of the fourexperiment.

    0.5

    1

    1.5

    2

    2.5

    Erro

    r Rat

    e

    Global Compa

    BING (Pixel)

    BING (Cluster)

    Illumina

    A432 J. Kriseman et al. / Journal of Biom3.1. Sequence data

    The cluster analysis performed with the BING and Illuminapipelines demonstrate a signicant difference in quality-controlledsequences that are identied. The Illumina technology is con-strained by theoretical limits as to the number of clusters whichshould be resolvable for a given image size. Experimental clustersizes in the Illumina technology vary from 6 to 8 pixels in diameter.Based upon a 6 6 pixel cluster size, the theoretical maximumcluster yield is 102,058 per image, while an 8 8 pixel cluster sizeyield is 57,408 per image.

    The introduction of pixel based processing methods allow fordetailed level examination of all pixels within an image, increasingthe theoretical yield to the maximum number based upon the spa-tial distribution of clusters which can be reected by of the Poissondistribution. While this is a signicant increase in sequence possi-bilities, consideration of cluster size, density, and distribution di-rectly reect the number of expected sequences in an image.

    As illustrated in Fig. 7, BING sequence distribution using thepixel-based approach has the ability to identify more unique, qual-ity-controlled sequences than the Illumina pipeline.

    3.2. Quality control

    When comparing the base distributions (Fig. 5), it is evidentthat the similarity between distributions is high. This correlation

    010 11 12 13 14 15 16 17

    Cyc

    B

    Fig. 6. (A, top) BING and Illumina error rate by cycle, averaged over titration levels 1, 2610 represent an adapter, and cycles 11 onward represent the sequence of interest. Tlibrary. (2) Determine the mismatches across two sequences. (3) Average all of the seqimage samples which represent a region of a specic tile (multiple clusters). Rows reprefrom cycle 26 onward (A). This phenomenon, which is known as pre-phasing, is attributedof signal over cycles [8].on of Error Rateraises the condence level in the BING algorithm, and its abilityto accurately detect usable data from Illuminas Genome Analyzer.

    3.3. Benchmark: BING vs. Illumina

    Accuracy for both BING and Illumina was assessed by usingNCBIs BLAST algorithm to nd the best alignment for each se-quence [7]. The BLAST database consisted of a sequence librarywhich was designed and utilized in the sequencing experiment.The Bit-Scores and E-values of the sequence alignments gave

    0

    50

    100

    150

    200

    250

    1x 2x 4x 8x 10x

    Meg

    abas

    es

    Titration Level

    Megabases/Lane per Titration LevelIlluminaBING (Cluster)BING (Pixel)

    Fig. 7. Quality-controlled sequence read yield per experiment, non redundant reads(ve lanes).

    18 19 20 21 22 23 24 25 26 27le Number

    and 4. Cycles 1128 displayed. Cycles 15 represent barcode information, cycleso calculate error rates: (1) Determine the best match sequence from the referenceuence mismatches across each cycle. (B, bottom) BING evidence image, illustratingsent bases, columns represent cycles. This image demonstrates the increase in errorto early sequencing kits, where poor uorophore cleavage resulted in accumulation

  • edicaindication to the increased quality of the BING pipeline over theIllumina pipeline results. Table 1 provides the results of a statisti-cal analysis of E-values between the qualities demonstrated byboth platforms. The alignment scores for the BING-generated se-quences were smaller (p-value < 0.0001), indicating better matchesagainst the genomic library utilized for this experiment.

    Moreover, since the genomic sequence library utilized for thisexperiment was known, it was also possible to perform a cycle-by-cycle error rate estimation (Fig. 6) by using direct comparisonof expected sequence vs. the sequence output from the Illuminaor BING pipelines. The error rates were comparable for the twopipelines, with the BING pixel-based approach having a slightlylower error rate. As illustrated in Figs. 6 and 7, BING consistentlyoutperforms Illumina providing more reads, and a low error rate.

    As illustrated in Fig. 7, BING produces higher density readswhile maintaining

  • identify intensities and calculate locations for base calling. Toinitiate the analysis, each image is scaled and registered, thenpassed through a set of lters to de-noise, sharpen and enhancethe clusters. Once the image has been adjusted, Firecrest per-forms cluster detection based upon feature extraction. 3. Bustardperforms base calling by deconvolving the signal and applyingtwo distinct areas of correction to the clusters, spectral cross talkand phasing. These corrections must be accounted for whenusing the Genome Analyzer image capture technique as the reac-tions are sensitive to both cross talk and phasing. 4. Generationof Recursive Analyses Linked by Dependency (GERALD) providesa mechanism for Sequence Analysis, Visualization, Filtering, andAlignment. It is the only module which allows for congurablelevels of parallelism (Illumina, Genome Analyzer Pipeline Soft-ware User Guide, 2008).

    References

    [1] Pillsbury E (n.d.). A history of genome sequencing. [accessed September 2008].

    [2] U.S. Department of Energy Ofce of Science (n.d.). Human Genome ProjectInformation. [accessed July 2008].

    [3] Illumina. Genome Analyzer Pipeline Software User Guide. Illumina; 2008.[4] Mackay C. Lucky Imaging Web Site Home. [accessed September 2008].[5] Najman L, Schmitt M. Watershed of a continuous function. Signal Process

    1994:99112.[6] Craig DW, Pearson JV, Szelinger S, Sekar A, Redman M, Corneveaux JJ, et al.

    Identication of genetic variants using barcoded multiplexed sequencing. NatMethods 2008;5(10):88793.

    [7] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignmentsearch tool. J Mol Biol 1990:40310.

    [8] Erlich Y, Mitra PP, delaBastide M, McCombie WR, Hannon G. Alta-cyclic: a self-optimizing base caller for next generation sequencing. Nat Methods2008;5:67982.

    434 J. Kriseman et al. / Journal of Biomedical Informatics 43 (2010) 428434

    BING: Biomedical informatics pipeline for Next Generation SequencingIntroductionMethodsBING approachImage alignmentSignal correlation, compensation and separationCluster registrationSignal measurement, and base callingQuality control and accuracy measurement

    ResultsSequence dataQuality controlBenchmark: BING vs. IlluminaRun time analysisResults discussion

    Future directionsSequencing technologyReal-time image near lossless compression

    AcknowledgmentsAppendix Aapproach

    References