01556805

Upload: fer6669993

Post on 04-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 01556805

    1/5

    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 52, NO. 12, DECEMBER 2005 851

    Reconfigurable Architectures for Bio-SequenceDatabase Scanning on FPGAs

    Timothy F. Oliver, Bertil Schmidt, Member, IEEE, and Douglas L. Maskell, Senior Member, IEEE

    AbstractProtein sequences with unknown functionality areoften compared to a set of known sequences to detect functionalsimilarities. Efficient dynamic-programming algorithms existfor solving this problem, however current solutions still requiresignificant scan times. These scan time requirements are likelyto become even more severe due to the rapid growth in the sizeof these databases. In this paper, we present a new approach tobio-sequence database scanning using re-configurable field-pro-grammable gate array (FPGA)-based hardware platforms togain high performance at low cost. Efficient mappings of theSmithWaterman algorithm using fine-grained parallel pro-cessing elements (PEs) that are tailored toward the parameters ofa query have been designed. We use customization opportunities

    available at run time to dynamically reconfigure the PEs to makebetter use of available resources. Our FPGA implementationachieves a speedup of approximately 170 for linear gap penaltiesand 125 for affine gap penalties compared to a standard desktopcomputing platform. We show how run-time reconfiguration canbe used to further improve performance.

    Index TermsBiomedical computing, database searching, pro-teins, reconfigurable architectures.

    I. INTRODUCTION

    SCANNING protein sequence databases is a common andoften repeated task in molecular biology. The need for

    speeding up these searches comes from the rapid growth of the

    bio-sequence banks: every year their size is scaled by a factor1.5 to 2. The scan operation consists of finding similarities be-tween a particular query sequence and all sequences of a bank.This allows biologists to identify sequences sharing commonsubsequences, which from a biological viewpoint have similar

    functionality. Comparison algorithms whose complexities arequadratic with respect to the length of the sequences detectsimilarities between the query sequence and a subject sequence.

    One frequently used approach to speed up this time consumingoperation is to introduce heuristics in the search algorithm [1].The drawback is that the more efficient the heuristics, the worseis the result.

    A number of parallel architectures have been developed forsequence analysis. Special-purpose systolic arrays provide thebest area/performance ratio for a particular algorithm [2]. How-

    ever, they are limited to one single algorithm, and thus cannotsupply the flexibility necessary to run a variety of algorithms

    required for analyzing DNA, RNA, and proteins. P-NAC wasthe first such machine and computed edit distance over a four-character alphabet [3]. More recent examples, better tuned to

    Manuscript received October 12, 2004. This paper was recommended byAssociate Editor N. R. Aluru.

    The authors are with the School of Computer Engineering, Nanyang Techno-logical University, Singapore 63978 (e-mail: [email protected]).

    Digital Object Identifier 10.1109/TCSII.2005.853340

    the needs of computational biology, include BISP and SAMBA[4], [5].

    An approach presented in [6] is based on instruction systolic

    arrays (ISAs). ISAs combine the speed and simplicity of systolic

    arrays with flexible programmability. Several other approaches

    are based on the single-instruction multiple data (SIMD) con-

    cept, e.g., MGAP [7], Kestrel [8], and Fuzion [6]. SIMD and

    ISA architectures are programmable and applicable to a wider

    range of applications, such as image processing and scientific

    computing. Since these architectures contain more general-pur-

    pose parallel processors, their processing element (PE) density

    is less than the density of special-purpose application-specificintegrated circuits (ASICs). Nevertheless, SIMD solutions can

    still achieve significant runtime savings. However, the costs in-

    volved in designing and producing SIMD architectures are quite

    high. As a consequence, none of the above solutions has a suc-

    cessor generation, making upgrading impossible.

    Reconfigurable solutions including Splash-2 [9] and Deci-

    pher [10] are based on field-programmable gate arrays (FPGAs)

    while processing in memory (PIM) [11] has its own reconfig-

    urable design. Solutions based on FPGAs have the additional

    advantage that they can be regularly upgraded to state-of-the-

    art-technology. This makes FPGAs a good alternative to spe-

    cial-purpose and SIMD architectures.Compared to the previously published FPGA solutions, we

    are using a new partitioning technique for varying query se-

    quence lengths. The design presented in [12] is closest to our

    approach since it also uses a linear array of PEs on a recon-

    figurable platform. Unfortunately, it only allows for linear gap

    penalties and global alignment, while our implementation con-

    siders both linear and affine gap penalties and is able to compute

    local alignments.

    II. SEQUENCECOMPARISONALGORITHM

    Surprising relationships have been discovered between pro-

    tein sequences that have little overall similarity but have similarsubsequences. Thus, the identification of similar subsequences

    is probably the most useful and practical method for comparing

    two sequences. The SmithWaterman algorithm [13] finds

    the most similar subsequences of two sequences (the local

    alignment) by dynamic programming. The algorithm com-

    pares two sequences by computing a distance representing the

    minimal cost of transforming one segment into another. Two

    elementary operations are used: substitution and insertion/dele-

    tion. Through a series of these operations, any segment can

    be transformed into any other segment. The smallest number

    of operations required to change one segment to another is a

    measure of the distance between the segments.

    1057-7130/$20.00 2005 IEEE

  • 8/14/2019 01556805

    2/5

    852 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 52, NO. 12, DECEMBER 2005

    Fig. 1. Sequence comparison on a linear processor array: the query sequenceis loaded into the processor array (one character per PE) and a subject sequence

    flows from left to right through the array.

    Consider two strings and of length and . To iden-

    tify common subsequences, the SmithWaterman algorithmcomputes the similarity of two sequences ending at po-

    sition and of the two sequences S1 and S2. The computation

    of is given by the following:

    for

    for

    for

    where is a character substitution cost table. Initialization

    of these values are given by

    for , . Multiple gap costs are

    taken into account as follows: is the cost of thefirst gap; is

    the cost of the following gaps. This type of gap cost is known

    as affine gap penalty. Some applications also use a linear gap

    penalty, i.e., . For linear gap penalties the above can be

    simplified to

    for

    for

    III. MAPPINGSEQUENCECOMPARISON

    The algorithm can be efficiently mapped to a linear array of

    PEs, where one PE is assigned to each character in the query

    string and then a subject sequence is shifted systolically through

    the linear chain of PEs (see Fig. 1). If and are the lengths

    of the two sequences, the comparison is performed in

    steps on PEs, instead of the steps required on a se-quential processor. Taking advantage of the reconfigurable de-

    vice, we tailor the PE design toward different gap penalty func-

    tions. This approach allows us to include only the required com-

    putational hardware and local memory. Fig. 2 shows our designs

    for linear gap penalties and affine gap penalties.

    The data width is scaled to the required precision (usu-

    ally is suf ficient). The look-up table (LUT) depth is

    scaled to hold the number of substitution table rows. The substi-

    tution width is scaled to accommodate the dynamic range

    required by the substitution table. The look-up address width

    is scaled in relation to the LUT depth. Each PE has local

    memory to store , and .

    The PE holds a column of the substitution table in its LUT. Thelook up of and addition to is

    done in one cycle. The score is calculated in the next cycle and

    is passed, along with the maximum score computed so far, to the

    next PE in the array. The affine gap penalty PE has additional

    storage for and and the additional score.

    Assume we are aligning the sequences

    and , on a linear processor array of size

    with affine gap penalties, where is the query sequence andis a subject sequence of the database. As a preprocessing step,

    symbol , i s loaded i nto P E , . A fter t hat t he r ow o f

    the substitution table corresponding to the respective character

    is loaded into each PE as well as the gap penalties and . is

    thencompletelyshiftedthrough the array in steps as

    displayed in Fig. 1. In iteration step , , the

    values , , and for all , with ,

    and are computed in parallel in all PEs

    , within a single clock cycle. For this calculation PE

    , , receives the values , , and

    from its left neighbor , while the values ,

    , , , , , and are stored

    locally. PE 0 receives in s teps with . C omputationfor a lineargap penalty is similar. Thesequences canbe pipelined

    with only one step delay between two different sequences.

    Our PE design incorporates the maximum computation of the

    matrix with only a constant time penalty as follows: After each

    iteration step all PEs compute a new value by taking the

    maximum of the newly computed value and the old value of

    from its left neighbor. After the last character of a subject

    sequence has been processed in PE , the maximum of matrix

    is stored in PE , which is then written to off-chip memory.

    Ranking and reconstructing the alignments are carried out by the

    front end PC. Because this last operation is only performed for

    very few subject sequences, its computation time is negligible.So far, we have assumed a processor array is equal in size

    to the query sequence length. In practice, the length of the se-

    quences may vary and thus the computation must be partitioned

    on thefixed size processor array. The query sequence is usually

    larger than the processor array. For sake of clarity we firstly as-

    sume a query sequence of length and a processor array of size

    where is a multiple of , i.e., where is

    an integer. A possible solution is to split the computation into

    passes. Unfortunately, this solution requires a large amount of

    off-chip memory. The memory requirement can be reduced by

    factor by splitting the database into equal-sized pieces and

    computing the alignment scores of all subject sequences within

    each piece. However, this approach also increases the loading

    time of substitution table columns by factor .

    To eliminate this loading time, we have extended our PE de-

    sign. Each PE now stores columns of the substitution table in-

    stead of only one. Although this increases the area of each PE, it

    allows alignment of each database sequence with the complete

    query sequence without additional delays. The off-chip memory

    requirement for intermediate results is also reduced to four times

    the longest database sequence size. Fig. 3 illustrates our solu-

    tion. Different configurations are designed for different values

    of , allowing the loading of a particular configuration suited

    for a range of query sequence lengths.

    The database sequences are passed in from the host one byone through afirst-infirst out (FIFO) buffer to the interface.

  • 8/14/2019 01556805

    3/5

    OLIVERet al.: RECONFIGURABLE ARCHITECTURES FOR BIO-SEQUENCE DATABASE SCANNING 853

    Fig. 2. (a) Linear gap penalty PE. (b) Affine gap penalty PE.

    Fig. 3. System implementation.

    For query lengths longer than the PE array the intermediate re-

    sults are stored i n a FIFO of width for af fine gap

    penalty. For linear gap penalty the FIFO width is .

    The FIFO depth is sized to hold the longest sequence in thedatabase. The database sequence is also stored in the FIFO. On

    each consecutive pass an LUT offset is added to address the next

    column of the substitution table stored within the PEs. The max-

    imum score on each pass is compared with those from all other

    passes and the absolute maximum is returned.

    IV. PERFORMANCEEVALUATION

    We have described the PE design in Verilog and targeted it to

    the Xilinx Virtex II architecture. The size of a linear gap penalty

    PE is 3 10 configurable logic blocks (CLBs) and the size

    of an affine gap penalty PE is 6 8 CLBs. Using a Virtex II

    XC2V6000 we are able to accommodate 252 linear PEs or 168

    affine PEs using . This allows for query sequence lengths

    up to 756 and 504, respectively, which is sufficient in most cases

    (74% of sequences in SwissProt are 500 [14]). For longer

    queries we have a design with , which can accommodate

    168 linear PEs or 126 affine PEs. The clock frequencies are 55MHz for linear and 45 MHz for affine.

  • 8/14/2019 01556805

    4/5

    854 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 52, NO. 12, DECEMBER 2005

    TABLE IPERFORMANCE OFFPGA-BASEDBIO-SEQUENCESCANNER

    TABLE IIPERFORMANCEEVALUATIONVERSUSQUERYSEQUENCELENGTH

    A performance measure commonly used in computational bi-

    ology is cell updates per second(CUPS). A CUPS represents

    the time for a complete computation of one entry of the matrix, including all comparisons, additions and maxima computa-

    tions. The CUPS performance of our implementations can be

    measured by multiplying number of PEs times clock frequency.

    Table I summarizes our results.

    As CUPS does not consider data transfer time, query length

    and initialization time, it is often a weak measure that does not

    reflect the behavior of the complete system. Thus, we use data-base scans for different query lengths in our evaluation. Table II

    reports the performance for scanning the SwissProt proteindatabank (release 42.5, which contains sequences com-

    prising amino acids [14]) for query sequences of var-

    ious lengths using our design on an RC2000 FPGA Mezzanine

    PCI-board with a Virtex II XC2V6000.

    An optimized C-program on a Pentium IV 1.6 GHz has a

    performance of 52 MCUPS for linear gap penalties and 40

    MCUPS for affine gap penalties. Hence, our FPGA implemen-

    tation achieves a speedup of approximately 170 for linear gap

    penalties and 125 for affine gap penalties.For the comparison of different parallel machines, we have

    taken data from [4], [6], [8], [12] for a database search with the

    SW algorithm for different query lengths. The XC2V6000 is

    around ten times faster than the larger 16 K-PE MasPar. Kestrel,

    Fuzion and Systola 1024 are one-board SIMD solutions. Kestrel

    is 12 times slower [8], Fuzion is two to three times slower [6],

    and Systola is around 50 times slower [6] than our solution.All have a lower performance, as they use older CMOS tech-

    nology. Extrapolating to current technologies both SIMD and

    reconfigurable FPGA platforms have approximately equal per-

    formance. However, the difference between approaches is that

    FPGAs allow easy upgrading.

    Our implementation is slower than the implementations of

    [9], [15], [16]. However, these designs implement only edit

    distance. This greatly simplifies the PE design and achieves ahigher PE density and clock frequency. Although of theoretical

    interest, edit distance is not used in practice as it does not allow

    for different gap penalties and substitution tables. The Virtex

    XCV2000E implementation presented in [12] is around three

    times slower than our solution. Unfortunately, the design onlyimplements global alignment.

    Fig. 4. Web service architecture.

    V. RUN-TIMERE-CONFIGURATION

    A web service based on our PE designs has been developed. A

    user entersa database query using a form-based webpage similarto MPSrch [17]. Fig. 4 shows the service architecture. A MySQL

    server is used to store the SwissProt database [14] and queueuser queries. The application manager handles communication

    between the MySQL server and the FPGA board. On start-up the

    sequence data is extracted from the database and compiled into

    a form suitable for fast streaming to the FPGA. The parameter

    set from a user query is given a unique ID and stored in a MySQL

    table. When the application manager sees the query in the list, it

    processes it for transfer to the FPGA and configures the FPGAwith a lineararray suitedto therequest.The query parameters are

    loadedontothelineararrayandthecompileddatabaseisstreamed

    across the PCI interface. Results are collected and ranked in theapplication manager. When the search isfinished the results arereturned to the MySQL database where the web server identifiesthem using the unique ID and displays them to the user.

    A. Customizing a Query

    Reconfiguration is worthwhile only if the time saving isgreater than the time to re-configure. It takes approximately80 ms to configure the XC2V6000 over the PCI bus, while a

    database scan takes several seconds. Thus, it is feasible to use

    the queriesparameters to customize the PEs to make better useof the available resource.

    B. Dynamic Precision Scaling

    The precision of the PE array is dictated by the bit-width of

    the data path, while the resource used by the data path scales

    linearly with its bit width. As the operations in the data path

    are all additions, the propagation delay scales linearly with its

    bit-width. Thus, for a lower bit-width it is possible tofit morePEs on to the FPGA area and run them faster. Given the query

    parameters, we must decide on an adequate precision for the PE

    data path. As the PE design uses zero limited saturation arith-

    metic, the minimum value is always 0 and the maximum value

    is , where is the bit width. The negative values

    summed in the PE naturally saturate to zero. The score will only

    increment when two amino acids match. The largest incrementin any one processing step is the highest positive value in the

  • 8/14/2019 01556805

    5/5

    OLIVERet al.: RECONFIGURABLE ARCHITECTURES FOR BIO-SEQUENCE DATABASE SCANNING 855

    Fig. 5. Dynamic precision database scan scaling to the length of SwissProtdatabase sequences.

    Fig. 6. Half and quarter size database scanners on one Virtex-II XC2V6000.

    substitution table . If is the length of thefirst se-quence and is the length of the second, the maximum score

    that could occur is . Thus, the product of

    the shortest sequence length and the maximum value in the sub-

    stitution table dictates the precision.

    Protein databases exhibit a Gaussian curve length distribu-

    tion centered around 300 amino acids. The SwissProt database

    [14] is no exception. The database is scanned from shortest to

    longest sequence, using a lower precision PE array at the begin-

    ning and a higher precision for longer sequences. A full device

    re-configuration is required each time the precision is changed

    which takes approximately 80 ms. Fig. 5 shows how dynamic

    scaling gives a higher MCUPS value at the start of the scan,

    slowing down toward the end. Using this technique with a query

    sequence of length 1452 and a PE array we achieved a

    6% increase in the performance.

    C. Multi-Threading Requests

    To maintain performance for short query sequences the data-

    base is split into two to four segments. We can place half-length

    and/or quarter-length linear arrays to make better use of the

    FPGA resource as shown in Fig. 6. The different database seg-

    ments have to be streamed across the PCI bus to each linear

    array, which could prove to be a performance bottleneck.

    It is conceivable that several short sequence requests could be

    in the queue at any one time. To handle this we propose sortingthem into bins depending on the size of PE array required. Short

    sequences are grouped and all scheduled to the device at once.

    The database streaming is synchronized between the linear ar-

    rays. We can regain the 2575% performance lost in short se-quence alignments while using the same FPGA resource and

    database-streaming overhead.

    VI. CONCLUSIONIn this paper, we have demonstrated that re-configurable

    hardware platforms provide a cost-effective solution to high

    performance bio-sequence database scanning. A partitioning

    strategy to implement database scans with a fixed-size pro-

    cessor array and varying sequence lengths was described. Using

    our PE design and our partitioning strategy we can achieve

    supercomputer-like performance on a low cost off-the-shelf

    FPGA. Our FPGA implementation achieves a speedup of

    approximately 170 for linear gap penalties and 125 for affine

    gap penalties as compared to a desktop computing platform.

    We have presented several strategies for further improving

    the performance of our approach using customization enabled

    by run-time re-configuration. Our investigation suggests a 6%improvement in performance when using dynamic precision

    scaling during a database scan. Scheduling multiple different

    sized processing arrays on a single device allows us to regain the

    25%75% performance lost in short-sequence alignments whileusingthesameFPGAresourceanddatabase-streamingoverhead.

    REFERENCES

    [1] S. F. Altschul,W. Gish, W. Miller, E. W. Myers,and D. J. Lipman, Basiclocalalignment search tool,J. Molec. Biol., vol.215,pp.403410, 1990.

    [2] R. Hughey, Parallel hardware for sequence comparison and alignment,CABIOS, vol. 12, no. 6, pp. 473 479, 1996.

    [3] D. P. Lopresti,P-NAC: a systolic array for comparing nucleic acid se-

    quences,Computer, vol. 20, no. 7, pp. 9899, 1987.[4] P. Guerdoux-Jamet andD. Lavenier, SAMBA: hardware acceleratorfor

    biological sequence comparison,CABIOS, vol. 12, no. 6, pp. 609615,1997.

    [5] R. K. Singh et al.,BIOSCAN: A network sharable computational re-source for searching biosequence databases, CABIOS, vol. 12, no. 3,pp. 191196, 1996.

    [6] B. Schmidt, H. Schrder, and M. Schimmler, Massively parallel solu-tions for molecular sequence analysis,in Proc. IEEE Int. Workshop on

    High Performance Computational Biology, Ft. Lauderdale, FL, 2002.[7] M. Borah, R. S. Bajwa, S. Hannenhalli, and M. J. Irwin,A SIMD so-

    lution to the sequence comparison problem on the MGAP, in Proc.ASAP94, 1994, pp. 144160.

    [8] D. Dahle, L. Grate, E. Rice, and R. Hughey,The UCSC Kestrel generalpurpose parallel processor,in Proc. Int. Conf. Parallel and DistributedProcessing Techniques and Applications., 1999, pp. 12431249.

    [9] D. T. Hoang,Searching genetic databases on Splash 2,in Proc. IEEE

    Workshop FPGAs for Custom Comp. Mach., 1993, pp. 185191.[10] TimeLogic Corporation [Online]. Available: http://www.timelogic.com[11] M. Gokhaleet al., Processing in memory: The terasys massively par-

    allel PIM array,Computer, vol. 28, no. 4, pp. 2331, 1995.[12] Y. Yamaguchi, T. Maruyama, and A. Konagaya,High speed homology

    search with FPGAs, in Proc. Pacific Symp. Biocomputing, 2002, pp.271282.

    [13] T. F. Smith and M. S. Waterman,Identification of common molecularsubsequences,J. Molec. Biol., vol. 147, pp. 195197, 1981.

    [14] B. Boeckmann etal., The SWISS-PROT protein knowledgebase and itssupplement TrEMBL in 2003,Nucl. Acids Res., vol. 31, pp. 365370,2003.

    [15] S. A. Guccione and E. Keller, Gene matching using JBits, in Proc.12th Int. Workshop Field-Programmable Logic and Applications, 2002,pp. 11681171.

    [16] C. W. Yu, K. H. Kwong, K. H. Lee, and P. H. W. Leong,A SmithWa-terman systolic cell,in Proc. 13th Int. Workshop Field Programmable

    Logic an d Applications, 2003, pp. 375384.[17] S. Sturrock and J. Collins,MPsrch Version 1.3, Biocomputing Re-

    search Unit, Univ. of Edinburgh, Edinburgh, U.K., 1993.