01556805

8/14/2019 01556805

1/5

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 52, NO. 12, DECEMBER 2005 851

Reconfigurable Architectures for Bio-SequenceDatabase Scanning on FPGAs

Timothy F. Oliver, Bertil Schmidt, Member, IEEE, and Douglas L. Maskell, Senior Member, IEEE

AbstractProtein sequences with unknown functionality areoften compared to a set of known sequences to detect functionalsimilarities. Efficient dynamic-programming algorithms existfor solving this problem, however current solutions still requiresignificant scan times. These scan time requirements are likelyto become even more severe due to the rapid growth in the sizeof these databases. In this paper, we present a new approach tobio-sequence database scanning using re-configurable field-pro-grammable gate array (FPGA)-based hardware platforms togain high performance at low cost. Efficient mappings of theSmithWaterman algorithm using fine-grained parallel pro-cessing elements (PEs) that are tailored toward the parameters ofa query have been designed. We use customization opportunities

available at run time to dynamically reconfigure the PEs to makebetter use of available resources. Our FPGA implementationachieves a speedup of approximately 170 for linear gap penaltiesand 125 for affine gap penalties compared to a standard desktopcomputing platform. We show how run-time reconfiguration canbe used to further improve performance.

Index TermsBiomedical computing, database searching, pro-teins, reconfigurable architectures.

I. INTRODUCTION

SCANNING protein sequence databases is a common andoften repeated task in molecular biology. The need for

speeding up these searches comes from the rapid growth of the

bio-sequence banks: every year their size is scaled by a factor1.5 to 2. The scan operation consists of finding similarities be-tween a particular query sequence and all sequences of a bank.This allows biologists to identify sequences sharing commonsubsequences, which from a biological viewpoint have similar

functionality. Comparison algorithms whose complexities arequadratic with respect to the length of the sequences detectsimilarities between the query sequence and a subject sequence.

One frequently used approach to speed up this time consumingoperation is to introduce heuristics in the search algorithm [1].The drawback is that the more efficient the heuristics, the worseis the result.

A number of parallel architectures have been developed forsequence analysis. Special-purpose systolic arrays provide thebest area/performance ratio for a particular algorithm [2]. How-

ever, they are limited to one single algorithm, and thus cannotsupply the flexibility necessary to run a variety of algorithms

required for analyzing DNA, RNA, and proteins. P-NAC wasthe first such machine and computed edit distance over a four-character alphabet [3]. More recent examples, better tuned to

Manuscript received October 12, 2004. This paper was recommended byAssociate Editor N. R. Aluru.

The authors are with the School of Computer Engineering, Nanyang Techno-logical University, Singapore 63978 (e-mail: [email protected]).

Digital Object Identifier 10.1109/TCSII.2005.853340

the needs of computational biology, include BISP and SAMBA[4], [5].

An approach presented in [6] is based on instruction systolic

arrays (ISAs). ISAs combine the speed and simplicity of systolic

arrays with flexible programmability. Several other approaches

are based on the single-instruction multiple data (SIMD) con-

cept, e.g., MGAP [7], Kestrel [8], and Fuzion [6]. SIMD and

ISA architectures are programmable and applicable to a wider

range of applications, such as image processing and scientific

computing. Since these architectures contain more general-pur-

pose parallel processors, their processing element (PE) density

is less than the density of special-purpose application-specificintegrated circuits (ASICs). Nevertheless, SIMD solutions can

still achieve significant runtime savings. However, the costs in-

volved in designing and producing SIMD architectures are quite

high. As a consequence, none of the above solutions has a suc-

cessor generation, making upgrading impossible.

Reconfigurable solutions including Splash-2 [9] and Deci-

pher [10] are based on field-programmable gate arrays (FPGAs)

while processing in memory (PIM) [11] has its own reconfig-

urable design. Solutions based on FPGAs have the additional

advantage that they can be regularly upgraded to state-of-the-

art-technology. This makes FPGAs a good alternative to spe-

cial-purpose and SIMD architectures.Compared to the previously published FPGA solutions, we

are using a new partitioning technique for varying query se-

quence lengths. The design presented in [12] is closest to our

approach since it also uses a linear array of PEs on a recon-

figurable platform. Unfortunately, it only allows for linear gap

penalties and global alignment, while our implementation con-

siders both linear and affine gap penalties and is able to compute

local alignments.

II. SEQUENCECOMPARISONALGORITHM

Surprising relationships have been discovered between pro-

tein sequences that have little overall similarity but have similarsubsequences. Thus, the identification of similar subsequences

is probably the most useful and practical method for comparing

two sequences. The SmithWaterman algorithm [13] finds

the most similar subsequences of two sequences (the local

alignment) by dynamic programming. The algorithm com-

pares two sequences by computing a distance representing the

minimal cost of transforming one segment into another. Two

elementary operations are used: substitution and insertion/dele-

tion. Through a series of these operations, any segment can

be transformed into any other segment. The smallest number

of operations required to change one segment to another is a

measure of the distance between the segments.

1057-7130/$20.00 2005 IEEE

8/14/2019 01556805

2/5

852 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 52, NO. 12, DECEMBER 2005

Fig. 1. Sequence comparison on a linear processor array: the query sequenceis loaded into the processor array (one character per PE) and a subject sequence

flows from left to right through the array.

Consider two strings and of length and . To iden-

tify common subsequences, the SmithWaterman algorithmcomputes the similarity of two sequences ending at po-

sition and of the two sequences S1 and S2. The computation

of is given by the following:

for

for

for

where is a character substitution cost table. Initialization

of these values are given by

for , . Multiple gap costs are

taken into account as follows: is the cost of thefirst gap; is

the cost of the following gaps. This type of gap cost is known

as affine gap penalty. Some applications also use a linear gap

penalty, i.e., . For linear gap penalties the above can be

simplified to

for

for

III. MAPPINGSEQUENCECOMPARISON

The algorithm can be efficiently mapped to a linear array of

PEs, where one PE is assigned to each character in the query

string and then a subject sequence is shifted systolically through

the linear chain of PEs (see Fig. 1). If and are the lengths

of the two sequences, the comparison is performed in

steps on PEs, instead of the steps required on a se-quential processor. Taking advantage of the reconfigurable de-

vice, we tailor the PE design toward different gap penalty func-

tions. This approach allows us to include only the required com-

putational hardware and local memory. Fig. 2 shows our designs

for linear gap penalties and affine gap penalties.

The data width is scaled to the required precision (usu-

ally is suf ficient). The look-up table (LUT) depth is

scaled to hold the number of substitution table rows. The substi-

tution width is scaled to accommodate the dynamic range

required by the substitution table. The look-up address width

is scaled in relation to the LUT depth. Each PE has local

memory to store , and .

The PE holds a column of the substitution table in its LUT. Thelook up of and addition to is

done in one cycle. The score is calculated in the next cycle and

is passed, along with the maximum score computed so far, to the

next PE in the array. The affine gap penalty PE has additional

storage for and and the additional score.

Assume we are aligning the sequences

and , on a linear processor array of size

with affine gap penalties, where is the query sequence andis a subject sequence of the database. As a preprocessing step,

symbol , i s loaded i nto P E , . A fter t hat t he r ow o f

the substitution table corresponding to the respective character

is loaded into each PE as well as the gap penalties and . is

thencompletelyshiftedthrough the array in steps as

displayed in Fig. 1. In iteration step , , the

values , , and for all , with ,

and are computed in parallel in all PEs

, within a single clock cycle. For this calculation PE

, , receives the values , , and

from its left neighbor , while the values ,

, , , , , and are stored

locally. PE 0 receives in s teps with . C omputationfor a lineargap penalty is similar. Thesequences canbe pipelined

with only one step delay between two different sequences.

Our PE design incorporates the maximum computation of the

matrix with only a constant time penalty as follows: After each

iteration step all PEs compute a new value by taking the

maximum of the newly computed value and the old value of

from its left neighbor. After the last character of a subject

sequence has been processed in PE , the maximum of matrix

is stored in PE , which is then written to off-chip memory.

Ranking and reconstructing the alignments are carried out by the

front end PC. Because this last operation is only performed for

very few subject sequences, its computation time is negligible.So far, we have assumed a processor array is equal in size

to the query sequence length. In practice, the length of the se-

quences may vary and thus the computation must be partitioned

on thefixed size processor array. The query sequence is usually

larger than the processor array. For sake of clarity we firstly as-

sume a query sequence of length and a processor array of size

where is a multiple of , i.e., where is

an integer. A possible solution is to split the computation into

passes. Unfortunately, this solution requires a large amount of

off-chip memory. The memory requirement can be reduced by

factor by splitting the database into equal-sized pieces and

computing the alignment scores of all subject sequences within

each piece. However, this approach also increases the loading

time of substitution table columns by factor .

To eliminate this loading time, we have extended our PE de-

sign. Each PE now stores columns of the substitution table in-

stead of only one. Although this increases the area of each PE, it

allows alignment of each database sequence with the complete

query sequence without additional delays. The off-chip memory

requirement for intermediate results is also reduced to four times

the longest database sequence size. Fig. 3 illustrates our solu-

tion. Different configurations are designed for different values

of , allowing the loading of a particular configuration suited

for a range of query sequence lengths.

The database sequences are passed in from the host one byone through afirst-infirst out (FIFO) buffer to the interface.

8/14/2019 01556805

3/5

OLIVERet al.: RECONFIGURABLE ARCHITECTURES FOR BIO-SEQUENCE DATABASE SCANNING 853

Fig. 2. (a) Linear gap penalty PE. (b) Affine gap penalty PE.

Fig. 3. System implementation.

For query lengths longer than the PE array the intermediate re-

sults are stored i n a FIFO of width for af fine gap

penalty. For linear gap penalty the FIFO width is .

The FIFO depth is sized to hold the longest sequence in thedatabase. The database sequence is also stored in the FIFO. On

each consecutive pass an LUT offset is added to address the next

column of the substitution table stored within the PEs. The max-

imum score on each pass is compared with those from all other

passes and the absolute maximum is returned.

IV. PERFORMANCEEVALUATION

We have described the PE design in Verilog and targeted it to

the Xilinx Virtex II architecture. The size of a linear gap penalty

PE is 3 10 configurable logic blocks (CLBs) and the size

of an affine gap penalty PE is 6 8 CLBs. Using a Virtex II

XC2V6000 we are able to accommodate 252 linear PEs or 168

affine PEs using . This allows for query sequence lengths

up to 756 and 504, respectively, which is sufficient in most cases

(74% of sequences in SwissProt are 500 [14]). For longer

queries we have a design with , which can accommodate

168 linear PEs or 126 affine PEs. The clock frequencies are 55MHz for linear and 45 MHz for affine.

8/14/2019 01556805

4/5

854 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 52, NO. 12, DECEMBER 2005

TABLE IPERFORMANCE OFFPGA-BASEDBIO-SEQUENCESCANNER

TABLE IIPERFORMANCEEVALUATIONVERSUSQUERYSEQUENCELENGTH

A performance measure commonly used in computational bi-

ology is cell updates per second(CUPS). A CUPS represents

the time for a complete computation of one entry of the matrix, including all comparisons, additions and maxima computa-

tions. The CUPS performance of our implementations can be

measured by multiplying number of PEs times clock frequency.

Table I summarizes our results.

As CUPS does not consider data transfer time, query length

and initialization time, it is often a weak measure that does not

reflect the behavior of the complete system. Thus, we use data-base scans for different query lengths in our evaluation. Table II

reports the performance for scanning the SwissProt proteindatabank (release 42.5, which contains sequences com-

prising amino acids [14]) for query sequences of var-

ious lengths using our design on an RC2000 FPGA Mezzanine

PCI-board with a Virtex II XC2V6000.

An optimized C-program on a Pentium IV 1.6 GHz has a

performance of 52 MCUPS for linear gap penalties and 40

MCUPS for affine gap penalties. Hence, our FPGA implemen-

tation achieves a speedup of approximately 170 for linear gap

penalties and 125 for affine gap penalties.For the comparison of different parallel machines, we have

taken data from [4], [6], [8], [12] for a database search with the

SW algorithm for different query lengths. The XC2V6000 is

around ten times faster than the larger 16 K-PE MasPar. Kestrel,

Fuzion and Systola 1024 are one-board SIMD solutions. Kestrel

is 12 times slower [8], Fuzion is two to three times slower [6],

and Systola is around 50 times slower [6] than our solution.All have a lower performance, as they use older CMOS tech-

nology. Extrapolating to current technologies both SIMD and

reconfigurable FPGA platforms have approximately equal per-

formance. However, the difference between approaches is that

FPGAs allow easy upgrading.

Our implementation is slower than the implementations of

[9], [15], [16]. However, these designs implement only edit

distance. This greatly simplifies the PE design and achieves ahigher PE density and clock frequency. Although of theoretical

interest, edit distance is not used in practice as it does not allow

for different gap penalties and substitution tables. The Virtex

XCV2000E implementation presented in [12] is around three

times slower than our solution. Unfortunately, the design onlyimplements global alignment.

Fig. 4. Web service architecture.

V. RUN-TIMERE-CONFIGURATION

A web service based on our PE designs has been developed. A

user entersa database query using a form-based webpage similarto MPSrch [17]. Fig. 4 shows the service architecture. A MySQL

server is used to store the SwissProt database [14] and queueuser queries. The application manager handles communication

between the MySQL server and the FPGA board. On start-up the

sequence data is extracted from the database and compiled into

a form suitable for fast streaming to the FPGA. The parameter

set from a user query is given a unique ID and stored in a MySQL

table. When the application manager sees the query in the list, it

processes it for transfer to the FPGA and configures the FPGAwith a lineararray suitedto therequest.The query parameters are

loadedontothelineararrayandthecompileddatabaseisstreamed

across the PCI interface. Results are collected and ranked in theapplication manager. When the search isfinished the results arereturned to the MySQL database where the web server identifiesthem using the unique ID and displays them to the user.

A. Customizing a Query

Reconfiguration is worthwhile only if the time saving isgreater than the time to re-configure. It takes approximately80 ms to configure the XC2V6000 over the PCI bus, while a

database scan takes several seconds. Thus, it is feasible to use

the queriesparameters to customize the PEs to make better useof the available resource.

B. Dynamic Precision Scaling

The precision of the PE array is dictated by the bit-width of

the data path, while the resource used by the data path scales

linearly with its bit width. As the operations in the data path

are all additions, the propagation delay scales linearly with its

bit-width. Thus, for a lower bit-width it is possible tofit morePEs on to the FPGA area and run them faster. Given the query

parameters, we must decide on an adequate precision for the PE

data path. As the PE design uses zero limited saturation arith-

metic, the minimum value is always 0 and the maximum value

is , where is the bit width. The negative values

summed in the PE naturally saturate to zero. The score will only

increment when two amino acids match. The largest incrementin any one processing step is the highest positive value in the

8/14/2019 01556805

5/5

OLIVERet al.: RECONFIGURABLE ARCHITECTURES FOR BIO-SEQUENCE DATABASE SCANNING 855

Fig. 5. Dynamic precision database scan scaling to the length of SwissProtdatabase sequences.

Fig. 6. Half and quarter size database scanners on one Virtex-II XC2V6000.

substitution table . If is the length of thefirst se-quence and is the length of the second, the maximum score

that could occur is . Thus, the product of

the shortest sequence length and the maximum value in the sub-

stitution table dictates the precision.

Protein databases exhibit a Gaussian curve length distribu-

tion centered around 300 amino acids. The SwissProt database

[14] is no exception. The database is scanned from shortest to

longest sequence, using a lower precision PE array at the begin-

ning and a higher precision for longer sequences. A full device

re-configuration is required each time the precision is changed

which takes approximately 80 ms. Fig. 5 shows how dynamic

scaling gives a higher MCUPS value at the start of the scan,

slowing down toward the end. Using this technique with a query

sequence of length 1452 and a PE array we achieved a

6% increase in the performance.

C. Multi-Threading Requests

To maintain performance for short query sequences the data-

base is split into two to four segments. We can place half-length

and/or quarter-length linear arrays to make better use of the

FPGA resource as shown in Fig. 6. The different database seg-

ments have to be streamed across the PCI bus to each linear

array, which could prove to be a performance bottleneck.

It is conceivable that several short sequence requests could be

in the queue at any one time. To handle this we propose sortingthem into bins depending on the size of PE array required. Short

sequences are grouped and all scheduled to the device at once.

The database streaming is synchronized between the linear ar-

rays. We can regain the 2575% performance lost in short se-quence alignments while using the same FPGA resource and

database-streaming overhead.

VI. CONCLUSIONIn this paper, we have demonstrated that re-configurable

hardware platforms provide a cost-effective solution to high

performance bio-sequence database scanning. A partitioning

strategy to implement database scans with a fixed-size pro-

cessor array and varying sequence lengths was described. Using

our PE design and our partitioning strategy we can achieve

supercomputer-like performance on a low cost off-the-shelf

FPGA. Our FPGA implementation achieves a speedup of

approximately 170 for linear gap penalties and 125 for affine

gap penalties as compared to a desktop computing platform.

We have presented several strategies for further improving

the performance of our approach using customization enabled

by run-time re-configuration. Our investigation suggests a 6%improvement in performance when using dynamic precision

scaling during a database scan. Scheduling multiple different

sized processing arrays on a single device allows us to regain the

25%75% performance lost in short-sequence alignments whileusingthesameFPGAresourceanddatabase-streamingoverhead.

REFERENCES

[1] S. F. Altschul,W. Gish, W. Miller, E. W. Myers,and D. J. Lipman, Basiclocalalignment search tool,J. Molec. Biol., vol.215,pp.403410, 1990.

[2] R. Hughey, Parallel hardware for sequence comparison and alignment,CABIOS, vol. 12, no. 6, pp. 473 479, 1996.

[3] D. P. Lopresti,P-NAC: a systolic array for comparing nucleic acid se-

quences,Computer, vol. 20, no. 7, pp. 9899, 1987.[4] P. Guerdoux-Jamet andD. Lavenier, SAMBA: hardware acceleratorfor

biological sequence comparison,CABIOS, vol. 12, no. 6, pp. 609615,1997.

[5] R. K. Singh et al.,BIOSCAN: A network sharable computational re-source for searching biosequence databases, CABIOS, vol. 12, no. 3,pp. 191196, 1996.

[6] B. Schmidt, H. Schrder, and M. Schimmler, Massively parallel solu-tions for molecular sequence analysis,in Proc. IEEE Int. Workshop on

High Performance Computational Biology, Ft. Lauderdale, FL, 2002.[7] M. Borah, R. S. Bajwa, S. Hannenhalli, and M. J. Irwin,A SIMD so-

lution to the sequence comparison problem on the MGAP, in Proc.ASAP94, 1994, pp. 144160.

[8] D. Dahle, L. Grate, E. Rice, and R. Hughey,The UCSC Kestrel generalpurpose parallel processor,in Proc. Int. Conf. Parallel and DistributedProcessing Techniques and Applications., 1999, pp. 12431249.

[9] D. T. Hoang,Searching genetic databases on Splash 2,in Proc. IEEE

Workshop FPGAs for Custom Comp. Mach., 1993, pp. 185191.[10] TimeLogic Corporation [Online]. Available: http://www.timelogic.com[11] M. Gokhaleet al., Processing in memory: The terasys massively par-

allel PIM array,Computer, vol. 28, no. 4, pp. 2331, 1995.[12] Y. Yamaguchi, T. Maruyama, and A. Konagaya,High speed homology

search with FPGAs, in Proc. Pacific Symp. Biocomputing, 2002, pp.271282.

[13] T. F. Smith and M. S. Waterman,Identification of common molecularsubsequences,J. Molec. Biol., vol. 147, pp. 195197, 1981.

[14] B. Boeckmann etal., The SWISS-PROT protein knowledgebase and itssupplement TrEMBL in 2003,Nucl. Acids Res., vol. 31, pp. 365370,2003.

[15] S. A. Guccione and E. Keller, Gene matching using JBits, in Proc.12th Int. Workshop Field-Programmable Logic and Applications, 2002,pp. 11681171.

[16] C. W. Yu, K. H. Kwong, K. H. Lee, and P. H. W. Leong,A SmithWa-terman systolic cell,in Proc. 13th Int. Workshop Field Programmable

Logic an d Applications, 2003, pp. 375384.[17] S. Sturrock and J. Collins,MPsrch Version 1.3, Biocomputing Re-

search Unit, Univ. of Edinburgh, Edinburgh, U.K., 1993.

01556805

Documents