[ieee 2010 vi southern programmable logic conference (spl) - ipojuca, pernambuco, brazil...

A GENERAL-PURPOSE DYNAMICALLY RECONFIGURABLE SVM

Jonas Gomes Filho, Mario Raffo, Marius Strum, and Wang Jiang Chau

Department of Electronic Systems,University of Sao Paulo

Av. Professor Luciano Gualberto, trav. 3, 158email: [email protected], [email protected], [email protected], [email protected]

ABSTRACTThis paper presents an hardware implementation of the Se-quential Minimal Optimization (SMO) for the Support Vec-tor Machine (SVM) training phase. A general-purpose re-configurable architecture, aimed to partial reconfigurationFPGAs, is developed, i.e., it supports different sizes of train-ing sets, with wide-range number of samples and elements.The effects of fixed-point implementation are analyzed anddata on area and frequency targeting the Xilinx Virtex-IVXC4VLX25 FPGA are provided. The architecture was ableto perform the training in different learning benchmarks andthe reconfigurable architecture was able to save 22.38% ofFPGA’s area.

1. INTRODUCTION

In recent years, the Support Vector Machine (SVM) has beenlargely used in different applications [1], due to its highclassifying capability without errors (generalization perfor-mance). The use of SVMs consists of two phases: trainingand testing. Training in SVM has as objective to find a hy-perplane that separates the positive from the negative exam-ples, maximizing a margin parameter. The margin is calcu-lated adding the minimum distances from both classes to theseparating hyperplane, requiring the solution of a quadraticprogramming (QP) problem [2], which can be solved by nu-merical methods [3]. However, the associated numerical so-lutions have been pointed as inadequate for large problemsdue to the storage of large matrices of size proportional tom2 and the large training times due to their slow conver-gence [3].

As an alternative, the Sequential Minimal Optimization(SMO) algorithm [4] is one of the fastest and easiest so-lutions to implement the most general, the non-linear andnon-separable case. The algorithm is iterative and adopts ananalytical solution, optimizing one pair of Lagrange Multi-pliers per iteration, therefore, avoiding the storage of largematrices in memory.

This work was partially supported by the National Council for Scien-tific and Technological Development of Brazil (CNPq).

Important dedicated HW implementations of the train-ing phase of the SVM have been implemented in digital FP-GAs [5, 6, 10] to obtain speed-up when compared to soft-ware processing. A first totally hardware implementation fora SVM algorithm was reported in [5], based on a numericalresolution for the QP problem, therefore, being applicableonly for small cases. SMO-based solutions were later pre-sented, first by the authors of [7], as a Software-Hardware(SW-HW) system and , later, in [10], as a parallel architec-ture for the computation of multiple-pair of samples. Al-though faster than a totally software SMO implementation,the solutions were targeted either to small examples or to aspecific case.

In the last decade, with the technology based on staticmemory (SRAM), FPGAs has provided a unique aspect offlexibility: the capability for dynamic reconfiguration, whichinvolves altering the programmed design at run-time [8].The dynamically reconfigurable systems (DRS) provides ben-efits as design flexibility and area reduction, while still pro-viding significant speed benefits over purely software solu-tions.

In this work we present a dynamically reconfigurableSVM-SMO architecture, motivated by the sequential natureof some of the SMO tasks, basically the iterations on: 1-Working set selection; 2- Coefficient pair optimization; and3- Global data updating. For the best of our knowledge,this is the first dynamically reconfigurable architecture de-veloped for the SVM training. The system was designedfollowing a modular scheme, defining concurrent and inter-changeable logical areas for the reconfiguration and leavinga high-level synthesis system to generate the RTL model foreach module [12]. In addition, we propose a general pur-pose implementation of the SMO algorithm, which is able,for a given fixed-point data representation, to train a variedset of examples with different number of samples and ele-ments. To determine the suitable size of the fixed-point datafor an ample set of benchmarks, a study of its effects on theprecision and classification error was carried out. After theRTL model generation and its synthesis, results on area andtiming is presented.

107978-1-4244-6311-4/10/$26.00 ©2010 IEEE

In Section 2, the SVM theory and the SMO algorithm arerevised, while in section 3 the digital architecture is shown.In Section 4, different size fixed-point representation is com-pared. Section 5 presents the reconfigurable architectureadopted in this work and finally, in Section 6, conclusionsare set.

2. BACKGROUND

2.1. Support Vector Machines

The objective of the SVM training phase is finding a classifi-cation function based on a set of samples {~xi, yi}mi=1, where~xi is an input pattern represented as a d-dimensional vectorand yi ∈ {+1,−1} is the corresponding output. In order tosplit the input vectors into two classes, a separating hyper-plane is found. The classification function is given by [2]:

Φ̂(~x) = ~w · ~x+ b (1)

Where ~w is the normal vector from separating hyper-plane and b is a threshold parameter. Among all possibleseparating hyperplanes, SVMs find the one that maximizesa margin parameter [1], i.e., the maximum distance fromboth negative and positive points. In most cases some pointsare allocated on the wrong side of hyperplane, what maylead to classification errors. To reduce errors, the pointscan be mapped from the d-dimensional input space to a n-dimensional feature space through a nonlinear function<d →<n being n > d.

The resolution of this problem requires calculating theminimum of an objective multivariate quadratic function withinequality constraints. This Quadratic Programming (QP)Problem is solved by using the Lagrange Multipliers The-ory [1, 2]. Subsequently, SVM training can be expressed bythe following equation:

minα

L(α) =1

2

m∑i,j=1

αiαjyiyj · ϕ(~xi) · ϕ(~xj)−m∑i=1

αi

(2)

with 0 ≤ α ≤ C andm∑i=1

αiyi = 0 (3)

C is an arbitrary parameter, which sets the trade-off be-tween the margin parameter and the number of misclassifiedpoints, both important for a small generalization error.

The advantage of using the dual formulation given by2 is to avoid knowing explicitly the nonlinear function ϕ,needing only the inner product of two points in the featurespace. This can be done due to the kernel function, definedby:

K(~xi, ~xj) = ϕ(~xi) · ϕ(~xj) (4)

Among the many kernel functions that have been pro-posed for SVM mapping, the Gaussian kernel given by (5)is one of the most applicable kernels for the general case.It can efficiently map the original sample input space into amuch larger dimension one, increasing the separation mar-gin between the classes.

K(~xi, ~xj) = e−‖~xi−~xj‖

2

2σ2 (5)

2.2. The Sequential Minimal Optimization

The Sequential Minimal Optimization, SMO, algorithm pro-posed by Platt in 1998 [4], solved the QP problem analyti-cally, avoiding problems related to the numerical methods as[3, 5], as the matrix storage size proportional to m2 and thetime-consuming inner loop. The algorithm breaks the largeSVM original problem into a series of smallest QP prob-lems, and, once all Lagrange coefficients satisfy the KKTconditions [1], the algorithm is finished.

The algorithm is iterative and three main tasks are exe-cuted per iteration: 1- Working set selection; 2- Coefficientpair optimization; 3- Global data updating. The selectionstask chooses one pair of samples of which Lagrange coeffi-cients will be optimized and, for that, the heuristic proposedby Keerthi et al [13] is largely used. The training samplesare split into two groups Iup and Ilow according to the valuesof α and y:

Iup = {i | αi < C, yi = +1 or αi > 0, yi = −1} (6)

Ilow = {i | αi < C, yi = −1 or αi > 0, yi = +1} (7)

Afterwards, by evaluating every sample I the workingset {i1, i2} is formed, with the selection of the sample insideof Iup which maximizes the function 8 and the sample inIlow that minimizes it.

Fi = −yi∂L(α)

∂αi(8)

In the optimization task, two bounds Hi and Lo are cal-culated based on constraints (3), and, then, the new valueαnewi2 and αnewi1 are calculated.

αnewi2 = αoldi2 +yi2(blow − bup)

η(9)

αnewi1 = αoldi1 + y1y2(αoldi2 − αnewi2 ) (10)

where

η =∂2L(α)

∂αi2= K(~xi1, ~xi1) +K(~xi2, ~xi2)− 2K(~xi1, ~xi2)

(11)and bup and blow are the value of Fi for the pair {i1, i2}.

108

Fig. 1. Modular architecture.

Finally, the global data update is performed for everysample i, by the following equations.

Fnewi = F oldi + ∆Fi1 + ∆Fi2 (12a)

∆Fi1 = (αoldi1 − αnewi1 )K(~xi1, ~xi) (12b)

∆Fi2 = (αoldi2 − αnewi2 )K(~xi2, ~xi) (12c)

The initial values of α are 0 and the initial values of Fare the same of y. The algorithm is finished when bup −blow < εwhere ε is a user-defined tolerance parameter. Thiscondition is proved to satisfy the KKT conditions [4].

3. DIGITAL ARCHITECTURE

This SVM architecture is designed for general-purpose train-ing, i.e., it supports different number of input samples orelements; furthermore, elements may have either real or bi-nary attributes. Based on the SMO algorithm, four separateblocks were developed in a modular design, represented bythe gray boxes in Fig. 1. The kernel block is conceptu-ally part of the updater block, however, due to its compu-tational complexity and high usage, a dedicated block wasdesigned for it. The blocks execute the tasks mentioned onsection 2 with minor changes, as the substitution of param-eter η by the constant 2. That allows a faster computation,

since the kernel function is not computed and the division isexecuted as a simple right shift, while the classification er-ror rate rate remains unaffected, because the SMO algorithmalways converges for an optimal solution, based on the tol-erance variable ε. Only the number of iterations increaseslightly as shown in Section 4.

The Gaussian kernel was adopted for this general pur-pose architecture. Since the kernel function is quite com-plex, Anguita et al. have proposed a hardware-friendly one,given as [9]:

K(~xi, ~xj) = 2−γ‖~xi−~xj‖1 (13)

where γ‖~xi − ~xj‖1 is the L1-norm and γ = 2±p is anarbitrary parameter with p = 0, 1, 2, · · · [9].

The Kernel can be implemented, in fixed-point arith-metic, using only shift and add operations through a coordi-nate rotation digital computer (CORDIC)-like algorithm. Inorder to adapt this calculation for any number of elements,the following procedure is adopted. First, the L1 norm iscalculated as follow:

‖~xi − ~xj‖1 =d∑s=1

|xis − xjs| (14)

where d is the number of elements of the input vectors.Then the result is shifted by γ, and the CORDIC-like algo-rithm is performed.

Although the architecture is able to support a wide rangenumber of input samples and elements, it is limited by theinternal memory of the targeted FPGA. The quantity of mem-ory required for storing all data is: m bits for y, mw bits forboth α and F , where w is the word size in bits. For inputvectors (~xi), the memory consumption is d(m(w − 6)) bits,since the input data is normalized and do not use the first sixbits of the fixed point. Therefore, the total memory requiredin bits can be expressed as:

rq = m(2w + d(w − 6)) (15)

Taking, for instance, some benchmarks from Universityof California-Irvine Machine Learning Repository [11], therequired storage for training sets can reach Megabits. Forexample, the Adult benchmark is composed by 1605 sam-ples, each one with 123 elements. In that case, adoptinga 24 bit word representation, the required memory wouldbe 4.08 Mb. Since large memory blocks are not availablein FPGAs, this issue must be tackled by including externalRAM blocks, with a fast interface for the dynamic accessrequired by the SMO algorithm. Since we have targeted theXilinx Virtex-IV XC4VLX25 FPGA in this work, for thisfirst version of the architecture, we have taken the simplify-ing assumption that the input training set data is limited to1.29Mb.

109

Table 1. Testing error and number of iterations.

Benchmarks m/d20-bit Hardware 24-bit Hardware 32-bit Hardware Software Parameters

T. Error N. Iter. T. Error N. Iter. T. Error N. Iter. T. Error N. Iter. C γ

Dermatology 358/132 2,86% 393 2,86% 379 2,86% 381 2,86% 340 10 0.125Breast Cancer 569/30 0.86% 515 0.86% 472 0.86% 472 0.86% 466 1 1Tic Tac Toe 958/27 N/C N/C 5,68% 2572 5,68% 2568 5,68% 2481 3 0.5

4. FIXED-POINT REPRESENTATION SIZE

One of the main concerns of the SVM implementation inhardware is the type and size of numeric representation, whatmay affect the precision of the computation. In general pur-pose computing, variables are stored with 32 or 64 bits infloating-point precision, which is enough to avoid most ofquantization problems. Dedicated hardware design, on theother hand, makes use of fixed-point with limited precisionregisters. For this implementation, where the error is prop-agated m times per iteration, it is recommended to analyzethe quantization effects.

The technical literature shows only a few works on thisproblem related to SVMs [4, 14], but not applicable to thiscase. In this paper, a short study, based on experimentaldata, was developed in order to obtain a satisfactory fixed-point implementation. Several implementations were testedunder a [6.k] fixed-point representation, with 6+k bits. Only6 bits were defined for the integer part since any input data isnormalized and belong to the [0,1] interval, as recommendedby [15]. Moreover, with the usual values for the user-definedparameters C and γ, the operations will hardly reach to num-bers that extend over the representation limits For testing thedifferent cases, we fixed the tolerance precision parameter εin the widely used 0.001 value [10].

Table 1 shows the results of the SMO algorithm appliedto three benchmarks (column 1), whose number of samplesand elements are given in column 2. Three hardware fixed-point representations were tested, with the computation ofthe testing (classification) error, in percentage, and numberof iterations: [6.14] representation in columns 3 an 4, [6.18],in columns 5 and 6, and [6.26], in columns 7 and 8. A soft-ware implementation in Matlab language for the SMO algo-rithm was simulated and the results are in columns 9 and 10.Finally, columns 11 and 12 present the best values of param-eters C and γ (for software implementation). For measuringerror and iteration values, we used a ten-fold cross valida-tion method, where ten folds are stratified from the originalset, nine of them are used for training and one for testing.This process is repeated ten times for testing every singlefold, and the average numbers are used.

It can be seen in Table 1 that, for 20 bits codification,the training algorithm could not converge for all cases, asfor the Tic-Tac-Toe example. That can be explained by the

Table 2. Number of slices and relative area occupation.Sel. Opt. Upd. Kernel Others

#Slices 103 322 196 402 416% Total 7.16 22.38 13.62 27.94 28.91

Fig. 2. Reconfiguration schedule.

great number of input samples and iterations, which ampli-fies the error. Actually, for any k, 0 < k < 14, the testedbenchmarks did not converge for the given ε. Results alsoshow that, the classification error rate is not affected by thedata representation precision in hardware implementations,even compared to the software floating-point representation.It can be also observed that as the representation precisionincreases, the number of iterations decreases, but this pro-cess stabilizes for 24 and larger bits codification. Therefore,the solution with the smallest number of bits, while main-taining the quality of results, is the 24-bit codification

Fig. 3. Partitions allocation.

5. RECONFIGURABLE ARCHITECTURE

For the definition of the reconfigurable architecture, valueson area consumption for each block was obtained by syn-thesis and presented in Table 3. It can be observed thatthe kernel and the optimization blocks occupy most of the

110

FPGA area. Since there is not a direct dependence betweenthem and they are not fired at same time, it is possible tointerchange these two blocks by dynamic reconfiguration,reaching an area saving of 22.38%.

Based on these facts, a reconfiguration scheduling il-lustrated on Fig.2. is proposed. In the figure, a completeiteration execution is illustrated. In t1, the selection taskis performed while the bitstream of the optimizer block isloaded into the FPGA. In t2, the optimization task is exe-cuted, while in t3, the FPGA is once more reconfigured forloading the kernel block, which will perform the update tasktogether with the updater block in t4 as described in Fig.1.The partitions according to the time split are shown in Fig.3.

This reconfiguration plan affects somehow the executiontime. In order to analyse this effect, let us consider the it-eration execution time in the architecture without dynamicreconfiguration:

it =m(9d+ 18) + 17

f(16)

where it is the iteration execution time and f is the op-eration frequency. Consider now the execution time in a im-plementation using a dynamic reconfiguration as describedin Fig.2. which is given by

rit =m(9d+ 13) + 17

f+ 2rt (17)

with

rt =Bitstream

#Bytesrecp · rfrecp(18)

where rit is the iteration execution time with dynamicreconfiguration, rt is the reconfiguration time, Bitstream isthe sequence of bytes used for the partial reconfiguration,#Bytesrecp is the number of bytes of the reconfigurationport, and rfrecp is the frequency of the reconfigurable port.The first term of equation (18) is reduced by 5m becausethe selection task is executed in parallel with the reconfig-uration process. The time penalty, i.e. the extra-time im-posed by a reconfigurable implementation, can be easily cal-culated as rit/it. Fig.4. shows the time penalty in functionof rt for three benchmarks, a dashed line was placed in 274µs, which represents the reconfiguration time for our tar-get device, the Xilinx Virtex-IV XC4VLX25 FPGA. Thistime was estimated considering f = 50MHz, rfrecp =100MHz, 4 bytes for reconfiguration port and the bitstreamwas calculated based on information given by datasheet. Itcan be observed that the time penalty varies according to thebenchmark. We have 16.5%, 6.5%, and 10% of time penaltyfor the Breast Cancer, the Dermatology, and the Tic Tac Toebenchmarks, respectively. Showing that the larger the as-sociated computation time, the smaller the penalty will be.As stated before, the synthesized circuit indicates 22.38%

of FPGAs area saving compared to a static solution, withoutdynamic reconfiguration.

Fig. 4. Time penalty for 0 ≤ rit ≤ 502µs.

6. EXPERIMENTAL RESULTS

For validating the architecture, a simulation using a dynamiccircuit switching (DCS) technique [16] was carried out. Thistechnique employs isolation switches between the reconfig-urable modules and the static modules, allowing that onereconfigurable module be active when the other one be inac-tive, the technique also simulate the reconfigurable times inthe interval. For that, a control unit must be developed foractivate the modules at the right time and manage the recon-figuration delay. The control unit was described in VHDL,based on the Reconfiguration schedule shown in Fig. 2. Thestate machine of the unit control is illustrated on Fig. 5.

The enable signals were used as information for know-ing when a task starts, and, consequently, when another taskis finished. The states 1 and 4 define the behavior of iso-lation switches; in the first state, the unit control activatethe optimizer block and deactivate the kernel block afterthe reconfiguration delay (transition from state 1 to state 2),whereas in the fourth state the optimizer is deactivated andthe kernel is activated after the waiting for the reconfigura-tion delay. The first state also executes the selection task,state 3 executes the optimization task, and state 5 executesthe update task.

Table 3 shows the total training time for software, hard-ware (DCS simulation), and the acceleration obtained forthis architecture. It can be observed that when compared toa purely software solution, the hardware solution is, at least,12.53 times faster.

111

Fig. 5. Unit control’s state machine.

Table 3. Total training timeBreast Cancer Dermatology Tic-Tac-Toe

Software 10.44s 13.10s 109.2sHardware 0.348s 1.045s 3.761sAccel. 30 12.53 29.03

7. CONCLUSION

Experimental results show that 24 bits [6.18] is good enoughfor coding the input data, even for benchmarks with greatnumber of operations and iterations. The reconfiguration ar-chitecture allowed an area saving of 22.38% against an ac-ceptable penalty time.

In order to improve the architecture and make it evenmore general, an external memory module may be addedand a smart interface designed in order to support the dy-namic access required by the SMO algorithm.

8. REFERENCES

[1] C. J. C. Burges, ”A Tutorial on support vector machines forpattern recognition,” in Data Mining and Knowledge Discov-ery, vol. 2, pp. 121-167, 1998.

[2] C. Cortes, V. Vapnik, ”Support-vector networks,” MachineLearning, vol. 20, pp. 273-297, 1995.

[3] L. Bottou, C. J. Lin, Support vector machine solvers, LargeScale Kernel Machines, Mit Press, Cambridge, MA, pp. 1-27,2007.

[4] J. C. Platt, ”Fast training of support vector machines using se-quential minimal optimization,” in Advances in kernel meth-ods: support vector learning, 1st. ed, B. B. Schlkopf, C. J.C.; Smola, A. J., Ed. Cambridge, MA, USA: MIT Press, pp.185-208, 1999.

[5] D. Anguita, A. Boni, and S. Ridella, ”A digital architecturefor support vector machines: theory, algorithm, and fpga im-plementation,” Neural Networks, IEEE Transactions on, vol.14, no. 5, pp. 993-1009, 2003.

[6] A. Ghio, S. Pischiutta, ”A support vector machine basedpedestrian recognition system on resource-limited hardwarearchitectures,” in Research in Microelectronics and Electron-ics, IEEE Conference, pp. 161-163, 2007.

[7] R. Pedersen, M. Schoeberl, ”An embedded support vectormachine,” in Intelligent Solutions in Embedded Systems, In-ternational Workshop on, Vienna, Austria, pp. 1-11, 2006.

[8] X. Zhang and K. Ng. ”A review of high-level synthesis fordynamically reconfigurable FPGAs,” Microprocessors andMicrosystems, v. 24, p.199-211. 2000.

[9] D. Anguita, S. Pischiutta, S. Ridella, D. Sterpi, ”Feed-forward support vector machine without multipliers” Neuralnetworks, IEEE Transactions on, vol. 17, pp. 1328 - 1331,2006.

[10] R. A. Hernandez, M. Strum, Wang Jiang Chau, J. A. Q. Gon-zalez, ”A VLSI implementation of the the SVM training,” inXV Workshop Iberchip, Buenos Aires, Argentina, pp. 204-209, 2009.

[11] A. Asuncion, D. J. Newman, ”UCI Ma-chine Learning Repository,” Available from:http://www.ics.uci.edu/ mlearn/MLRepository.html, Lastaccess: November 2009.

[12] J. A. Q. Gonzalez, J. C. Wang, ”Circuit Partitioning and HDLRestructuring for Behavioral Simulation of Dynamically Re-configurable Circuit Partitions: a Case Study,” in 2nd Interna-tional Conference on Electronic Design, Vera Cruz, Mexico,pp. 6-11, 2006.

[13] S.S. Keerthi, S. K. Shevade, C. Bhattacharyya, K. R. K.Murthy, Improvements to Platts algorithm for SVM classifierdesign. Neural Computation, 13, pp. 637649, 2001.

[14] Anguita, D.; Bozza, G., ”The effect of quantization on sup-port vector machines with Gaussian kernel,” Neural Net-works, 2005. IJCNN ’05. Proceedings. 2005 IEEE Interna-tional Joint Conference on, vol.2, no., pp. 681-684 vol. 2, 31July-4 Aug. 2005.

[15] C. W. Hsu, C. C. Chang, and C. J. Lin, ”A practical guide tosupport vector classification”, Department of Computer Sci-ence and Information Engineering, National Taiwan Univer-sity, Taipei, Taiwan, 2003.

[16] C. W. P. Lysaght and J. Stockwood, ”A simulation tool fordynamically reconfigurable field programmable gate arrays,”Very Large Scale Integration (VLSI) Systems, IEEE Trans-actions on, vol. 4, no. 3, pp. 381-390, 1996.

112

[ieee 2010 vi southern programmable logic conference (spl) - ipojuca, pernambuco, brazil...

Documents