scalable parallel architecture for singular value ... · pdf filescalable parallel...

Scalable Parallel Architecture forSingular Value Decomposition of Large Matrices

Unai Martinez-Corral∗, Koldo Basterretxea†, and Raul Finker‡∗‡Grupo de Diseno en Electronica Digital (GDED) †Dept. Electronics Technology

University of the Basque Country (UPV/EHU)Bilbao, Basque Country, Spain

∗[email protected] †[email protected] ‡[email protected]

Abstract—Singular Value Decomposition (SVD) is a key linearalgebraic operation in many scientific and engineering appli-cations, many of them involving high dimensionality datasetsand real-time response. In this paper we describe a scalableparallel processing architecture for accelerating the SVD of largem × n matrices. Based on a linear array of simple processing-units (PUs), the proposed architecture follows a double data-flow paradigm (FIFO memories and a shared-bus) for optimizingthe time spent in data transferences. The PUs, which performelemental column-pair evaluations and rotations, have beendesigned for an efficient utilization of available FPGA resourcesand to achieve maximum algorithm speed-ups. The architectureis fully scalable from a two-PU scheme to an arrangement with asmany as n/2 PUs. This allows for a trade-off between occupiedarea and processing acceleration in the final implementation,and permits the SVD processor to be implemented both on low-cost and high-end FPGAs. The system has been prototyped onSpartan-6 and Kintex-7 devices for performance comparison.

Index Terms—Singular Value Decomposition, scalable archi-tecture, adaptive threshold, CORDIC, co-processor, FPGA

I. INTRODUCTION AND RELATED WORK

The Singular Value Decomposition (SVD) of a m×n matrixA is defined by:

Um×m Σdiag(n) VTn×n = Am×n (1)

where U and V matrices are orthogonal, and Σ =diag(σ1, σ2, ..., σn) where σ1, σ2, ..., σn are the singular val-ues of A. Among the existing factorization algorithms forextracting quantitative information from (very) high dimen-sionality datasets, SVD is, on balance, one of the most accurateand numerically stable method. SVD is commonly used inthe solution of unconstrained linear least square problems,matrix rank estimation, and canonical correlation analysis, andit is a key linear operation in many scientific and engineer-ing applications, such as signal and image processing, datamining, and information retrieval. It is also a basic stage inmore complex algorithms, such as in Principal ComponentAnalysis (PCA) and in Latent Semantic Analysis (LSA). Fast-processing the SVD algorithm to meet real-time requirementsusually demands the use of a parallel and direct mapped

This work was supported by the Spanish Ministry of Economy and Com-petitiveness, and European FEDER funds (grant TEC2010-15388), and by theBasque Government (grants IT733-13, S-PC12UN016, and S-PC13UN034).

hardware realization. This is especially so when availablecomputational resources are limited, as in embedded systems.Our main objective in this work has been the design of ascalable parallel SVD architecture to solve medium to largeproblems in embedded applications, which can fit in both low-cost and high-end FPGAs.

The striking activity of emerging computer scientists inthe 70’s and 80’s led to the design of efficient techniques toachieve parallel computation schemes, among which systolicarrays showed great potential for improving the performanceof the SVD. Since Jacobi-like algorithms are based on theorthogonalization of pairs of rows/columns, data sharing isnearly non-conflicting, thus providing an opportunity for par-allel processing. Brent, Luk and Van Loan’s [1] idea of an ex-pandable square systolic array of simple 2×2 Processing Units(PUs) —named BLV— together with Sibul and Fogelsanger’s[2] proposal of using COordinate Rotation DIgital Computer(CORDIC) for SVD was merged by Cavallaro and Luk [3] anda milestone was set. In fact, most of the subsequent parallelarchitectures for SVD proposed into the literature are basedon this scheme. A linear array of n/2 processors was alsoproposed by the same authors, and Schreiber [4] went into thisscheme in depth regarding undersized linear architectures, i.e.pu < n/2.

Since each PU tackles two rows and two columns in atwo-sided Jacobi algorithm, focusing on embedded systemsand large matrices, managing so much data can be a complextask. Besides that, any parallelisation attempt implies at leasttwo PUs taking eight common elements. Given an m × nnon-symmetric matrix, the one-sided Jacobi variant (Hestenes-Jacobi algorithm) avoids data-dependence issues: an orthogo-nal matrix W is generated as a product of plane rotations, Qij ,to matrix A, and exactly the same rotations are applied to In×n

in order to get V . By applying the formula of Rutishauser [5],(2), in each rotation the angle, θ, is chosen in such a way thatthe resulting column-pairs are orthogonal.

θ =

atan

(2 · (Wi ∗Wj)

||Wj ||22 − ||Wi||22

)2

(2)

The decomposition is then completed by getting the singularvalues of A, i.e. Σ in (1), out of W , as these equal the l2norm or Euclidean length of its columns, σi = ||W (:, i)||2.

Finally, U is obtained by normalizing each column with itssingular value, U(:, i) = W (:, i)/σi. Linear arrays to computeone-sided Jacobi have also been reported (see [6], [7], forinstance).

Since FPGA-based processor implementation allows for theuse of advanced design methods such as pipelining, parallel-lization and HW/SW co-design to achieve higher processingperformance per unit area and power consumption, the designof specifically tailored SVD processing architectures has beena common approach in contemporary literature. In [8] thebroadcasting of BLV is improved, and is claimed to be thefirst FPGA implementation of a SVD unit. In [9] a modulecompensation relaxation scheme is proposed to achieve shortercomputation times at the expense of prolonging the timeneeded to converge. In [10] direct broadcasting to all off-diagonal PUs is implemented. While the mentioned proposalsare based on two-sided Jacobi, in [11] Hestenes-Jacobi is usedin a system made upon a PC and a FPGA-board plugged in aPCI slot, where dynamic reconfiguration is used to implementtwo configurations in a 4 × 4 processor scheme. In [12] thesame PU design is synthesised along with the entire system ina FPGA, enhancing CORDIC instead of the PC and avoidingdynamic reconfiguration. In [13], a 4 × 4 complex SVDprocessor is proposed, which is composed of a vector productcomputation unit and eight customized CORDIC cores in a2× 4 scheme with a shared bus for each line.

In brief, since many published architectures attempt to the-oretically reach full parallelization, mostly small to medium-size examples are described (matrix sizes up to 8×8, 40×40,127× 32 and 150× 150) and, so far as the authors are aware,no practical optimal performance designs have been publishedconsidering parallel-computation of large matrices in FPGAs.Moreover, all mentioned works focus on the orthogonalizationof a given matrix A, and neither data-transfer issues nor thecomputation of the auxiliary matrix V are considered whenanalysing time and resource requirements. When addressed,it has been usually solved by doubling the resource usage[1], [13]. However, since effective computation-time has beenheavily reduced in the past decades making the best ofCORDIC and DSP blocks, the relative impact of data transfertimes has become critical. In this sense, mimicking computernetwork schemes with limited communication resources leadsto considering the non-simultaneity of data-transfers and com-putation. In consequence, a means for improving the SVDprocessing speed is customizing the Jacobi-like algorithms toachieve data-flow schemes that effectively map to reconfig-urable devices to make the most of the available HW resources.

In this work, we present a scalable parallel processingarchitecture for a modified Hestenes-Jacobi SVD based on alinear array of PUs with a double data-flow paradigm (FIFOmemories and a shared bus) for efficient data transfer. Thedesign has been specifically tailored to the factorization oflarge matrices, but can be adapted to process square and non-square matrices of any size. Some internal arithmetic opera-tions have been carefully designed, enhancing both CORDICas in [3] and DSPs in MACC mode as in [11], to achieve a

computing scheme that makes the most of embedded resourcesin modern FPGAs. At the same time, data-dependency hasbeen thoroughly analysed to achieve maximum overlappingof stages in each PU.

II. A MODIFIED HESTENES-JACOBI ALGORITHM

Being iterative, the Hestenes-Jacobi algorithm goes throughall column-pairs at least once in a minimum-length se-quence named ’sweep’. A threshold value is needed to de-cide whether the orthogonalization process is finished. Bysetting this threshold, the user can obtain a trade-off betweendesired accuracy and computing-effort (processing time). Amodification of the Hestenes-Jacobi algorithm named ActiveAdaptive Rutishauser (AARH) is proposed by the authors in[14]. The modified algorithm relies on the evaluation of theshifted minimum norm and the angle of Rutishauser (θ). Ontop of that, columns are swapped before orthogonalizationwhen ||Wi||22 < ||Wj ||22. This proposal outperforms previ-ously published approaches in terms of the total number ofsweeps/rotations required to achieve the same accuracy.

Considering the restrictions imposed by memory bandwidthwhen dealing with big matrices in embedded applications,moving data only when needed and performing computationsin parallel with column transferences can help accelerating theSVD. Cyclic sequences seem to be the most suitable whenseeking minimum data transfer, since a column i may be keptin a PU while computing n− i pairs, that is from j = i+ 1 toj = n−1. Besides that, when any ||Wi||2 or ||Wj ||2 is null noorthogonalization is required, and due to the implicit sortingin AARH, the rank of the matrix can be estimated online,rather than waiting until convergence is met. In consequence,the computation can be notably shortened after each sweep.The flowchart of the modified algorithm is shown in Figure 1,where ns tracks the smallest index with a null norm during a

Fig. 1. Active Adaptive Rutishauser (AARH).

sweep, and nl gets the rank at the end of it. In the beginningof the next sweep a problem of size m×nl is computed. If afull-rank matrix is processed, at the end ns = nl = n− 1.

III. A DATA DRIVEN COMPUTING SCHEME

The proposed architecture for the parallelization of thealgorithm described in the previous section is based on theinterconnection of pu basic processing units or PUs. Asa result of the adaptability of AARH, not all column-paircomputations take the same time, so tight schedules for pureparallel-working PUs would not be optimal. We designed a”loose” linear systolic array architecture by fitting one FIFObetween consecutive stages to enable maximum PU workload(see Figure 2, left).

As soon as A is loaded into the main memory, each PUperforms a column-pair evaluation according to the definedthreshold and, when orthogonalizations are required, Givens’rotations on the columns in A —renamed to W— and V areperformed. Focusing on the transference load, each columnWi or Wj , where 0 ≤ i ≤ nl − 2 and 1 ≤ j ≤ nl − 1,has to be computed (evaluated and, if needed, rotated) nl −1 times at each sweep. If keeping a cyclic sequence, a PUcan sequentially handle all the computations corresponding toa column Wi, thus requiring only two transfers from/to themain memory (pulling first and pushing if modified) at eachsweep. On the other hand, Wj columns have to be transferredtwice for each Wi. Consequently, the total amount of transfersrequired per sweep is γnl0 = nl2 + nl − 2.

A. Double data-flow paradigm

Since minimum convergence time is achieved when (i, j)pairs are computed before any (i + ki, j + kj), it is sensibleto assign sequential i indexes to the PUs placed in a linearscheme and make the Wj columns cross the array at most

β = 1 + (nl\pu) times in a sweep. This is exposed in Figure2 (right) for pu = 3 and nl = 8. Since each Wj columnis sent directly from the PU computing column-pair (i, j) tothe PU waiting to start (i + 1, j), the number of requiredtransfers from/to the main memory is reduced to γnlF =β · (pu · (β − 1) + 2 · (nl mod pu)), which results in savingabout 65% of the required transfer time: γ8F /γ

80 = 42.85%,

γ50F /γ500 = 34.69% γ500F /γ5000 = 33.5%. This is achieved withno penalty in data dependence, i.e. convergence, allowing forthe simplest architecture with a unique data-flow: from thereading port of the main memory, through the array of PUsand back to the writing port (see Figure 2, left).

Furthermore, since columns in Vi and Vj are only to becomputed if the corresponding Wi and Wj have been rotated,a secondary data-flow is adopted to move data without crossingthe whole array. This has been accomplished by implementingtwo unidirectional shared buses to produce an disjoint full-duplex communication channel, which involves a multiplexor-based crossbar-switch and an arbiter for managing the connec-tions. This secondary data-flow is used to push Wi columnsinto the main memory once all the corresponding computationsare done, as well as to pull/push Vi (which are transferred justonce per sweep) and Vj columns (each time needed). Themaximum number of data-transfer cycles required per sweepfor the orthogonalization of W and V is n · γnl0 + m · γnlF ,although it can be expected to be less since many transfersmay be overlapped (e.g. if PUp is requesting access to pushWj and PUp+1 is requesting to pull it, the main memory maybe bypassed). If the resources in the target FPGA suffice, andin the case that the main memory is split into different blocks,multichannel (ch > 1) buses can be implemented to reducepotential access collisions. Ideally, no collision will exist if nmemory blocks (mem) are used, holding m+n pieces each (a

Fig. 2. Proposed processing architecture for pu = 3, three blocks of main memory and a bus with three channels (left); Processing Unit (PU) core design(center) and column-pair schedule for n = 8 (right).

column of W and its corresponding of V ), and pu+1 channelsare implemented.

B. The Processing Unit (PU)As exposed in Figure 1, the Hestenes-Jacobi algorithm can

be analysed in two separate non overlapping stages: 1) orthog-onalization and 2) factorization. During orthogonalization PUsare required to check whether ||Wi||22 < ||Wj ||22 and evaluateeach pair of columns according to AARH, which involvescomputing (2). If the columns in a pair are not orthogonal, aGivens’ rotation has to be performed. Arithmetic operations tobe performed by the PUs have been carefully chosen to achievea hardware-friendly scheme. Fixed-point arithmetic has beenused, all the fixed shifts have been hardwired to avoid barrel-shifters, and unrolled pipelined realizations have been selectedto improve throughput. Embedded DSP blocks have beendirectly used in MACC mode to compute the square Euclideannorms and vector multiplications (||Wi||22, ||Wj ||22 and Wi ∗Wj). CORDIC is used in vectoring/evaluation/accumulationmode to compute the angle of Rutishauser (2) and it has beendirectly compared against the threshold as described in [14]. Ifrotation is required, the same CORDIC core has been used inrotating mode to effectively perform Given’s rotations, thus allthe computations have been reduced to addsubs or fixed shifts,except for the computation of the square Euclidean norms andvector products.

As exposed previously, the SVD may be accelerated bycomputing in parallel with data transference. Since fine-grained pipelining has nearly no hardware cost in modernFPGAs, transference can be executed at maximum rate —1piece per clock cycle— and high throughput may be achievedwhen managing large datasets. Figure 2 (center) shows a sim-plified scheme of the nuclear PU design, which is composedof three modules (EVALUATION, CORDIC and CACHE), andtwo cooperative FSMs to manage an internal multiplexor-based crossbar-switch to share data between the modules.

C. ResultsThe proposed architecture has been implemented in both

Spartan-6 and Kintex-7 devices by Xilinx. Resource utilizationand system specifications for the orthogonalization, i.e. toobtain W and V , are exposed in Table I. Processing timeshave been measured considering Am×n and In×n are alreadyloaded into the main memory.

IV. CONCLUSION AND FUTURE WORK

A parallel processing scheme based on a scalable lineararray of processing units has been developed, which adoptsa double data-flow paradigm to optimize data transfers byavoiding unnecesary transmisions and computations. Speedup-factors up to 600× compared to a softcore microprocessor-based solution (Microblaze) have been obtained, validatingpreviously fixed step simulation-based speed-up and excellentscalability figures [14]. We have not found any previouslyproposed architectures to compare with in terms of resourceutilization figures, either because small matrices are used orbecause insufficient details are reported.

TABLE ISPARTAN-6 AND KINTEX-7 RESOURCE UTILIZATION AND ELAPSED TIME

Model xc6slx45-3fgg484 xc7k160t-3fbg484DSPs — RAMs — Area (%) 93 — 62 — 56 40 — 44 — 61

Freq. 55 MHz 90 MHz

Matrix Size — pu 300x100 — 9 500x250 — 40

Processing time 60 ms 150 ms

Speed-up (Microblaze) 112x 600x

Precision 18 bits 18 bits

Regarding future work, redundant arithmetic-basedCORDIC designs, the use of square root and division freeGivens’ rotations, and designing ad-hoc estimators to computeEuclidean norms and the atan function are some of themeans for upgrading we would like to explore. For improvedarea efficiency, two designs can be explored, customizing theinternal connections for each function, and using dynamicpartial reconfiguration to switch from orthogonalization tofactorization once convergence is met.

REFERENCES

[1] R. P. Brent, F. T. Luk, and C. Van Loan, “Computation of the SingularValue Decomposition Using Mesh-Connected Processors,” Journal ofVLSI and Computer Systems, vol. 1, no. 3, pp. 242–270, 1985.

[2] L. H. Sibul and A. L. Fogelsanger, “Application of Coordinate RotationAlgorithm to Singular Value Decomposition,” in IEEE InternationalSympoisum on Circuits and Systems, 1984, pp. 821–824.

[3] J. R. Cavallaro and F. T. Luk, “Architectures for a CORDIC SVDProcessor,” in 30th Annual Technical Symposium. International Societyfor Optics and Photonics, 1986, pp. 45–53.

[4] R. Schreiber, “Solving eigenvalue and singular value problems on anundersized systolic array,” SIAM Journal on Scientific and StatisticalComputing, vol. 7, no. 2, pp. 441–451, April 1986.

[5] H. Rutishauser, “The jacobi method for real symmetric matrices,”Numerische Mathematik, vol. 9, no. 1, pp. 1–10, 1966.

[6] F. T. Luk, “Computing the Singular Value Decomposition on the IlliacIV,” Cornell University, Tech. Rep. 415, 1980.

[7] A. H. Sameh, Solving the linear least squares problem on a linear ar-ray of processors, ser. Algorithmically-Specialized Parallel Computers,Snyder, L. et. al. Academic Press, 1985, pp. 191–200.

[8] A. Ahmedsaid, A. Amira, and A. Bouridane, “Improved SVD systolicarray and implementation on FPGA,” in Proc. IEEE InternationalConference on Field-Programmable Technology (FPT). IEEE, 2003,pp. 35–42.

[9] Z. Liu, K. Dickson, and J. V. McCanny, “A floating-pointCORDIC based SVD processor,” in Proc. IEEE International Con-ference on Application-Specific Systems, Architectures, and Processors(ASAP2003). IEEE, 2003, pp. 194–203.

[10] W. Ma, M. E. Kaye, D. M. Luke, and R. Doraiswami, “An FPGA-Based Singular Value Decomposition Processor,” in Proc. CanadianConference on Electrical and Computer Engineering (CCECE ’06),2006, pp. 1047–1050.

[11] C. Bobda, K. Danne, and A. Linarth, Efficient Implementation of theSingular Value Decomposition on a Reconfigurable System, ser. LectureNotes in Computer Science. Springer, 2003, vol. 2778, pp. 1123–1126.

[12] L. M. Ledesma-Carrillo, E. Cabal-Yepez, R. d. J. Romero-Troncoso,A. Garcia-Perez, R. A. Osornio-Rios, and T. D. Carozzi, “Reconfig-urable FPGA-Based Unit for Singular Value Decomposition of Largem x n Matrices,” in Proc. International Conference on ReconfigurableComputing and FPGAs (ReConFig), 2011, pp. 345–350.

[13] D. Milford and M. Sandell, “Singular value decomposition us-ing an array of CORDIC processors,” J.Signal Processing, 2014,dx.doi.org/10.1016/j.sigpro.2014.03.022.

[14] I. Bildosola, U. Martinez-Corral, and K. Basterretxea, “Adaptive Scal-able SVD Unit for Fast Processing of Large LSE Problems,” in Proc.IEEE International Conference on Application-Specific Systems, Archi-tectures, and Processors (ASAP2014), 2014.

scalable parallel architecture for singular value ... · pdf filescalable parallel...

Documents