fpga-based matrix inversion using an iterative chebyshev ......fpga-based matrix inversion using an...

FPGA-based Matrix Inversion Using an IterativeChebyshev-type Method in the Context of

Compressed Sensing

Hector Daniel Rico-Aniles, Juan Manuel Ramirez-Cortes, Jose de Jesus Rangel-MagdalenoINAOE, Tonantzintla, México.

{ h.d.rico, jram, jrangel}@inaoep.mx

Abstract—Compressed sensing is a recently proposed tech-nique aiming to acquire a signal with sparse or compressiblerepresentation in some domain, using a number of samples underthe limit established by the Nyquist theorem. The challenge isto recover the sensed signal solving an underdetermined linearsystem. Several techniques can be used for that purpose, such asl1 minimization, Greedy and combinatorial algorithms. Greedyalgorithms have been found to be more suitable in hardware so-lutions, however they rely on efficient matrix inversion techniquesin order to solve the underdetermined linear systems involved.In this paper, a novel and efficient FPGA architecture to finda matrix inversion is presented. The architecture is based onan iterative Chebyshev-type method, and it was developed in aXilinx Spartan-6 XC6SLX45 FPGA. Preliminary results show ahigh accuracy with an error of 0.0001 in average.

I. INTRODUCTION

In the middle of a digital revolution, a plethora of sensingsystems have been developed dealing with the compromisebetween data size, resolution, and quality. Traditionally, theacquisition systems follow the mathematical analysis estab-lished by Nyquist and Shannon [1][2] in the so called samplingtheorem. Derived from their work, it is widely known that forthe reconstruction of a signal with a bandwidth F , the samplingfrequency Fs must be at least two times the signal bandwidth,Fs ≥ 2F .

When a signal is acquired and digitized, considering theNyquist-Shannon theorem, dealing with large data sets be-comes a challenge. Transform Coding is a popular techniquethat aims to find a base or frame where the signal has a sparseor compressible representation [3]. A signal is k-sparse if it canbe represented with k elements, k

III. Sections IV and V present the results of the developedFPGA architecture and the conclusion of the work.

II. CHEBYSHEV-TYPE METHOD FOR MATRIX INVERSION

In 2003, Amat et. al. [15] introduced an iterativeChebyshev-type method of third order or cubic convergenceto find the inverse of a given matrix. Few years later in 2011,Hou-Biao Li et. al.[14] compared the method proposed byAmat with the iterative Newton method and demostrated thatChebyshev-type method has less computational complexityand needs a small number of iterations to find the solution.In that work they suggested a preconditioning technique forthe initial guess of the method to ensure that the method willconverge.

A. Mathematical Formalization

The mathematical formalization of the Chebyshev-typemethod is given by (2).

Nm+1 = Nm(3I −ANm(3I −ANm)) (2)

Where:

Nm+1 next inverse aproximation,

Nm previous inverse aproximation,

I identity matrix,

A matrix to be inverted.

B. Preconditioning

In order to find the solution to the inverse matrix problemthrough the Chebyshev-type method, it is important to choosea suitable initial guess, otherwise the method diverges.

With the equation (3) a suitable initial guess can becomputed to ensure the method’s convergence.

N0 =AT

||A||1||A||∞(3)

Where:

N0 initial guess,

A matrix to be inverted,

AT transpose of A,

||A||1 max value of the summation of the elements on eachcolumn (4),

||A||∞ max value of the summation of the elements on eachrow (5).

||A||1 = maxj{n∑

i=1

|aij |} (4)

||A||∞ = maxi{n∑

j=1

|aij |} (5)

III. FPGA IMPLEMENTATION

The implementation of the Chebyshev-type algorithmfor the inversion of a matrix is based on the mathematicalformulation proposed in [15][14] and depicted on theprevious section. Starting from that, an algorithm and FPGAarchitecture is proposed.

A. Algorithm

The algorithm has been divided, as shown in Table I, intothree stages: preconditioning stage, based on equation (3),iterative stage, based on equation (2) and verification stage,based on the premise of (6).

AA−1 = I (6)

In the preconditioning stage, matrix A is transposed and||A||1 and ||A||∞ are calculated, using equations (4) and (5).The output of this stage is the initial guess, N0, of the matrixinversion, which is saved into a memory location.

The iterative stage has been divided into its simplestoperation such as matrix multiplication and substraction. Everystep at this stage is saved into an embbeded RAM memory.

Verification stage makes a multiplication between the ma-trix A and the previously found inverse matrix aproximationNm+1 and verifies if the result meets the condition establishedin (6). Otherwise, it copies the Nm+1 aproximation to the Nmmemory address and sends a signal to the control block to startanother iteration.

B. System Structure

The architecture developed has been divided into blocks,each one carries out an specific task. These blocks are:preconditioner, core, verifier, multiplexer, control and storage,and are described in the next subsections.

Figure 1 graphically depicts the system structure and itscomposition blocks.

Fig. 1. System structure.

TABLE I. ALGORITHM

Input:Matrix A,Dimension of A.

Preconditioning||A||1, ||A||∞Transpose of A,

N0 =AT

||A||1||A||∞

IterativeA ∗Nm3I −ANmANm(3I −ANm)3I −ANm(3I −ANm)Nm+1 = Nm(3I −ANm(3I −ANm))VerificationA ∗Nm+1if (A ∗Nm+1) ≈ I

finish algorithm

else

Nm = Nm+1

go to iterative stage

end

Output:Nm+1 as the inverted matrix A−1.

1) Preconditioner: Preconditioner block’s aim is to givethe first aproximation to the inverse matrix A−1, taking asinput the LUT A which contains matrix A. This block isconstituted by two sub-blocks: max and transpose, as shownin figure 2.

Equations (4) and (5) are computed in the max sub-blockwith the max row and max column blocks, and at the outputof the max the product ||A||1||A||∞ is obtained using amultiplier. The output is saved in a register and sent to theinput of the transpose sub-block.

The transpose sub-block has an input and output addressgenerator, that are composed by two counters and one adder.Input address generator reads LUT A as rows and outputaddress generator sends the data to be stored as columns, thatway the matrix A is transposed.

The division is inside the transpose sub-block and takesas entries the output of the max sub-block and makes thedivision of every element of the transposed matrix AT andsends the result to preconditioner block output to be storedinto an embbeded RAM memory. An address multiplexer isincluded to select the transpose or max output address.

When this process is finished and the first aproximation,N0, has been sent to memory, the block sends a signal to thecontrol block so the core block can start another iteration.

Fig. 2. Preconditioner block diagram.

2) Core: The core block has two sub-blocks inside:matrix multiplication and substraction. Based on those twosub-blocks it computes the next inverse matrix aproximation,Nm+1 from (2). The block distribution is shown in figure 3.

Multiplication sub-block has three address generator, twoof them are for the multiplication inputs and the other is toallocate the row column product. It has a multiplier to do theoperation and a finite state machine to control the operationflow.

Substraction sub-block has a finite state machine tocontrol the process, an adder to do the substraction given by(3I − input) and an address generator that works for theinput and output data.

3) Verifier: When the core block finishes its procedure themultiplier does another matrix multiplication between matrixA and the last stored aproximation Nm+1. The verifier checksup if the product gives as result the identity matrix I , if itdoes it sends a finish signal, otherwise it copies the Nm+1aproximation to the Nm memory location and the algorithmgoes back to the iterative stage.

The verifier has a sub-block, called checker, to verify ifthe aproximation found is the inverse matrix AT . The checkeruses two counters and a comparator to check up if the elementbeing analized is in or out of the matrix diagonal. The secondcomparator examines if the result is zero if the elemente is out

Fig. 3. Core block diagram.

of the matrix diagonal, or one if is in the matrix diagonal. Thechecker FSM controls the check up procedure and the thirdcounter gives as output how many elements of the identitymatrix are different of zero or one.

The copier, the other sub-block of the verifier, generatesthe address with three counters. The finite state machine isused to control the counters behavior and sends a signal toenable writing on memory.

4) Multiplexer: The multiplexer block, consisting onseveral multiplexers and demultiplexers, adresses the dataneeded on the input and output of each block.

5) Control and storage: For the control of the systema finite state machine (FSM) has been implemented. Anembbeded RAM memory is used to store the data.

The calculation is cotrolled by the FSM and is carried outrecursively as follows:

FSM computing flow

1© Max values ||A||1, ||A||∞ are computed.2© Initial guess Nm=0 (3) is calculated.

Fig. 4. Verifier block diagram.

3© Multiplication A ∗Nm is done.4© 3I −ANm is computed.5© Signal is sent to core block to compute and store the

multiplication ANm(3I −ANm).6© Core block does and store the operation 3I −

ANm(3I −ANm).7© Core block computes Nm(3I −ANm(3I −ANm)).8© A signal is sent to the core block to compute the

multiplication A ∗Nm+1.9© Verifier checks every multiplication and counts every

unexpected result.

10© If the verifier counter output is greater than zero,Nm+1 is copied to the Nm memory address and theproccess start again from step 3©.

11© If the verifier counter output is zero, the proccess isfinished.

IV. RESULTS

The Chebyshev-type method was first implemented on aMATLAB program with a 5x5 matrix and the inverse matrixwas found on eight iterations of the algorithm. The FPGAfound the matrix inversion in eight iteration as in the programdeveloped in MATLAB with an aproximated time of 4.4us,about 6.5 times faster than MATLAB, and an error of 0.0001.

Two more experiments were performed with matrix sizesof 8x8 and 16x16 and computation time of 2.077ms and23.06ms on the FPGA architecture and 4.3ms and 58.5msin MATLAB respectevily. The implementation is focus onresource optimization.

A Xilinx Spartan-6 XC6SLX45 was used to implement thealgorithm with a device utilization summarised on TABLE II.The maximum working frequency is 50.7 Mhz. Data was in a36-bits fixed point format with 18-bits for integer and 18-bitsfor fractional. The architecture implementation is depicted infigure 5.

Fig. 5. The architecture is discribed as follow: 1) Copier, 2) max, 3)transpose, 4) multiplexer, 5) multiplication, 6) substraction, 7) RAM,8) checker and 9) FSM

TABLE II. DEVICE UTILIZATION

Device Utilization SummaryLogic Utilization Used Available UtilizationSlice Registers 801 54576 1%

Slice LUTs 1335 27288 4%

Fully used LUT-FF pairs 460 1676 27%

Bonded IOBs 13 316 4%

Block RAM/FIFO 8 116 6%

BUFG/BUFGCTRLs 1 16 6%

DSP48A1s 27 58 46%

V. CONCLUSION AND FUTURE WORK

In this work a new FPGA architecture to solve the leastsquare problem is proposed. It adopts an iterative Chebyshev-type method with a cubic order of convergence. This architec-ture is easy to extend as only the size of the memory needsto be expanded when the matrix, LUT A, to be inverted isenlarged and the format of the fixed point data. The numberof bits used in data format is directly affected by the divisiondone in the preconditioning block. This is because a smallnumber is divided by a much greater number.

Finally the system is developed in a Xilinx Spartan-6 XC6SLX45 FPGA chip. The results obtained have highaccuracy. Incorporation of this work to solve the least squareproblem in a Compressed Sensing algorithm, is currently inprogress.

REFERENCES[1] C. E Shannon. Communication in the Presence of Noise. Proceedings

of the IEEE, vol. 86, No. 2, pp. 447-457. February 1998.[2] H. Nyquist. Certain Topics in Telegraph Transmission Theory. Proceed-

ings of the IEEE, vol. 90, No. 2, February 2002.[3] A. M. Bruckstein, D. L. Donoho, M. Elad. From Sparse Solutions of

Systems of Equations to Sparse Modeling of Signals and Images. Societyfor Industrial and .Applied Mathematics, Vol. 51, No. 1, pp. 34-81.2009.

[4] D. L. Donoho. Compressed Sensing. IEEE Transactions on InformationTheory, vol. 52, no. 4. April 2006.

[5] E. J. Candés, J. Romberg, T. Tao. Robust Uncertainty Principles: ExactSignal Reconstruction From Highly Incomplete Frequency Information.IEEE Transactions on Information Therory, vol. 52, No. 2, February2006.

[6] R. Baraniuk. Compressive Sensing. IEEE Signal Proc Mag, vol.24, No.4, pp.118-120, 124, 2007.

[7] Y. C. Eldar, G. Kutyniok. Compressed Sensing Theory and Applications.Cambrige University Press, ISBN 978-1-107-00558-7, 2012.

[8] Linfeng D, Rui W, Wanggen W, XiaoQing Y, Shuai Y. Analisyson Greedy Reconstruction Algorithms Based on Compressed Sensing.International Conferences on Audio, Language and Image Processing(ICALIP). pp. 783-789. Shangai, July 2012.

[9] L. Jicheng ,Z. Hao, M. Huadong. Novel Hardware Architecture ofSparse Recovery Based on FPGAs. 2nd International Conference onSignal Processing Systems (ICSPS). Vol. 1, pp. 302-306. Dalian, July2010.

[10] P. Blache, H. Rabah, A. Amira. High Level Prototyping and FPGAImplementation of the Orthogonal Matching Pursuit Algorithm. 11thInternational Conference on Information Science, Signal Processing andtheir Applications (ISSPA), pp. 1336-1340. Montreal, QC. July, 2012.

[11] L. Bai, P. Maechler, M. Muehlberghuber, H. Kaeslin. High-Speed Com-pressed Sensing Reconstruction on FPGA Using OMP and AMP. 19thIEEE International Conference on Electronics, Circuits and Systems(ICECS), pp. 53-56. Seville, Dec 2012.

[12] J. L. V. M. Stanislaus, T. Mohsenin. Low-Complexity FPGA Implemen-tation of Compressive Sensing Reconstruction. International Conferenceon Computing, Networking and Communications (ICNC), pp. 671-675.San Diego, CA. Jan, 2013.

[13] J. L. V. M. Stanislaus, T. Mohsenin. High Performance CompressiveSensing Reconstruction Hardware with QRD Process. IEEE Interna-tional Symposium on Circuits and Systems (ISCAS). pp. 29-32. Seoul,May 2012.

[14] H.-B. Li, T.-Z. Huang, Y. Zhang, X.-P. Liu, T.-X. Gu, Chebyshev-type methods and preconditioning techniques, Applied Mathematics andComputation, Vol. 218, Issue 2, 15 September 2011, pp. 260-270.

[15] S. Amat, S. Busquier, J. M. Gutiérrez, Geometric constructions ofiterative functions to solve nonlinear equations, Applied Mathematicsand Computation, Vol. 157, Issue 1, 1 August 2003, pp. 197-205.

fpga-based matrix inversion using an iterative chebyshev ......fpga-based matrix inversion using an...

Documents