accelerating real-time computer vision algorithms on ... · assistant professor akash kumar, who...

National University of SingaporeFaculty of Engineering

Department of Electrical and ComputerEngineering

M.Eng. Dissertation

Accelerating Real-time ComputerVision Algorithms on Parallel

Hardware Architectures

Submitted by:Mr. Ang Zhi Ping(B. Eng. (Hons.), NUS)[email protected]

Supervisor:Assistant Professor

Akash [email protected]

A thesis submitted for the degree ofMaster Of Engineering

2014

Typesetted in LATEX 2εLast revised on October 28, 2014

Declaration

I hereby declare that this thesis is my original work and it has been written

by me in its entirety. I have duly acknowledged all the sources of information

which have been used in the thesis. This thesis has also not been submitted

for any degree in any university previously.

Ang Zhi Ping, October 28, 2014

iii

Acknowledgments

The author would like to express gratitude towards his research supervisor,

Assistant Professor Akash Kumar, who has provided invaluable advice

with respect to the choice of algorithms and implementation aspects on

hardware platforms, and to Joling who assisted in collecting and tabulating

run-time results. He would also like to thank DSO National Laboratories

for providing full sponsorship during the course of study.

v

Contents

Declaration iii

Acknowledgments v

Summary xi

List of Tables xiii

List of Figures xvi

List of Symbols and Abbreviations xvii

1 Introduction 11.1 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 �1-Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Application: Image Registration Under Projective Transfor-

mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.2 Registration Under Gaussian Error . . . . . . . . . . 61.4.3 Registration Under Gross Occlusion Error . . . . . . 81.4.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 �1 Optimization on Embedded Platforms . . . . . . . . . . . 91.5.1 Computer Vision and Imaging . . . . . . . . . . . . . 101.5.2 Biomedical Sensing . . . . . . . . . . . . . . . . . . . 111.5.3 Wireless Sensor Networks . . . . . . . . . . . . . . . 11

1.6 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Unsuitability of Targeting Existing Solvers to EmbeddedPlatforms 132.1 Floating Point Division . . . . . . . . . . . . . . . . . . . . . 142.2 Expensive Transcendental Functions . . . . . . . . . . . . . 142.3 Large Software Libraries . . . . . . . . . . . . . . . . . . . . 152.4 Inefficient Memory Usage . . . . . . . . . . . . . . . . . . . . 152.5 Lack of Parallelism . . . . . . . . . . . . . . . . . . . . . . . 15

vii

2.6 Poor Recovery Performance of Orthogonal Matching Pursuit 162.7 Questionable Scalability of OMP . . . . . . . . . . . . . . . 16

3 Vectorization on Embedded Systems 173.1 SIMD Instruction Set . . . . . . . . . . . . . . . . . . . . . 173.2 Eliminating if-else Using Vectorization . . . . . . . . . . . . 183.3 Limitations of SIMD Processing . . . . . . . . . . . . . . . 21

4 Proposed BPDN Solver 234.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Initializing x and Ω . . . . . . . . . . . . . . . . . . . . . . . 244.3 Phase I: Solving for x Given Ω . . . . . . . . . . . . . . . . . 244.4 Phase II: Updating Ω Given x . . . . . . . . . . . . . . . . . 264.5 Estimating the Sparsity of x . . . . . . . . . . . . . . . . . . 28

4.5.1 Wright et al. [1]: Face recognition from a database . 294.5.2 Wang et al. [2]: Three-dimensional arrangement of

light sources in bioassays . . . . . . . . . . . . . . . . 294.5.3 Elhamifar and Vidal [3]: Motion segmentation using

sparse subspace clustering . . . . . . . . . . . . . . . 30

5 BPDN Solver Benchmark 315.1 List of Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1.1 ADMM LASSO [4] . . . . . . . . . . . . . . . . . . 315.1.2 CGIST [5] . . . . . . . . . . . . . . . . . . . . . . . 325.1.3 FPC-BB [6] . . . . . . . . . . . . . . . . . . . . . . 325.1.4 GLMNET [7] . . . . . . . . . . . . . . . . . . . . . . 325.1.5 GPSR-BB6 [8] . . . . . . . . . . . . . . . . . . . . . 325.1.6 Homotopy [9] . . . . . . . . . . . . . . . . . . . . . 335.1.7 L1-LS [10] . . . . . . . . . . . . . . . . . . . . . . . 335.1.8 OMP [11] . . . . . . . . . . . . . . . . . . . . . . . . 335.1.9 SESOP_PACK [12] . . . . . . . . . . . . . . . . . . 335.1.10 SPAMS [13] . . . . . . . . . . . . . . . . . . . . . . 335.1.11 SpaRSA2 [14] . . . . . . . . . . . . . . . . . . . . . 345.1.12 TFOCS [15] . . . . . . . . . . . . . . . . . . . . . . 345.1.13 TwIST2 [16] . . . . . . . . . . . . . . . . . . . . . . 345.1.14 YALL1 [17] . . . . . . . . . . . . . . . . . . . . . . . 34

5.2 Test Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.3 Run-Time Performance . . . . . . . . . . . . . . . . . . . . . 355.4 Algorithm Bottleneck . . . . . . . . . . . . . . . . . . . . . . 375.5 Accuracy of Recovered Results . . . . . . . . . . . . . . . . . 375.6 Memory Usage . . . . . . . . . . . . . . . . . . . . . . . . . 395.7 Convergence Properties of Solver . . . . . . . . . . . . . . . 41

viii

6 Solver Implementation on the Xilinx Zynq Z-7020 436.1 Porting MATLAB to C . . . . . . . . . . . . . . . . . . . . 436.2 Eigen BLAS Library . . . . . . . . . . . . . . . . . . . . . . 446.3 Accelerating AT

(:,Ω)A(:,Ω) Using Programmable Logic . . . . . 456.3.1 Multiply-And-Accumulate Hardware Engine . . . . . 466.3.2 Detailed Operation . . . . . . . . . . . . . . . . . . . 466.3.3 Specification of MAC Engine . . . . . . . . . . . . . 48

6.4 Remaining Bottleneck . . . . . . . . . . . . . . . . . . . . . 53

7 Solver Implementation on NVIDIA CUDA GPU Architec-ture 557.1 Comparisons Between GPU and FPGA Architectures . . . 567.2 I/O-boundness of AT

(A(:,Ω)xΩ

). . . . . . . . . . . . . . . . 57

7.3 Accelerating Level 2 BLAS Operation (GEMV) on CUDA . 577.4 Problem Partitioning . . . . . . . . . . . . . . . . . . . . . . 597.5 Thread Block Operation . . . . . . . . . . . . . . . . . . . . 59

7.5.1 Staging data onto shared memory . . . . . . . . . . . 617.5.2 Multiply and Accumulate . . . . . . . . . . . . . . . 627.5.3 Copy to Shared Memory . . . . . . . . . . . . . . . . 637.5.4 Final Summation . . . . . . . . . . . . . . . . . . . . 637.5.5 Writing Results to DDR Memory . . . . . . . . . . . 64

7.6 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.7 Hardware Benchmark . . . . . . . . . . . . . . . . . . . . . . 66

7.7.1 Hardware Environment . . . . . . . . . . . . . . . . . 667.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 67

8 Conclusion 718.1 Proposed �1-Solver and FPGA Implementation . . . . . . . 718.2 GPU Acceleration of GEMV . . . . . . . . . . . . . . . . . 72

ix

Summary

Computer vision routines are designed as sequential algorithms intended

to run on CPUs, hence they are often formulated with parallelism as

an afterthought. Targeting these algorithms for real-time applications on

parallel hardware architectures such as field-programmable gate arrays

(FPGA) and graphics processing units (GPU) would therefore be tedious

exercise.

In this thesis, a class of �1-optimization problems known as basis pursuit

denoising (BPDN) is explored. BPDN has been widely used in situations

where the underlying signal is known to be sparse and measurements are

corrupted by Gaussian error. Robust computer vision algorithms are often

formulated as �1-optimization problems because it gives better quality

results when the data is known to be corrupted by non-Gaussian sources

of errors. For example, occlusion, salt-and-pepper noise and discontinuous

flow boundaries can be effectively modeled using �1 penalty terms.

Real world applications heavily rely on an embedded real-time �1-solver

to recover the sparse signal within a reasonable time frame because such

a solver can be integrated with a portable device. Unfortunately, existing

solvers are generally unsuitable for embedded implementation due to either

poor run-time performance, lack of parallelism or high memory usage. To

address the aforementioned issues, this thesis proposes an efficient �1-solver

suitable for implementation on parallel architectures. The algorithm is

xi

implemented on the Xilinx Zynq-7020 All Programmable System-on-Chip

FPGA and the NVIDIA Tesla M2050 GPU Computing Module.

For a problem with 5000 variables and 500 constraints, the solver occupies

a small memory footprint of 29 kB and takes 0.14 seconds to complete

on the Zynq Z-7020. The same problem takes 0.19 seconds on the second

generation Intel Core i7-2620M mobile processor, which runs at 4 times the

clock frequency and 114 times the power budget of the Z-7020. Without

sacrificing run-time performance, the solver is highly optimized for power

constrained embedded applications. By far this is the first energy efficient

embedded solver capable of handling large scale problems with several

thousand variables.

Although FPGAs are suited for accelerating compute-bound operations,

they under-perform for I/O-bound operations due to limited off-chip mem-

ory bandwidth. This is not the case for GPUs which have substantially

larger bandwidths. Thus, the GPU is ideal for speeding up I/O-bounded

operations. In this aspect, the solver is also implemented on the NVIDIA

Tesla M2050 GPU Computing Module. cuBLAS libraries are used to

port linear algebraic routines to the GPU. The thesis also furnishes an

optimized algorithm to solve the generalized matrix-vector multiplication,

which is at least twice as fast as cuBLAS and 702 times as fast as the

FPGA implementation.

Some of the figures in the thesis are animations that can be viewed using

Adobe Reader 7 or later. An electronic copy of this thesis is accessible at

http://x.co/mthesis.

xii

List of Tables

3.1 4-way single precision floating point SIMD assembly instruc-tions in common architectures . . . . . . . . . . . . . . . . . 18

3.2 A sequence of transformations vectorizing algorithm 2 . . . . 20

5.1 Run-time profiling results on MATLAB . . . . . . . . . . . . 37

6.1 Run-time profiling results on the Z-7020 using Gprof . . . . 466.2 Hardware MAC engine specifications . . . . . . . . . . . . . 53

7.1 Comparison of memory bandwidth for various hardware systems 55

xiii

List of Figures

1.1 Image registration error using �1 and �2 penalties for variousGaussian noise levels. . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Animation — Comparison of �1 versus �2 registration forvarying Gaussian noise levels. Brighter pixels indicate higherregistration error. . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Animation — Comparison of �1 versus �2 registration forvarying degrees of gross occlusions. Brighter pixels indicatehigher registration error . . . . . . . . . . . . . . . . . . . . 9

1.4 Image registration error using �1 and �2 penalties for varyingdegrees of gross occlusions . . . . . . . . . . . . . . . . . . . 10

4.1 Animation — Convergence towards the true solution (redboxes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1 Run-time for various problem sizes, where A ∈ Rm×n and

there are s non-zero entries in x . . . . . . . . . . . . . . . . 365.2 Recovery accuracy for various problem sparsity, where A ∈

Rm×n and there are s non-zero entries in x . . . . . . . . . . 38

5.3 Peak memory usage for various problem sizes, where A ∈R

m×n and there are s non-zero entries in x . . . . . . . . . . 405.4 Convergence of proposed solver of test case m = 10000,

n = 100000 and s = 100 for various η . . . . . . . . . . . . . 41

6.1 ZedBoard hardware development platform . . . . . . . . . . 446.2 Hardware engine comprising of 9 floating point MAC units . 476.3 Submatrix partitioning of AT

(:,Ω)A(:,Ω) for various sizes. Re-dundantly computed entries are coloured purple. . . . . . . . 49

6.4 Output of every MAC being demultiplexed to a bank of 8registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.5 Post-routed layout on the Z-7020 using Vivado 2013.2 (Leg-end: Yellow – MAC engine, brown – ARM Cortex-A9 andDDR3 memory bus, green – AXI_ACP interconnect logic,blue – AXI_GP interconnect logic, red – reset logic) . . . . 51

xv

6.6 System module schematic (Legend: processing_system7_1– ARM Cortex-A9, matmul_1 – Hardware MAC engine,axi_mem_intercon – AXI_ACP bus logic, processing_system-7_1_axi_periph – AXI_HP bus logic, proc_sys_reset –Reset logic) . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

7.1 Animation — Computation of AT(A(:,Ω)xΩ

). The rectan-

gle in the middle is AT, and the column vector on the rightis A(:,Ω)xΩ. Blue represent read, and red represent writeoperations in the respective memory locations. . . . . . . . . 58

7.2 Animation — AT(A(:,Ω)xΩ

)partitioned into a one dimen-

sional kernel grid comprising of 16 × 16 thread blocks . . . . 607.3 Organization of the problem at the thread block level . . . . 617.4 Animation — Serialized visualisation of the staging 16 ×

1 partitions of A(:,Ω)xΩ onto the shared memory. Yellowindicates the thread in-charge of the transfer . . . . . . . . . 62

7.5 Animation — Multiply and accumulate operation . . . . . 637.6 Animation — Contents of registers for every thread is

copied into shared memory . . . . . . . . . . . . . . . . . . . 637.7 Animation — First 16 threads performing the final sum-

mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.8 Animation — Writing the final accumulated result to off-

chip memory . . . . . . . . . . . . . . . . . . . . . . . . . . 647.9 GEMV speedup of proposed over cuBLAS (plotted along

z-axis) and the corresponding Zynq-7020 implementation(plotted as a color map). Error bars annotate the 95% confi-dence interval at every data point. . . . . . . . . . . . . . . 67

7.10 Memory transfer efficiency with respect to the advertisedpeak memory bandwidth . . . . . . . . . . . . . . . . . . . . 68

7.11 Computational efficiency with respect to the advertised peakperformance . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

xvi

List of Symbols andAbbreviations

a · b Vector dot product, or equivalently aTb.

Ω,Ψ Symbols used to represent ordered sets of integers, i.e. indicesof a matrix. Can be interpreted as a |Ω| × 1 (|Ψ| × 1) matrix.

(·)(i,j) Returns the entry residing in the i-row and j-th column of thebracketed matrix expression. i and j are 1-indexed.

(·)(a:b,c:d) Returns a submatrix of the bracketed matrix expression, com-prising of rows a through b inclusively, and columns c throughd inclusively. Special cases are (·)(i,:) and (·)(:,j), which givethe i-th row and j-column vector respectively.

N (A) Null space of the column space of matrix A.

O (f(n)) Big O notation, where there exists finite positive constantsM, n0 such that g(n) = O (f(n)) ↔ |g(n)| ≤ M |f(n)|, n ≥ n0.

ARM Advanced RISC Machines, originally known as Acorn RISCMachine

AXI Advanced extensible interface, a bus standard which is part ofthe Advanced Microcontroller Bus Architecture

BLAS Basic linear algebra subprograms. These routines are oftenhighly optimized by hardware vendors, and form the backboneof advanced matrix libraries such as LAPACK. Exampleswould be Intel Math Kernel Library and NVIDIA cuBLAS.

BPDN Basis pursuit denoising

xvii

BRAM Block random access memory, a hardware resource commonlyavailable in FPGA

CMOS Complementary metal oxide semiconductor

CPU Central processing unit

DDR Double data rate, a memory bus standard

FPGA Field-programmable gate array

GEMV Generalized matrix-vector multiplication y ← αAx+βy. Thisoperation and the triangular solver Tx = y are part of theLevel 2 BLAS functionality.

GPGPU General-purpose computing on graphics processing units

GSM Global System for Mobile Communications

LAPACK Linear algebra package

LU Lower upper, used within the context of LU decomposition

MAC Multiply-and-accumulate

MAP Maximum a posteriori

MATLAB Matrix Laboratory, a scientific computing software environ-ment

MRI Magnetic resonance imaging, a non-invasive medical imagingtechnology

RISC Reduced instruction set computing

SM Streaming multiprocessors, hardware building blocks of NVIDIAGPU

SIMD Single instruction multiple data. Often used in the context ofSIMD instruction sets.

WSN Wireless sensor networks

xviii

Chapter 1

Introduction

Sparse recovery has made inroads within several fields of research such

as computer vision [18], radar [19] and medical imaging [20]. By relaxing

NP-hard sparse recovery problems into convex �1-optimization programs,

problems previously reckoned to be intractable are now solvable within

polynomial time [21, 22]. A category of sparse recovery, collectively known

as basis pursuit denoising (BPDN), recovers a sparse solution from linear

measurements that are corrupted by Gaussian noise. Therefore, this method

is highly relevant for dealing with real world data and is the focus of this

thesis.

Hosting an embedded real-time �1-optimization solver is desirable be-

cause analysis can be made in-situ instead of deferring to off-line process-

ing. Moreover, an embedded target is highly suited for power and space

constrained environments. Unfortunately, existing solvers are either time-

consuming, memory intensive or unsuitable for parallelization because of

inappropriate embedded design methodology, therefore this thesis motivates

the design of an efficiently parallelizable solver suitable for targeting various

hardware platforms such as CPU, DSP, GPU and FPGA.

1

1.1 Foundations

In several applications, experimenters are interested in recovering an under-

lying physical signal x ∈ Rn. Due to limitations of the sensor’s technology, x

cannot be directly observed. Instead, a set of linear measurements y = Ax

is available, where A ∈ Rm×n characterizes the sensor’s physics. Recovering

x is straightforward if A has full column rank; the pseudo-inverse x = A+y

uniquely recovers x which agrees to the measurements in the least-square

sense.

The number of measurements made is roughly proportional to sensor’s

power consumption. Thus it makes sense to reduce the number of measure-

ments for power constrained embedded platforms. But when the number

of constraints is less than the dimension of x, the pseudo-inverse yields an

infinite family of solutions, namely: x = A+y + N (A) z for z ∈ Rdim(N (A)).

This ambiguity in x has been elegantly resolved in a breakthrough

research by Candes et al. [21], subjected to additional sparsity constraints

imposed on x and incoherency on A. In their paper, it has been proven that

if x is an k-sparse signal (i.e. there are up to k non-zero entries in x), and

the number of randomly chosen frequency samples as linear measurements is

no less than O (k log n), x can be uniquely recovered by solving an �1 convex

optimization problem. Specifically, the optimization entails minimizing the

sum of absolute values of x subjected to the set of equality constraints where

the solution agrees with the linear measurement (Equation 1.1). Using this

method, the number of measurements can be reduced by an exponential

factor.

x∗ = argminx

||x||1 subject to y = Ax (1.1)

2

1.2 Formulation

The scenario in section 1.1 is highly idealistic because y and A are assumed

to be noiseless. This is rarely true as y is often corrupted by thermal noise,

and A can never be tailored or measured to infinite precision. Thus, the

method of basis pursuit denoising (BPDN, [23]) is formulated to loosen

the equality constraint and assumed the presence of Gaussian error in the

measurement term (Equation 1.2). The parameter λ trades signal fidelity

(λ → 0) for solution sparsity (λ → ∞).

x∗ = argminx

12 ||y − Ax||22 + λ||x||1 (1.2)

A variant of Equation 1.2 is the weighted BPDN (Equation 1.3), where

w is a set of non-negative weights. This formulation is used in re-weighted �1-

minimization, which enhances recovery performances by solving a sequence

of BPDN problems with an adaptive w [24].

x∗ = argminx

12 ||y − Ax||22 + ||Qx||1,w (1.3)

Q is an optional sparsifying orthogonal basis for x, i.e. the signal has

a sparse representation under Q. This is commonly encountered in image

processing applications where the underlying image vector Qx has a sparse

representation in a frequency domain. Some transformations suitable for

images include the discrete cosine and the Haar transformation. A relevant

application would be magnetic resonance imaging (MRI), which involves

solving Equation 1.3, where x represents the underlying MRI image to

be recovered, y − Ax represents the set of measurements made by the

MRI scanner, and Q is a suitable wavelet transform which can efficiently

compresses MRI imagery [25].

3

1.3 �1-Optimization

Solve BPDN entails performing �1 optimization. Consider the overdeter-

mined problem in Equation 1.4. Given that the error eG has a Gaussian

prior, the first case yields the MAP estimate, whereas if the error eL has a

Laplace prior, the second case would give the MAP estimate [26]. Therefore,

the most probable solution depends on the properties of the underlying

error distribution.

The Gaussian prior has a finite variance, therefore errors that can be

modeled as Gaussian should not deviate far from the mean. An example

would be Johnson-Nyquist noise that is introduced by thermal agitation

of electrons, which in turn reduces the precision of CMOS image sensors

by a few bits. On the other hand, the Laplace prior does not have a

finite variance due to its fat-tailed distribution, hence it is more suited for

modeling errors which may be arbitrary large, for example salt-and-pepper

noise. Depending the type of error which is being modeled, the respective

�2 or �1 objective should be solved.

x∗ =

⎧⎪⎨⎪⎩

argminx∈Rn

‖eG‖2, eG = Ax − b

argminx∈Rn

‖eL‖1, eL = Ax − b(1.4)

1.4 Application: Image Registration UnderProjective Transformation

To illustrate the differences between using �2 and �1 objectives, both methods

are used to register a pair of closely related images (denoted I and I ′),

possibly coming from consecutive frames of an aerial video feed, or the same

scenery that are taken at different times. We want to infer a homography H

4

such that the projective relation x′ ∼ Hx applies, where x is in homogeneous

coordinate form corresponding to a point in I, and x′ to I ′. Assuming there

are no changes in scene illumination, one would expect that the optimal H

would minimize Σi‖I(xi) − I ′(Hxi)‖2, since the same pixel from different

vantage points should have equal intensities. This reasoning is valid if

the matching errors can be attributed towards Gaussian sources of noise,

i.e. sensor noise. If the pixel mismatch is due to sources of errors that

do not obey Gaussian statistics and have large variances, the objective

Σi‖I(xi) − I ′(Hxi)‖1 would be a better choice. Examples of errors that can

be modeled using the �1 objective include objects which are either absent or

displaced between the image pair, or due to hard-coded video watermarks.

1.4.1 Method

I and I ′ are grayscale images taken from a scale model of an urban envi-

ronment, both of which are taken from slightly different viewpoints. The

task would be to register I ′ to I under a projective transformation H. Two

types of errors are investigated: Gaussian noise and occluding rectangles.

For the first case, every pixel of both images is subjected to additive

zero-mean Gaussian noise of fixed standard deviation σ. σ is varied from

0 to 0.099 in steps of 0.001. For the second case, to simulate sources of

non-Gaussian error that have large variance, occlusion error is introduced by

overlaying randomly-sized uniformly coloured rectangles placed at random

locations in I and I ′. The colour of every overlay is uniformly distributed

over the grayscale range. The degree of occlusion is measured by the pro-

portional of pixels that are unoccluded in both I and I ′. Image registration

proceeds by overlaying an increasing amount of rectangles, corresponding

5

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.11900

1950

2000

2050

2100

2150

2200

Standard Deviation of Gaussian Error

TotalAbsolute

Error

Image Registration Error For Varying Degrees of Gaussian Error

�1

�2

Figure 1.1: Image registration error using �1 and �2 penalties for variousGaussian noise levels.

to increasing occlusion degree. In both scenarios, 100 registrations are done

for varying degrees of error. The quality of the registration is measured by

Σi‖I(xi) − I ′(Hxi)‖1.

1.4.2 Registration Under Gaussian Error

Figure 1.1 shows a composite animation, comprising of the corrupted {I, I ′}and the registration error for both �1 and �2 methods. Although it is not

obvious from the error images, �2 outperforms �1 for large Gaussian errors

(Figure 1.2) albeit by a small proportion compared to the baseline error.

This result naturally follows from the assumption that the residual error is

Gaussian.

6

Figure 1.2: Animation — Comparison of �1 versus �2 registration forvarying Gaussian noise levels. Brighter pixels indicate higher registrationerror.

7

Gaussian σ: 0.000

Corrupted image 0 Corrupted image 1

�1 registration error �2 registration error

It is notable that �1 slightly outperforms �2 method for small Gaussian

noise. This is not surprising considering the fact that the registered images

would substantially deviate at the boundaries where there are no matching

pixels. Boundary mismatches are sources of non-Gaussian error which are

elegantly handles by the �1 objective. As �1 does not penalize boundary

mismatches as harshly as �2, the latter overcompensates and as a result,

deduces a non-optimal transformation.

1.4.3 Registration Under Gross Occlusion Error

From Figure 1.3, it is evident that �1 outperforms �2 for high degrees of

occlusion. Even when just a fifth of the pixels are unoccluded, �1-objective

optimally solves for H that is comparable to the scenario when there is no

occlusion.

1.4.4 Remarks

We have seen that �1-optimization robustly models non-Gaussian sources of

error. When registering a pair of corrupted images with up to 80% occlusion

error, using �1 penalty gives a residual error that is comparable to the

case of no occlusion. Whereas if �2 penalty is used, the registration error

consistently increases with occlusion, with the total absolute error reaching

nine times that of the baseline error. Even when the �1 objective is used in

place of �2 in the presence of Gaussian sources of error, the residual error

incurred is only slightly greater than the baseline error. Therefore, the case

of Gaussian error can be conveniently subsumed within the framework of �1

optimization.

8

Figure 1.3: Animation — Comparison of �1 versus �2 registration for vary-ing degrees of gross occlusions. Brighter pixels indicate higher registrationerror

1.5 �1 Optimization on Embedded Platforms

Embedded applications which exploit compressive sensing realize power

savings by lowering the sampling rate below the Nyquist frequency. The

underlying sparse signal can then be reliably recovered from limited sensor

readings using sparse recovery. A survey of compressive sensing applications

in constrained embedded environments is detailed in this section.

9

Occlusion: 0.0%

Corrupted image 0 Corrupted image 1

�1 registration error �2 registration error

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.5

1

1.5

2

2.5 x 104

Fraction Occlusion

TotalAbsolute

Error

Image Registration Error For Varying Degrees of Occlusion

�1

�2

Figure 1.4: Image registration error using �1 and �2 penalties for varyingdegrees of gross occlusions

1.5.1 Computer Vision and Imaging

Object recognition based on a distributed camera system is explored in

[27], where compressive sensing is used to reduce the amount of data being

transferred over the bandwidth constrained network. Recovery is performed

on a centralized base station which hosts an �1-solver. Similar to the

aforementioned is the work of Wani and Rahnavard [28], which explores

using compressive sensing via linear projections to reduce bandwidth usage

in a wireless camera sensor network. Like [27] recovery is done at a base

station.

A high frame rate and low power CMOS sensor has been realized in [29],

where power is reduced by using compressive sensing to reduce the number

10

of readings taken per frame. Image reconstruction is performed off-line

using basis pursuit with total variation. Charbiwala et al. [30] presented

a low power scalable analog frontend which directly samples within the

compressive domain by randomly projecting the input signals onto a set of

pseudo-random sparse bases.

1.5.2 Biomedical Sensing

The work of Hussein et al. [31] and Chiang and Ward [32] uses compressive

sensing to lengthen the operational life of wireless electroencephalography

(EEG) sensors that are used in detecting the onset of epileptic seizures. [33]

demonstrates low power recovery of sparsely sampled EEG signals, where an

ASIC chip has been demonstrated to consume a little over 107 μW during

operation. Similarly, Shoaran et al. [34] fabricated a wireless neural recording

chip that uses compressive sensing to cut power consumption. Imtiaz et al.

[35] explores various topologies of wearable sensor nodes and how they

impact upon recovery accuracy and battery lifespan. The aforementioned

applications rely on a computer to do off-line sparse recovery.

1.5.3 Wireless Sensor Networks

Wireless sensor networks (WSN) are used in applications where there is

a need to collect data (i.e.: acoustic, seismic, humidity, soil composition)

across vast geographical regions. They are realized by the distributed

placement of autonomous sensing elements, each within communication

proximity to its neighbors. Data is generated on the fly by the nodes and

are aggregated by means of wireless transmission to gateway nodes, which

11

in turn transmits data out of the WSN through GSM or satellite network.

The operational time of individual nodes are constrained by the limited

initial energy available to them.

A use case for compressive sensing is to reduce the number of actively

sensing nodes. Ling and Tian [36] exploited the sparse nature of local

spatial phenomena by powering down part of the network in a randomized

fashion. Also, their work features the novel use of distributed computing to

perform sparse signal recovery by judiciously spreading the computational

workload across the WSN.

1.6 Thesis Overview

In this introduction we motivate for the need of a power efficient �1 solver

suited to perform sparse recovery on embedded platforms. The next chapter

discusses about existing solvers and their unsuitability towards being ported

to embedded platforms. Chapter three covers various architectural aspects

of embedded processors that �1 solvers can exploit. Chapters four and five

proposes a low power �1 solver, and performance metrics such as run-time,

recovery accuracy and memory usage of the proposed solver are compared

against a benchmark of state-of-the-art solvers in chapter five. Chapters six

and seven discuss in detail implementation aspects of the proposed solver on

the Xilinx Zynq Z-7020 FPGA and NVIDIA CUDA GPU respectively. The

final chapter concludes by summarizing key results and future directions.

12

Chapter 2

Unsuitability of TargetingExisting Solvers to EmbeddedPlatforms

Solving BPDN problems generally require more time as compared to prob-

lems with closed-form linear algebraic solutions (i.e. least squares regression).

This is because optimized software libraries for matrix computations, most

notably the Basic Linear Algebra Sub-programs (BLAS), can be used

to accelerate problems formulated using linear algebra, something that

cannot be easily applied to BPDN problems. The main difficulty with

�1-optimization is the minimization of non-smooth functions. Most solvers

minimize a smooth approximation of the objective function, progressively

refining the solution by solving a sequence of sub-problems approaching

that of the original objective. In this aspect, �1-solvers are often control-flow

rather than data-flow intensive, making acceleration on parallel hardware a

nontrivial task. With this combination of factors, it is rare to find �1-solvers

that are deployed to solve real-time large scale problems in embedded de-

vices. This chapter explains factors contributing to the unsuitability of

porting existing �1 solvers onto embedded targets.

13

2.1 Floating Point Division

Solver designers often assume that all mathematical operations run in

constant time complexity on the CPU, which is false because different op-

erations have critical paths of varying lengths. Between two n-bit operands,

addition takes 1 clock cycle because the critical path has log2 n levels of logic

[37]. Multiplication takes slightly longer as the critical path is log1.5 n-deep

based on the Wallace tree multiplier [38], whereas division is the hardest

because there is no trivial circuitry which computes the quotient within a

single clock cycle. Instead, iterative algorithms such as Goldschmidt division

[39] are used to successively approximate the quotient, requiring several

clock cycles to complete. For example, the Cortex-A9 has an initiation

interval of 10 cycles for floating point division. The prudent designer would

be wise to avoid frivolous use of division that heavily impacts on run-time

performance.

2.2 Expensive Transcendental Functions

A handful of solvers liberally use transcendental functions. For example,

�1-MAGIC [40] uses the log-barrier method to solve BPDN, but it is rare to

find hardware support for computing logarithms. These functions are often

implemented as polynomial approximations in software that typically take

hundreds of clock cycles per evaluation. Hardware support for transcendental

functions is sporadic, and if available, expensive in run-time. The ARM

Cortex-A9 on-board the Z-7020 has support for floating point square root

that takes 13 clock cycles per computation.

14

2.3 Large Software Libraries

Several solvers rely on software libraries that provide advanced matrix

operations (i.e. LAPACK). An example would be SparseLab [41] which

uses Cholesky and LU factorization. Hosting such a library on an embedded

platform is memory intensive: a pre-compiled LAPACK library1 for x86

targets consumes 7.4 MB of program space, possibly greater for a RISC

architecture like ARM.

2.4 Inefficient Memory Usage

A number of solvers achieve speed up by pre-computing ATA. These solvers

are likely to run out of memory when handling problems with thousands of

variables because memory usage scales quadratically with n. CGIST [5]

and L1-LS [10] caches AT along with A, and ADMM LASSO [4] saves the

LU factorization of A, making them consume an extra O(mn) memory.

2.5 Lack of Parallelism

Some solvers are control-flow intensive, complicating any attempts of

pipelined processing or data-flow acceleration. Examples would be TFOCS

[15] and SpaRSA2 [14].

1http://icl.cs.utk.edu/lapack-for-windows/lapack/

15

2.6 Poor Recovery Performance of Orthog-onal Matching Pursuit

Because hosting an embedded �1-solver is challenging, the majority of

FPGA systems that require �1-optimization [42, 43, 44, 45, 46, 47, 48, 49]

use orthogonal matching pursuit (OMP) [11] to approximately solve the

BPDN problem. Since OMP is a greedy heuristic, implementation is

straightforward and the run-time is short provided x is very sparse. Although

OMP has a recovery accuracy that rivals standard �1-solvers for Gaussian A,

performance is poor for correlated matrices. A comparison of the recovery

accuracy between BPDN and OMP for problems with non-random A can

be found in [50].

2.7 Questionable Scalability of OMP

OMP has good run-time performance under the assumption x is highly

sparse, which is unnecessarily pessimistic because an �1 problem is recov-

erable as long as x is O(√

m)-sparse. Since the run-time of OMP scales

quadratically with sparsity, these solvers are only capable of handling prob-

lems with low sparsity. The timings reported by studies that implement

OMP on FPGA are therefore highly optimistic because the sparsity (de-

fined to be the fraction sn) used are close to 0. The following are some

sparsity parameters: 0.03 (n = 256) [42], 0.04 (n = 128) [43], 0.04 (n = 128)

[45], 0.008 (n = 255) [46] and 0.04 (n = 128) [48]. Given that most of the

publications reported timings for n ≤ 256, it is doubtful whether these

solvers will gracefully scale with problem size, not to mention when problem

sparsity increases.

16

Chapter 3

Vectorization on EmbeddedSystems

To run a solver on an embedded system while achieving real-time perfor-

mance, the algorithm has to fully exploit parallel hardware features that

are commonly found in modern embedded processors. Modern CPUs and

GPU shader cores have built in single-instruction multiple-data (SIMD)

instruction sets, making them highly efficient in computing programming

constructs that are embarrassingly parallel. Examples of embedded pro-

cessors which have SIMD capability includes the Intel Atom (Streaming

SIMD Extensions), Power Architecture (AltiVec) and ARM (NEON). Vec-

torization also naturally extends to GPUs, where every shader core holds

the same instructions and executes on different partitions of the dataset.

Therefore it is important to formulate algorithms to use SIMD instructions

to the fullest extent possible.

3.1 SIMD Instruction Set

Relevant to efficient solver design are floating point operations such as

vector add and multiply. Modern processors computes four single-precision

17

Table 3.1: 4-way single precision floating point SIMD assembly instructionsin common architectures

Instruction Intel AtomSSE

Power Arch.AltiVec

ARM NEON OpenGLARB

Addition ADDPS VADDFP VADD ADDMultiplication MULPS VMADDFP VMUL MULDivision DIVPS - - -Maximum MAXPS VMAXFP VMAX MAXCompare CMPPS VCMP* VC* CMPReciprocalsquare root

RSQRTPS VRSQRTEFP VRSQRTE -

floating operations in one go, whereas later models such as the Intel Core

i7 supports 8-way single-precision processing using Advanced Vector Exten-

sions (AVX). The latest Intel AVX-512 instruction set, to be implemented

in the upcoming Intel R© Xeon PhiTM processor family, is able to perform

16-way single precision floating point in a clock cycle. Table 3.1 details

SIMD assembly instructions for common mathematical operations on some

processor architectures.

3.2 Eliminating if-else Using Vectorization

Consider the pseudo-code in Algorithm 1. Array x is assigned from either

a or b depending on cond, which we can assume to represent boolean true

using 1 and false as 0. A naïve method would be to use if-else. The

disadvantages are two-fold: if-else incurs branch penalty, which may be

heavy on architectures that do not have sophisticated branch prediction

hardware (i.e. Cell Broadband Engine synergistic processing units), and

secondly SIMD instructions could have been used but compilers are rarely

unable to infer SIMD from if-else statements.

Equation 3.1 shows a reformulation that can be vectorized. It exploits

the fact that multiplying by a one/zero achieves the effect of if-else selection.

a − b can be computed using vector subtract, and x can be computed using

18

Algorithm 1: Simple if-elsefor i ← 0 to n − 1 do

if cond[i] then x[i] ← a[i];else x[i] ← b[i];

end

a single vector multiply-accumulate.

x[i] = b[i] + (a[i] − b[i]) × cond[i] (3.1)

Algorithm 2 shows an if-else statement that is used in the proposed �1

solver that will be presented in chapter 4. Two modifications distinguishes it

from Algorithm 1: Firstly, the if-statement contains a min-term, and the else-

statement a max-term, and secondly the arguments of these two functions

are different. Despite these complications, the snippet can be transformed

into vectorized code by applying mathematical identities via a series of

transformations outlined in table 3.2. All function are converted into min-

form with matching parameters, and the remaining differences are elegantly

handled by computing the sign and magnitude of Δx∗λ=∞[i] − Δx∗

λ=0[i].

The final expression can be expressed using commonly available vector

instructions such as absolute, minimum and compare. Data operands like

Δx∗λ=0[i], Δx∗

λ=∞[i] − Δx∗λ=0[i] and λ

aTi ai

can be stored in memory aligned

arrays which are efficiently handled by vector instructions.

Algorithm 2: If-else with non-trivial logicfor i ← 0 to |Ω| − 1 do

if Δx∗λ=0[i] ≤ Δx∗

λ=∞[i] thenΔx∗[i] = min

(λ

aTi ai

+ Δx∗λ=0[i], Δx∗

λ=∞[i])

;

else Δx∗[i] = max(

− λaT

i ai+ Δx∗

λ=0[i], Δx∗λ=∞[i]

);

end

19

Table 3.2: A sequence of transformations vectorizing algorithm 2

Transformation If-statement Else-statement Condition

(Initial) min(

λaT

i ai+Δx∗

λ=0[i],Δx∗λ=∞[i]

)max

(− λ

aTi ai

+Δx∗λ=0[i],Δx∗

λ=∞[i])

Δx∗λ=0[i] ≤ Δx∗

λ=∞[i]max(a, b) ≡−min(−a, −b)

min(

λaT

i ai+Δx∗

λ=0[i],Δx∗λ=∞[i]

)−min

(λ

aTi ai

−Δx∗λ=0[i], −Δx∗

λ=∞[i])

min(a, b)≡ min(a ±c, b ± c)∓ c

min(

λaT

i ai,Δx∗

λ=∞[i]−Δx∗λ=0[i]

)

+ Δx∗λ=0[i]

−min(

λaT

i ai,Δx∗

λ=0[i]−Δx∗λ=∞[i]

)

+ Δx∗λ=0[i]

x if x ≥ 0,else −x ⇔|x| and ap-plying equa-tion 3.1

Δx∗λ=0[i] + sign (Δx∗

λ=∞[i]−Δx∗λ=0[i]) .∗ min

(λ

aTi ai

, |Δx∗λ=∞[i]−Δx∗

λ=0[i]|)

The expressionΔx∗

λ=∞[i] −Δx∗

λ=0[i] is usedas the condition-ing statement

20

3.3 Limitations of SIMD Processing

One has to be mindful of the limitations of floating-point SIMD instructions.

In most cases, processors expected memory arrays holding vector operands

to be aligned. Processors that are able to take in unaligned packed data

often process them at suboptimal speed. Vector division is poorly supported

as division hardware is area-intensive. Even if vector division is available,

the operation has a long latency and low throughput, making sequential

use of vector division time consuming. Support for double-precision vector

operations is available on most architectures, but at half the throughput of

single-precision computation. If the compiler is not intelligent enough to

vectorize code, one has to resort to use vendor-supplied intrinsics within C

code.

21

Chapter 4

Proposed BPDN Solver

To address the constraints of implementing a BPDN solver for real-time

embedded applications, the proposed solver is formulated such that compu-

tationally intensive bottlenecks are amenable towards SIMD vectorization,

efficient pipelined processing and data-flow parallelization. The use of tran-

scendental functions is eschewed and floating point divisions are kept to a

minimum. Economical memory usage is ensured by in-place manipulation

of A and judicious pre-computation of intermediate results.

4.1 Overview

At all times the solver maintains a prediction of the set of non-zero entries,

denoted Ω, in the sparse x. At the beginning, x is initialized to a rough

estimate and Ω is updated from this estimate. After which, the algorithm

iterates between two phases: during the first phase, Equation 1.3 is solved

with the additional constraint that entries indexed by Ωc are 0. In the

second phase, Ω is intelligently updated based on the current estimate of

x, gradually introducing more correct non-zeros after each iteration. The

algorithm iterates until x converges (Algorithm 3).

23

Algorithm 3: Proposed Solver OverviewInput : y ∈ R

m, A ∈ Rm×n, λ ∈ R

+ from Equation 1.3, number of sparseentries num_nonzeros, convergence parameter η

Output : Optimal x∗ satisfying Equation 1.3begin

x ← (ATy

)./ diag

(ATA

);

[∼,sorted_index] ← sort(|x|,1,′descend′);Ω ← sorted_index(1:num_nonzeros);idx ← sorted_index(1);while x has not converged do

x ← EstimateFromNonzeros (y, A, x, λ,Ω, η);Ω ← GuessNonzeros (y, A, x, λ,Ω, idx, num_nonzeros);

endreturn x;

end

4.2 Initializing x and Ω

An approximate solution to Equation 1.3 is x =(ATA

)†ATy. If the

problem is designed such that A has mutually incoherent columns (as

that would have been the case for compressive sensing applications), ATA

would be strongly diagonal and the pseudo-inverse can be approximated as

diag(1 ./ diag

(ATA

)). The i-th entry of x can be efficiently computed

as aTi y

aTi ai

by using vectorization. Ω is populated by picking the largest entries

of x sorted by magnitude.

4.3 Phase I: Solving for x Given Ω

Given that only entries indexed by Ω are non-zero, x can be incrementally

solved by minimizing Equation 1.3 one variable at a time. Equation 1.3 is

differentiated with respect to the perturbation Δxi (where xi = xi0+Δxi, i ∈Ω and y0 = y − A(:,Ω)xΩ0) to give:

df

dΔxi

= −yT0 ai + aT

i aiΔxi + λ sign(xi0 + Δxi) (4.1)

24

The optimal perturbation Δx∗i is obtained when Equation 4.1 vanishes,

or due to the gradient discontinuity introduced by the modulus term, when

the derivative changes sign at Δxi = −xi0. Determining x∗i can be reasoned

as follows: Setting λ = 0 would mean only the quadratic term is minimized,

and Δx∗i,λ=0 = yT

0 ai

aTi ai

. Setting λ = ∞ would mean only the modulus term

is minimized, and Δx∗i,λ=∞ = −xi0. Thus, Δx∗

i lies between Δx∗i,λ=0 and

Δx∗i,λ=∞, and can be computed from Equation 4.2.

Δx∗i =

⎧⎪⎪⎨⎪⎪⎩

min(

λ+yT0 ai

aTi ai

, Δx∗i,λ=∞

), Δx∗

i,λ=0 ≤ Δx∗i,λ=∞

max(

−λ+yT0 ai

aTi ai

, Δx∗i,λ=∞

), otherwise

(4.2)⇒ Δx∗

Ω = Δx∗Ω,λ=0 + sign(Δx∗

Ω,λ=∞ − Δx∗Ω,λ=0) .∗

min(λ ./ diag

(AT

(:,Ω)A(:,Ω)

),∣∣∣Δx∗

Ω,λ=∞ − Δx∗Ω,λ=0

∣∣∣)

(4.3)

By applying the identities max(a, b) ≡ − min(−a, −b) and min(a, b) ≡min(a−x, b−x)+x, Equation 4.2 can be vectorized to give Equation 4.3. If

the matrix A remains constant from problem to problem (commonly the case

for compressive sensing applications), the reciprocal 1aT

i aican be precomputed

to avoid performing expensive divisions. The optimal perturbations of every

variable, Δx∗Ω, are aggregated to update x and repeated until convergence.

Due to the fact that one variable is perturbed at a time, applying all changes

at once does not yield the optimal solution, therefore the change is weighted

to ensure convergence, i.e. xΩ ← xΩ + η Δx∗Ω, 0 < η < 1. Algorithm 4

summarizes what has been mentioned so far. Note that b is loop invariant

and can be pre-computed in the program preamble.

25

Algorithm 4: EstimateFromNonzeros(y,A,x,λ,Ω,η)Input : y ∈ R

m, A ∈ Rm×n, current estimate of x ∈ R

n, λ ∈ R+ from

Equation 1.3, indices of sparse entries Ω, convergence parameter ηOutput : Optimal solution to Equation 1.3 among the family of x with non-zero

entries in Ωb ← diag

(ATA

);

beginc ← AT

(:,Ω)y;

while xΩ has not converged doΔx∗

Ω,λ=0 ← c −(

AT(:,Ω)A(:,Ω)

)xΩ;

Δx∗Ω,λ=∞ ← −xΩ .∗ bΩ;

Δx∗Ω ←

(Δx∗

Ω,λ=0 + sign(Δx∗

Ω,λ=∞ −Δx∗Ω,λ=0

).∗

min(

λ ./ bΩ,∣∣∣Δx∗

Ω,λ=∞ −Δx∗Ω,λ=0

∣∣∣) );

xΩ ← xΩ + η Δx∗Ω;

endxΩc ← 0;return x;

end

4.4 Phase II: Updating Ω Given x

The current Ω may not contain all the non-zero entries of the underlying true

solution, therefore phase two refines Ω based on the current x. The gist is to

optimize Equation 1.3 with respect to (xi,xj) with the rest held constant. A

large magnitude for xi/xj indicates a higher probability that the respective

variable should be included in Ω. To solve (xi,xj), consider λ = 0 where

Equation 1.3 simplifies to the �2-penalty f(xi, xj) = 12‖y − Ax‖2

2. Letting

b = −ATy and a0 = Ax − aixi − ajxj, the global minimum occurs at

(x�2i , x�2

j ) satisfying Equation 4.4, a linear system that can be easily solved.

∂f

∂xi

= ‖ai‖2x�2i + (ai · aj) x�2

j + a0 · ai + bi = 0

∂f

∂xj

= (ai · aj) x�2i + ‖aj‖2x�2

j + a0 · aj + bj = 0 (4.4)

When the �1 penalty term is included, the minimum point shifts from

26

(x�2i , x�2

j ) to (x�2�1i , x�2�1

j ) according to Equation 4.5.

‖ai‖2x�2�1i + (ai · aj) x�2�1

j + a0 · ai

+ bi + λ sign(x�1

i

)= 0

(ai · aj) x�2�1i + ‖aj‖2x�2�1

j + a0 · aj

+ bj + λ sign(x�1

j

)= 0 (4.5)

If x�1j and x�1�2

j are opposite in signs, xj is not a candidate for Ω. xi is

fixed to be the entry with the largest magnitude from the initialization in

section 4.2, and xj is independently solved for all variables. Ω is subsequently

updated by selecting the largest magnitudes among x�1�2 . Algorithm 5

outlines what has been mentioned. Due to loop invariance, b, c, and d can

be pre-computed in the program preamble.

Algorithm 5: GuessNonzeros(y,A,x,λ,Ω,idx,num_nonzeros)Input : y ∈ R

m, A ∈ Rm×n, current estimate of x ∈ R

n, λ ∈ R+ from

Equation 1.3, indices of sparse entries Ω, index of the entry with thelargest initial magnitude idx, number of sparse entries num_nonzeros

Output : Optimal solution to Equation 1.3 among the family of x with non-zeroentries in Ω

b ← ATA(:,idx);c ← ATy;d ← diag

(ATA

);

begina ← AT (

A(:,Ω)xΩ);

x�2 ← b .∗ (aidx − (b .∗ x)− cidx)− didx (a − (d .∗ x)− c);x�2�1 ← x�2 + λ sign (xidx)b − λ didx sign

(x�2

);

xnonsparse ← (x�2 .∗ x�2�1

)> 0;

[∼,sorted_index] ← sort(∣∣xnonsparse .∗ x�2�1

∣∣ , 1, ′descend′);Ω ← sorted_index(1:num_nonzeros);return Ω;

end

Figure 4.1 shows an animation of how the proposed algorithm converges.

The underlying BPDN solution is a randomly generated sparse vector that

has a hundred thousand variables with a hundred non-zero entries and ten

27

Figure 4.1: Animation — Convergence towards the true solution (redboxes)

thousand constraints. At the beginning of every major iteration (i.e. Phase

II), the solver introduces new non-zeros based on the current trial solution

(coloured solid red). After that, Phase I iterates till convergence. Note

that some of the introduced non-zeros correctly converges to the underlying

solution (red boxes), and the others converges to zero. At the end of the first

major iteration, there are a handful of unmatched non-zeros, i.e. red boxes

with no matching blue circles. Therefore, the second Phase II introduces

more non-zeros which guesses the unmatched non-zero. This alternation

between Phase I and II continues till the objective function converges.

4.5 Estimating the Sparsity of x

The solver requires knowing the sparsity of x, a piece of information that

most solvers ignore (except OMP and Homotopy). Therefore it is reasonable

28

to question the assumption whether the problem sparsity can be determined

beforehand. The following are examples where the number of non-zeros is

known due to the structure of the problem.

4.5.1 Wright et al. [1]: Face recognition from a database

An input facial image needs to be identified from a database containing

fixed multiples (k) of images from different persons. This is accomplished

by encoding the query image as y, and the columns of A corresponding to

database images. The recovered x is sparse because the face of a person

would strongly correlate with the matching few in the database, and the

sparsity is given by k.

4.5.2 Wang et al. [2]: Three-dimensional arrangementof light sources in bioassays

Experimentation with biological assays requires detecting the three di-

mensional arrangement of fluorescent beads that are tagged to molecular

structures. The authors use m angle sensitive pixels to image a volume

that is partitioned into n sections. The problem is to recover which of these

pockets have a prominent light source, admitting a sparse solution if the n

partitions are interpreted as entries in x. Since the number of fluorescent

beads introduced into the assay is controllable, this can be used as a proxy

to determine the problem sparsity, where the amount of beads introduced

is proportional to the sparsity of the recovered x.

29

4.5.3 Elhamifar and Vidal [3]: Motion segmentationusing sparse subspace clustering

Given several rigid bodies undergoing independent motions in front of

an affine camera, the image trajectory of n feature points across several

frames can be grouped as columns in A. Due to geometrical constraints,

trajectories of feature points residing on the same body are embedded

within a 3-dimensional affine space. The algorithm involves finding a sparse

correlation of every column of A with the others excluding itself, and

expected sparsity is given by the dimension of the affine space.

30

Chapter 5

BPDN Solver Benchmark

The proposed solver is benchmarked against several state-of-the-art solvers

for run-time performance, recovery accuracy and peak memory usage. De-

fault settings recommended by respective authors are used. MEX-files are

used if provided or instructed for compilation. The benchmark runs on an

Intel R© CoreTM i7-2620M (2.7 GHz) machine with 8 GB memory. MATLAB

R2011b (7.13.0.564 64-bit) is used as the benchmarking environment.

5.1 List of Solvers

General-purpose solvers (i.e. CVX, Gurobi, GPLX) are omitted from the

benchmark since they are not customized to the intricacies of BPDN, thus

it is unfair to expect comparable performance with specialized solvers. The

following solvers are included in the benchmark.

5.1.1 ADMM LASSO [4]

Alternating Direction Method of Multipliers efficiently solves large scale

problems by processing sub-problems across distributed computing re-

sources.

31

5.1.2 CGIST [5]

Conjugate Gradient Iterative Shrinkage/Thresholding solves by using a

forward-backward splitting method with an acceleration step. The test for

adjointness between A and AT is omitted because this test would fail due

to finite machine precision.

5.1.3 FPC-BB [6]

Fixed-Point Continuation is advertised for large scale image and data pro-

cessing. The solver uses Barzilar-Borwein steps to accelerate convergence.

5.1.4 GLMNET [7]

Generalized Linear Model for elastic-net regularization is the the reference

algorithm used to implement the MATLAB lasso function.

5.1.5 GPSR-BB6 [8]

Gradient Projection for Sparse Reconstruction uses special line search and

termination techniques to yield faster solutions as compared to SparseLab

[41], �1-MAGIC [40], bound-optimization method [51] and interior-point

methods. Similar to FPC-BB, Barzilar-Borwein steps are used to accelerate

convergence.

32

5.1.6 Homotopy [9]

Homotopy refers to a class of methods that solves BPDN by solving a

sequence of intermediate problems with varying λ.

5.1.7 L1-LS [10]

L1-LS is an interior point method for large-scale sparse problems, or dense

problems where A contains structure that admits fast transform computa-

tions. The preconditioned conjugate gradient algorithm is used to discover

the search direction.

5.1.8 OMP [11]

Although Orthogonal Matching Pursuit approximately solves BPDN, the

recovered accuracy for Gaussian A is comparable to �1-solvers and has

excellent run-time performance. It is therefore an attractive candidate for

FPGA implementation.

5.1.9 SESOP_PACK [12]

Sequential Subspace Optimization solves large-scale smooth unconstrained

optimization.

5.1.10 SPAMS [13]

Sparse Modeling Software is a MATLAB toolbox for sparse recovery

problems. Its C++ library makes use of the Intel Math Kernel Library for

33

floating point computations. In this aspect, SPAMS is a strong contender

for run-time performance.

5.1.11 SpaRSA2 [14]

Sparse Reconstruction by Separable Approximation is an iterative method

where each step is an optimization sub-problem involving a separable

quadratic term plus the sparsity-inducing term. This solver is recommended

for cases where the sub-problem can be efficiently solved.

5.1.12 TFOCS [15]

Templates for First-Order Conic Solvers provides a set of modules that can

be mixed-and-matched to create customized solvers.

5.1.13 TwIST2 [16]

Two-Step Iterative Shrinkage / Thresholding implements a nonlinear two-

step iterative version over the original iterative shrinkage/thresholding

procedure to provide faster convergence for ill-conditioned problems.

5.1.14 YALL1 [17]

Your Algorithms for L1 is a suite of solvers that uses alternating direc-

tion algorithms, with the option of enforcing joint sparsity among related

variables.

34

5.2 Test Input

For various m, n and s, the entries of A are drawn from the standard

normal distribution. Positions of the non-zero entries in x are randomly

picked, and the values follow a uniform distribution in the interval (−1, 1).

The ideal measurements y = Ax are corrupted by scaled Gaussian noise of

zero mean and 0.1 variance.

5.3 Run-Time Performance

The overall run-time complexity of the proposed solver is O(k1(mn + k2s)),

where k1 is the number of iterations in Algorithm 3, and k2 is the number of

iterations in Algorithm 4. From Figure 5.1, it is evident that the proposed

solver has superior run-time performance over state-of-the-art BPDN solvers

due to extensive use of matrix multiplication and vectorized operations. For

large problems, the run-time performance of the proposed solver is at least

ten times that of the next fastest solver.

35

10 2

10 1

100

101

102

103

104

CPU

run

time

(sec

onds

)CPU Run Time vs Problem Size

m=5

00,n=

5000

,s=5

m=1

000,n

=100

00,s=

10

m=1

500,n

=150

00,s=

15

m=2

000,n

=200

00,s=

20

m=2

500,n

=250

00,s=

25

m=3

000,n

=300

00,s=

30

m=3

500,n

=350

00,s=

35

m=4

000,n

=400

00,s=

40

m=4

500,n

=450

00,s=

45

m=5

000,n

=500

00,s=

50

m=5

500,n

=550

00,s=

55

m=6

000,n

=600

00,s=

60

m=6

500,n

=650

00,s=

65

m=7

000,n

=700

00,s=

70

m=7

500,n

=750

00,s=

75

m=8

000,n

=800

00,s=

80

m=8

500,n

=850

00,s=

85

m=9

000,n

=900

00,s=

90

m=9

500,n

=950

00,s=

95

m=1

0000

,n=10

0000

,s=10

0

proposedadmm_lassocgistfpc bbglmnetgpsr bb6homotopyl1lsompsesop_packspamssparsa2tfocstwist2yall1

Figure 5.1: Run-time for various problem sizes, where A ∈ Rm×n and there are s non-zero entries in x

36

Table 5.1: Run-time profiling results on MATLAB

# Time Taken (s) % Total Time Operation1 14.771 43.2 AT

(A(:,Ω)xΩ

)

2 7.669 22.4 diag(ATA

)3 4.878 14.3 ATA(:,idx)4 4.863 14.2 ATy5 0.554 1.6 AT

(:,Ω)A(:,Ω)

6 1.431 4.2 Rest of the code

5.4 Algorithm Bottleneck

The solver is profiled on MATLAB for the problem size m = 10000 and

n = 100000. Table 5.1 shows the run-time summary for 20 consecutive

trials. About 95% of run-time is spent on matrix-matrix and matrix-vector

multiplications, which are excellent candidates for speed-up on either SIMD-

capable embedded processors, or acceleration on programmable logic.

5.5 Accuracy of Recovered Results

For varying problem sparsity, x is recovered and debiased. Recovery accuracy

is expected to decrease as problem sparsity increases because the problem

progressively enters the ill-conditioned region of the Donoho-Tanner phase

transition diagram [52]. The error measure between the debiased solution

xrecovered and the underlying true solution xactual is given by ‖xactual−xrecovered‖2n

.

From Figure 5.2, the proposed solver has superior recovery accuracy com-

pared to all the solvers.

37

10 7

10 6

10 5

10 4

10 3

||xactual−xrecovered||2

nRecovery Accuracy vs Problem Sparsity

m=1

500,n

=150

00,s=

5

m=1

500,n

=150

00,s=

20

m=1

500,n

=150

00,s=

35

m=1

500,n

=150

00,s=

50

m=1

500,n

=150

00,s=

65

m=1

500,n

=150

00,s=

80

m=1

500,n

=150

00,s=

95

m=1

500,n

=150

00,s=

110

m=1

500,n

=150

00,s=

125

m=1

500,n

=150

00,s=

140

m=1

500,n

=150

00,s=

155

m=1

500,n

=150

00,s=

170

m=1

500,n

=150

00,s=

185

m=1

500,n

=150

00,s=

200

m=1

500,n

=150

00,s=

215

m=1

500,n

=150

00,s=

230

m=1

500,n

=150

00,s=

245

m=1

500,n

=150

00,s=

260

m=1

500,n

=150

00,s=

275

m=1

500,n

=150

00,s=

290


Figure 5.2: Recovery accuracy for various problem sparsity, where A ∈ Rm×n and there are s non-zero entries in x

38

5.6 Memory Usage

The proposed solver requires O(n + sm + s2) of additional memory besides

storing A and y, making the implementation memory efficient if the un-

derlying x is sparse. The solver does not require advanced linear algebraic

decomposition, (i.e. LU, Cholesky or singular value decomposition), hence

there is no hidden memory requirement whatsoever. Because the solver

caches pre-computed values, its memory footprint is not the lowest but

nevertheless remains as one of the highly competitive among the bench-

marked (Figure 5.3). Due to the use of dynamic memory allocation in

some solvers, memory measurements fluctuate for increasing problem sizes,

whereas solvers which preallocate memory exhibits a monotonically increase

memory usage.

39

101

102

103

104

105

106

107Pe

ak M

emor

y U

sage

(kB

)Peak Memory Usage vs Problem Size

m=5

00,n=

5000

,s=5

m=1

000,n

=100

00,s=

10

m=1

500,n

=150

00,s=

15

m=2

000,n

=200

00,s=

20

m=2

500,n

=250

00,s=

25

m=3

000,n

=300

00,s=

30

m=3

500,n

=350

00,s=

35

m=4

000,n

=400

00,s=

40

m=4

500,n

=450

00,s=

45

m=5

000,n

=500

00,s=

50

m=5

500,n

=550

00,s=

55

m=6

000,n

=600

00,s=

60

m=6

500,n

=650

00,s=

65

m=7

000,n

=700

00,s=

70

m=7

500,n

=750

00,s=

75

m=8

000,n

=800

00,s=

80

m=8

500,n

=850

00,s=

85

m=9

000,n

=900

00,s=

90

m=9

500,n

=950

00,s=

95

m=1

0000

,n=10

0000

,s=10

0


Figure 5.3: Peak memory usage for various problem sizes, where A ∈ Rm×n and there are s non-zero entries in x

40

0 20 40 60 80 100 120 140 160 18010 2

100

102

104

Inner Loop Iteration

1 2||y−

Axestimate

||2 2+

λ||xestimate

||1

Convergence of Proposed Solver vs Inner Loop Iteration for Various ηη = 0.2η = 0.4η = 0.6η = 0.8

Figure 5.4: Convergence of proposed solver of test case m = 10000, n =100000 and s = 100 for various η

5.7 Convergence Properties of Solver

For the test case where m = 10000, n = 100000 and s = 100, the number

of outer iterations (k1) is around 3 regardless of η, whereas the number of

inner iterations (k2) within EstimateFromNonzeros ranges from 20 to 80

for the first outer iteration for decreasing η, progressively decreasing for

subsequent outer iterations (Figure 5.4). Given a well-tuned η, convergence

is achievable within 40 iterations.

41

Chapter 6

Solver Implementation on theXilinx Zynq Z-7020

The solver is implemented on the ZedBoard development board (Figure 6.1),

comprising of a ZynqTM-XC7Z020-CLG484-1 All Programmable System-

on-Chip with 512 MB DDR3 memory. The Z-7020 chip features a dual

ARM R© CortexTM-A9 MPCoreTM that is tightly coupled with the ArtixTM-7

FPGA fabric. Each core has separate 32 kB L1 instruction and data

caches, and both share a unified 512 kB L2 cache. Matrix and vector

operations are efficiently handled by BLAS libraries that use the NEONTM

SIMD engine on-board each CPU. Custom hardware data-paths can be

instantiated within the FPGA fabric if hardware acceleration is necessary.

The CPU is clocked at 667 MHz and the FPGA fabric at 125 MHz.

All implementations are benchmarked with respect to the problem size of

m = 500, n = 5000, s = 75. The reference MATLAB solver takes 0.19

seconds to complete on the i7-2620M.

6.1 Porting MATLAB to C

The solver is ported to C code using the Embedded CoderTM v6.1 toolbox.

On the Z-7020, the compiled executable occupies 93 kB of .txt memory,

43

Figure 6.1: ZedBoard hardware development platform

1.93 MB of .bss memory, and takes 1.84 seconds to process. Since matrix

multiplication has been identified as the run-time bottleneck, an improved

BLAS library is sought to replace the stock library that is provided by

MATLAB.

6.2 Eigen BLAS Library

Eigen1 is an open-source library providing optimized assembly routines for

matrix operations. The library supports hardware vectorization on ARM

targets. Run-time benchmarks by its authors show that Eigen outperforms

the Intel Kernel Math Library for operations such as y ← αx+βy, y ← Ax

and Y ← AAT, therefore this library is chosen to replace MATLAB’s1http://eigen.tuxfamily.org/

44

BLAS library. The Eigen-compiled executable occupies 29 kB of .text

memory, 40 kB of .bss memory and takes 0.30 seconds to complete. The

compact .text program size allows the solver to be fully loaded within the

L1 instruction cache, ensuring fast program execution without expensive

memory fetches. Pre-computed .bss data structures used by the solver also

fits economically within the L2 cache.

6.3 Accelerating AT(:,Ω)A(:,Ω) Using Programmable

Logic

Table 6.1 shows the run-time summary of the solver running on a single

Cortex-A9 CPU without FPGA acceleration. 89% of the overall run-

time is spent on executing Eigen library code. Further profiling using the

snoop control timer reveals that the matrix operations AT(A(:,Ω)xΩ

)and

AT(:,Ω)A(:,Ω) occupy 34% and 55% of the run-time respectively. AT

(A(:,Ω)xΩ

)

cannot be further accelerated because for every entry of A that is read

from memory, one multiply-and-accumulate (MAC) is performed, making

the operation susceptible towards I/O-boundness. A ballpark estimate

illustrates the problem: The maximum read bandwidth from DDR memory

to the programmable logic over an AXI_HP interface is 1.2 GB/s [53,

§22.3]. This means it takes 8.3 ms to deliver A to the programmable

logic. The average number of fetches for A per run is around 10, giving a

paltry speed-up of 1.23×. It is possible but infeasible to utilize all the four

AXI_HP buses on the Z-7020 to achieve a 4.9× speed-up as this would

deprive other hardware modules of any routing resources to interface with

the CPU. On the contrary, AT(:,Ω)A(:,Ω) is compute-bound because s(s+1)

2

MACs have to be performed for every s entries read, making it an excellent

45

Table 6.1: Run-time profiling results on the Z-7020 using Gprof

% Total Time Function45.8 Eigen::general_matrix_vector_product40.9 Eigen::gebp_kernel10.4 Preamble of fastBPDN1.47 Eigen::gemm_pack_lhs0.95 Eigen::gemm_pack_rhs0.48 Rest of the code

candidate for FPGA acceleration.

6.3.1 Multiply-And-Accumulate Hardware Engine

Figure 6.2 shows a hardware engine capable of parallelizing AT(:,Ω)A(:,Ω) by

pipelining 9 MACs per clock cycle. The matrix AT(:,Ω)A(:,Ω) is partitioned

into 3 × 3 sub-matrices, and the engine computes all elements of a sub-

matrix in parallel. Since AT(:,Ω)A(:,Ω) is symmetric, sub-matrices lying on and

above the main diagonal are computed on the FPGA, and entries beneath

the diagonal are populated by replicating above-diagonal entries using the

CPU. Inputs from A(:,Ω) are served over the AXI_ACP bus and results

are written over the same interface. Transactions over the AXI_ACP are

also cache coherent because the bus has access to the SCU that governs

both caches. Prior to operation, the engine is configured by the CPU over

the AXI_GP interface.

6.3.2 Detailed Operation

In the first stage of operation, the MAC engine populates internal BRAM

with entries from A(:,Ω). It makes sense to stage A(:,Ω) since every entry will

be used s times, thus minimizing the amount of reads from the slow DDR

memory. The transfer occurs over the AXI_ACP bus interface which

46

AXI_ACP Bus

×

+

×

+

×

+

×

+

×

+

×

+

×

+

×

+

×

+

Dual-ported BRAM

Dual-ported BRAM

Dual-ported BRAM

Figure 6.2: Hardware engine comprising of 9 floating point MAC units

arbitrates transfers between the DDR memory and the programmable logic.

Burst transfer commands are issued for this initial transaction to ensure

high throughput.

The BRAMs are partitioned into three banks so as to independently

serve each of the nine multipliers. Every column of A(:,Ω) are stored in

alternating banks: A(:,1), A(:,4), . . . , A(:,3k+1) reside consecutively in bank 1,

A(:,2), A(:,5), . . . , A(:,3k+2) in bank 2 and A(:,3), A(:,6), . . . , A(:,3k+3) in bank 3.

Since computing each submatrix requires two independent read accesses from

each bank for off-diagonal submatrices, and one read access for on-diagonal

ones, the BRAMs are configured to be dual-ported. The submatrices

are then processed in row-major order. For every submatrix, nine MAC

operations are executed in parallel for m consecutive entries from the

BRAM.

Parallelization can be extended to processing n × n submatrices using

n2 multipliers. BRAMs will then be divided into n partitions, the i-th

47

partition storing entries from A(:,nk+i). A larger speed-up can be achieved

but the disadvantages are two-fold: More DSP resources are needed (5

DSP48E1 slices are needed to synthesize a single precision MAC), and

timing closure is harder to achieve given that a larger mesh demands

more routing resources and therefore congestion may occur. Secondly,

a larger degree of parallelization leads to poorer work efficiency due to

more redundant entries being computed, evident from the larger number

of redundant computations that are coloured purple in Figure 6.3 with

increasing partition size.

Each MAC output is demultiplexed to a bank of 8 registers (Figure 6.4),

reason being there is a finite latency of 5 clock cycles for each MAC to

produce a result. By spreading the accumulation across 8 registers, the

multiplier mesh can operate at maximum throughput without data hazards.

A post-processing step to combine the 8 partial sums is done before writing

the result back to memory over the AXI_ACP bus.

6.3.3 Specification of MAC Engine

Figure 6.5 and Figure 6.6 show the post-routed design comprising both

the Cortex-A9 CPU and MAC engine, and Table 6.2 details the engine

specifications. The MAC engine provides a seven-fold speed-up over the

software implementation of AT(:,Ω)A(:,Ω), while utilizing a modest 7% of

slice logic available on the Z-7020. Although it is possible to increase the

hardware speed-up by increasing the sub-matrix size, a larger multiplier

mesh has to be synthesized. By running the accelerator in parallel with CPU

code, the solver takes 0.14 seconds to complete the test instance, a speed-up

of 26% over running on the i7-2620M. Although the speed-up may seem

48

(a) 1× 1 (b) 3× 3

(c) 5× 5 (d) 15× 15

(e) 25× 25 (f) 75× 75

Figure 6.3: Submatrix partitioning of AT(:,Ω)A(:,Ω) for various sizes. Redun-

dantly computed entries are coloured purple.

49

+×

R

R

R

R

R

R

R

R

3-bit up-counter

Output to AXI_ACP

Input from BRAM

Input from BRAM

+ R

Figure 6.4: Output of every MAC being demultiplexed to a bank of 8registers

modest, keep in mind that the clock frequency of the Cortex-A9 on-board

the Z-7020 (667 MHz) is a quarter of i7-2620M (2.7 GHz), and the power

consumption of the i7-2620M (35 W2) is 114 times that of the Z-7020 (305

mW3). Hence, the solver implementation on the Z-7020 has been optimized

for low power applications without sacrificing run-time performance. Note

that the Z-7020’s power figure is pessimistic as the hardware logic is only

active for 17% of the final run-time; clock gating can be used to power down

FPGA logic during inactivity.

2http://ark.intel.com/products/522313http://www.arm.com/products/processors/cortex-a/cortex-a9.php

50

Figure 6.5: Post-routed layout on the Z-7020 using Vivado 2013.2 (Legend:Yellow – MAC engine, brown – ARM Cortex-A9 and DDR3 memory bus,green – AXI_ACP interconnect logic, blue – AXI_GP interconnect logic,red – reset logic)

51

Figure 6.6: System module schematic (Legend: processing_system7_1 – ARM Cortex-A9, matmul_1 – Hardware MACengine, axi_mem_intercon – AXI_ACP bus logic, processing_system-7_1_axi_periph – AXI_HP bus logic, proc_sys_reset– Reset logic)

52

Table 6.2: Hardware MAC engine specifications

Operating Frequency 125 MHzPower 55 mW

Resource UsageSlice Logic 13735 (7%)

Block RAM (RAMB36E1) 48 (34%)DSP Slices (DSP48E1) 46 (21%)

Timing

Speed-Up 7.00×Initiation Interval 224680 cycles

Latency 224679 cyclesWorst Negative Slack 0.280 ns

Worst Hold Slack 0.052 nsWorst Pulse Width Slack 2.750 ns

6.4 Remaining Bottleneck

After applying FPGA acceleration, the remaining bottleneck of the em-

bedded �1-solver is due to the computation of AT(A(:,Ω)xΩ

), taking up

75% of run-time. As this operation is I/O-bounded, FPGA acceleration

is ineffective because the memory bandwidth between the DDR memory

and the programmable logic is a mere 4.8 GB/s. A computational platform

with a larger memory bandwidth is required. This leads to the choice of

using the graphics processing unit (GPU) as an alternative implementation

platform for the solver.

53

Chapter 7

Solver Implementation onNVIDIA CUDA GPUArchitecture

One defining feature of the GPU is its large memory bandwidth that services

the many processing cores (Table 7.1). Given that a GPU is comprised of

several shader cores running in parallel, each possibly accessing different

texture memory locations, memory bandwidth has to scale up in order to

keep up with the memory transactions issued by every core.

In this aspect, the GPU excels in I/O-bounded tasks due to large

memory bandwidth. For example, the matrix-vector multiplication Ax

Table 7.1: Comparison of memory bandwidth for various hardware systems

Name Category Launch Process(nm)

Standard Bandwidth(GB/s)

Zynq-7100AP SoC

FPGA withintegratedCPU

Mar-13 28 DDR3 10.7

KeyStone66AK2H12

DSP with inte-grated CPU

Nov-12 28 DDR3 12.8

Intel Core i7-4770R

CPU with in-tegrated GPU

Jun-13 22 DDR3L 25.6

GeForce GTX780 Ti

GPU Nov-13 28 GDDR5 336

55

involves fetching every entry of A from memory to the processor, thus the

execution speed is limited by the amount of memory bandwidth available

to deliver every entry to the execution cores.

A natural extension from parallel computing using SIMD instructions

is to perform vectorized operations on the GPU. The typical GPU has

multiple shader cores which extends 4-way SIMD processing on the CPU

to n-way parallel computing. For example, the embedded ARM Mali-450

MP graphics processor has 8 fragment processors, thus allowing 8-way

general-purpose floating point processing. At the upper end, NVIDIA’s

GRID K2 boasts 3072 thread processors.

7.1 Comparisons Between GPU and FPGAArchitectures

Both architectures are similar in terms of the hardware parallelism. The

GPU has multiple stream processors which execute code in parallel, whereas

the FPGA can instantiate multiple hardware blocks that run in parallel. A

striking difference would be that the GPU has a massive amount of memory

bandwidth compared to state-of-the-art commercial FPGA. For a start,

the NVIDIA M2050 features the latest GDDR5 memory standard, whereas

Xilinx supports DDR4 under its Kintex and Virtex UltraScale products.

Wide memory buses that is commonly found on GPUs are not easily

available on smaller FPGA devices. For example, the Xilinx Zynq-7000

series has at most 128 pins which are dedicated for DDR interface.

56

7.2 I/O-boundness of AT(A(:,Ω)xΩ

)

We have seen from chapter 6 that the FPGA is not suited to accelerate I/O-

bounded operations. One such operation is the matrix-matrix multiplication

AT(A(:,Ω)xΩ

). To see why this is I/O-bounded, consider Figure 7.1. Every

entry of A is used only once, therefore the speed of computation is limited

by how fast the processor can fetch the entire A. Since this matrix may be

large, especially when the optimization problem has a lot of variables and

constraints, it is often stored off-chip. The computational speed is limited

by the bandwidth between the processor and off-chip main memory, and

the situation does not benefit from data caching as entries are only read

once.

7.3 Accelerating Level 2 BLAS Operation(GEMV) on CUDA

The Level 2 BLAS operation refers to the generalized matrix-vector multi-

plication GEMV (Equation 7.1). This subroutine is often used as primitives

(together with GAXPY and GEMM) by advanced linear algebra libraries

(e.g. LAPACK). Since the solver implementation on FPGA is I/O limited

on this particular operation, with the matrix A and vector A(:,Ω)xΩ as

inputs, it makes sense to explore ways to speed up this operation on the

GPU by way of optimizing the GEMV primitive.

y ← αAx + βy (7.1)

57

Figure 7.1: Animation — Computation of AT(A(:,Ω)xΩ

). The rectangle

in the middle is AT, and the column vector on the right is A(:,Ω)xΩ. Bluerepresent read, and red represent write operations in the respective memorylocations.

58

=

7.4 Problem Partitioning

The matrix-vector operation, AT(A(:,Ω)xΩ

), is partitioned into sub-problems

that are independently processed using a one dimensional kernel grid com-

prising of m16 thread block (assuming that m, n are multiples of 16), each

block a two dimensional array of 16 × 16 CUDA threads (Figure 7.2). AT

is partitioned into 16 × 16 submatrices, and A(:,Ω)xΩ and the result vector

are partitioned into 16 × 1 vectors. Denoting AT as M and A(:,Ω)xΩ as v,

the i-th thread block processes Equation 7.2.

y = Mv

=

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

M1:16,1:16 . . . M1:16,m−15:m. . . . .

.

... M16i−15:16i,16j−15:16j

...

. .. . . .

Mn−15:n,1:16 . . . Mn−15:n,m−15:m

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣

v1:16...

v16j−15:16j

...vm−15:m

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦

⇒ y16i−15:16i =m16∑

k=1M16i−15:16i,16k−15:16k × v16k−15:16k (7.2)

=

⎡⎢⎢⎢⎢⎣

...∑16q=1

∑ m16k=1 M16(i−1)+p,16(k−1)+qv16(k−1)+q

...

⎤⎥⎥⎥⎥⎦ (7.3)

7.5 Thread Block Operation

Figure 7.3 shows the partitioned operation at the level of the CUDA thread

block. Every thread block is in-charge of computing the dot product of a

row vector of 16 × 16 submatrices with a column vector of 16 × 1 vectors.

Boxes in blue shows memory elements that resides on the off-chip DDR

59

Figure 7.2: Animation — AT(A(:,Ω)xΩ

)partitioned into a one dimen-

sional kernel grid comprising of 16 × 16 thread blocks

60

SM0

SM1

SM2

SM3

SM4

SM5

Thread Block 1

Thread Block 2

Thread Block 3

Thread Block 4

Thread Block 5

Thread Block 0

SMi

Thread Block i

Shar

ed M

emor

y (:, : )Out

put

(:, )

Figure 7.3: Organization of the problem at the thread block level

memory. Green boxes are on-chip shared memory that can be found in every

streaming multiprocessor (SM). Since shared memory has lower latency

and higher throughput compared to DDR memory, it is common to stage

data structures onto shared memory prior to operation.

Red boxes are register storage of every CUDA thread. Reading and

writing to the register file are faster than shared memory, but threads cannot

interact with registers residing in other threads. Therefore, threads write

to shared memory to make their data accessible to threads residing within

the same thread block. The only method for threads of different blocks to

share data is to use off-chip DDR memory.

7.5.1 Staging data onto shared memory

Every entry of the input vector is used 16 times, incurring at least 16

memory transactions for every element. Since transactions with the off-chip

DDR memory are highly inefficient, the 16 × 1 input vector is staged

onto the shared memory so that the cost of an off-chip memory read can

be amortized across 16 fast shared memory reads. Since there are only

16 entries to be fetched, the first 16 threads of the block simultaneously

launch a 64B coalesced memory read transaction from the off-chip memory,

with the remaining 240 threads idle. Figure 7.4 shows what has just been

61

Figure 7.4: Animation — Serialized visualisation of the staging 16 × 1partitions of A(:,Ω)xΩ onto the shared memory. Yellow indicates the threadin-charge of the transfer

mentioned. Note that although the animation is shown to be serialized, the

16 threads concurrently fetch all the 16 entries from the off-chip onto the

shared memory.

7.5.2 Multiply and Accumulate

Every thread within a block performs a multiplication between an entry in

the shared memory, and another from the off-chip memory (Figure 7.7). The

products are accumulated within the register of the (p, q)-th thread according

to the inner summation of Equation 7.3. Assuming that A is stored in

column-major order, since consecutive warps of threads read consecutive

memory on the off-chip memory, the GPU recognizes these coherent access

patterns and will issue 16 64B coalesced memory transactions instead of

256 separate instructions, making this operation highly I/O-efficient. Also

note that every entry of A is used only once, hence there is no need to stage

the data onto shared memory.

62

SMi

Figure 7.5: Animation — Multiply and accumulate operation

Figure 7.6: Animation — Contents of registers for every thread is copiedinto shared memory

7.5.3 Copy to Shared Memory

Since threads cannot access the registers of other threads, all the threads

within the same block write their accumulated products into a 16 × 16

shared memory. Figure 7.6 shows a serialized visualization of the operation,

where in actual operation writing to shared memory occurs simultaneously

across all the 256 threads.

7.5.4 Final Summation

The first 16 threads perform the outer summation of Equation 7.3 based on

inputs that has been copied into the shared memory during the previous

step. The accumulated products are stored within the thread registers.

Figure 7.7 shows the serialized operation of what has been mentioned so

63

SMi

SMi

Figure 7.7: Animation — First 16 threads performing the final summation

Figure 7.8: Animation — Writing the final accumulated result to off-chipmemory

far. Note that the 16 threads run concurrently in the actual operation.

7.5.5 Writing Results to DDR Memory

Each of the 16 threads writes the final accumulated result to off-chip memory.

As the memory locations are consecutive, a single 64B coalesced memory

transaction is issued instead of 16 separate instructions. Figure 7.8 shows a

serialized animation of the operation. Note that in the actual hardware the

write operation occurs simultaneously.

7.6 Parallelism

Exposing parallelism serves two purposes: Firstly, it allows the CUDA

hardware to efficiently schedule sub-problems onto SMs that are blocked on

64

SMi

SMi

the current wrap that is being executed, which may be caused by pending

memory transactions, block synchronization or resolving read-after-write

register dependencies. Secondly, code performance scales with the amount

of physical processors available on the hardware.

There are two levels of parallelism exposed by how the problem has

been partitioned in section 7.4. At the kernel grid level, each thread block

progresses independently of one another. Within the underlying hardware

implementation, thread blocks are executed on SMs, the number of which

depends on the hardware version. Therefore the same code running on the

a later generation of hardware is able to achieve more speed-up due to a

larger number of physical SMs, compared with one which is executed on

older hardware.

The second type of parallelism arises at the thread level. Every thread

within the block executes independently of one another, except when they

meet at synchronization points. Threads are mapped onto physical stream

processors within every SM, the actual number depending on the hardware

version. For example, the NVIDIA GT200 architecture contains 8 shader

processors and 2 special function units per SM, whereas the more advanced

GF100 series has 32 shaders and 4 special function cores. It is not always

the case that an equal number of physical shaders is required for every

thread in a block because threads are executed in a round-robin fashion.

One expects computations to be faster on an architecture which has more

processors per SM.

Although there are sections where only 16 threads participate in the

processing (subsection 7.5.1 and subsection 7.5.5), this does not imply that

other threads of the same block are idling. Instead, inactive threads are

swapped out and the multiprocessor schedules threads of other blocks to

65

run. Since the processor schedules threads in multiple of 32 [54, §2.1], only

16 threads are idling during these phases. These can be avoided by allowing

a full warp to stage data onto shared memory, and write data to device

memory, albeit with additional code which does not translate into significant

speed-up.

7.7 Hardware Benchmark

To see how well the proposed GEMV compares with cuBLAS, both codes

are run on the Amazon AWS EC2 cg1.4xlarge hardware instance, backed

by two Intel Xeon X5570 and two NVIDIA Tesla M2050 GPUs. The size

of A is fixed to be 295 million entries. The aspect ratio of A, defined to

be the width to height ratio, is varied between 1 to around 33. The thread

block aspect ratio, qp, is varied between 1 and 8. For each configuration the

run-time average and standard deviation of 16 trials are tallied.

7.7.1 Hardware Environment

The NVIDIA Tesla M2050 GPU Computing Module features 448 thread

processors grouped into 14 SMs of 32 cores each, each running at a clock

speed of 1.15 GHz. The entire unit provides a peak processing power of 1.03

TFLOPs1. A 3 GiB graphics memory services all the processors through

a 384 bit GDDR5 bus interface at a clock speed of 1.55 GHz, giving an

aggregate bandwidth of 148.4 GB/s. The hardware module is offered by

Amazon Web Services Elastic Computing 2 under the cg1.4xlarge instance

type.1Counting a single precision fused multiply–add as one operation

66

1

611

1621263133

1

23

45 6 7 8

1.8

2

2.2

2.4

2.6

Matrix Aspect Ratio

GEMV Speedup of Proposed over cuBLAS (z-axis) and FPGA (color)

Thread Block Aspect Ratio

Spee

dup

of P

ropo

sed

over

cuB

LAS

540

560

580

600

620

640

660

680

700

Figure 7.9: GEMV speedup of proposed over cuBLAS (plotted alongz-axis) and the corresponding Zynq-7020 implementation (plotted as a colormap). Error bars annotate the 95% confidence interval at every data point.

7.7.2 Results

Figure 7.9 shows the run-time speedup of the proposed method over

cuBLAS and the implementation on Zynq-7020. For all configurations the

proposed method consistently performs twice as fast as cuBLAS, and as

high as 2.5 times for an aspect ratio of 1.23 on the thread block. Speedup

factor over the FPGA implementation is an impressive 702. Performance

is consistent over different matrix aspect ratio.

The memory transfer efficiency is computed for all test case based on the

maximum memory bandwidth of 148.4 GB/s advertised by NVIDIA. From

Figure 7.10, the proposed method consistently uses no less than 50% of the

67

1 5 9 13 17 21 25 29 331

2

3

4

5

6

7

8Memory Transfer Efficiency of Proposed Method

Matrix Aspect Ratio

Thre

ad B

lock

Asp

ect R

atio

0.55

0.6

0.65

0.7

0.75

Figure 7.10: Memory transfer efficiency with respect to the advertised peakmemory bandwidth

memory bandwidth, with a peak utilization of 68%. A possible reason for

not being able to achieve maximum throughput is due to the overhead of

context switching between different kernels on the same physical SM. The

proposed method is not compute-bounded because the peak computational

efficiency is 4.9% (Figure 7.11).

68

1 5 9 13 17 21 25 29 331

2

3

4

5

6

7

8Computational Efficiency of Proposed Method

Matrix Aspect Ratio

Thre

ad B

lock

Asp

ect R

atio

0.038

0.04

0.042

0.044

0.046

0.048

Figure 7.11: Computational efficiency with respect to the advertised peakperformance

69

Chapter 8

Conclusion

With advances in chip technology, parallel computing structures are increas-

ingly becoming ubiquitous in modern embedded processors. Examples of

embedded architectures that already feature SIMD instructions include

the ARM (NEON), the Power Architecture (AltiVec) and the Intel Atom

(SSE). Hybrid CPU-FPGA system-on-chips, like the Xilinx Zynq-7000 Ex-

tensible Processing Platform and Altera’s Hard Processor System, are also

becoming the norm. Therefore, algorithms intended to be executed on an

embedded target should be designed to have as much data-flow parallelism

as possible, so as to exploit these parallel hardware.

8.1 Proposed �1-Solver and FPGA Implemen-tation

Compared to state-of-the-art solvers, the proposed solver exhibits supe-

rior run-time performance by formulating compute-intensive routines as

matrix-matrix and matrix-vector multiplications, both of which are effi-

ciently handled by BLAS libraries. Since these libraries are tuned to

use architectural-specific SIMD instructions, computations are guaranteed

71

to execute as efficiently as possible. Program memory running on the

Cortex-A9 CPU is economical enough to fit within the L1 cache, and data

structures within the L2 cache. The bottleneck of the solver is implemented

in programmable logic which achieves a seven-fold speed-up over software

code running on the embedded processor. Without sacrificing run-time

performance, the embedded implementation on the Z-7020 is at least 114

times as power efficient as the MATLAB prototype on the i7-2620M.

8.2 GPU Acceleration of GEMV

Also highlighted is a major limitation of the FPGA architecture: that it per-

forms poorly on I/O bounded operations. In this aspect, GPU architectures

have the upper hand due to larger memory bandwidths. The remaining

bottleneck of the solver, namely the GEMV operation AT (A:,ΩxΩ), has

been accelerated on the NVIDIA Tesla M2050 GPU Computing Module.

The proposed parallel algorithm to speed up GEMV is at least twice as

fast as the proprietary cuBLAS library, and provides an overall speedup of

702 times over the respective FPGA implementation.

72

Bibliography

[1] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Yi Ma. Robust facerecognition via sparse representation. Pattern Analysis and MachineIntelligence, IEEE Transactions on, 31(2):210–227, 2009. ISSN 0162-8828. doi: 10.1109/TPAMI.2008.79.

[2] A. Wang, P.R. Gill, and A. Molnar. Fluorescent imaging and localiza-tion with angle sensitive pixel arrays in standard cmos. In Sensors,IEEE, pages 1706–1709, 2010. doi: 10.1109/ICSENS.2010.5689914.

[3] E. Elhamifar and R. Vidal. Sparse subspace clustering. In ComputerVision and Pattern Recognition. CVPR. IEEE Conference on, pages2790–2797, 2009. doi: 10.1109/CVPR.2009.5206547.

[4] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and JonathanEckstein. Distributed optimization and statistical learning via thealternating direction method of multipliers. Foundations and Trends R©in Machine Learning, 3(1):1–122, 2011.

[5] Tom Goldstein and Simon Setzer. High-order methods for basis pursuit.Methods, pages 1–17, 2011.

[6] Elaine T Hale, Wotao Yin, and Yin Zhang. Fixed-point continuationfor �1-minimization: Methodology and convergence. SIAM Journal onOptimization, 19(3):1107–1130, 2008.

[7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. glmnet: Lassoand elastic-net regularized generalized linear models. R package version,1, 2009.

[8] M. A T Figueiredo, R.D. Nowak, and S.J. Wright. Gradient projectionfor sparse reconstruction: Application to compressed sensing and otherinverse problems. Selected Topics in Signal Processing, IEEE Journal of,1(4):586–597, 2007. ISSN 1932-4553. doi: 10.1109/JSTSP.2007.910281.

[9] M Salman Asif and Justin Romberg. Fast and accurate algorithmsfor re-weighted l1-norm minimization. arXiv preprint arXiv:1208.0651,2012.

73

[10] Kwangmoo Koh, S Kim, and S Boyd. l1 ls: A matlab solver forlarge-scale l1-regularized least squares problems. Stanford University,2007.

[11] Y.C. Pati, R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matchingpursuit: recursive function approximation with applications to waveletdecomposition. In Signals, Systems and Computers, Proceedings of27th Asilomar Conference on, pages 40–44 vol.1, 1993. doi: 10.1109/ACSSC.1993.342465.

[12] Guy Narkiss and Michael Zibulevsky. Sequential subspace optimizationmethod for large-scale unconstrained problems. Technion-IIT, Depart-ment of Electrical Engineering, 2005.

[13] Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozin-ski. Optimization with sparsity-inducing penalties. arXiv preprintarXiv:1108.0775, 2011.

[14] S.J. Wright, R.D. Nowak, and M. A T Figueiredo. Sparse reconstructionby separable approximation. Signal Processing, IEEE Transactionson, 57(7):2479–2493, 2009. ISSN 1053-587X. doi: 10.1109/TSP.2009.2016892.

[15] Stephen R Becker. Practical Compressed Sensing: modern data ac-quisition and signal processing. PhD thesis, California Institute ofTechnology, 2011.

[16] J.M. Bioucas-Dias and M. A T Figueiredo. A new twist: Two-stepiterative shrinkage/thresholding algorithms for image restoration. Im-age Processing, IEEE Transactions on, 16(12):2992–3004, 2007. ISSN1057-7149. doi: 10.1109/TIP.2007.909319.

[17] Junfeng Yang and Yin Zhang. Alternating direction algorithms for �1-problems in compressive sensing. SIAM journal on scientific computing,33(1):250–278, 2011.

[18] J. Wright, Yi Ma, J. Mairal, G. Sapiro, T.S. Huang, and ShuichengYan. Sparse representation for computer vision and pattern recognition.Proceedings of the IEEE, 98(6):1031–1044, June 2010. ISSN 0018-9219.doi: 10.1109/JPROC.2010.2044470.

[19] R. Baraniuk and P. Steeghs. Compressive radar imaging. In RadarConference, 2007 IEEE, pages 128–133, April 2007. doi: 10.1109/RADAR.2007.374203.

[20] Michael Lustig, David Donoho, and John M Pauly. Sparse mri: Theapplication of compressed sensing for rapid mr imaging. Magneticresonance in medicine, 58(6):1182–1195, 2007.

74

[21] E.J. Candes, J. Romberg, and T. Tao. Robust uncertainty principles:exact signal reconstruction from highly incomplete frequency informa-tion. Information Theory, IEEE Transactions on, 52(2):489–509, Feb2006. ISSN 0018-9448. doi: 10.1109/TIT.2005.862083.

[22] Stephen P Boyd and Lieven Vandenberghe. Convex optimization.Cambridge university press, 2004.

[23] Scott Shaobing Chen, David L Donoho, and Michael A Saunders.Atomic decomposition by basis pursuit. SIAM journal on scientificcomputing, 20(1):33–61, 1998.

[24] Emmanuel J Candes, Michael B Wakin, and Stephen P Boyd. En-hancing sparsity by reweighted �1 minimization. Journal of FourierAnalysis and Applications, 14(5-6):877–905, 2008.

[25] Wotao Yin and Yin Zhang. Extracting salient features from less datavia l1-minimization. SIAG/OPT Views-and-News, 19(1):11–19, 2008.

[26] Hongjing Lu, Tungyou Lin, Alan LF Lee, Luminita A Vese, and Alan LYuille. Functional form of motion priors in human motion perception.In NIPS, pages 1495–1503, 2010.

[27] A.Y. Yang, S. Maji, C.M. Christoudias, T. Darrell, J. Malik, and S.S.Sastry. Multiple-view object recognition in band-limited distributedcamera networks. In Distributed Smart Cameras, 2009. ICDSC 2009.Third ACM/IEEE International Conference on, pages 1–8, Aug 2009.doi: 10.1109/ICDSC.2009.5289410.

[28] A. Wani and N. Rahnavard. Compressive sampling for energy efficientand loss resilient camera sensor networks. In MILITARY COMMUNI-CATIONS CONFERENCE, 2011 - MILCOM 2011, pages 1766–1771,Nov 2011. doi: 10.1109/MILCOM.2011.6127567.

[29] N. Katic, M.H. Kamal, M. Kilic, A. Schmid, P. Vandergheynst, andY. Leblebici. Power-efficient cmos image acquisition system based oncompressive sampling. In Circuits and Systems (MWSCAS), 2013IEEE 56th International Midwest Symposium on, pages 1367–1370,Aug 2013. doi: 10.1109/MWSCAS.2013.6674910.

[30] Z. Charbiwala, P. Martin, and M.B. Srivastava. Capmux: A scalableanalog front end for low power compressed sensing. In Green ComputingConference (IGCC), 2012 International, pages 1–10, June 2012. doi:10.1109/IGCC.2012.6322255.

[31] Ramy Hussein, Amr Mohamed, Masoud Alghoniemy, and Alaa Awad.Design and analysis of an adaptive compressive sensing architecture

75

for epileptic seizure detection. In Energy Aware Computing Systemsand Applications (ICEAC), 2013 4th Annual International Conferenceon, pages 141–146, Dec 2013. doi: 10.1109/ICEAC.2013.6737653.

[32] J. Chiang and R. Ward. Data reduction for wireless seizure de-tection systems. In Neural Engineering (NER), 2013 6th Interna-tional IEEE/EMBS Conference on, pages 48–52, Nov 2013. doi:10.1109/NER.2013.6695868.

[33] M. Shoaib, K.H. Lee, N.K. Jha, and N. Verma. A 0.6–107 μw energy-scalable processor for directly analyzing compressively-sensed eeg. Cir-cuits and Systems I: Regular Papers, IEEE Transactions on, PP(99):1–14, 2014. ISSN 1549-8328. doi: 10.1109/TCSI.2013.2285912.

[34] M. Shoaran, M.M. Lopez, V.S.R. Pasupureddi, Y. Leblebici, andA. Schmid. A low-power area-efficient compressive sensing approachfor multi-channel neural recording. In Circuits and Systems (ISCAS),2013 IEEE International Symposium on, pages 2191–2194, May 2013.doi: 10.1109/ISCAS.2013.6572310.

[35] S.A. Imtiaz, A. Casson, and E. Rodriguez-Villegas. Compressionin wearable sensor nodes: Impacts of node topology. BiomedicalEngineering, IEEE Transactions on, PP(99):1–1, 2013. ISSN 0018-9294. doi: 10.1109/TBME.2013.2293916.

[36] Qing Ling and Zhi Tian. Decentralized sparse signal recovery forcompressive sleeping wireless sensor networks. Signal Processing, IEEETransactions on, 58(7):3816–3827, July 2010. ISSN 1053-587X. doi:10.1109/TSP.2010.2047721.

[37] Peter M. Kogge and Harold S. Stone. A parallel algorithm for theefficient solution of a general class of recurrence equations. Computers,IEEE Transactions on, C-22(8):786–793, Aug 1973. ISSN 0018-9340.doi: 10.1109/TC.1973.5009159.

[38] C. S. Wallace. A suggestion for a fast multiplier. Electronic Computers,IEEE Transactions on, EC-13(1):14–17, Feb 1964. ISSN 0367-7508.doi: 10.1109/PGEC.1964.263830.

[39] Robert E Goldschmidt. Applications of division by convergence. PhDthesis, Massachusetts Institute of Technology, 1964.

[40] Emmanuel Candes and Justin Romberg. l1-magic: Recovery ofsparse signals via convex programming. URL: www. acm. caltech.edu/l1magic/downloads/l1magic. pdf, 4, 2005.

76

[41] A. Maleki and D.L. Donoho. Optimally tuned iterative reconstructionalgorithms for compressed sensing. Selected Topics in Signal Processing,IEEE Journal of, 4(2):330–341, 2010. ISSN 1932-4553. doi: 10.1109/JSTSP.2009.2039176.

[42] Jerome LVM Stanislaus and Tinoosh Mohsenin. Low-complexity fpgaimplementation of compressive sensing reconstruction. In InternationalConference on Computing, Networking and Communications, 2013.

[43] K. Karakus and H.A. Ilgin. Implementation of image reconstructionalgorithm using compressive sensing in fpga. In Signal Processing andCommunications Applications Conference (SIU), 20th, pages 1–4, 2012.doi: 10.1109/SIU.2012.6204682.

[44] H. Rabah, A. Amira, and A. Ahmad. Design and implementaiton of afall detection system using compressive sensing and shimmer technology.In Microelectronics (ICM), 24th International Conference on, pages1–4, 2012. doi: 10.1109/ICM.2012.6471399.

[45] P. Blache, H. Rabah, and A. Amira. High level prototyping and fpgaimplementation of the orthogonal matching pursuit algorithm. InInformation Science, Signal Processing and their Applications (ISSPA),11th International Conference on, pages 1336–1340, 2012. doi: 10.1109/ISSPA.2012.6310501.

[46] Jicheng Lu, Hao Zhang, and Huadong Meng. Novel hardware ar-chitecture of sparse recovery based on fpgas. In Signal ProcessingSystems (ICSPS), 2nd International Conference on, volume 1, pagesV1–302–V1–306, 2010. doi: 10.1109/ICSPS.2010.5555628.

[47] Lin Bai, P. Maechler, M. Muehlberghuber, and H. Kaeslin. High-speedcompressed sensing reconstruction on fpga using omp and amp. InElectronics, Circuits and Systems (ICECS), 19th IEEE InternationalConference on, pages 53–56, 2012. doi: 10.1109/ICECS.2012.6463559.

[48] A. Septimus and R. Steinberg. Compressive sampling hardwarereconstruction. In Circuits and Systems (ISCAS), Proceedings ofIEEE International Symposium on, pages 3316–3319, 2010. doi:10.1109/ISCAS.2010.5537976.

[49] YV Zakharov and V Nascimento. Orthogonal matching pursuit withdcd iterations. Electronics Letters, 49(4):295–297, 2013.

[50] P.R. Gill, A. Wang, and A. Molnar. The in-crowd algorithm for fastbasis pursuit denoising. Signal Processing, IEEE Transactions on,59(10):4595–4605, 2011. ISSN 1053-587X. doi: 10.1109/TSP.2011.2161292.

77

[51] M. A T Figueiredo and R.D. Nowak. A bound optimization approachto wavelet-based image deconvolution. In Image Processing. ICIP.IEEE International Conference on, volume 2, pages II–782–5, 2005.doi: 10.1109/ICIP.2005.1530172.

[52] David Donoho and Jared Tanner. Observed universality of phasetransitions in high-dimensional geometry, with implications for moderndata analysis and signal processing. Philosophical Transactions of theRoyal Society A: Mathematical, Physical and Engineering Sciences, 367(1906):4273–4293, 2009.

[53] Zynq-7000 AP SoC Technical Reference Manual. Xilinx Inc., v1.6.1edition, September 2013.

[54] CUDA C Best Practices Guide. Nvidia Corporation, Santa Clara,California, USA, 5.5 edition, July 2013.

78

accelerating real-time computer vision algorithms on ... · assistant professor akash kumar, who...

Documents