accelerating real-time computer vision algorithms on ... · assistant professor akash kumar, who...
TRANSCRIPT
National University of SingaporeFaculty of Engineering
Department of Electrical and ComputerEngineering
M.Eng. Dissertation
Accelerating Real-time ComputerVision Algorithms on Parallel
Hardware Architectures
Submitted by:Mr. Ang Zhi Ping(B. Eng. (Hons.), NUS)[email protected]
Supervisor:Assistant Professor
Akash [email protected]
A thesis submitted for the degree ofMaster Of Engineering
2014
Typesetted in LATEX 2εLast revised on October 28, 2014
Declaration
I hereby declare that this thesis is my original work and it has been written
by me in its entirety. I have duly acknowledged all the sources of information
which have been used in the thesis. This thesis has also not been submitted
for any degree in any university previously.
Ang Zhi Ping, October 28, 2014
iii
iv
Acknowledgments
The author would like to express gratitude towards his research supervisor,
Assistant Professor Akash Kumar, who has provided invaluable advice
with respect to the choice of algorithms and implementation aspects on
hardware platforms, and to Joling who assisted in collecting and tabulating
run-time results. He would also like to thank DSO National Laboratories
for providing full sponsorship during the course of study.
v
vi
Contents
Declaration iii
Acknowledgments v
Summary xi
List of Tables xiii
List of Figures xvi
List of Symbols and Abbreviations xvii
1 Introduction 11.1 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 �1-Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Application: Image Registration Under Projective Transfor-
mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.2 Registration Under Gaussian Error . . . . . . . . . . 61.4.3 Registration Under Gross Occlusion Error . . . . . . 81.4.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 �1 Optimization on Embedded Platforms . . . . . . . . . . . 91.5.1 Computer Vision and Imaging . . . . . . . . . . . . . 101.5.2 Biomedical Sensing . . . . . . . . . . . . . . . . . . . 111.5.3 Wireless Sensor Networks . . . . . . . . . . . . . . . 11
1.6 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Unsuitability of Targeting Existing Solvers to EmbeddedPlatforms 132.1 Floating Point Division . . . . . . . . . . . . . . . . . . . . . 142.2 Expensive Transcendental Functions . . . . . . . . . . . . . 142.3 Large Software Libraries . . . . . . . . . . . . . . . . . . . . 152.4 Inefficient Memory Usage . . . . . . . . . . . . . . . . . . . . 152.5 Lack of Parallelism . . . . . . . . . . . . . . . . . . . . . . . 15
vii
2.6 Poor Recovery Performance of Orthogonal Matching Pursuit 162.7 Questionable Scalability of OMP . . . . . . . . . . . . . . . 16
3 Vectorization on Embedded Systems 173.1 SIMD Instruction Set . . . . . . . . . . . . . . . . . . . . . 173.2 Eliminating if-else Using Vectorization . . . . . . . . . . . . 183.3 Limitations of SIMD Processing . . . . . . . . . . . . . . . 21
4 Proposed BPDN Solver 234.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Initializing x and Ω . . . . . . . . . . . . . . . . . . . . . . . 244.3 Phase I: Solving for x Given Ω . . . . . . . . . . . . . . . . . 244.4 Phase II: Updating Ω Given x . . . . . . . . . . . . . . . . . 264.5 Estimating the Sparsity of x . . . . . . . . . . . . . . . . . . 28
4.5.1 Wright et al. [1]: Face recognition from a database . 294.5.2 Wang et al. [2]: Three-dimensional arrangement of
light sources in bioassays . . . . . . . . . . . . . . . . 294.5.3 Elhamifar and Vidal [3]: Motion segmentation using
sparse subspace clustering . . . . . . . . . . . . . . . 30
5 BPDN Solver Benchmark 315.1 List of Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1.1 ADMM LASSO [4] . . . . . . . . . . . . . . . . . . 315.1.2 CGIST [5] . . . . . . . . . . . . . . . . . . . . . . . 325.1.3 FPC-BB [6] . . . . . . . . . . . . . . . . . . . . . . 325.1.4 GLMNET [7] . . . . . . . . . . . . . . . . . . . . . . 325.1.5 GPSR-BB6 [8] . . . . . . . . . . . . . . . . . . . . . 325.1.6 Homotopy [9] . . . . . . . . . . . . . . . . . . . . . 335.1.7 L1-LS [10] . . . . . . . . . . . . . . . . . . . . . . . 335.1.8 OMP [11] . . . . . . . . . . . . . . . . . . . . . . . . 335.1.9 SESOP_PACK [12] . . . . . . . . . . . . . . . . . . 335.1.10 SPAMS [13] . . . . . . . . . . . . . . . . . . . . . . 335.1.11 SpaRSA2 [14] . . . . . . . . . . . . . . . . . . . . . 345.1.12 TFOCS [15] . . . . . . . . . . . . . . . . . . . . . . 345.1.13 TwIST2 [16] . . . . . . . . . . . . . . . . . . . . . . 345.1.14 YALL1 [17] . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Test Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.3 Run-Time Performance . . . . . . . . . . . . . . . . . . . . . 355.4 Algorithm Bottleneck . . . . . . . . . . . . . . . . . . . . . . 375.5 Accuracy of Recovered Results . . . . . . . . . . . . . . . . . 375.6 Memory Usage . . . . . . . . . . . . . . . . . . . . . . . . . 395.7 Convergence Properties of Solver . . . . . . . . . . . . . . . 41
viii
6 Solver Implementation on the Xilinx Zynq Z-7020 436.1 Porting MATLAB to C . . . . . . . . . . . . . . . . . . . . 436.2 Eigen BLAS Library . . . . . . . . . . . . . . . . . . . . . . 446.3 Accelerating AT
(:,Ω)A(:,Ω) Using Programmable Logic . . . . . 456.3.1 Multiply-And-Accumulate Hardware Engine . . . . . 466.3.2 Detailed Operation . . . . . . . . . . . . . . . . . . . 466.3.3 Specification of MAC Engine . . . . . . . . . . . . . 48
6.4 Remaining Bottleneck . . . . . . . . . . . . . . . . . . . . . 53
7 Solver Implementation on NVIDIA CUDA GPU Architec-ture 557.1 Comparisons Between GPU and FPGA Architectures . . . 567.2 I/O-boundness of AT
(A(:,Ω)xΩ
). . . . . . . . . . . . . . . . 57
7.3 Accelerating Level 2 BLAS Operation (GEMV) on CUDA . 577.4 Problem Partitioning . . . . . . . . . . . . . . . . . . . . . . 597.5 Thread Block Operation . . . . . . . . . . . . . . . . . . . . 59
7.5.1 Staging data onto shared memory . . . . . . . . . . . 617.5.2 Multiply and Accumulate . . . . . . . . . . . . . . . 627.5.3 Copy to Shared Memory . . . . . . . . . . . . . . . . 637.5.4 Final Summation . . . . . . . . . . . . . . . . . . . . 637.5.5 Writing Results to DDR Memory . . . . . . . . . . . 64
7.6 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.7 Hardware Benchmark . . . . . . . . . . . . . . . . . . . . . . 66
7.7.1 Hardware Environment . . . . . . . . . . . . . . . . . 667.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8 Conclusion 718.1 Proposed �1-Solver and FPGA Implementation . . . . . . . 718.2 GPU Acceleration of GEMV . . . . . . . . . . . . . . . . . 72
ix
x
Summary
Computer vision routines are designed as sequential algorithms intended
to run on CPUs, hence they are often formulated with parallelism as
an afterthought. Targeting these algorithms for real-time applications on
parallel hardware architectures such as field-programmable gate arrays
(FPGA) and graphics processing units (GPU) would therefore be tedious
exercise.
In this thesis, a class of �1-optimization problems known as basis pursuit
denoising (BPDN) is explored. BPDN has been widely used in situations
where the underlying signal is known to be sparse and measurements are
corrupted by Gaussian error. Robust computer vision algorithms are often
formulated as �1-optimization problems because it gives better quality
results when the data is known to be corrupted by non-Gaussian sources
of errors. For example, occlusion, salt-and-pepper noise and discontinuous
flow boundaries can be effectively modeled using �1 penalty terms.
Real world applications heavily rely on an embedded real-time �1-solver
to recover the sparse signal within a reasonable time frame because such
a solver can be integrated with a portable device. Unfortunately, existing
solvers are generally unsuitable for embedded implementation due to either
poor run-time performance, lack of parallelism or high memory usage. To
address the aforementioned issues, this thesis proposes an efficient �1-solver
suitable for implementation on parallel architectures. The algorithm is
xi
implemented on the Xilinx Zynq-7020 All Programmable System-on-Chip
FPGA and the NVIDIA Tesla M2050 GPU Computing Module.
For a problem with 5000 variables and 500 constraints, the solver occupies
a small memory footprint of 29 kB and takes 0.14 seconds to complete
on the Zynq Z-7020. The same problem takes 0.19 seconds on the second
generation Intel Core i7-2620M mobile processor, which runs at 4 times the
clock frequency and 114 times the power budget of the Z-7020. Without
sacrificing run-time performance, the solver is highly optimized for power
constrained embedded applications. By far this is the first energy efficient
embedded solver capable of handling large scale problems with several
thousand variables.
Although FPGAs are suited for accelerating compute-bound operations,
they under-perform for I/O-bound operations due to limited off-chip mem-
ory bandwidth. This is not the case for GPUs which have substantially
larger bandwidths. Thus, the GPU is ideal for speeding up I/O-bounded
operations. In this aspect, the solver is also implemented on the NVIDIA
Tesla M2050 GPU Computing Module. cuBLAS libraries are used to
port linear algebraic routines to the GPU. The thesis also furnishes an
optimized algorithm to solve the generalized matrix-vector multiplication,
which is at least twice as fast as cuBLAS and 702 times as fast as the
FPGA implementation.
Some of the figures in the thesis are animations that can be viewed using
Adobe Reader 7 or later. An electronic copy of this thesis is accessible at
http://x.co/mthesis.
xii
List of Tables
3.1 4-way single precision floating point SIMD assembly instruc-tions in common architectures . . . . . . . . . . . . . . . . . 18
3.2 A sequence of transformations vectorizing algorithm 2 . . . . 20
5.1 Run-time profiling results on MATLAB . . . . . . . . . . . . 37
6.1 Run-time profiling results on the Z-7020 using Gprof . . . . 466.2 Hardware MAC engine specifications . . . . . . . . . . . . . 53
7.1 Comparison of memory bandwidth for various hardware systems 55
xiii
xiv
List of Figures
1.1 Image registration error using �1 and �2 penalties for variousGaussian noise levels. . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Animation — Comparison of �1 versus �2 registration forvarying Gaussian noise levels. Brighter pixels indicate higherregistration error. . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Animation — Comparison of �1 versus �2 registration forvarying degrees of gross occlusions. Brighter pixels indicatehigher registration error . . . . . . . . . . . . . . . . . . . . 9
1.4 Image registration error using �1 and �2 penalties for varyingdegrees of gross occlusions . . . . . . . . . . . . . . . . . . . 10
4.1 Animation — Convergence towards the true solution (redboxes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1 Run-time for various problem sizes, where A ∈ Rm×n and
there are s non-zero entries in x . . . . . . . . . . . . . . . . 365.2 Recovery accuracy for various problem sparsity, where A ∈
Rm×n and there are s non-zero entries in x . . . . . . . . . . 38
5.3 Peak memory usage for various problem sizes, where A ∈R
m×n and there are s non-zero entries in x . . . . . . . . . . 405.4 Convergence of proposed solver of test case m = 10000,
n = 100000 and s = 100 for various η . . . . . . . . . . . . . 41
6.1 ZedBoard hardware development platform . . . . . . . . . . 446.2 Hardware engine comprising of 9 floating point MAC units . 476.3 Submatrix partitioning of AT
(:,Ω)A(:,Ω) for various sizes. Re-dundantly computed entries are coloured purple. . . . . . . . 49
6.4 Output of every MAC being demultiplexed to a bank of 8registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.5 Post-routed layout on the Z-7020 using Vivado 2013.2 (Leg-end: Yellow – MAC engine, brown – ARM Cortex-A9 andDDR3 memory bus, green – AXI_ACP interconnect logic,blue – AXI_GP interconnect logic, red – reset logic) . . . . 51
xv
6.6 System module schematic (Legend: processing_system7_1– ARM Cortex-A9, matmul_1 – Hardware MAC engine,axi_mem_intercon – AXI_ACP bus logic, processing_system-7_1_axi_periph – AXI_HP bus logic, proc_sys_reset –Reset logic) . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.1 Animation — Computation of AT(A(:,Ω)xΩ
). The rectan-
gle in the middle is AT, and the column vector on the rightis A(:,Ω)xΩ. Blue represent read, and red represent writeoperations in the respective memory locations. . . . . . . . . 58
7.2 Animation — AT(A(:,Ω)xΩ
)partitioned into a one dimen-
sional kernel grid comprising of 16 × 16 thread blocks . . . . 607.3 Organization of the problem at the thread block level . . . . 617.4 Animation — Serialized visualisation of the staging 16 ×
1 partitions of A(:,Ω)xΩ onto the shared memory. Yellowindicates the thread in-charge of the transfer . . . . . . . . . 62
7.5 Animation — Multiply and accumulate operation . . . . . 637.6 Animation — Contents of registers for every thread is
copied into shared memory . . . . . . . . . . . . . . . . . . . 637.7 Animation — First 16 threads performing the final sum-
mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.8 Animation — Writing the final accumulated result to off-
chip memory . . . . . . . . . . . . . . . . . . . . . . . . . . 647.9 GEMV speedup of proposed over cuBLAS (plotted along
z-axis) and the corresponding Zynq-7020 implementation(plotted as a color map). Error bars annotate the 95% confi-dence interval at every data point. . . . . . . . . . . . . . . 67
7.10 Memory transfer efficiency with respect to the advertisedpeak memory bandwidth . . . . . . . . . . . . . . . . . . . . 68
7.11 Computational efficiency with respect to the advertised peakperformance . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
xvi
List of Symbols andAbbreviations
a · b Vector dot product, or equivalently aTb.
Ω,Ψ Symbols used to represent ordered sets of integers, i.e. indicesof a matrix. Can be interpreted as a |Ω| × 1 (|Ψ| × 1) matrix.
(·)(i,j) Returns the entry residing in the i-row and j-th column of thebracketed matrix expression. i and j are 1-indexed.
(·)(a:b,c:d) Returns a submatrix of the bracketed matrix expression, com-prising of rows a through b inclusively, and columns c throughd inclusively. Special cases are (·)(i,:) and (·)(:,j), which givethe i-th row and j-column vector respectively.
N (A) Null space of the column space of matrix A.
O (f(n)) Big O notation, where there exists finite positive constantsM, n0 such that g(n) = O (f(n)) ↔ |g(n)| ≤ M |f(n)|, n ≥ n0.
ARM Advanced RISC Machines, originally known as Acorn RISCMachine
AXI Advanced extensible interface, a bus standard which is part ofthe Advanced Microcontroller Bus Architecture
BLAS Basic linear algebra subprograms. These routines are oftenhighly optimized by hardware vendors, and form the backboneof advanced matrix libraries such as LAPACK. Exampleswould be Intel Math Kernel Library and NVIDIA cuBLAS.
BPDN Basis pursuit denoising
xvii
BRAM Block random access memory, a hardware resource commonlyavailable in FPGA
CMOS Complementary metal oxide semiconductor
CPU Central processing unit
DDR Double data rate, a memory bus standard
FPGA Field-programmable gate array
GEMV Generalized matrix-vector multiplication y ← αAx+βy. Thisoperation and the triangular solver Tx = y are part of theLevel 2 BLAS functionality.
GPGPU General-purpose computing on graphics processing units
GSM Global System for Mobile Communications
LAPACK Linear algebra package
LU Lower upper, used within the context of LU decomposition
MAC Multiply-and-accumulate
MAP Maximum a posteriori
MATLAB Matrix Laboratory, a scientific computing software environ-ment
MRI Magnetic resonance imaging, a non-invasive medical imagingtechnology
RISC Reduced instruction set computing
SM Streaming multiprocessors, hardware building blocks of NVIDIAGPU
SIMD Single instruction multiple data. Often used in the context ofSIMD instruction sets.
WSN Wireless sensor networks
xviii
Chapter 1
Introduction
Sparse recovery has made inroads within several fields of research such
as computer vision [18], radar [19] and medical imaging [20]. By relaxing
NP-hard sparse recovery problems into convex �1-optimization programs,
problems previously reckoned to be intractable are now solvable within
polynomial time [21, 22]. A category of sparse recovery, collectively known
as basis pursuit denoising (BPDN), recovers a sparse solution from linear
measurements that are corrupted by Gaussian noise. Therefore, this method
is highly relevant for dealing with real world data and is the focus of this
thesis.
Hosting an embedded real-time �1-optimization solver is desirable be-
cause analysis can be made in-situ instead of deferring to off-line process-
ing. Moreover, an embedded target is highly suited for power and space
constrained environments. Unfortunately, existing solvers are either time-
consuming, memory intensive or unsuitable for parallelization because of
inappropriate embedded design methodology, therefore this thesis motivates
the design of an efficiently parallelizable solver suitable for targeting various
hardware platforms such as CPU, DSP, GPU and FPGA.
1
1.1 Foundations
In several applications, experimenters are interested in recovering an under-
lying physical signal x ∈ Rn. Due to limitations of the sensor’s technology, x
cannot be directly observed. Instead, a set of linear measurements y = Ax
is available, where A ∈ Rm×n characterizes the sensor’s physics. Recovering
x is straightforward if A has full column rank; the pseudo-inverse x = A+y
uniquely recovers x which agrees to the measurements in the least-square
sense.
The number of measurements made is roughly proportional to sensor’s
power consumption. Thus it makes sense to reduce the number of measure-
ments for power constrained embedded platforms. But when the number
of constraints is less than the dimension of x, the pseudo-inverse yields an
infinite family of solutions, namely: x = A+y + N (A) z for z ∈ Rdim(N (A)).
This ambiguity in x has been elegantly resolved in a breakthrough
research by Candes et al. [21], subjected to additional sparsity constraints
imposed on x and incoherency on A. In their paper, it has been proven that
if x is an k-sparse signal (i.e. there are up to k non-zero entries in x), and
the number of randomly chosen frequency samples as linear measurements is
no less than O (k log n), x can be uniquely recovered by solving an �1 convex
optimization problem. Specifically, the optimization entails minimizing the
sum of absolute values of x subjected to the set of equality constraints where
the solution agrees with the linear measurement (Equation 1.1). Using this
method, the number of measurements can be reduced by an exponential
factor.
x∗ = argminx
||x||1 subject to y = Ax (1.1)
2
1.2 Formulation
The scenario in section 1.1 is highly idealistic because y and A are assumed
to be noiseless. This is rarely true as y is often corrupted by thermal noise,
and A can never be tailored or measured to infinite precision. Thus, the
method of basis pursuit denoising (BPDN, [23]) is formulated to loosen
the equality constraint and assumed the presence of Gaussian error in the
measurement term (Equation 1.2). The parameter λ trades signal fidelity
(λ → 0) for solution sparsity (λ → ∞).
x∗ = argminx
12 ||y − Ax||22 + λ||x||1 (1.2)
A variant of Equation 1.2 is the weighted BPDN (Equation 1.3), where
w is a set of non-negative weights. This formulation is used in re-weighted �1-
minimization, which enhances recovery performances by solving a sequence
of BPDN problems with an adaptive w [24].
x∗ = argminx
12 ||y − Ax||22 + ||Qx||1,w (1.3)
Q is an optional sparsifying orthogonal basis for x, i.e. the signal has
a sparse representation under Q. This is commonly encountered in image
processing applications where the underlying image vector Qx has a sparse
representation in a frequency domain. Some transformations suitable for
images include the discrete cosine and the Haar transformation. A relevant
application would be magnetic resonance imaging (MRI), which involves
solving Equation 1.3, where x represents the underlying MRI image to
be recovered, y − Ax represents the set of measurements made by the
MRI scanner, and Q is a suitable wavelet transform which can efficiently
compresses MRI imagery [25].
3
1.3 �1-Optimization
Solve BPDN entails performing �1 optimization. Consider the overdeter-
mined problem in Equation 1.4. Given that the error eG has a Gaussian
prior, the first case yields the MAP estimate, whereas if the error eL has a
Laplace prior, the second case would give the MAP estimate [26]. Therefore,
the most probable solution depends on the properties of the underlying
error distribution.
The Gaussian prior has a finite variance, therefore errors that can be
modeled as Gaussian should not deviate far from the mean. An example
would be Johnson-Nyquist noise that is introduced by thermal agitation
of electrons, which in turn reduces the precision of CMOS image sensors
by a few bits. On the other hand, the Laplace prior does not have a
finite variance due to its fat-tailed distribution, hence it is more suited for
modeling errors which may be arbitrary large, for example salt-and-pepper
noise. Depending the type of error which is being modeled, the respective
�2 or �1 objective should be solved.
x∗ =
⎧⎪⎨⎪⎩
argminx∈Rn
‖eG‖2, eG = Ax − b
argminx∈Rn
‖eL‖1, eL = Ax − b(1.4)
1.4 Application: Image Registration UnderProjective Transformation
To illustrate the differences between using �2 and �1 objectives, both methods
are used to register a pair of closely related images (denoted I and I ′),
possibly coming from consecutive frames of an aerial video feed, or the same
scenery that are taken at different times. We want to infer a homography H
4
such that the projective relation x′ ∼ Hx applies, where x is in homogeneous
coordinate form corresponding to a point in I, and x′ to I ′. Assuming there
are no changes in scene illumination, one would expect that the optimal H
would minimize Σi‖I(xi) − I ′(Hxi)‖2, since the same pixel from different
vantage points should have equal intensities. This reasoning is valid if
the matching errors can be attributed towards Gaussian sources of noise,
i.e. sensor noise. If the pixel mismatch is due to sources of errors that
do not obey Gaussian statistics and have large variances, the objective
Σi‖I(xi) − I ′(Hxi)‖1 would be a better choice. Examples of errors that can
be modeled using the �1 objective include objects which are either absent or
displaced between the image pair, or due to hard-coded video watermarks.
1.4.1 Method
I and I ′ are grayscale images taken from a scale model of an urban envi-
ronment, both of which are taken from slightly different viewpoints. The
task would be to register I ′ to I under a projective transformation H. Two
types of errors are investigated: Gaussian noise and occluding rectangles.
For the first case, every pixel of both images is subjected to additive
zero-mean Gaussian noise of fixed standard deviation σ. σ is varied from
0 to 0.099 in steps of 0.001. For the second case, to simulate sources of
non-Gaussian error that have large variance, occlusion error is introduced by
overlaying randomly-sized uniformly coloured rectangles placed at random
locations in I and I ′. The colour of every overlay is uniformly distributed
over the grayscale range. The degree of occlusion is measured by the pro-
portional of pixels that are unoccluded in both I and I ′. Image registration
proceeds by overlaying an increasing amount of rectangles, corresponding
5
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.11900
1950
2000
2050
2100
2150
2200
Standard Deviation of Gaussian Error
TotalAbsolute
Error
Image Registration Error For Varying Degrees of Gaussian Error
�1
�2
Figure 1.1: Image registration error using �1 and �2 penalties for variousGaussian noise levels.
to increasing occlusion degree. In both scenarios, 100 registrations are done
for varying degrees of error. The quality of the registration is measured by
Σi‖I(xi) − I ′(Hxi)‖1.
1.4.2 Registration Under Gaussian Error
Figure 1.1 shows a composite animation, comprising of the corrupted {I, I ′}and the registration error for both �1 and �2 methods. Although it is not
obvious from the error images, �2 outperforms �1 for large Gaussian errors
(Figure 1.2) albeit by a small proportion compared to the baseline error.
This result naturally follows from the assumption that the residual error is
Gaussian.
6
Figure 1.2: Animation — Comparison of �1 versus �2 registration forvarying Gaussian noise levels. Brighter pixels indicate higher registrationerror.
7
Gaussian σ: 0.000
Corrupted image 0 Corrupted image 1
�1 registration error �2 registration error
It is notable that �1 slightly outperforms �2 method for small Gaussian
noise. This is not surprising considering the fact that the registered images
would substantially deviate at the boundaries where there are no matching
pixels. Boundary mismatches are sources of non-Gaussian error which are
elegantly handles by the �1 objective. As �1 does not penalize boundary
mismatches as harshly as �2, the latter overcompensates and as a result,
deduces a non-optimal transformation.
1.4.3 Registration Under Gross Occlusion Error
From Figure 1.3, it is evident that �1 outperforms �2 for high degrees of
occlusion. Even when just a fifth of the pixels are unoccluded, �1-objective
optimally solves for H that is comparable to the scenario when there is no
occlusion.
1.4.4 Remarks
We have seen that �1-optimization robustly models non-Gaussian sources of
error. When registering a pair of corrupted images with up to 80% occlusion
error, using �1 penalty gives a residual error that is comparable to the
case of no occlusion. Whereas if �2 penalty is used, the registration error
consistently increases with occlusion, with the total absolute error reaching
nine times that of the baseline error. Even when the �1 objective is used in
place of �2 in the presence of Gaussian sources of error, the residual error
incurred is only slightly greater than the baseline error. Therefore, the case
of Gaussian error can be conveniently subsumed within the framework of �1
optimization.
8
Figure 1.3: Animation — Comparison of �1 versus �2 registration for vary-ing degrees of gross occlusions. Brighter pixels indicate higher registrationerror
1.5 �1 Optimization on Embedded Platforms
Embedded applications which exploit compressive sensing realize power
savings by lowering the sampling rate below the Nyquist frequency. The
underlying sparse signal can then be reliably recovered from limited sensor
readings using sparse recovery. A survey of compressive sensing applications
in constrained embedded environments is detailed in this section.
9
Occlusion: 0.0%
Corrupted image 0 Corrupted image 1
�1 registration error �2 registration error
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
0.5
1
1.5
2
2.5 x 104
Fraction Occlusion
TotalAbsolute
Error
Image Registration Error For Varying Degrees of Occlusion
�1
�2
Figure 1.4: Image registration error using �1 and �2 penalties for varyingdegrees of gross occlusions
1.5.1 Computer Vision and Imaging
Object recognition based on a distributed camera system is explored in
[27], where compressive sensing is used to reduce the amount of data being
transferred over the bandwidth constrained network. Recovery is performed
on a centralized base station which hosts an �1-solver. Similar to the
aforementioned is the work of Wani and Rahnavard [28], which explores
using compressive sensing via linear projections to reduce bandwidth usage
in a wireless camera sensor network. Like [27] recovery is done at a base
station.
A high frame rate and low power CMOS sensor has been realized in [29],
where power is reduced by using compressive sensing to reduce the number
10
of readings taken per frame. Image reconstruction is performed off-line
using basis pursuit with total variation. Charbiwala et al. [30] presented
a low power scalable analog frontend which directly samples within the
compressive domain by randomly projecting the input signals onto a set of
pseudo-random sparse bases.
1.5.2 Biomedical Sensing
The work of Hussein et al. [31] and Chiang and Ward [32] uses compressive
sensing to lengthen the operational life of wireless electroencephalography
(EEG) sensors that are used in detecting the onset of epileptic seizures. [33]
demonstrates low power recovery of sparsely sampled EEG signals, where an
ASIC chip has been demonstrated to consume a little over 107 μW during
operation. Similarly, Shoaran et al. [34] fabricated a wireless neural recording
chip that uses compressive sensing to cut power consumption. Imtiaz et al.
[35] explores various topologies of wearable sensor nodes and how they
impact upon recovery accuracy and battery lifespan. The aforementioned
applications rely on a computer to do off-line sparse recovery.
1.5.3 Wireless Sensor Networks
Wireless sensor networks (WSN) are used in applications where there is
a need to collect data (i.e.: acoustic, seismic, humidity, soil composition)
across vast geographical regions. They are realized by the distributed
placement of autonomous sensing elements, each within communication
proximity to its neighbors. Data is generated on the fly by the nodes and
are aggregated by means of wireless transmission to gateway nodes, which
11
in turn transmits data out of the WSN through GSM or satellite network.
The operational time of individual nodes are constrained by the limited
initial energy available to them.
A use case for compressive sensing is to reduce the number of actively
sensing nodes. Ling and Tian [36] exploited the sparse nature of local
spatial phenomena by powering down part of the network in a randomized
fashion. Also, their work features the novel use of distributed computing to
perform sparse signal recovery by judiciously spreading the computational
workload across the WSN.
1.6 Thesis Overview
In this introduction we motivate for the need of a power efficient �1 solver
suited to perform sparse recovery on embedded platforms. The next chapter
discusses about existing solvers and their unsuitability towards being ported
to embedded platforms. Chapter three covers various architectural aspects
of embedded processors that �1 solvers can exploit. Chapters four and five
proposes a low power �1 solver, and performance metrics such as run-time,
recovery accuracy and memory usage of the proposed solver are compared
against a benchmark of state-of-the-art solvers in chapter five. Chapters six
and seven discuss in detail implementation aspects of the proposed solver on
the Xilinx Zynq Z-7020 FPGA and NVIDIA CUDA GPU respectively. The
final chapter concludes by summarizing key results and future directions.
12
Chapter 2
Unsuitability of TargetingExisting Solvers to EmbeddedPlatforms
Solving BPDN problems generally require more time as compared to prob-
lems with closed-form linear algebraic solutions (i.e. least squares regression).
This is because optimized software libraries for matrix computations, most
notably the Basic Linear Algebra Sub-programs (BLAS), can be used
to accelerate problems formulated using linear algebra, something that
cannot be easily applied to BPDN problems. The main difficulty with
�1-optimization is the minimization of non-smooth functions. Most solvers
minimize a smooth approximation of the objective function, progressively
refining the solution by solving a sequence of sub-problems approaching
that of the original objective. In this aspect, �1-solvers are often control-flow
rather than data-flow intensive, making acceleration on parallel hardware a
nontrivial task. With this combination of factors, it is rare to find �1-solvers
that are deployed to solve real-time large scale problems in embedded de-
vices. This chapter explains factors contributing to the unsuitability of
porting existing �1 solvers onto embedded targets.
13
2.1 Floating Point Division
Solver designers often assume that all mathematical operations run in
constant time complexity on the CPU, which is false because different op-
erations have critical paths of varying lengths. Between two n-bit operands,
addition takes 1 clock cycle because the critical path has log2 n levels of logic
[37]. Multiplication takes slightly longer as the critical path is log1.5 n-deep
based on the Wallace tree multiplier [38], whereas division is the hardest
because there is no trivial circuitry which computes the quotient within a
single clock cycle. Instead, iterative algorithms such as Goldschmidt division
[39] are used to successively approximate the quotient, requiring several
clock cycles to complete. For example, the Cortex-A9 has an initiation
interval of 10 cycles for floating point division. The prudent designer would
be wise to avoid frivolous use of division that heavily impacts on run-time
performance.
2.2 Expensive Transcendental Functions
A handful of solvers liberally use transcendental functions. For example,
�1-MAGIC [40] uses the log-barrier method to solve BPDN, but it is rare to
find hardware support for computing logarithms. These functions are often
implemented as polynomial approximations in software that typically take
hundreds of clock cycles per evaluation. Hardware support for transcendental
functions is sporadic, and if available, expensive in run-time. The ARM
Cortex-A9 on-board the Z-7020 has support for floating point square root
that takes 13 clock cycles per computation.
14
2.3 Large Software Libraries
Several solvers rely on software libraries that provide advanced matrix
operations (i.e. LAPACK). An example would be SparseLab [41] which
uses Cholesky and LU factorization. Hosting such a library on an embedded
platform is memory intensive: a pre-compiled LAPACK library1 for x86
targets consumes 7.4 MB of program space, possibly greater for a RISC
architecture like ARM.
2.4 Inefficient Memory Usage
A number of solvers achieve speed up by pre-computing ATA. These solvers
are likely to run out of memory when handling problems with thousands of
variables because memory usage scales quadratically with n. CGIST [5]
and L1-LS [10] caches AT along with A, and ADMM LASSO [4] saves the
LU factorization of A, making them consume an extra O(mn) memory.
2.5 Lack of Parallelism
Some solvers are control-flow intensive, complicating any attempts of
pipelined processing or data-flow acceleration. Examples would be TFOCS
[15] and SpaRSA2 [14].
1http://icl.cs.utk.edu/lapack-for-windows/lapack/
15
2.6 Poor Recovery Performance of Orthog-onal Matching Pursuit
Because hosting an embedded �1-solver is challenging, the majority of
FPGA systems that require �1-optimization [42, 43, 44, 45, 46, 47, 48, 49]
use orthogonal matching pursuit (OMP) [11] to approximately solve the
BPDN problem. Since OMP is a greedy heuristic, implementation is
straightforward and the run-time is short provided x is very sparse. Although
OMP has a recovery accuracy that rivals standard �1-solvers for Gaussian A,
performance is poor for correlated matrices. A comparison of the recovery
accuracy between BPDN and OMP for problems with non-random A can
be found in [50].
2.7 Questionable Scalability of OMP
OMP has good run-time performance under the assumption x is highly
sparse, which is unnecessarily pessimistic because an �1 problem is recov-
erable as long as x is O(√
m)-sparse. Since the run-time of OMP scales
quadratically with sparsity, these solvers are only capable of handling prob-
lems with low sparsity. The timings reported by studies that implement
OMP on FPGA are therefore highly optimistic because the sparsity (de-
fined to be the fraction sn) used are close to 0. The following are some
sparsity parameters: 0.03 (n = 256) [42], 0.04 (n = 128) [43], 0.04 (n = 128)
[45], 0.008 (n = 255) [46] and 0.04 (n = 128) [48]. Given that most of the
publications reported timings for n ≤ 256, it is doubtful whether these
solvers will gracefully scale with problem size, not to mention when problem
sparsity increases.
16
Chapter 3
Vectorization on EmbeddedSystems
To run a solver on an embedded system while achieving real-time perfor-
mance, the algorithm has to fully exploit parallel hardware features that
are commonly found in modern embedded processors. Modern CPUs and
GPU shader cores have built in single-instruction multiple-data (SIMD)
instruction sets, making them highly efficient in computing programming
constructs that are embarrassingly parallel. Examples of embedded pro-
cessors which have SIMD capability includes the Intel Atom (Streaming
SIMD Extensions), Power Architecture (AltiVec) and ARM (NEON). Vec-
torization also naturally extends to GPUs, where every shader core holds
the same instructions and executes on different partitions of the dataset.
Therefore it is important to formulate algorithms to use SIMD instructions
to the fullest extent possible.
3.1 SIMD Instruction Set
Relevant to efficient solver design are floating point operations such as
vector add and multiply. Modern processors computes four single-precision
17
Table 3.1: 4-way single precision floating point SIMD assembly instructionsin common architectures
Instruction Intel AtomSSE
Power Arch.AltiVec
ARM NEON OpenGLARB
Addition ADDPS VADDFP VADD ADDMultiplication MULPS VMADDFP VMUL MULDivision DIVPS - - -Maximum MAXPS VMAXFP VMAX MAXCompare CMPPS VCMP* VC* CMPReciprocalsquare root
RSQRTPS VRSQRTEFP VRSQRTE -
floating operations in one go, whereas later models such as the Intel Core
i7 supports 8-way single-precision processing using Advanced Vector Exten-
sions (AVX). The latest Intel AVX-512 instruction set, to be implemented
in the upcoming Intel R© Xeon PhiTM processor family, is able to perform
16-way single precision floating point in a clock cycle. Table 3.1 details
SIMD assembly instructions for common mathematical operations on some
processor architectures.
3.2 Eliminating if-else Using Vectorization
Consider the pseudo-code in Algorithm 1. Array x is assigned from either
a or b depending on cond, which we can assume to represent boolean true
using 1 and false as 0. A naïve method would be to use if-else. The
disadvantages are two-fold: if-else incurs branch penalty, which may be
heavy on architectures that do not have sophisticated branch prediction
hardware (i.e. Cell Broadband Engine synergistic processing units), and
secondly SIMD instructions could have been used but compilers are rarely
unable to infer SIMD from if-else statements.
Equation 3.1 shows a reformulation that can be vectorized. It exploits
the fact that multiplying by a one/zero achieves the effect of if-else selection.
a − b can be computed using vector subtract, and x can be computed using
18
Algorithm 1: Simple if-elsefor i ← 0 to n − 1 do
if cond[i] then x[i] ← a[i];else x[i] ← b[i];
end
a single vector multiply-accumulate.
x[i] = b[i] + (a[i] − b[i]) × cond[i] (3.1)
Algorithm 2 shows an if-else statement that is used in the proposed �1
solver that will be presented in chapter 4. Two modifications distinguishes it
from Algorithm 1: Firstly, the if-statement contains a min-term, and the else-
statement a max-term, and secondly the arguments of these two functions
are different. Despite these complications, the snippet can be transformed
into vectorized code by applying mathematical identities via a series of
transformations outlined in table 3.2. All function are converted into min-
form with matching parameters, and the remaining differences are elegantly
handled by computing the sign and magnitude of Δx∗λ=∞[i] − Δx∗
λ=0[i].
The final expression can be expressed using commonly available vector
instructions such as absolute, minimum and compare. Data operands like
Δx∗λ=0[i], Δx∗
λ=∞[i] − Δx∗λ=0[i] and λ
aTi ai
can be stored in memory aligned
arrays which are efficiently handled by vector instructions.
Algorithm 2: If-else with non-trivial logicfor i ← 0 to |Ω| − 1 do
if Δx∗λ=0[i] ≤ Δx∗
λ=∞[i] thenΔx∗[i] = min
(λ
aTi ai
+ Δx∗λ=0[i], Δx∗
λ=∞[i])
;
else Δx∗[i] = max(
− λaT
i ai+ Δx∗
λ=0[i], Δx∗λ=∞[i]
);
end
19
Table 3.2: A sequence of transformations vectorizing algorithm 2
Transformation If-statement Else-statement Condition
(Initial) min(
λaT
i ai+Δx∗
λ=0[i],Δx∗λ=∞[i]
)max
(− λ
aTi ai
+Δx∗λ=0[i],Δx∗
λ=∞[i])
Δx∗λ=0[i] ≤ Δx∗
λ=∞[i]max(a, b) ≡−min(−a, −b)
min(
λaT
i ai+Δx∗
λ=0[i],Δx∗λ=∞[i]
)−min
(λ
aTi ai
−Δx∗λ=0[i], −Δx∗
λ=∞[i])
min(a, b)≡ min(a ±c, b ± c)∓ c
min(
λaT
i ai,Δx∗
λ=∞[i]−Δx∗λ=0[i]
)
+ Δx∗λ=0[i]
−min(
λaT
i ai,Δx∗
λ=0[i]−Δx∗λ=∞[i]
)
+ Δx∗λ=0[i]
x if x ≥ 0,else −x ⇔|x| and ap-plying equa-tion 3.1
Δx∗λ=0[i] + sign (Δx∗
λ=∞[i]−Δx∗λ=0[i]) .∗ min
(λ
aTi ai
, |Δx∗λ=∞[i]−Δx∗
λ=0[i]|)
The expressionΔx∗
λ=∞[i] −Δx∗
λ=0[i] is usedas the condition-ing statement
20
3.3 Limitations of SIMD Processing
One has to be mindful of the limitations of floating-point SIMD instructions.
In most cases, processors expected memory arrays holding vector operands
to be aligned. Processors that are able to take in unaligned packed data
often process them at suboptimal speed. Vector division is poorly supported
as division hardware is area-intensive. Even if vector division is available,
the operation has a long latency and low throughput, making sequential
use of vector division time consuming. Support for double-precision vector
operations is available on most architectures, but at half the throughput of
single-precision computation. If the compiler is not intelligent enough to
vectorize code, one has to resort to use vendor-supplied intrinsics within C
code.
21
22
Chapter 4
Proposed BPDN Solver
To address the constraints of implementing a BPDN solver for real-time
embedded applications, the proposed solver is formulated such that compu-
tationally intensive bottlenecks are amenable towards SIMD vectorization,
efficient pipelined processing and data-flow parallelization. The use of tran-
scendental functions is eschewed and floating point divisions are kept to a
minimum. Economical memory usage is ensured by in-place manipulation
of A and judicious pre-computation of intermediate results.
4.1 Overview
At all times the solver maintains a prediction of the set of non-zero entries,
denoted Ω, in the sparse x. At the beginning, x is initialized to a rough
estimate and Ω is updated from this estimate. After which, the algorithm
iterates between two phases: during the first phase, Equation 1.3 is solved
with the additional constraint that entries indexed by Ωc are 0. In the
second phase, Ω is intelligently updated based on the current estimate of
x, gradually introducing more correct non-zeros after each iteration. The
algorithm iterates until x converges (Algorithm 3).
23
Algorithm 3: Proposed Solver OverviewInput : y ∈ R
m, A ∈ Rm×n, λ ∈ R
+ from Equation 1.3, number of sparseentries num_nonzeros, convergence parameter η
Output : Optimal x∗ satisfying Equation 1.3begin
x ← (ATy
)./ diag
(ATA
);
[∼,sorted_index] ← sort(|x|,1,′descend′);Ω ← sorted_index(1:num_nonzeros);idx ← sorted_index(1);while x has not converged do
x ← EstimateFromNonzeros (y, A, x, λ,Ω, η);Ω ← GuessNonzeros (y, A, x, λ,Ω, idx, num_nonzeros);
endreturn x;
end
4.2 Initializing x and Ω
An approximate solution to Equation 1.3 is x =(ATA
)†ATy. If the
problem is designed such that A has mutually incoherent columns (as
that would have been the case for compressive sensing applications), ATA
would be strongly diagonal and the pseudo-inverse can be approximated as
diag(1 ./ diag
(ATA
)). The i-th entry of x can be efficiently computed
as aTi y
aTi ai
by using vectorization. Ω is populated by picking the largest entries
of x sorted by magnitude.
4.3 Phase I: Solving for x Given Ω
Given that only entries indexed by Ω are non-zero, x can be incrementally
solved by minimizing Equation 1.3 one variable at a time. Equation 1.3 is
differentiated with respect to the perturbation Δxi (where xi = xi0+Δxi, i ∈Ω and y0 = y − A(:,Ω)xΩ0) to give:
df
dΔxi
= −yT0 ai + aT
i aiΔxi + λ sign(xi0 + Δxi) (4.1)
24
The optimal perturbation Δx∗i is obtained when Equation 4.1 vanishes,
or due to the gradient discontinuity introduced by the modulus term, when
the derivative changes sign at Δxi = −xi0. Determining x∗i can be reasoned
as follows: Setting λ = 0 would mean only the quadratic term is minimized,
and Δx∗i,λ=0 = yT
0 ai
aTi ai
. Setting λ = ∞ would mean only the modulus term
is minimized, and Δx∗i,λ=∞ = −xi0. Thus, Δx∗
i lies between Δx∗i,λ=0 and
Δx∗i,λ=∞, and can be computed from Equation 4.2.
Δx∗i =
⎧⎪⎪⎨⎪⎪⎩
min(
λ+yT0 ai
aTi ai
, Δx∗i,λ=∞
), Δx∗
i,λ=0 ≤ Δx∗i,λ=∞
max(
−λ+yT0 ai
aTi ai
, Δx∗i,λ=∞
), otherwise
(4.2)⇒ Δx∗
Ω = Δx∗Ω,λ=0 + sign(Δx∗
Ω,λ=∞ − Δx∗Ω,λ=0) .∗
min(λ ./ diag
(AT
(:,Ω)A(:,Ω)
),∣∣∣Δx∗
Ω,λ=∞ − Δx∗Ω,λ=0
∣∣∣)
(4.3)
By applying the identities max(a, b) ≡ − min(−a, −b) and min(a, b) ≡min(a−x, b−x)+x, Equation 4.2 can be vectorized to give Equation 4.3. If
the matrix A remains constant from problem to problem (commonly the case
for compressive sensing applications), the reciprocal 1aT
i aican be precomputed
to avoid performing expensive divisions. The optimal perturbations of every
variable, Δx∗Ω, are aggregated to update x and repeated until convergence.
Due to the fact that one variable is perturbed at a time, applying all changes
at once does not yield the optimal solution, therefore the change is weighted
to ensure convergence, i.e. xΩ ← xΩ + η Δx∗Ω, 0 < η < 1. Algorithm 4
summarizes what has been mentioned so far. Note that b is loop invariant
and can be pre-computed in the program preamble.
25
Algorithm 4: EstimateFromNonzeros(y,A,x,λ,Ω,η)Input : y ∈ R
m, A ∈ Rm×n, current estimate of x ∈ R
n, λ ∈ R+ from
Equation 1.3, indices of sparse entries Ω, convergence parameter ηOutput : Optimal solution to Equation 1.3 among the family of x with non-zero
entries in Ωb ← diag
(ATA
);
beginc ← AT
(:,Ω)y;
while xΩ has not converged doΔx∗
Ω,λ=0 ← c −(
AT(:,Ω)A(:,Ω)
)xΩ;
Δx∗Ω,λ=∞ ← −xΩ .∗ bΩ;
Δx∗Ω ←
(Δx∗
Ω,λ=0 + sign(Δx∗
Ω,λ=∞ −Δx∗Ω,λ=0
).∗
min(
λ ./ bΩ,∣∣∣Δx∗
Ω,λ=∞ −Δx∗Ω,λ=0
∣∣∣) );
xΩ ← xΩ + η Δx∗Ω;
endxΩc ← 0;return x;
end
4.4 Phase II: Updating Ω Given x
The current Ω may not contain all the non-zero entries of the underlying true
solution, therefore phase two refines Ω based on the current x. The gist is to
optimize Equation 1.3 with respect to (xi,xj) with the rest held constant. A
large magnitude for xi/xj indicates a higher probability that the respective
variable should be included in Ω. To solve (xi,xj), consider λ = 0 where
Equation 1.3 simplifies to the �2-penalty f(xi, xj) = 12‖y − Ax‖2
2. Letting
b = −ATy and a0 = Ax − aixi − ajxj, the global minimum occurs at
(x�2i , x�2
j ) satisfying Equation 4.4, a linear system that can be easily solved.
∂f
∂xi
= ‖ai‖2x�2i + (ai · aj) x�2
j + a0 · ai + bi = 0
∂f
∂xj
= (ai · aj) x�2i + ‖aj‖2x�2
j + a0 · aj + bj = 0 (4.4)
When the �1 penalty term is included, the minimum point shifts from
26
(x�2i , x�2
j ) to (x�2�1i , x�2�1
j ) according to Equation 4.5.
‖ai‖2x�2�1i + (ai · aj) x�2�1
j + a0 · ai
+ bi + λ sign(x�1
i
)= 0
(ai · aj) x�2�1i + ‖aj‖2x�2�1
j + a0 · aj
+ bj + λ sign(x�1
j
)= 0 (4.5)
If x�1j and x�1�2
j are opposite in signs, xj is not a candidate for Ω. xi is
fixed to be the entry with the largest magnitude from the initialization in
section 4.2, and xj is independently solved for all variables. Ω is subsequently
updated by selecting the largest magnitudes among x�1�2 . Algorithm 5
outlines what has been mentioned. Due to loop invariance, b, c, and d can
be pre-computed in the program preamble.
Algorithm 5: GuessNonzeros(y,A,x,λ,Ω,idx,num_nonzeros)Input : y ∈ R
m, A ∈ Rm×n, current estimate of x ∈ R
n, λ ∈ R+ from
Equation 1.3, indices of sparse entries Ω, index of the entry with thelargest initial magnitude idx, number of sparse entries num_nonzeros
Output : Optimal solution to Equation 1.3 among the family of x with non-zeroentries in Ω
b ← ATA(:,idx);c ← ATy;d ← diag
(ATA
);
begina ← AT (
A(:,Ω)xΩ);
x�2 ← b .∗ (aidx − (b .∗ x)− cidx)− didx (a − (d .∗ x)− c);x�2�1 ← x�2 + λ sign (xidx)b − λ didx sign
(x�2
);
xnonsparse ← (x�2 .∗ x�2�1
)> 0;
[∼,sorted_index] ← sort(∣∣xnonsparse .∗ x�2�1
∣∣ , 1, ′descend′);Ω ← sorted_index(1:num_nonzeros);return Ω;
end
Figure 4.1 shows an animation of how the proposed algorithm converges.
The underlying BPDN solution is a randomly generated sparse vector that
has a hundred thousand variables with a hundred non-zero entries and ten
27
Figure 4.1: Animation — Convergence towards the true solution (redboxes)
thousand constraints. At the beginning of every major iteration (i.e. Phase
II), the solver introduces new non-zeros based on the current trial solution
(coloured solid red). After that, Phase I iterates till convergence. Note
that some of the introduced non-zeros correctly converges to the underlying
solution (red boxes), and the others converges to zero. At the end of the first
major iteration, there are a handful of unmatched non-zeros, i.e. red boxes
with no matching blue circles. Therefore, the second Phase II introduces
more non-zeros which guesses the unmatched non-zero. This alternation
between Phase I and II continues till the objective function converges.
4.5 Estimating the Sparsity of x
The solver requires knowing the sparsity of x, a piece of information that
most solvers ignore (except OMP and Homotopy). Therefore it is reasonable
28
to question the assumption whether the problem sparsity can be determined
beforehand. The following are examples where the number of non-zeros is
known due to the structure of the problem.
4.5.1 Wright et al. [1]: Face recognition from a database
An input facial image needs to be identified from a database containing
fixed multiples (k) of images from different persons. This is accomplished
by encoding the query image as y, and the columns of A corresponding to
database images. The recovered x is sparse because the face of a person
would strongly correlate with the matching few in the database, and the
sparsity is given by k.
4.5.2 Wang et al. [2]: Three-dimensional arrangementof light sources in bioassays
Experimentation with biological assays requires detecting the three di-
mensional arrangement of fluorescent beads that are tagged to molecular
structures. The authors use m angle sensitive pixels to image a volume
that is partitioned into n sections. The problem is to recover which of these
pockets have a prominent light source, admitting a sparse solution if the n
partitions are interpreted as entries in x. Since the number of fluorescent
beads introduced into the assay is controllable, this can be used as a proxy
to determine the problem sparsity, where the amount of beads introduced
is proportional to the sparsity of the recovered x.
29
4.5.3 Elhamifar and Vidal [3]: Motion segmentationusing sparse subspace clustering
Given several rigid bodies undergoing independent motions in front of
an affine camera, the image trajectory of n feature points across several
frames can be grouped as columns in A. Due to geometrical constraints,
trajectories of feature points residing on the same body are embedded
within a 3-dimensional affine space. The algorithm involves finding a sparse
correlation of every column of A with the others excluding itself, and
expected sparsity is given by the dimension of the affine space.
30
Chapter 5
BPDN Solver Benchmark
The proposed solver is benchmarked against several state-of-the-art solvers
for run-time performance, recovery accuracy and peak memory usage. De-
fault settings recommended by respective authors are used. MEX-files are
used if provided or instructed for compilation. The benchmark runs on an
Intel R© CoreTM i7-2620M (2.7 GHz) machine with 8 GB memory. MATLAB
R2011b (7.13.0.564 64-bit) is used as the benchmarking environment.
5.1 List of Solvers
General-purpose solvers (i.e. CVX, Gurobi, GPLX) are omitted from the
benchmark since they are not customized to the intricacies of BPDN, thus
it is unfair to expect comparable performance with specialized solvers. The
following solvers are included in the benchmark.
5.1.1 ADMM LASSO [4]
Alternating Direction Method of Multipliers efficiently solves large scale
problems by processing sub-problems across distributed computing re-
sources.
31
5.1.2 CGIST [5]
Conjugate Gradient Iterative Shrinkage/Thresholding solves by using a
forward-backward splitting method with an acceleration step. The test for
adjointness between A and AT is omitted because this test would fail due
to finite machine precision.
5.1.3 FPC-BB [6]
Fixed-Point Continuation is advertised for large scale image and data pro-
cessing. The solver uses Barzilar-Borwein steps to accelerate convergence.
5.1.4 GLMNET [7]
Generalized Linear Model for elastic-net regularization is the the reference
algorithm used to implement the MATLAB lasso function.
5.1.5 GPSR-BB6 [8]
Gradient Projection for Sparse Reconstruction uses special line search and
termination techniques to yield faster solutions as compared to SparseLab
[41], �1-MAGIC [40], bound-optimization method [51] and interior-point
methods. Similar to FPC-BB, Barzilar-Borwein steps are used to accelerate
convergence.
32
5.1.6 Homotopy [9]
Homotopy refers to a class of methods that solves BPDN by solving a
sequence of intermediate problems with varying λ.
5.1.7 L1-LS [10]
L1-LS is an interior point method for large-scale sparse problems, or dense
problems where A contains structure that admits fast transform computa-
tions. The preconditioned conjugate gradient algorithm is used to discover
the search direction.
5.1.8 OMP [11]
Although Orthogonal Matching Pursuit approximately solves BPDN, the
recovered accuracy for Gaussian A is comparable to �1-solvers and has
excellent run-time performance. It is therefore an attractive candidate for
FPGA implementation.
5.1.9 SESOP_PACK [12]
Sequential Subspace Optimization solves large-scale smooth unconstrained
optimization.
5.1.10 SPAMS [13]
Sparse Modeling Software is a MATLAB toolbox for sparse recovery
problems. Its C++ library makes use of the Intel Math Kernel Library for
33
floating point computations. In this aspect, SPAMS is a strong contender
for run-time performance.
5.1.11 SpaRSA2 [14]
Sparse Reconstruction by Separable Approximation is an iterative method
where each step is an optimization sub-problem involving a separable
quadratic term plus the sparsity-inducing term. This solver is recommended
for cases where the sub-problem can be efficiently solved.
5.1.12 TFOCS [15]
Templates for First-Order Conic Solvers provides a set of modules that can
be mixed-and-matched to create customized solvers.
5.1.13 TwIST2 [16]
Two-Step Iterative Shrinkage / Thresholding implements a nonlinear two-
step iterative version over the original iterative shrinkage/thresholding
procedure to provide faster convergence for ill-conditioned problems.
5.1.14 YALL1 [17]
Your Algorithms for L1 is a suite of solvers that uses alternating direc-
tion algorithms, with the option of enforcing joint sparsity among related
variables.
34
5.2 Test Input
For various m, n and s, the entries of A are drawn from the standard
normal distribution. Positions of the non-zero entries in x are randomly
picked, and the values follow a uniform distribution in the interval (−1, 1).
The ideal measurements y = Ax are corrupted by scaled Gaussian noise of
zero mean and 0.1 variance.
5.3 Run-Time Performance
The overall run-time complexity of the proposed solver is O(k1(mn + k2s)),
where k1 is the number of iterations in Algorithm 3, and k2 is the number of
iterations in Algorithm 4. From Figure 5.1, it is evident that the proposed
solver has superior run-time performance over state-of-the-art BPDN solvers
due to extensive use of matrix multiplication and vectorized operations. For
large problems, the run-time performance of the proposed solver is at least
ten times that of the next fastest solver.
35
10 2
10 1
100
101
102
103
104
CPU
run
time
(sec
onds
)CPU Run Time vs Problem Size
m=5
00,n=
5000
,s=5
m=1
000,n
=100
00,s=
10
m=1
500,n
=150
00,s=
15
m=2
000,n
=200
00,s=
20
m=2
500,n
=250
00,s=
25
m=3
000,n
=300
00,s=
30
m=3
500,n
=350
00,s=
35
m=4
000,n
=400
00,s=
40
m=4
500,n
=450
00,s=
45
m=5
000,n
=500
00,s=
50
m=5
500,n
=550
00,s=
55
m=6
000,n
=600
00,s=
60
m=6
500,n
=650
00,s=
65
m=7
000,n
=700
00,s=
70
m=7
500,n
=750
00,s=
75
m=8
000,n
=800
00,s=
80
m=8
500,n
=850
00,s=
85
m=9
000,n
=900
00,s=
90
m=9
500,n
=950
00,s=
95
m=1
0000
,n=10
0000
,s=10
0
proposedadmm_lassocgistfpc bbglmnetgpsr bb6homotopyl1lsompsesop_packspamssparsa2tfocstwist2yall1
Figure 5.1: Run-time for various problem sizes, where A ∈ Rm×n and there are s non-zero entries in x
36
Table 5.1: Run-time profiling results on MATLAB
# Time Taken (s) % Total Time Operation1 14.771 43.2 AT
(A(:,Ω)xΩ
)
2 7.669 22.4 diag(ATA
)3 4.878 14.3 ATA(:,idx)4 4.863 14.2 ATy5 0.554 1.6 AT
(:,Ω)A(:,Ω)
6 1.431 4.2 Rest of the code
5.4 Algorithm Bottleneck
The solver is profiled on MATLAB for the problem size m = 10000 and
n = 100000. Table 5.1 shows the run-time summary for 20 consecutive
trials. About 95% of run-time is spent on matrix-matrix and matrix-vector
multiplications, which are excellent candidates for speed-up on either SIMD-
capable embedded processors, or acceleration on programmable logic.
5.5 Accuracy of Recovered Results
For varying problem sparsity, x is recovered and debiased. Recovery accuracy
is expected to decrease as problem sparsity increases because the problem
progressively enters the ill-conditioned region of the Donoho-Tanner phase
transition diagram [52]. The error measure between the debiased solution
xrecovered and the underlying true solution xactual is given by ‖xactual−xrecovered‖2n
.
From Figure 5.2, the proposed solver has superior recovery accuracy com-
pared to all the solvers.
37
10 7
10 6
10 5
10 4
10 3
||xactual−xrecovered||2
nRecovery Accuracy vs Problem Sparsity
m=1
500,n
=150
00,s=
5
m=1
500,n
=150
00,s=
20
m=1
500,n
=150
00,s=
35
m=1
500,n
=150
00,s=
50
m=1
500,n
=150
00,s=
65
m=1
500,n
=150
00,s=
80
m=1
500,n
=150
00,s=
95
m=1
500,n
=150
00,s=
110
m=1
500,n
=150
00,s=
125
m=1
500,n
=150
00,s=
140
m=1
500,n
=150
00,s=
155
m=1
500,n
=150
00,s=
170
m=1
500,n
=150
00,s=
185
m=1
500,n
=150
00,s=
200
m=1
500,n
=150
00,s=
215
m=1
500,n
=150
00,s=
230
m=1
500,n
=150
00,s=
245
m=1
500,n
=150
00,s=
260
m=1
500,n
=150
00,s=
275
m=1
500,n
=150
00,s=
290
proposedadmm_lassocgistfpc bbglmnetgpsr bb6homotopyl1lsompsesop_packspamssparsa2tfocstwist2yall1
Figure 5.2: Recovery accuracy for various problem sparsity, where A ∈ Rm×n and there are s non-zero entries in x
38
5.6 Memory Usage
The proposed solver requires O(n + sm + s2) of additional memory besides
storing A and y, making the implementation memory efficient if the un-
derlying x is sparse. The solver does not require advanced linear algebraic
decomposition, (i.e. LU, Cholesky or singular value decomposition), hence
there is no hidden memory requirement whatsoever. Because the solver
caches pre-computed values, its memory footprint is not the lowest but
nevertheless remains as one of the highly competitive among the bench-
marked (Figure 5.3). Due to the use of dynamic memory allocation in
some solvers, memory measurements fluctuate for increasing problem sizes,
whereas solvers which preallocate memory exhibits a monotonically increase
memory usage.
39
101
102
103
104
105
106
107Pe
ak M
emor
y U
sage
(kB
)Peak Memory Usage vs Problem Size
m=5
00,n=
5000
,s=5
m=1
000,n
=100
00,s=
10
m=1
500,n
=150
00,s=
15
m=2
000,n
=200
00,s=
20
m=2
500,n
=250
00,s=
25
m=3
000,n
=300
00,s=
30
m=3
500,n
=350
00,s=
35
m=4
000,n
=400
00,s=
40
m=4
500,n
=450
00,s=
45
m=5
000,n
=500
00,s=
50
m=5
500,n
=550
00,s=
55
m=6
000,n
=600
00,s=
60
m=6
500,n
=650
00,s=
65
m=7
000,n
=700
00,s=
70
m=7
500,n
=750
00,s=
75
m=8
000,n
=800
00,s=
80
m=8
500,n
=850
00,s=
85
m=9
000,n
=900
00,s=
90
m=9
500,n
=950
00,s=
95
m=1
0000
,n=10
0000
,s=10
0
proposedadmm_lassocgistfpc bbglmnetgpsr bb6homotopyl1lsompsesop_packspamssparsa2tfocstwist2yall1
Figure 5.3: Peak memory usage for various problem sizes, where A ∈ Rm×n and there are s non-zero entries in x
40
0 20 40 60 80 100 120 140 160 18010 2
100
102
104
Inner Loop Iteration
1 2||y−
Axestimate
||2 2+
λ||xestimate
||1
Convergence of Proposed Solver vs Inner Loop Iteration for Various ηη = 0.2η = 0.4η = 0.6η = 0.8
Figure 5.4: Convergence of proposed solver of test case m = 10000, n =100000 and s = 100 for various η
5.7 Convergence Properties of Solver
For the test case where m = 10000, n = 100000 and s = 100, the number
of outer iterations (k1) is around 3 regardless of η, whereas the number of
inner iterations (k2) within EstimateFromNonzeros ranges from 20 to 80
for the first outer iteration for decreasing η, progressively decreasing for
subsequent outer iterations (Figure 5.4). Given a well-tuned η, convergence
is achievable within 40 iterations.
41
42
Chapter 6
Solver Implementation on theXilinx Zynq Z-7020
The solver is implemented on the ZedBoard development board (Figure 6.1),
comprising of a ZynqTM-XC7Z020-CLG484-1 All Programmable System-
on-Chip with 512 MB DDR3 memory. The Z-7020 chip features a dual
ARM R© CortexTM-A9 MPCoreTM that is tightly coupled with the ArtixTM-7
FPGA fabric. Each core has separate 32 kB L1 instruction and data
caches, and both share a unified 512 kB L2 cache. Matrix and vector
operations are efficiently handled by BLAS libraries that use the NEONTM
SIMD engine on-board each CPU. Custom hardware data-paths can be
instantiated within the FPGA fabric if hardware acceleration is necessary.
The CPU is clocked at 667 MHz and the FPGA fabric at 125 MHz.
All implementations are benchmarked with respect to the problem size of
m = 500, n = 5000, s = 75. The reference MATLAB solver takes 0.19
seconds to complete on the i7-2620M.
6.1 Porting MATLAB to C
The solver is ported to C code using the Embedded CoderTM v6.1 toolbox.
On the Z-7020, the compiled executable occupies 93 kB of .txt memory,
43
Figure 6.1: ZedBoard hardware development platform
1.93 MB of .bss memory, and takes 1.84 seconds to process. Since matrix
multiplication has been identified as the run-time bottleneck, an improved
BLAS library is sought to replace the stock library that is provided by
MATLAB.
6.2 Eigen BLAS Library
Eigen1 is an open-source library providing optimized assembly routines for
matrix operations. The library supports hardware vectorization on ARM
targets. Run-time benchmarks by its authors show that Eigen outperforms
the Intel Kernel Math Library for operations such as y ← αx+βy, y ← Ax
and Y ← AAT, therefore this library is chosen to replace MATLAB’s1http://eigen.tuxfamily.org/
44
BLAS library. The Eigen-compiled executable occupies 29 kB of .text
memory, 40 kB of .bss memory and takes 0.30 seconds to complete. The
compact .text program size allows the solver to be fully loaded within the
L1 instruction cache, ensuring fast program execution without expensive
memory fetches. Pre-computed .bss data structures used by the solver also
fits economically within the L2 cache.
6.3 Accelerating AT(:,Ω)A(:,Ω) Using Programmable
Logic
Table 6.1 shows the run-time summary of the solver running on a single
Cortex-A9 CPU without FPGA acceleration. 89% of the overall run-
time is spent on executing Eigen library code. Further profiling using the
snoop control timer reveals that the matrix operations AT(A(:,Ω)xΩ
)and
AT(:,Ω)A(:,Ω) occupy 34% and 55% of the run-time respectively. AT
(A(:,Ω)xΩ
)
cannot be further accelerated because for every entry of A that is read
from memory, one multiply-and-accumulate (MAC) is performed, making
the operation susceptible towards I/O-boundness. A ballpark estimate
illustrates the problem: The maximum read bandwidth from DDR memory
to the programmable logic over an AXI_HP interface is 1.2 GB/s [53,
§22.3]. This means it takes 8.3 ms to deliver A to the programmable
logic. The average number of fetches for A per run is around 10, giving a
paltry speed-up of 1.23×. It is possible but infeasible to utilize all the four
AXI_HP buses on the Z-7020 to achieve a 4.9× speed-up as this would
deprive other hardware modules of any routing resources to interface with
the CPU. On the contrary, AT(:,Ω)A(:,Ω) is compute-bound because s(s+1)
2
MACs have to be performed for every s entries read, making it an excellent
45
Table 6.1: Run-time profiling results on the Z-7020 using Gprof
% Total Time Function45.8 Eigen::general_matrix_vector_product40.9 Eigen::gebp_kernel10.4 Preamble of fastBPDN1.47 Eigen::gemm_pack_lhs0.95 Eigen::gemm_pack_rhs0.48 Rest of the code
candidate for FPGA acceleration.
6.3.1 Multiply-And-Accumulate Hardware Engine
Figure 6.2 shows a hardware engine capable of parallelizing AT(:,Ω)A(:,Ω) by
pipelining 9 MACs per clock cycle. The matrix AT(:,Ω)A(:,Ω) is partitioned
into 3 × 3 sub-matrices, and the engine computes all elements of a sub-
matrix in parallel. Since AT(:,Ω)A(:,Ω) is symmetric, sub-matrices lying on and
above the main diagonal are computed on the FPGA, and entries beneath
the diagonal are populated by replicating above-diagonal entries using the
CPU. Inputs from A(:,Ω) are served over the AXI_ACP bus and results
are written over the same interface. Transactions over the AXI_ACP are
also cache coherent because the bus has access to the SCU that governs
both caches. Prior to operation, the engine is configured by the CPU over
the AXI_GP interface.
6.3.2 Detailed Operation
In the first stage of operation, the MAC engine populates internal BRAM
with entries from A(:,Ω). It makes sense to stage A(:,Ω) since every entry will
be used s times, thus minimizing the amount of reads from the slow DDR
memory. The transfer occurs over the AXI_ACP bus interface which
46
AXI_ACP Bus
×
+
×
+
×
+
×
+
×
+
×
+
×
+
×
+
×
+
Dual-ported BRAM
Dual-ported BRAM
Dual-ported BRAM
Figure 6.2: Hardware engine comprising of 9 floating point MAC units
arbitrates transfers between the DDR memory and the programmable logic.
Burst transfer commands are issued for this initial transaction to ensure
high throughput.
The BRAMs are partitioned into three banks so as to independently
serve each of the nine multipliers. Every column of A(:,Ω) are stored in
alternating banks: A(:,1), A(:,4), . . . , A(:,3k+1) reside consecutively in bank 1,
A(:,2), A(:,5), . . . , A(:,3k+2) in bank 2 and A(:,3), A(:,6), . . . , A(:,3k+3) in bank 3.
Since computing each submatrix requires two independent read accesses from
each bank for off-diagonal submatrices, and one read access for on-diagonal
ones, the BRAMs are configured to be dual-ported. The submatrices
are then processed in row-major order. For every submatrix, nine MAC
operations are executed in parallel for m consecutive entries from the
BRAM.
Parallelization can be extended to processing n × n submatrices using
n2 multipliers. BRAMs will then be divided into n partitions, the i-th
47
partition storing entries from A(:,nk+i). A larger speed-up can be achieved
but the disadvantages are two-fold: More DSP resources are needed (5
DSP48E1 slices are needed to synthesize a single precision MAC), and
timing closure is harder to achieve given that a larger mesh demands
more routing resources and therefore congestion may occur. Secondly,
a larger degree of parallelization leads to poorer work efficiency due to
more redundant entries being computed, evident from the larger number
of redundant computations that are coloured purple in Figure 6.3 with
increasing partition size.
Each MAC output is demultiplexed to a bank of 8 registers (Figure 6.4),
reason being there is a finite latency of 5 clock cycles for each MAC to
produce a result. By spreading the accumulation across 8 registers, the
multiplier mesh can operate at maximum throughput without data hazards.
A post-processing step to combine the 8 partial sums is done before writing
the result back to memory over the AXI_ACP bus.
6.3.3 Specification of MAC Engine
Figure 6.5 and Figure 6.6 show the post-routed design comprising both
the Cortex-A9 CPU and MAC engine, and Table 6.2 details the engine
specifications. The MAC engine provides a seven-fold speed-up over the
software implementation of AT(:,Ω)A(:,Ω), while utilizing a modest 7% of
slice logic available on the Z-7020. Although it is possible to increase the
hardware speed-up by increasing the sub-matrix size, a larger multiplier
mesh has to be synthesized. By running the accelerator in parallel with CPU
code, the solver takes 0.14 seconds to complete the test instance, a speed-up
of 26% over running on the i7-2620M. Although the speed-up may seem
48
(a) 1× 1 (b) 3× 3
(c) 5× 5 (d) 15× 15
(e) 25× 25 (f) 75× 75
Figure 6.3: Submatrix partitioning of AT(:,Ω)A(:,Ω) for various sizes. Redun-
dantly computed entries are coloured purple.
49
+×
R
R
R
R
R
R
R
R
3-bit up-counter
Output to AXI_ACP
Input from BRAM
Input from BRAM
+ R
Figure 6.4: Output of every MAC being demultiplexed to a bank of 8registers
modest, keep in mind that the clock frequency of the Cortex-A9 on-board
the Z-7020 (667 MHz) is a quarter of i7-2620M (2.7 GHz), and the power
consumption of the i7-2620M (35 W2) is 114 times that of the Z-7020 (305
mW3). Hence, the solver implementation on the Z-7020 has been optimized
for low power applications without sacrificing run-time performance. Note
that the Z-7020’s power figure is pessimistic as the hardware logic is only
active for 17% of the final run-time; clock gating can be used to power down
FPGA logic during inactivity.
2http://ark.intel.com/products/522313http://www.arm.com/products/processors/cortex-a/cortex-a9.php
50
Figure 6.5: Post-routed layout on the Z-7020 using Vivado 2013.2 (Legend:Yellow – MAC engine, brown – ARM Cortex-A9 and DDR3 memory bus,green – AXI_ACP interconnect logic, blue – AXI_GP interconnect logic,red – reset logic)
51
Figure 6.6: System module schematic (Legend: processing_system7_1 – ARM Cortex-A9, matmul_1 – Hardware MACengine, axi_mem_intercon – AXI_ACP bus logic, processing_system-7_1_axi_periph – AXI_HP bus logic, proc_sys_reset– Reset logic)
52
Table 6.2: Hardware MAC engine specifications
Operating Frequency 125 MHzPower 55 mW
Resource UsageSlice Logic 13735 (7%)
Block RAM (RAMB36E1) 48 (34%)DSP Slices (DSP48E1) 46 (21%)
Timing
Speed-Up 7.00×Initiation Interval 224680 cycles
Latency 224679 cyclesWorst Negative Slack 0.280 ns
Worst Hold Slack 0.052 nsWorst Pulse Width Slack 2.750 ns
6.4 Remaining Bottleneck
After applying FPGA acceleration, the remaining bottleneck of the em-
bedded �1-solver is due to the computation of AT(A(:,Ω)xΩ
), taking up
75% of run-time. As this operation is I/O-bounded, FPGA acceleration
is ineffective because the memory bandwidth between the DDR memory
and the programmable logic is a mere 4.8 GB/s. A computational platform
with a larger memory bandwidth is required. This leads to the choice of
using the graphics processing unit (GPU) as an alternative implementation
platform for the solver.
53
54
Chapter 7
Solver Implementation onNVIDIA CUDA GPUArchitecture
One defining feature of the GPU is its large memory bandwidth that services
the many processing cores (Table 7.1). Given that a GPU is comprised of
several shader cores running in parallel, each possibly accessing different
texture memory locations, memory bandwidth has to scale up in order to
keep up with the memory transactions issued by every core.
In this aspect, the GPU excels in I/O-bounded tasks due to large
memory bandwidth. For example, the matrix-vector multiplication Ax
Table 7.1: Comparison of memory bandwidth for various hardware systems
Name Category Launch Process(nm)
Standard Bandwidth(GB/s)
Zynq-7100AP SoC
FPGA withintegratedCPU
Mar-13 28 DDR3 10.7
KeyStone66AK2H12
DSP with inte-grated CPU
Nov-12 28 DDR3 12.8
Intel Core i7-4770R
CPU with in-tegrated GPU
Jun-13 22 DDR3L 25.6
GeForce GTX780 Ti
GPU Nov-13 28 GDDR5 336
55
involves fetching every entry of A from memory to the processor, thus the
execution speed is limited by the amount of memory bandwidth available
to deliver every entry to the execution cores.
A natural extension from parallel computing using SIMD instructions
is to perform vectorized operations on the GPU. The typical GPU has
multiple shader cores which extends 4-way SIMD processing on the CPU
to n-way parallel computing. For example, the embedded ARM Mali-450
MP graphics processor has 8 fragment processors, thus allowing 8-way
general-purpose floating point processing. At the upper end, NVIDIA’s
GRID K2 boasts 3072 thread processors.
7.1 Comparisons Between GPU and FPGAArchitectures
Both architectures are similar in terms of the hardware parallelism. The
GPU has multiple stream processors which execute code in parallel, whereas
the FPGA can instantiate multiple hardware blocks that run in parallel. A
striking difference would be that the GPU has a massive amount of memory
bandwidth compared to state-of-the-art commercial FPGA. For a start,
the NVIDIA M2050 features the latest GDDR5 memory standard, whereas
Xilinx supports DDR4 under its Kintex and Virtex UltraScale products.
Wide memory buses that is commonly found on GPUs are not easily
available on smaller FPGA devices. For example, the Xilinx Zynq-7000
series has at most 128 pins which are dedicated for DDR interface.
56
7.2 I/O-boundness of AT(A(:,Ω)xΩ
)
We have seen from chapter 6 that the FPGA is not suited to accelerate I/O-
bounded operations. One such operation is the matrix-matrix multiplication
AT(A(:,Ω)xΩ
). To see why this is I/O-bounded, consider Figure 7.1. Every
entry of A is used only once, therefore the speed of computation is limited
by how fast the processor can fetch the entire A. Since this matrix may be
large, especially when the optimization problem has a lot of variables and
constraints, it is often stored off-chip. The computational speed is limited
by the bandwidth between the processor and off-chip main memory, and
the situation does not benefit from data caching as entries are only read
once.
7.3 Accelerating Level 2 BLAS Operation(GEMV) on CUDA
The Level 2 BLAS operation refers to the generalized matrix-vector multi-
plication GEMV (Equation 7.1). This subroutine is often used as primitives
(together with GAXPY and GEMM) by advanced linear algebra libraries
(e.g. LAPACK). Since the solver implementation on FPGA is I/O limited
on this particular operation, with the matrix A and vector A(:,Ω)xΩ as
inputs, it makes sense to explore ways to speed up this operation on the
GPU by way of optimizing the GEMV primitive.
y ← αAx + βy (7.1)
57
Figure 7.1: Animation — Computation of AT(A(:,Ω)xΩ
). The rectangle
in the middle is AT, and the column vector on the right is A(:,Ω)xΩ. Bluerepresent read, and red represent write operations in the respective memorylocations.
58
=
7.4 Problem Partitioning
The matrix-vector operation, AT(A(:,Ω)xΩ
), is partitioned into sub-problems
that are independently processed using a one dimensional kernel grid com-
prising of m16 thread block (assuming that m, n are multiples of 16), each
block a two dimensional array of 16 × 16 CUDA threads (Figure 7.2). AT
is partitioned into 16 × 16 submatrices, and A(:,Ω)xΩ and the result vector
are partitioned into 16 × 1 vectors. Denoting AT as M and A(:,Ω)xΩ as v,
the i-th thread block processes Equation 7.2.
y = Mv
=
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
M1:16,1:16 . . . M1:16,m−15:m. . . . .
.
... M16i−15:16i,16j−15:16j
...
. .. . . .
Mn−15:n,1:16 . . . Mn−15:n,m−15:m
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣
v1:16...
v16j−15:16j
...vm−15:m
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦
⇒ y16i−15:16i =m16∑
k=1M16i−15:16i,16k−15:16k × v16k−15:16k (7.2)
=
⎡⎢⎢⎢⎢⎣
...∑16q=1
∑ m16k=1 M16(i−1)+p,16(k−1)+qv16(k−1)+q
...
⎤⎥⎥⎥⎥⎦ (7.3)
7.5 Thread Block Operation
Figure 7.3 shows the partitioned operation at the level of the CUDA thread
block. Every thread block is in-charge of computing the dot product of a
row vector of 16 × 16 submatrices with a column vector of 16 × 1 vectors.
Boxes in blue shows memory elements that resides on the off-chip DDR
59
Figure 7.2: Animation — AT(A(:,Ω)xΩ
)partitioned into a one dimen-
sional kernel grid comprising of 16 × 16 thread blocks
60
SM0
SM1
SM2
SM3
SM4
SM5
Thread Block 1
Thread Block 2
Thread Block 3
Thread Block 4
Thread Block 5
Thread Block 0
SMi
Thread Block i
Shar
ed M
emor
y (:, : )Out
put
(:, )
Figure 7.3: Organization of the problem at the thread block level
memory. Green boxes are on-chip shared memory that can be found in every
streaming multiprocessor (SM). Since shared memory has lower latency
and higher throughput compared to DDR memory, it is common to stage
data structures onto shared memory prior to operation.
Red boxes are register storage of every CUDA thread. Reading and
writing to the register file are faster than shared memory, but threads cannot
interact with registers residing in other threads. Therefore, threads write
to shared memory to make their data accessible to threads residing within
the same thread block. The only method for threads of different blocks to
share data is to use off-chip DDR memory.
7.5.1 Staging data onto shared memory
Every entry of the input vector is used 16 times, incurring at least 16
memory transactions for every element. Since transactions with the off-chip
DDR memory are highly inefficient, the 16 × 1 input vector is staged
onto the shared memory so that the cost of an off-chip memory read can
be amortized across 16 fast shared memory reads. Since there are only
16 entries to be fetched, the first 16 threads of the block simultaneously
launch a 64B coalesced memory read transaction from the off-chip memory,
with the remaining 240 threads idle. Figure 7.4 shows what has just been
61
Figure 7.4: Animation — Serialized visualisation of the staging 16 × 1partitions of A(:,Ω)xΩ onto the shared memory. Yellow indicates the threadin-charge of the transfer
mentioned. Note that although the animation is shown to be serialized, the
16 threads concurrently fetch all the 16 entries from the off-chip onto the
shared memory.
7.5.2 Multiply and Accumulate
Every thread within a block performs a multiplication between an entry in
the shared memory, and another from the off-chip memory (Figure 7.7). The
products are accumulated within the register of the (p, q)-th thread according
to the inner summation of Equation 7.3. Assuming that A is stored in
column-major order, since consecutive warps of threads read consecutive
memory on the off-chip memory, the GPU recognizes these coherent access
patterns and will issue 16 64B coalesced memory transactions instead of
256 separate instructions, making this operation highly I/O-efficient. Also
note that every entry of A is used only once, hence there is no need to stage
the data onto shared memory.
62
SMi
Figure 7.5: Animation — Multiply and accumulate operation
Figure 7.6: Animation — Contents of registers for every thread is copiedinto shared memory
7.5.3 Copy to Shared Memory
Since threads cannot access the registers of other threads, all the threads
within the same block write their accumulated products into a 16 × 16
shared memory. Figure 7.6 shows a serialized visualization of the operation,
where in actual operation writing to shared memory occurs simultaneously
across all the 256 threads.
7.5.4 Final Summation
The first 16 threads perform the outer summation of Equation 7.3 based on
inputs that has been copied into the shared memory during the previous
step. The accumulated products are stored within the thread registers.
Figure 7.7 shows the serialized operation of what has been mentioned so
63
SMi
SMi
Figure 7.7: Animation — First 16 threads performing the final summation
Figure 7.8: Animation — Writing the final accumulated result to off-chipmemory
far. Note that the 16 threads run concurrently in the actual operation.
7.5.5 Writing Results to DDR Memory
Each of the 16 threads writes the final accumulated result to off-chip memory.
As the memory locations are consecutive, a single 64B coalesced memory
transaction is issued instead of 16 separate instructions. Figure 7.8 shows a
serialized animation of the operation. Note that in the actual hardware the
write operation occurs simultaneously.
7.6 Parallelism
Exposing parallelism serves two purposes: Firstly, it allows the CUDA
hardware to efficiently schedule sub-problems onto SMs that are blocked on
64
SMi
SMi
the current wrap that is being executed, which may be caused by pending
memory transactions, block synchronization or resolving read-after-write
register dependencies. Secondly, code performance scales with the amount
of physical processors available on the hardware.
There are two levels of parallelism exposed by how the problem has
been partitioned in section 7.4. At the kernel grid level, each thread block
progresses independently of one another. Within the underlying hardware
implementation, thread blocks are executed on SMs, the number of which
depends on the hardware version. Therefore the same code running on the
a later generation of hardware is able to achieve more speed-up due to a
larger number of physical SMs, compared with one which is executed on
older hardware.
The second type of parallelism arises at the thread level. Every thread
within the block executes independently of one another, except when they
meet at synchronization points. Threads are mapped onto physical stream
processors within every SM, the actual number depending on the hardware
version. For example, the NVIDIA GT200 architecture contains 8 shader
processors and 2 special function units per SM, whereas the more advanced
GF100 series has 32 shaders and 4 special function cores. It is not always
the case that an equal number of physical shaders is required for every
thread in a block because threads are executed in a round-robin fashion.
One expects computations to be faster on an architecture which has more
processors per SM.
Although there are sections where only 16 threads participate in the
processing (subsection 7.5.1 and subsection 7.5.5), this does not imply that
other threads of the same block are idling. Instead, inactive threads are
swapped out and the multiprocessor schedules threads of other blocks to
65
run. Since the processor schedules threads in multiple of 32 [54, §2.1], only
16 threads are idling during these phases. These can be avoided by allowing
a full warp to stage data onto shared memory, and write data to device
memory, albeit with additional code which does not translate into significant
speed-up.
7.7 Hardware Benchmark
To see how well the proposed GEMV compares with cuBLAS, both codes
are run on the Amazon AWS EC2 cg1.4xlarge hardware instance, backed
by two Intel Xeon X5570 and two NVIDIA Tesla M2050 GPUs. The size
of A is fixed to be 295 million entries. The aspect ratio of A, defined to
be the width to height ratio, is varied between 1 to around 33. The thread
block aspect ratio, qp, is varied between 1 and 8. For each configuration the
run-time average and standard deviation of 16 trials are tallied.
7.7.1 Hardware Environment
The NVIDIA Tesla M2050 GPU Computing Module features 448 thread
processors grouped into 14 SMs of 32 cores each, each running at a clock
speed of 1.15 GHz. The entire unit provides a peak processing power of 1.03
TFLOPs1. A 3 GiB graphics memory services all the processors through
a 384 bit GDDR5 bus interface at a clock speed of 1.55 GHz, giving an
aggregate bandwidth of 148.4 GB/s. The hardware module is offered by
Amazon Web Services Elastic Computing 2 under the cg1.4xlarge instance
type.1Counting a single precision fused multiply–add as one operation
66
1
611
1621263133
1
23
45 6 7 8
1.8
2
2.2
2.4
2.6
Matrix Aspect Ratio
GEMV Speedup of Proposed over cuBLAS (z-axis) and FPGA (color)
Thread Block Aspect Ratio
Spee
dup
of P
ropo
sed
over
cuB
LAS
540
560
580
600
620
640
660
680
700
Figure 7.9: GEMV speedup of proposed over cuBLAS (plotted alongz-axis) and the corresponding Zynq-7020 implementation (plotted as a colormap). Error bars annotate the 95% confidence interval at every data point.
7.7.2 Results
Figure 7.9 shows the run-time speedup of the proposed method over
cuBLAS and the implementation on Zynq-7020. For all configurations the
proposed method consistently performs twice as fast as cuBLAS, and as
high as 2.5 times for an aspect ratio of 1.23 on the thread block. Speedup
factor over the FPGA implementation is an impressive 702. Performance
is consistent over different matrix aspect ratio.
The memory transfer efficiency is computed for all test case based on the
maximum memory bandwidth of 148.4 GB/s advertised by NVIDIA. From
Figure 7.10, the proposed method consistently uses no less than 50% of the
67
1 5 9 13 17 21 25 29 331
2
3
4
5
6
7
8Memory Transfer Efficiency of Proposed Method
Matrix Aspect Ratio
Thre
ad B
lock
Asp
ect R
atio
0.55
0.6
0.65
0.7
0.75
Figure 7.10: Memory transfer efficiency with respect to the advertised peakmemory bandwidth
memory bandwidth, with a peak utilization of 68%. A possible reason for
not being able to achieve maximum throughput is due to the overhead of
context switching between different kernels on the same physical SM. The
proposed method is not compute-bounded because the peak computational
efficiency is 4.9% (Figure 7.11).
68
1 5 9 13 17 21 25 29 331
2
3
4
5
6
7
8Computational Efficiency of Proposed Method
Matrix Aspect Ratio
Thre
ad B
lock
Asp
ect R
atio
0.038
0.04
0.042
0.044
0.046
0.048
Figure 7.11: Computational efficiency with respect to the advertised peakperformance
69
70
Chapter 8
Conclusion
With advances in chip technology, parallel computing structures are increas-
ingly becoming ubiquitous in modern embedded processors. Examples of
embedded architectures that already feature SIMD instructions include
the ARM (NEON), the Power Architecture (AltiVec) and the Intel Atom
(SSE). Hybrid CPU-FPGA system-on-chips, like the Xilinx Zynq-7000 Ex-
tensible Processing Platform and Altera’s Hard Processor System, are also
becoming the norm. Therefore, algorithms intended to be executed on an
embedded target should be designed to have as much data-flow parallelism
as possible, so as to exploit these parallel hardware.
8.1 Proposed �1-Solver and FPGA Implemen-tation
Compared to state-of-the-art solvers, the proposed solver exhibits supe-
rior run-time performance by formulating compute-intensive routines as
matrix-matrix and matrix-vector multiplications, both of which are effi-
ciently handled by BLAS libraries. Since these libraries are tuned to
use architectural-specific SIMD instructions, computations are guaranteed
71
to execute as efficiently as possible. Program memory running on the
Cortex-A9 CPU is economical enough to fit within the L1 cache, and data
structures within the L2 cache. The bottleneck of the solver is implemented
in programmable logic which achieves a seven-fold speed-up over software
code running on the embedded processor. Without sacrificing run-time
performance, the embedded implementation on the Z-7020 is at least 114
times as power efficient as the MATLAB prototype on the i7-2620M.
8.2 GPU Acceleration of GEMV
Also highlighted is a major limitation of the FPGA architecture: that it per-
forms poorly on I/O bounded operations. In this aspect, GPU architectures
have the upper hand due to larger memory bandwidths. The remaining
bottleneck of the solver, namely the GEMV operation AT (A:,ΩxΩ), has
been accelerated on the NVIDIA Tesla M2050 GPU Computing Module.
The proposed parallel algorithm to speed up GEMV is at least twice as
fast as the proprietary cuBLAS library, and provides an overall speedup of
702 times over the respective FPGA implementation.
72
Bibliography
[1] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Yi Ma. Robust facerecognition via sparse representation. Pattern Analysis and MachineIntelligence, IEEE Transactions on, 31(2):210–227, 2009. ISSN 0162-8828. doi: 10.1109/TPAMI.2008.79.
[2] A. Wang, P.R. Gill, and A. Molnar. Fluorescent imaging and localiza-tion with angle sensitive pixel arrays in standard cmos. In Sensors,IEEE, pages 1706–1709, 2010. doi: 10.1109/ICSENS.2010.5689914.
[3] E. Elhamifar and R. Vidal. Sparse subspace clustering. In ComputerVision and Pattern Recognition. CVPR. IEEE Conference on, pages2790–2797, 2009. doi: 10.1109/CVPR.2009.5206547.
[4] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and JonathanEckstein. Distributed optimization and statistical learning via thealternating direction method of multipliers. Foundations and Trends R©in Machine Learning, 3(1):1–122, 2011.
[5] Tom Goldstein and Simon Setzer. High-order methods for basis pursuit.Methods, pages 1–17, 2011.
[6] Elaine T Hale, Wotao Yin, and Yin Zhang. Fixed-point continuationfor �1-minimization: Methodology and convergence. SIAM Journal onOptimization, 19(3):1107–1130, 2008.
[7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. glmnet: Lassoand elastic-net regularized generalized linear models. R package version,1, 2009.
[8] M. A T Figueiredo, R.D. Nowak, and S.J. Wright. Gradient projectionfor sparse reconstruction: Application to compressed sensing and otherinverse problems. Selected Topics in Signal Processing, IEEE Journal of,1(4):586–597, 2007. ISSN 1932-4553. doi: 10.1109/JSTSP.2007.910281.
[9] M Salman Asif and Justin Romberg. Fast and accurate algorithmsfor re-weighted l1-norm minimization. arXiv preprint arXiv:1208.0651,2012.
73
[10] Kwangmoo Koh, S Kim, and S Boyd. l1 ls: A matlab solver forlarge-scale l1-regularized least squares problems. Stanford University,2007.
[11] Y.C. Pati, R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matchingpursuit: recursive function approximation with applications to waveletdecomposition. In Signals, Systems and Computers, Proceedings of27th Asilomar Conference on, pages 40–44 vol.1, 1993. doi: 10.1109/ACSSC.1993.342465.
[12] Guy Narkiss and Michael Zibulevsky. Sequential subspace optimizationmethod for large-scale unconstrained problems. Technion-IIT, Depart-ment of Electrical Engineering, 2005.
[13] Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozin-ski. Optimization with sparsity-inducing penalties. arXiv preprintarXiv:1108.0775, 2011.
[14] S.J. Wright, R.D. Nowak, and M. A T Figueiredo. Sparse reconstructionby separable approximation. Signal Processing, IEEE Transactionson, 57(7):2479–2493, 2009. ISSN 1053-587X. doi: 10.1109/TSP.2009.2016892.
[15] Stephen R Becker. Practical Compressed Sensing: modern data ac-quisition and signal processing. PhD thesis, California Institute ofTechnology, 2011.
[16] J.M. Bioucas-Dias and M. A T Figueiredo. A new twist: Two-stepiterative shrinkage/thresholding algorithms for image restoration. Im-age Processing, IEEE Transactions on, 16(12):2992–3004, 2007. ISSN1057-7149. doi: 10.1109/TIP.2007.909319.
[17] Junfeng Yang and Yin Zhang. Alternating direction algorithms for �1-problems in compressive sensing. SIAM journal on scientific computing,33(1):250–278, 2011.
[18] J. Wright, Yi Ma, J. Mairal, G. Sapiro, T.S. Huang, and ShuichengYan. Sparse representation for computer vision and pattern recognition.Proceedings of the IEEE, 98(6):1031–1044, June 2010. ISSN 0018-9219.doi: 10.1109/JPROC.2010.2044470.
[19] R. Baraniuk and P. Steeghs. Compressive radar imaging. In RadarConference, 2007 IEEE, pages 128–133, April 2007. doi: 10.1109/RADAR.2007.374203.
[20] Michael Lustig, David Donoho, and John M Pauly. Sparse mri: Theapplication of compressed sensing for rapid mr imaging. Magneticresonance in medicine, 58(6):1182–1195, 2007.
74
[21] E.J. Candes, J. Romberg, and T. Tao. Robust uncertainty principles:exact signal reconstruction from highly incomplete frequency informa-tion. Information Theory, IEEE Transactions on, 52(2):489–509, Feb2006. ISSN 0018-9448. doi: 10.1109/TIT.2005.862083.
[22] Stephen P Boyd and Lieven Vandenberghe. Convex optimization.Cambridge university press, 2004.
[23] Scott Shaobing Chen, David L Donoho, and Michael A Saunders.Atomic decomposition by basis pursuit. SIAM journal on scientificcomputing, 20(1):33–61, 1998.
[24] Emmanuel J Candes, Michael B Wakin, and Stephen P Boyd. En-hancing sparsity by reweighted �1 minimization. Journal of FourierAnalysis and Applications, 14(5-6):877–905, 2008.
[25] Wotao Yin and Yin Zhang. Extracting salient features from less datavia l1-minimization. SIAG/OPT Views-and-News, 19(1):11–19, 2008.
[26] Hongjing Lu, Tungyou Lin, Alan LF Lee, Luminita A Vese, and Alan LYuille. Functional form of motion priors in human motion perception.In NIPS, pages 1495–1503, 2010.
[27] A.Y. Yang, S. Maji, C.M. Christoudias, T. Darrell, J. Malik, and S.S.Sastry. Multiple-view object recognition in band-limited distributedcamera networks. In Distributed Smart Cameras, 2009. ICDSC 2009.Third ACM/IEEE International Conference on, pages 1–8, Aug 2009.doi: 10.1109/ICDSC.2009.5289410.
[28] A. Wani and N. Rahnavard. Compressive sampling for energy efficientand loss resilient camera sensor networks. In MILITARY COMMUNI-CATIONS CONFERENCE, 2011 - MILCOM 2011, pages 1766–1771,Nov 2011. doi: 10.1109/MILCOM.2011.6127567.
[29] N. Katic, M.H. Kamal, M. Kilic, A. Schmid, P. Vandergheynst, andY. Leblebici. Power-efficient cmos image acquisition system based oncompressive sampling. In Circuits and Systems (MWSCAS), 2013IEEE 56th International Midwest Symposium on, pages 1367–1370,Aug 2013. doi: 10.1109/MWSCAS.2013.6674910.
[30] Z. Charbiwala, P. Martin, and M.B. Srivastava. Capmux: A scalableanalog front end for low power compressed sensing. In Green ComputingConference (IGCC), 2012 International, pages 1–10, June 2012. doi:10.1109/IGCC.2012.6322255.
[31] Ramy Hussein, Amr Mohamed, Masoud Alghoniemy, and Alaa Awad.Design and analysis of an adaptive compressive sensing architecture
75
for epileptic seizure detection. In Energy Aware Computing Systemsand Applications (ICEAC), 2013 4th Annual International Conferenceon, pages 141–146, Dec 2013. doi: 10.1109/ICEAC.2013.6737653.
[32] J. Chiang and R. Ward. Data reduction for wireless seizure de-tection systems. In Neural Engineering (NER), 2013 6th Interna-tional IEEE/EMBS Conference on, pages 48–52, Nov 2013. doi:10.1109/NER.2013.6695868.
[33] M. Shoaib, K.H. Lee, N.K. Jha, and N. Verma. A 0.6–107 μw energy-scalable processor for directly analyzing compressively-sensed eeg. Cir-cuits and Systems I: Regular Papers, IEEE Transactions on, PP(99):1–14, 2014. ISSN 1549-8328. doi: 10.1109/TCSI.2013.2285912.
[34] M. Shoaran, M.M. Lopez, V.S.R. Pasupureddi, Y. Leblebici, andA. Schmid. A low-power area-efficient compressive sensing approachfor multi-channel neural recording. In Circuits and Systems (ISCAS),2013 IEEE International Symposium on, pages 2191–2194, May 2013.doi: 10.1109/ISCAS.2013.6572310.
[35] S.A. Imtiaz, A. Casson, and E. Rodriguez-Villegas. Compressionin wearable sensor nodes: Impacts of node topology. BiomedicalEngineering, IEEE Transactions on, PP(99):1–1, 2013. ISSN 0018-9294. doi: 10.1109/TBME.2013.2293916.
[36] Qing Ling and Zhi Tian. Decentralized sparse signal recovery forcompressive sleeping wireless sensor networks. Signal Processing, IEEETransactions on, 58(7):3816–3827, July 2010. ISSN 1053-587X. doi:10.1109/TSP.2010.2047721.
[37] Peter M. Kogge and Harold S. Stone. A parallel algorithm for theefficient solution of a general class of recurrence equations. Computers,IEEE Transactions on, C-22(8):786–793, Aug 1973. ISSN 0018-9340.doi: 10.1109/TC.1973.5009159.
[38] C. S. Wallace. A suggestion for a fast multiplier. Electronic Computers,IEEE Transactions on, EC-13(1):14–17, Feb 1964. ISSN 0367-7508.doi: 10.1109/PGEC.1964.263830.
[39] Robert E Goldschmidt. Applications of division by convergence. PhDthesis, Massachusetts Institute of Technology, 1964.
[40] Emmanuel Candes and Justin Romberg. l1-magic: Recovery ofsparse signals via convex programming. URL: www. acm. caltech.edu/l1magic/downloads/l1magic. pdf, 4, 2005.
76
[41] A. Maleki and D.L. Donoho. Optimally tuned iterative reconstructionalgorithms for compressed sensing. Selected Topics in Signal Processing,IEEE Journal of, 4(2):330–341, 2010. ISSN 1932-4553. doi: 10.1109/JSTSP.2009.2039176.
[42] Jerome LVM Stanislaus and Tinoosh Mohsenin. Low-complexity fpgaimplementation of compressive sensing reconstruction. In InternationalConference on Computing, Networking and Communications, 2013.
[43] K. Karakus and H.A. Ilgin. Implementation of image reconstructionalgorithm using compressive sensing in fpga. In Signal Processing andCommunications Applications Conference (SIU), 20th, pages 1–4, 2012.doi: 10.1109/SIU.2012.6204682.
[44] H. Rabah, A. Amira, and A. Ahmad. Design and implementaiton of afall detection system using compressive sensing and shimmer technology.In Microelectronics (ICM), 24th International Conference on, pages1–4, 2012. doi: 10.1109/ICM.2012.6471399.
[45] P. Blache, H. Rabah, and A. Amira. High level prototyping and fpgaimplementation of the orthogonal matching pursuit algorithm. InInformation Science, Signal Processing and their Applications (ISSPA),11th International Conference on, pages 1336–1340, 2012. doi: 10.1109/ISSPA.2012.6310501.
[46] Jicheng Lu, Hao Zhang, and Huadong Meng. Novel hardware ar-chitecture of sparse recovery based on fpgas. In Signal ProcessingSystems (ICSPS), 2nd International Conference on, volume 1, pagesV1–302–V1–306, 2010. doi: 10.1109/ICSPS.2010.5555628.
[47] Lin Bai, P. Maechler, M. Muehlberghuber, and H. Kaeslin. High-speedcompressed sensing reconstruction on fpga using omp and amp. InElectronics, Circuits and Systems (ICECS), 19th IEEE InternationalConference on, pages 53–56, 2012. doi: 10.1109/ICECS.2012.6463559.
[48] A. Septimus and R. Steinberg. Compressive sampling hardwarereconstruction. In Circuits and Systems (ISCAS), Proceedings ofIEEE International Symposium on, pages 3316–3319, 2010. doi:10.1109/ISCAS.2010.5537976.
[49] YV Zakharov and V Nascimento. Orthogonal matching pursuit withdcd iterations. Electronics Letters, 49(4):295–297, 2013.
[50] P.R. Gill, A. Wang, and A. Molnar. The in-crowd algorithm for fastbasis pursuit denoising. Signal Processing, IEEE Transactions on,59(10):4595–4605, 2011. ISSN 1053-587X. doi: 10.1109/TSP.2011.2161292.
77
[51] M. A T Figueiredo and R.D. Nowak. A bound optimization approachto wavelet-based image deconvolution. In Image Processing. ICIP.IEEE International Conference on, volume 2, pages II–782–5, 2005.doi: 10.1109/ICIP.2005.1530172.
[52] David Donoho and Jared Tanner. Observed universality of phasetransitions in high-dimensional geometry, with implications for moderndata analysis and signal processing. Philosophical Transactions of theRoyal Society A: Mathematical, Physical and Engineering Sciences, 367(1906):4273–4293, 2009.
[53] Zynq-7000 AP SoC Technical Reference Manual. Xilinx Inc., v1.6.1edition, September 2013.
[54] CUDA C Best Practices Guide. Nvidia Corporation, Santa Clara,California, USA, 5.5 edition, July 2013.
78