northeastern university graduate school of · pdf filenortheastern university graduate school...
TRANSCRIPT
NORTHEASTERN UNIVERSITY
Graduate School of Engineering
Thesis Title: Phase Unwrapping on Reconfigurable Hardware and Graphics Pro-cessors
Author: Sherman Braganza
Department: Electrical and Computer Engineering
Approved for Thesis Requirements of the Master of Science Degree
Thesis Advisor: Prof. Miriam Leeser Date
Thesis Reader: Prof. Charles DiMarzio Date
Thesis Reader: Prof. David Kaeli Date
Department Chair: Prof. Ali Abur Date
Graduate School Notified of Acceptance:
Dean: Prof. David E. Luzzi Date
Copy Deposited in Library:
Reference Librarian Date
NORTHEASTERN UNIVERSITY
Graduate School of Engineering
Thesis Title: Phase Unwrapping on Reconfigurable Hardware and Graphics Pro-cessors
Author: Sherman Braganza
Department: Electrical and Computer Engineering
Approved for Thesis Requirements of the Master of Science Degree
Thesis Advisor: Prof. Miriam Leeser Date
Thesis Reader: Prof. Charles DiMarzio Date
Thesis Reader: Prof. David Kaeli Date
Department Chair: Prof. Ali Abur Date
Graduate School Notified of Acceptance:
Dean: Prof. David E. Luzzi Date
Copy Deposited in Library:
Reference Librarian Date
Phase Unwrapping on Reconfigurable Hardware and Graphics
Processors
A Thesis Presented
by
Sherman Braganza
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirementsfor the degree of
Master of Science
in
Electrical Engineering
in the field of
Computer Engineering
Northeastern UniversityBoston, Massachusetts
August 2008
c© Copyright 2008 by Sherman BraganzaAll Rights Reserved
Acknowledgement
I would like to thank my advisor Professor Miriam Leeser. Without the opportunities
and guidance she has provided, none of the work presented in this thesis would have
been possible. The patience, understanding, and technical advice she has given me
have been invaluable in my research and personal growth.
I would also like to thank my colleagues in the Reconfigurable Computing Lab at
Northeastern University. The friendly support they offer provide an enjoyable and
productive work environment that I will miss. I would also like to thank my family
and friends for their encouragement and support.
I would also like to thank Professor Charles DiMarzio and William Warger II
for their help with any questions that I had regarding the OQM microscope and
the data sets that they provided. Finally, I would like to acknowledge the sup-
port of CenSSIS, the Center for Subsurface Sensing and Imaging Systems, under the
Engineering Research Centers Program of the National Science Foundation (award
number EEC-9986821), without whose funding this would not have been possible.
Abstract
Phase unwrapping is the process of converting discontinuous phase data into a con-
tinuous image. This procedure is required by any imaging technology that uses phase
data such as MRI, SAR or OQM microscopy. Such algorithms often take a significant
amount of time to process on a general purpose computer, rendering it difficult to
process large quantities of information. This thesis focuses on implementing a specific
phase unwrapping algorithm known as Minimum LP norm unwrapping on a Field
Programmable Gate Array (FPGA) and a Graphics Processing Unit (GPU) for the
purpose of acceleration. The computation required involves a matrix preconditioner
(based on a DCT transform) and a conjugate gradient calculation along with a few
other matrix operations. These functions are partitioned to run on the host or the
accelerator depending on the capabilities of the accelerator. The tradeoffs between
the two platforms are analyzed and compared to a General Purpose Processor (GPP)
in terms of performance, power and cost.
Contents
1 Introduction 14
2 Background 16
2.1 The Keck Fusion Microscope . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.1 Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 Optical Quadrature Microscopy (OQM) . . . . . . . . . . . . 17
2.2 Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 WildStar II Pro PCI . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 NVIDIA GPUs - G80 . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Phase Unwrapping - Algorithms and Selection . . . . . . . . . . . . . 26
2.3.1 Path Following Algorithms . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Minimum Norm Algorithms . . . . . . . . . . . . . . . . . . . 35
2.3.3 Choosing The Right Algorithm . . . . . . . . . . . . . . . . . 41
2.4 Bitwidth Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.2 GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3 Implementation 52
3.1 Experimental Platforms And Timing Profile . . . . . . . . . . . . . . 52
3.1.1 Host Machine Descriptions And Timing Profiles . . . . . . . . 53
3.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.1 The Discrete Cosine Transform - An Overview . . . . . . . . . 55
3.2.2 Algorithm Details for the 1D DCT . . . . . . . . . . . . . . . 56
3.2.3 Algorithm Details for the 2D DCT . . . . . . . . . . . . . . . 59
3.2.4 Algorithm Details for the Conjugate Gradient . . . . . . . . . 60
3.3 The FPGA Implementation of the Preconditioner . . . . . . . . . . . 61
3.3.1 The One Dimensional DCT Transform . . . . . . . . . . . . . 62
3.3.2 The Two Dimensional DCT On The FPGA . . . . . . . . . . 69
3.3.3 Division And Scaling . . . . . . . . . . . . . . . . . . . . . . . 73
3.4 GPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4.1 Preconditioner . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.2 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.5.1 Programmed IO . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.5.2 Direct Memory Access (DMA) . . . . . . . . . . . . . . . . . . 79
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4 Results 83
4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1.1 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.1.2 Benchmark Suite . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2.2 FPGA Area Consumption . . . . . . . . . . . . . . . . . . . . 88
4.2.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2.4 Cost Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.2.5 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5 Conclusion and Future Work 98
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Bibliography 101
List of Figures
2.1 Optical Quadrature Microscopy setup [39] . . . . . . . . . . . . . . . 18
2.2 Virtex II Pro - Architecture [43] . . . . . . . . . . . . . . . . . . . . . 21
2.3 An image of the Annapolis Wildstar II Pro PCI [3] . . . . . . . . . . 22
2.4 A block diagram showing the various components of the Wildstar II
Pro [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 NVIDIA G80 core - Architecture[27] . . . . . . . . . . . . . . . . . . 25
2.6 A wrapped image. Note the data range which lies between π and −π 28
2.7 The need for smart phase unwrapping algorithms a) A raster unwrap
using Matlabs ’unwrap’ routine b) A minimum LP norm unwrap . . . 29
2.8 Goldstein’s algorithm on the two embryo sample . . . . . . . . . . . . 31
2.9 Quality Mapped algorithm on the two embryo sample . . . . . . . . . 32
2.10 Mask Cut algorithm on the two embryo sample . . . . . . . . . . . . 34
2.11 Flynn’s algorithm on the two embryo sample . . . . . . . . . . . . . . 35
2.12 Preconditioned Conjugate Gradient Algorithm pseudo-code . . . . . . 38
2.13 Preconditioned Conjugate gradient algorithm on the two embryo sample 39
2.14 Minimum LP Norm Algorithm pseudo-code . . . . . . . . . . . . . . 40
2.15 The Minimum LP Norm algorithm on the two embryo sample . . . . 41
2.16 The Multi-grid algorithm on the two embryo sample . . . . . . . . . . 42
2.17 Image produced by a double-precision unwrap . . . . . . . . . . . . . 44
2.18 Using a bitwidth of 27 . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.19 Using a bitwidth of 28 . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1 A comparison between the performance of the two machines . . . . . 54
3.2 Even extension around n=-0.5 and n=N-0.5 . . . . . . . . . . . . . . 57
3.3 PCG - Detailed pseudo-code[1] . . . . . . . . . . . . . . . . . . . . . . 61
3.4 Components and dataflow for the forward DCT transform . . . . . . 63
3.5 Components and dataflow for the inverse DCT transform . . . . . . . 64
3.6 Re-ordering pattern in a forward shuffle . . . . . . . . . . . . . . . . . 65
3.7 The rebuild component - Forward Transform . . . . . . . . . . . . . . 66
3.8 The 1D transform including dynamic scaling . . . . . . . . . . . . . . 67
3.9 A High Level Diagram of the Preconditioning Kernel . . . . . . . . . 70
3.10 The FSM controlling high level data-flow . . . . . . . . . . . . . . . . 71
3.11 The SRAM A FSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.12 Poisson Equation Calculation in the Frequency Domain . . . . . . . . 74
3.13 Implementation of the floating point divide and scale logic . . . . . . 75
4.1 Phase unwraps on a) The reference software implementation, b) The
GPU and c) The FPGA . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Differences in phase unwraps between software and a) The GPU and
b) The FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3 Speedup achieved using the FPGA versus the reference software im-
plementation on Machine 1 . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4 Speedup achieved using the GPU versus the reference software imple-
mentation on Machine 2 . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.5 Time to complete 200 iterations of the preconditioner on both platforms 92
4.6 A comparison showing the performance per dollar on three platforms 94
4.7 A comparison showing the total power consumption for the two machines 95
4.8 A comparison showing the power consumption difference between the
processor running in the idle state and executing the algorithm using
either a GPP or an accelerator . . . . . . . . . . . . . . . . . . . . . . 97
List of Tables
3.1 A comparison between the two platforms . . . . . . . . . . . . . . . . 53
4.1 FPGA area consumption for the single DCT core implementation . . 88
Chapter 1
Introduction
Computational accelerators are not a new idea but the emergence of two platforms,
FPGAs and GPUs, are in the process of reshaping how developers involved in High
Performance Computing (HPC) solve problems. Both platforms are well suited for
highly parallel tasks since they both present architectures capable of exploiting data
parallelism. FPGAs provide complete control for the designer at the cost of imple-
mentation effort whereas GPUs provide a fixed memory hierarchy and architecture
to which designers must fit their algorithms. Many image processing algorithms map
well to either platform since such algorithms tend to parallelize well. In particu-
lar, two dimensional phase unwrapping has the potential to see significant speedup
through implementation on these platforms.
Our interest in two dimensional phase unwrapping stems from the research cur-
rently ongoing with the W.M. Keck 3-D Fusion microscope [18]. One of its modalities,
known as Optical Quadrature Microscopy (OQM), utilizes phase information to re-
construct an image of the sample under investigation. This phase information is
CHAPTER 1. INTRODUCTION 15
wrapped between π and −π and needs to be unwrapped before it can be of use.
In this thesis, we explore a variety of phase unwrapping algorithms and their
suitability for processing the data sets produced by the OQM mode of the Keck mi-
croscope. We select the Minimum LP norm algorithm as being the one that provides
the best results in terms of quality and then implement its kernel on both the FPGA
and the GPU platforms. We see an overall algorithm speedup of between 2x and 3x
for the FPGA and between 5x and 6x for the GPU as compared to the host machines
on which the accelerators are running. When compared to the host machine that
is currently being used (which takes two minutes per frame), we see an algorithm
speedup of 40x.
The rest of this thesis is arranged as follows. Chapter 2 presents details about
the Keck microscope as well as discussing our initial work in selecting the right phase
unwrapping algorithm. In this chapter we also discuss other related work in the
field and how our implementation compares to them. In Chapter 3 we present the
mathematical details behind our specific implementation of the preconditioner and
the conjugate gradient calculation and then proceed to discuss our implementation on
the FPGA and the GPU platforms. We also present details of the two host machines
themselves and benchmark their performance. In Chapter 4 we present our results
for the two platforms and finally in Chapter 5 we present our conclusions and discuss
potential future work.
Chapter 2
Background
This chapter describes the motivation behind our work and the related research that
has been published in the field. We start off by describing the Keck microscope and its
Optical Quadrature Microscopy modality and then talk about the FPGA and GPU
platforms. Next, we discuss our work in analyzing the results that the various phase
unwrapping algorithms produce for the OQM datasets and determine the appropriate
bitwidth for that algorithm. Finally, the state of the art in the development of related
FPGA and GPU implementations is discussed and some conclusions presented.
2.1 The Keck Fusion Microscope
This Master’s thesis research was motivated by the Keck Fusion microscope. In
order to generate useful images produced by one of it’s modalities, it is necessary to
perform some phase unwrapping calculations. This calculation takes approximately
two minutes per frame on the current platform. This leads to a bottleneck in terms
of processing time and thus it was deemed necessary to speed it up.
CHAPTER 2. BACKGROUND 17
2.1.1 Modalities
The Keck microscope utilizes multiple modalities in order to generate a complete
fused image of the target. These include Differential Interference Contrast (DIC), Re-
flectance Confocal Microscopy (CRM), Laser Scanning Confocal Microscopy (LSCM),
Two-photon Microscopy (TPLSM) and the one that is of interest to us, Optical
Quadrature Microscopy. Further details about the microscope and the various modal-
ities can be found in [18]. OQM is described in further detail in Section 2.1.2.
2.1.2 Optical Quadrature Microscopy (OQM)
Optical quadrature microscopy is a detection technique for measuring phase and
amplitude changes to a sinusoidal signal. A diagram of the setup is shown in Figure
2.1. A signal from a HeNe laser is split into two components, reference and unknown.
The unknown signal passes through the sample. The known reference signal is split,
with one component being phase shifted by 90 degrees. The unknown signal is then
mixed separately with both components of the known reference signal. The merged
signal consisting of unknown and non- phase shifted reference is referred to as the I
channel, or the in-phase signal, while the unknown signal mixed with the 90 degree
phase-shifted reference signal is referred to as the Q channel, or the quadrature signal.
By interpreting the I and Q signals as real and imaginary values of a complex number,
it is possible to find the amplitude and phase of the unknown signal.
These concepts of quadrature detection are applied to microscopy to create the
CHAPTER 2. BACKGROUND 18
OQM mode of the Keck 3D Fusion Microscope. Since coherent (HeNe laser) detection
provides an effective gain of |Eref | × |Esig|, low levels of light can be used for
illumination, minimizing cell exposure/damage.
Figure 2.1: Optical Quadrature Microscopy setup [39]
OQM forms the motivation behind the phase unwrapping acceleration project as
it produces wrapped phase based images with data between π and −π. This data
needs to be unwrapped in order for the data to be usable. A software implementation
of the Minimum LP norm phase unwrapping algorithm in a mixture of C and Matlab
takes nearly two minutes to process a single frame. Speeding up the processing would
render the OQM modality much more useful in processing large stacks of images and
ultimately, having a near real-time version showing direct unwrapped output from
the microscope.
CHAPTER 2. BACKGROUND 19
2.2 Platforms
The concept of using an external accelerator for application speedup is not a new
one. In the early days of PCs, it was common to have an extra socket for a math
coprocessor [40] that could be used to accelerate floating point computations. More
recently, platforms such as the Cell Broadband Engine have been used as accelerators
in petaflop supercomputers [6] such as the Roadrunner machine in order to reach new
levels of computing performance. FPGAs have also been used in machines such as
the Cray XD1 [7] and GPUs in systems like the Bull Novascale supercomputer [37].
Such systems are general accelerators in the sense that they apply to a wide range
of application domains. More specific accelerators such as Ageia’s PhysX physics
accelerator [41] also exist that target a restricted domain.
In the work presented in this thesis, a phase-unwrapping algorithm has been
implemented on two separate platforms: Field Programmable Gate Arrays (FPGAs)
and Graphics Processing Units (GPUs) and the results are compared to general
purpose processors.
2.2.1 FPGAs
Implementation of application specific funcationality in hardware can be performed
on either Application Specific Integrated Circuits (ASICs) or FPGAs. ASIC imple-
mentation is generally expensive and thus is reserved for high- volume production.
ASICS are usually mass produced and cannot be reprogrammed. Gate arrays on
CHAPTER 2. BACKGROUND 20
the other hand are programmable and exist in both volatile and non-volatile flavors.
Non-volatile types such as those that are mask-programmed are also not feasible for
prototypes or low volume production. Reprogrammable gate arrays are thus ideal
for prototyping applications.
SRAM-based programmable FPGAs (which are the target for this implementa-
tion) use look-up tables (LUTs) to implement combinational logic. These LUTs,
along with other primitives such as registers, multipliers and memory, are arranged
in a regular pattern in hardware with each component being connected by a pro-
grammable interconnect. An example FPGA architecture is shown in Figure 2.2.
This architecture allows for the implementation of arbitrary functionality limited
only by available area. By exploiting hardware based optimizations such as paral-
lelization and pipelining, designers can achieve high performance on such hardware.
For this research we use a Virtex II Pro FPGA chip which has 18-bit embedded
multipliers, on-chip RAM elements called BlockRAMS and two embedded PowerPCs
which are not used in this implementation.
2.2.2 WildStar II Pro PCI
The specific Virtex II Pro based FPGA board on which our design was implemented
was the Wildstar II Pro by Annapolis. The Wildstar II Pro is has dual Virtex II Pro
FPGA chips on the accelerator board. It was designed to speed up signal and image
processing applications. An image of the board is shown in Figure 2.3.
FPGA implementations usually achieve performance through a mixture of coarse
CHAPTER 2. BACKGROUND 21
Figure 2.2: Virtex II Pro - Architecture [43]
and fine grained pipelining and parallelism. In order to maximize potential exploita-
tion of such characteristics, six DDR2 SRAM banks are provided for each FPGA,
resulting in a total of twelve banks. Each bank is 18 MBits in size and is arranged in
524288 36 bit words. The SRAM banks are also all independently accessible via the
previously mentioned 36 bit wide bus, but are accessible as 72-bit words once passed
through the clock domain crossing logic. They run at a frequency governed by the
FPGA design speed although there are some setup latencies involved in any transfer.
The arrangement of the SRAM banks and FPGAs is shown in Figure 2.4.
Each FPGA also has a separate 64 MB DDR DRAM bank that remains unused in
the phase-unwrapping design, along with the differential pair and Rocket IO buses.
Also present is an independent bus for each FPGA to the PCI controller chip that
CHAPTER 2. BACKGROUND 22
Figure 2.3: An image of the Annapolis Wildstar II Pro PCI [3]
interfaces with the bus to the host.
There are also multiple programmable clocks available on the chip, referred to as
PCLK, ICLK and MCLK. These correspond to the user-programmable clock, the
bus to the PCI controller clock and the memory clock.
Annapolis utilizes a bus architecture that they term the Local Address Data
(LAD) bus. This refers to the bus between the FPGA and the PCI controller chip.
It provides abstractions for DMA transfers and for register space read/writes (PIO).
This bus is what user generated designs interface with in order to transfer data to
and from the host.
Annapolis also provides a software API for dealing with issues such as data trans-
fer, setting up clock frequencies and other parameters of the board. These will be
described as necessary in the software section of the FPGA implementation details.
CHAPTER 2. BACKGROUND 23
Figure 2.4: A block diagram showing the various components of the Wildstar II Pro[3]
2.2.3 NVIDIA GPUs - G80
The newest generation of DirectX 10 compatible GPU hardware supports general
purpose High Performance Computing (HPC). This was accomplished by breaking
the old graphics pipeline composed of specific shader and fragment units in favor of
unified computation units capable of handling either. In this subsection we discuss
specific details of the architecture that enable the hardware to achieve speedup on
parallel workloads.
CHAPTER 2. BACKGROUND 24
Hardware
NVIDIA’s first foray into explicitly supported general purpose graphics hardware was
the G80 architecture depicted in Figure 2.5. The G80 GPU consists of 128 processing
elements, each capable of operating on a separate single precision floating point datum
in parallel with the others. Groups of 8 elements are grouped into a multiprocessor
with each multiprocessor having its own shared memory. This shared memory is
used by threads on the same multiprocessor to share data. However, there exists no
way of rapidly transferring data from one multiprocessor to another (the application
programmer is required to go through main memory). Each multiprocessor has its
own set of registers, a constant cache and a texture cache. These different memory
types are optimized for different access types. There is also hardware thread control
that enables rapid thread switching and thus the hardware is optimized for dealing
with thousands of threads in parallel. This approach allows for the rapid computation
of massively parallel algorithms that exhibit high arithmetic intensity.
NVIDIA 8800 GTX
The 8800 GTX is a specific model of NVIDIAs G80 family of GPUs and represents
the second highest performing member of that family (the highest being the 8800
Ultra). It possesses a full complement of 128 stream processors operating at 1.35
GHz or twice the frequency of the rest of the core (which includes components such
as the Raster Operator etc.). The EVGA board that we use comes with 768 MB of
GDDR3 RAM operating at 900 MHz (effectively 1.8 GHz double pumped) with a
CHAPTER 2. BACKGROUND 25
Figure 2.5: NVIDIA G80 core - Architecture[27]
total bandwidth to the GPU of 86.4 GB/s via a 384-bit wide bus. The bus to the
host is PCI-E x16 and offers a peak throughput of 4 GB/s.
Application Programmers Interface (API)
The paradigm around which General Purpose GPUs are based is the kernel-stream
concept [4] [42] [31]. This approach maps well to many highly parallel applications.
Here, a kernel running on multiple identical processors (usually arranged in a regular
array) operate on separate data independently. This is known as a Single Instruction
Multiple Data (SIMD) architecture. An example application that uses this streaming
method is scalar matrix multiplication. Each data point in the matrix can be read,
multiplied and written back independently.
In their API, called the Compute Unified Device Architecture (CUDA) API,
CHAPTER 2. BACKGROUND 26
NVIDIA espouse a similar method which they label Single Instruction Multiple
Thread (SIMT); the difference being that SIMT instructions do not contain informa-
tion as to the width of the processor array. For the above example of scalar-matrix
multiplication, this would mean that each data point would be operated on by its
own thread.
The CUDA API operates on a large number of threads, broken up into warps,
blocks and grids. A warp consists of 32 threads that are managed together on a
multiprocessor. A thread block is a larger group of threads that executes on a single
multiprocessor and it typically consists of multiple warps. A grid is a collection of
blocks and it operates on many multiprocessors. The threads per block and blocks
per grid parameters must be specified for each individual kernel, but the warp size is
fixed.
There exist two levels to the CUDA API, a higher, abstract level and the Parallel
Thread eXecution (PTX) level. The two levels operate mutually exclusively. The
higher level provides simpler abstractions whereas the PTX API gives the program-
mer access to lower level aspects of the GPU. All implementations discussed in this
thesis were implemented in the high level API.
2.3 Phase Unwrapping - Algorithms and Selection
As a preliminary step to the implementation of a specific phase-unwrapping algorithm
on an accelerator (in this case an FPGA and a GPUs), it was necessary for us to verify
CHAPTER 2. BACKGROUND 27
the choice of phase unwrapping algorithm used. Our primary criterion was unwrap
quality and we tested each algorithm over widely differing datasets consisting of both
real embryo data and artifical targets such as glass beads in water or epoxy media,
and optical fibres. In this section we present the properties and the tradeoffs between
the different algorithms described by Ghiglia and Pritt [1] and our final analysis as to
which one produces the best results. Software implementations tailored to the OQM
data sets were provided by a previous student [35].
The idea behind most phase-unwrapping algorithms is that the correct unwrapped
phase varies slowly enough such that the gradient between pixels is less than a half-
cycle, or π radians. If this assumption holds true, a wrapped signal may be un-
wrapped by simply summing (integrating in the continuous domain) until a gradient
of |π| is reached at which point the phase is added to an integer multiple of 2π and
the summation continues. This is the only method for solving 1D phase based data
sets. However, one problem with this approach is that if the data is noisy enough,
phase gradients greater than π are created due to noise. These large phase gradients
can lead to image corruption over large segments of the data. Lower levels of noise
(i.e. below π) also lead to an accumulation of error that eventually results in large
deviations near the end of the accumulation. Residues (discussed further on in this
Section) also contribute to incorrect unwraps. A wrapped data set is shown in Figure
2.6 and an example of both bad and good unwraps performed using the the raster
based Matlab unwrap and Minimum LP norm unwrapping is shown in Figure 2.7.
CHAPTER 2. BACKGROUND 28
To solve this problem, various two-dimensional phase unwrapping algorithms have
been developed, each with differing tradeoffs in terms of quality and performance.
Figure 2.6: A wrapped image. Note the data range which lies between π and −π
2.3.1 Path Following Algorithms
Path following algorithms solve the noisy data problem by selecting the path over
which to integrate. Goldsteins algorithm, one of the most common path-following
algorithms, operates by identifying residues (points where the integral over a closed
four pixel loop is non-zero) and connecting them via branch cuts or paths along which
the integration path may not intersect.
One problem with Goldsteins algorithm is that it does not utilize all the data
CHAPTER 2. BACKGROUND 29
Figure 2.7: The need for smart phase unwrapping algorithms a) A raster unwrapusing Matlabs ’unwrap’ routine b) A minimum LP norm unwrap
available to guide the generation of branch cuts. By generating a map indicating the
quality of the data over the image, it is possible to unwrap instances that cannot
be done using Goldsteins algorithm. These quality maps may be user-supplied or
automatically generated using pseudo-correlation, the variance of phase derivates or
the maximum phase gradient.
Quality maps may be combined with the branch cuts used in Goldsteins algorithm
to form a hybrid mask-cut method. The quality map is used to guide the placement
of branch-cuts. Another approach, proposed by Flynn, detects discontinuities, joins
them into loops and adds the correct multiple of 2π to it if the action removes more
discontinuities than it adds. Flynns minimum discontinuity solution can also be used
CHAPTER 2. BACKGROUND 30
with a quality map to generate higher quality solutions.
Goldstein’s Algorithm
The simplest algorithm in terms of computational complexity is Goldstein’s Branch
Cut Algorithm. It operates in the following way:
Step 1. Identify residues: This step is accomplished by integrating in a four
pixel loop starting at pixel p0. If the sum is 2π then the p0 is marked as having
a positive residue charge and if the sum is −2π then it is marked as having a
negative residue charge.
Step 2. Create Branch Cuts: This step operates by connecting residues together
by branch cuts until the sum of residue charges is zero.
Step 3. Integrate: Integration of the image is performed using a Breadth First
Search (BFS) exploration of all the pixels in the image. As each pixel is en-
countered, it is unwrapped, unless it lies on a branch cut. Next, segments of
the image that may have been isolated by branch cuts are unwrapped similarly.
Pixels on branch cuts are unwrapped separately at the end.
As can be inferred from the steps above, Goldstein’s algorithm operates in O(N2)
time while consuming O(N2) space where the input image is N ×N .
The result of a phase unwrapping using Goldstein’s algorithm is shown in Figure
2.8. The problems with this method are immediately apparent. The segments of
CHAPTER 2. BACKGROUND 31
the image that are of interest are mostly still wrapped with the amplitudes being
incorrect by a large margin.
Figure 2.8: Goldstein’s algorithm on the two embryo sample
Quality Maps
Quality maps are based on the concept of a user-supplied or auto-generated array that
defines the goodness of each phase value. These can be used to guide the unwrapping
since corrupted phase and residues usually have low quality values.
Unwrapping using quality maps works by first taking as an input the phase array
and the quality map. The quality map is either user input or based on the variance of
phase derivatives or the maximum phase gradient. The unwrapping is then performed
in a similar fashion as the Goldstein algorithm’s BFS exploration, except that the
CHAPTER 2. BACKGROUND 32
adjoin list is not explored in FIFO order but according to quality. This leaves the
low quality pixels to be unwrapped at the very end.
The results of the quality mapped unwrapping are shown in Figure 2.9. There
are significant failures noticeable in the center of the lower embryo as well as around
the edges of both.
Figure 2.9: Quality Mapped algorithm on the two embryo sample
Mask Cut Algorithm
The Quality Map method does not explicitly use all the information available such
as residues. A hybrid approach called the Mask Cut Algorithm also exists. It uses
both quality maps and residues to guide the placement of branch cuts. It operates
as follows:
CHAPTER 2. BACKGROUND 33
Step 1. Identify residues: This is performed as described in Section 2.3.1.
Step 2. Create mask cuts: This operates on the lowest quality pixels in the
image, gradually growing outwards using a BFS exploration technique and
marking the low quality pixels as being part of a mask cut once a residue is
encountered. The mask cut continues growing until the charge is balanced.
Step 3. Thin the mask cuts: Mask cuts tend to be thicker than branch cuts and
need to be thinned before integration. This step clears the mask on all mask
pixels that do not lie next to a residue and can be safely removed without
changing mask connectivity.
Step 4. Integrate: This is performed as described in Section 2.3.1.
For our datasets, the mask cut algorithm performs poorly as can been seen in Fig-
ure 2.10. The diagonal flecking and inconsistent phase changes render this technique
unusable for the OQM microscope.
Flynn’s Algorithm
One method of phase-unwrapping is to segment the image along lines of discontinuity
into regions where each region has the same integer multiple of 2π associated with
it. This approach fails in the presence of high noise values or residues. Flynn’s
algorithm only segments regions along lines of discontinuity if the process of doing so
and adding the appropriate 2π multiple removes more discontinuities than it adds.
The algorithm works as follows:
CHAPTER 2. BACKGROUND 34
Figure 2.10: Mask Cut algorithm on the two embryo sample
Step 1. Compute jump counts: Here horizontal and vertical jump counts are
computed. A jump count is the integer k associated with the 2πk multiplier
regions 2π multiple.
Step 2. Scan nodes: Go over the set of nodes adding edges and removing loops.
When no changes are made terminate.
Step 3. Compute unwrapped solution: The wrap counts are added to the
input phase data to get the final unwrapped solution.
There are various other optimizations performed on the image such as the inte-
gration of quality data that are not described here. From Figure 2.11 it can be seen
CHAPTER 2. BACKGROUND 35
that Flynn’s algorithm provides a high quality solution, the best amongst the path
following algorithms discussed.
Figure 2.11: Flynn’s algorithm on the two embryo sample
2.3.2 Minimum Norm Algorithms
This set of phase-unwrapping algorithms seek to generate an unwrapped phase whose
local phase derivatives match the measured derivatives as closely as possible. This
comparison can be defined as the difference between the two, raised to some power
p.
The simplest case is the unweighted least squares method where p = 2. This
family of methods uses Fourier or DCT techniques to solve the least squares problem
CHAPTER 2. BACKGROUND 36
but are vulnerable to noise. The pre-conditioned conjugate gradient (PCG) technique
overcomes this problem by using quality maps to zero-weight noisy regions so that
the unwrapped solution is not corrupted. There also exists a weighted multi-grid
algorithm that uses a combination of fine and coarse grids to converge on a solution.
Finally, the Minimum LP Norm algorithm solves the phase unwrapping problem for
p = 0. In this situation, the algorithm minimizes the number of discontinuities in
the unwrapped solution without concern for the magnitude of these discontinuities.
This value of p generally produces the best solution. This algorithm can be used
with or without user-supplied weights and can also generate its own data-dependent
weights. It iterates the PCG algorithm, which in turn iterates the DCT algorithm.
This results in the Minimum LP Norm algorithm having among the highest costs of
all the algorithms in terms of runtime and memory.
Preconditioned Conjugate Gradient
The Preconditioned Conjugate Gradient(PCG) algorithm iterates the unweighted
least squares algorithm in order to perform a weighted phase unwrap. This un-
weighted least squares technique minimizes the difference between the discrete par-
tial derivatives of the wrapped phase data and the discrete partial derivatives of the
unwrapped solution. The solution φi,j that minimizes
ε2 =M−2∑i=0
N−2∑j=0
(φi+1,j − φi,j −∆xi,j)
2 +M−2∑i=0
N−2∑j=0
(φi,j+1 − φi,j −∆yi,j)
2 (2.1)
CHAPTER 2. BACKGROUND 37
forms the final solution where ∆xi,j represents the phase difference going in the x
direction. This solution can be reduced to the discretized Poisson equation given by:
(φi+1,j − 2φi, j + φi− 1, j) + (φi,j+1 − 2φi, j + φi, j − 1) = ρi,j, (2.2)
where
ρi,j = (∆xi,j −∆x
i−1,j) + (∆yi,j −∆y
i,j−1),
This can be solved in the frequency domain by constructing reflections in the
x and y directions (in order to fulfill boundary condition requirements) and then
applying a two-dimensional Fourier transform to the input data. Alternatively, a
two-dimensional Discrete Cosine Transform(DCT) can be used. Applying the Fourier
transform to Equation 2.2 and noting that Φ and P represent the Fourier transformed
versions of φ and p, we get:
Φm,n =Pm,n
2cos(πm/M) + 2cos(πn/N)− 4, (2.3)
The PCG algorithm uses Conjugate Gradient(CG) to solve the discretized Poisson
equation. The CG technique has rapid and robust convergence and is guaranteed to
converge in N iterations for a N ×N matrix (barring roundoff error). However, the
actual number of iterations depends on the condition of the original matrix. If the
original matrix is close to the identity matrix, the matrix converges rapidly. In order
to achieve this condition, a preconditioning step is applied that solves an approximate
problem, the unweighted least-squares phase unwrapping. After this, the usual CG
steps are performed. The algorithm is shown in the pseudocode in Figure 2.12
CHAPTER 2. BACKGROUND 38
Compute residual R_k of weight phase Laplacians
Initialize solution phi to zero
for (k=0 to MaxNumberOfIterations-1)
Solve P z_k=r_k using unweighted phase unwrapping to get z_k
Use CG method to solve for phi using z_k
if solution lies between predefined convergence bounds
exit loop
end
Figure 2.12: Preconditioned Conjugate Gradient Algorithm pseudo-code
As can be seen in Figure 2.13, the PCG algorithm produces a smooth, continuous
result but the image has a gradually greater magnitude going from right to left. This
affects both the foreground as well as the background, rendering the technique of
limited use. However, the conjugate gradient method presented here will be used
later for the Minimum LP Norm algorithm.
Minimum LP Norm
The Minimum LP Norm is similar to the PCG method since it also aims to minimize
the difference in gradients between the measured and calculated phases. However,
PCG sets p = 2 or to the least squares norm whereas the Minimum LP Norm algo-
rithm uses p = 0. This means that the Minimum LP Norm algorithm minimizes the
number of points where gradients of the measured points differ from the calculated
solution whereas the PCG algorithm minimizes the square of the differences, hence
ensuring that the measured gradients rarely match the solution.
For the Minimum LP Norm algorithm, we are trying to solve Equation 2.4.
CHAPTER 2. BACKGROUND 39
Figure 2.13: Preconditioned Conjugate gradient algorithm on the two embryo sample
(φi+1,j−φi,j)Ui,j +(φi,j+1−φi,j)Vi,j−(φi,j−φi−1,j)Ui−1,j−(φi,j−φi,j−1)Ui,j−1 = c(i, j)
(2.4)
where U and V are data-dependent weights and c is the weighted phase Laplacian
given by
c(i, j) = ∆xi,jU(i, j)−∆x
i−1,jU(i− 1, j) + ∆yi,jV (i, j)−∆y
i,j−1V (i, j − 1)
This equation can be rewritted as a matrix equation as in Equation 2.5.
Qφ = c, (2.5)
which is solvable by the PCG method discussed in Section 2.3.2. A pseudocode
CHAPTER 2. BACKGROUND 40
description of the algorithm is given in Figure 2.14. The full implementation also has
options for applying user-input or dynamically generated quality maps to the data.
Initialize solution phi_0 to zero
for (k=0 to maxIterations)
Compute Residual R
If R has no residues exit
Compute data dependent weights U and V
Compute weighted phase Laplacian c
Subtract c from weighted phase Laplacian of current solution
(the left side of Lp Norm equation)
Solve Qphi = c with PCG.
end
if (no residues in residual)
Unwrap using Goldstein’s
else
Apply post-processing congruency operation
end
Figure 2.14: Minimum LP Norm Algorithm pseudo-code
The results of the Minimum LP Norm Algorithm are displayed in Figure 2.15. As
can be seen, it produces the best quality images thus far, slightly better than Flynn’s
method. There are however, several incorrect areas such as the cell boundaries in the
lower embryos. This is partly because the algorithm reached it’s maximum number
of iterations without eliminating all residues from the residual.
Multi-Grid
Multi-grid methods enable the rapid solution of PDEs on large grids. They usually
operate as fast as Fourier methods but have the advantage that they can handle
non power-of-two sized arrays. These algorithms, while theoretically operating as
CHAPTER 2. BACKGROUND 41
Figure 2.15: The Minimum LP Norm algorithm on the two embryo sample
fast as or faster than the PCG algorithm, fail to produce meaningful results for the
data produced by the OQM modality as shown in Figure 2.16 and are not discussed
further.
2.3.3 Choosing The Right Algorithm
The primary criteria by which these algorithms were judged by was the quality of
their unwraps over a wide array of benchmarks mostly consisting of real data sets,
but occasionally with artifically constructed situations that posed challenging un-
wraps (such as imaging tiny glass beads or optical fiber). Overall, we noted that the
path following algorithms operated quickly, but often had isolated sections that were
CHAPTER 2. BACKGROUND 42
Figure 2.16: The Multi-grid algorithm on the two embryo sample
unwrapped poorly. The minimum norm algorithms had smooth solutions, but often
had large errors as in PCG. The Minimum LP Norm algorithm produced the best
overall solution at the expense of the greatest computation time. Thus the Minimum
LP norm algorithm was chosen as the algorithm to accelerate for this research.
2.4 Bitwidth Analysis
Implementing an algorithm with full floating point accuracy is usually not feasible
on either Digital Signal Processors (DSPs) or on Field Programmable Gate Arrays
(FPGAs) due to either a lack of support on the former or the formidable size re-
quirements on the latter platform. Thus before implementing any new floating point
CHAPTER 2. BACKGROUND 43
algorithm on these platforms, it is important both to verify that the data can be
converted to fixed point and still have the algorithm operate accurately and also to
discover the minimum bitwidth that can be used. Finding this minimum bitwidth
can result in large area savings in hardware. It has been noted that for Conjugate
Gradient (which is used in the Minimum LP norm phase unwrapping algorithm that
we use), precision issues directly affect the number of iterations and hence lower pre-
cision can actually increase time to convergence. The importance of precision is one
reason why large CG calculations are usually computed in double precision and hence
difficut to implement on FPGAs without using mixed floating point precisions[36].
It is less important for the GPU implementation since GPUs support floating point
natively in hardware, albeit currently in single precision.
We implemented C code that performs a fixed point, bit-accurate calculation of
the preconditioning step of the Minimum LP norm phase unwrap. This was possible
since the operations performed by the the preconditioner, which include the Discrete
Cosine Transform (DCT) and some intermediate floating point operations, are all
scalable with a possible loss of precision. For example if f(x) represents the DCT
and Poisson calculation, and if f(x) = V then f(Cx) = CV . This allows for the fixed
point implementation to be performed by multiplying the single precision floating
point input data x : −1 < x < 1, by a scaling factor C = 2p where p represents
the number of bits to be shifted. After the scaling, a cast to an integer data type
truncates all data after the decimal point. After the calculation of f is performed,
CHAPTER 2. BACKGROUND 44
the results are then cast to floating point and scaled down. The FFT used in the
DCT C code implementation was KISS FFT [21], a simple open-source package that
supports both fixed and floating point.
The quality of the images produced by the different bitwidths was analyzed by
visual quality, lack of unwrapped sections and discontinuous jumps, and by the dif-
ference between the fixed point and the full double-precision implementation. We
tested it over a large number of data sets of which one is shown in Figure 2.17.
Figure 2.17: Image produced by a double-precision unwrap
The double embryo image shown was just one of the benchmark images used to
determine the optimal bitwidth. However, this image presents a challenging unwrap
that does not converge completely before reaching the maximum number of iterations.
Hence it represents amongst the worst case real-world images.
The 27-bit unwrap shown in Figure 2.18 has small isolated areas of very low
phase. This stretches the overall magnitude range of the data and causes wild visual
CHAPTER 2. BACKGROUND 45
variation between this and the double-precision version.
Figure 2.18: Using a bitwidth of 27
This variation disappears with the 28-bit version seen in Figure 2.19. It does not
have the isolated low phase regions and so presents an unwrap very similar to that of
the double-precision version. However, the only FFT core that we had access to was
a 24-bit fixed point core provided by Xilinx [45]. Since we knew that twenty four bits
would be insufficient, we made use of the block floating point functionality present
in the core to give our data the greater dynamic range we knew was necessary.
2.5 Related Work
In this section we present work related to our research. The work discussed as pertain-
ing to FPGAs and GPUs differ in topic. This is because the FPGA implementation
presented in this paper performs the preconditioning step of the PCG algorithm,
which while the most computationally intensive part, is only a small section of the
CHAPTER 2. BACKGROUND 46
Figure 2.19: Using a bitwidth of 28
overall algorithm. This preconditioning step involves a DCT and some floating point
computation. The GPU implementation on the other hand implements the entire
conjugate gradient calculation as well as the Minimum LP norm algorithm’s inner
loop.
2.5.1 FPGAs
There exist a large number of image processing applications that have been mapped
to FPGAs. We implement a large 2D DCT transform of size 1024× 512 along with
some floating point calculations. No 2D DCT transforms of this size were discovered
in the literature. However, the popularity of the JPEG and MPEG standards has
resulted in a proliferation of smaller hardware implementations of the DCT targeted
towards specific sizes, most commonly the 8x8 2D implementation since this is what
is required by the standard. This is usually accomplished by multiplying an 8x8 block
CHAPTER 2. BACKGROUND 47
from the image data by an 8x8 coefficient matrix, resulting in an output matrix that
contains the component frequencies. This is a fairly expensive operation, requiring
4096 additions and another 4096 multiplications. There have been many publications
detailing the implementation of such 2D matrix multiplications using distributed
arithmetic (DA) which reduces the number of multiplications required, but which still
results in relatively large, low latency hardware. Woods et al. [30] used a combination
of 1D DCTs, distributed arithmetic, and transpose buffers on a Xilinx XC6264 to
generate such a design that utilized 30 percent of total chip area. Bukhari et al.
[17] investigated the implementation of a modified Loeffler algorithm for 8x8 DCTs
on various FPGAs. Siu and Chan proposed a multiplierless VLSI implementation[5]
for an 11x11 2D transform and many other variations on small 2D DCTs exist. Our
implementation differs from these in that our transform size is 1024×512 while using
a relatively small proportion of the chip.
Larger 2D DCTs can be implemented using 1D DCTs by taking advantage of the
DCT’s separability property. This is accomplished by first taking the DCT of all the
rows and then of all the columns. However, there do not exist many implementations
of large 1D DCTs for reconfigurable systems. In [44], an 8 to 32 point core with a
maximum of 24 bit precision was implemented using distributed arithemetic for the
vector-coefficient matrix multiplication. A 32 point, 24 bit instance of this core has
a latency of only 32 cycles, but consumes 10588 LUTs which makes the approach
impractical for large designs. In contrast, our approach is much more compact and
CHAPTER 2. BACKGROUND 48
therefore enables larger transform sizes at the cost of higher latency. Leong [19]
implements a variable radix, bit serial DCT using a systolic array but only describes
area requirements for designs up to N=32 which consume between 457 and 1363
adders and have a high worst case error. In comparison, our approach supports
much larger transform sizes with demonstrated designs of up to 1024 points.
The Spiral project uses a heuristic algorithm to explore the DFT design space with
performance feedback to generate a hardware-software DFT implementation, but the
comparison is with a software implementation on the embedded PowerPC which is
severely outclassed by a modern desktop processor. The project does however include
a customizable 1D DCT implemented in Verilog that is available from their website
and is the only available large 1D-DCT found[8].
Unlike the DCT, there exist many implementations of the Fast Fourier Transform
(FFT) on FPGAs, some of which date back to 1995 as in the case of Shirazi et al.
[33] They implemented a 2-D FFT, complex multiply and a 2-D IFFT on the Splash-
2 computing engine using a non-standard 18-bit floating point representation. The
hardware they target is the Xilinx 4010 and for the purposes of that application,
they use 34 FPGA chips to achieve adequate thoroughput. The nature of their
application does not require sending data back to the host. Our implementation on
the other hand uses just a single, more modern FPGA and a higher accuracy, semi-
floating point representation. Dillon implements a high performance floating point
2D FFT that would greatly improve our own results if integrated into our design [9].
CHAPTER 2. BACKGROUND 49
Bouridane et al.[14] implement a high performance 2D FFT as well but on images
half the size of ours and with similar performance .
Conjugate Gradient calculations have been implemented on FPGAs although in
a hybrid manner with strongly coupled host-accelerator interactions. This is because
of the lack of space on a single FPGA to implement the entire CG kernel in a suitably
high enough precision to ensure convergence. Prasanna et al. [38] implemented a
double precision hybrid CG solver in this manner but saw modest gains only in the
cases where limited cache sizes on the host forced page faults. If the data was already
in cache, significant slowdowns were measured. Strzodka et al.[36] investigated the
effects of using a mixture of precisions to achieve near double-precision accuracy on
FPGAs, but also noted slowdowns in performance.
2.5.2 GPUs
With the introduction of unified shaders with DirectX 10 and above, the GPGPU
community has witnessed an explosion in the number of suitable applications acceler-
ated on these platforms. The papers discussed here reflect only those closely related
to the higher level conjugate gradient kernel or preconditioning of the matrix which
is what was implemented on the GPU for this research.
Karasev et al. [16] use GPUs to implement 2D phase unwrapping on NVIDIA
GPUs using CG [24] and achieve a 35x speedup. They implement the weighted least
squares algorithm, similar to the PCG and multigrid algorithms discussed previously,
and apply it to Interferometric Synthetic Aperture Radar (IFSAR) data. They chose
CHAPTER 2. BACKGROUND 50
to use multigrid and Gauss-Seidel iterations to solve the minimization problem and
compare their results to C and Matlab implementations. However, multigrid tech-
niques do not work on our datasets as has been shown previously in Cary Smith’s
research [35] as well as in the experiments described in Section 2.3.2. Their algorithm
also requires a very high number of iterations to converge (on the order of tens of
thousands), a known result of using Gauss-Seidel. In comparison, the PCG or Min-
imum LP norm algorithm require tens or hundreds of iterations. Thus their total
computation time is greater than ours.
Bolz et al. [15] implement sparse matrix conjugate gradient solvers and multi-grid
solvers on GPUs. They achieve only modest speedups over CPUs as they work with
the ATI 9700 and Geforce FX generation of hardware and OpenGL and are thus
limited to working within the constraints of the classical graphics pipeline rather
than the unified shader architecture of the DirectX 10+ compatible video cards.
The 2D DCT required by the preconditioning step uses a 2D FFT, a complex
multiply and some reordering. NVIDIA provides a high performance FFT library
called CUFFT [25] that has been benchmarked by HP and shown to provide a 3x
speedup for large transform sizes[10] as measured against a highly optimized software
implementation on a multicore HP server available as part of Intel’s MKL library [12].
The point at which a GPU implementation becomes feasible is between the 512×512
and the 1024× 1024 matrix sizes which see a speedup of about 0.9 (a slowdown) and
3.0 respectively in real-world scenarios. Our input data set is 1024× 512 and we are
CHAPTER 2. BACKGROUND 51
running on a single core machine desktop machine. so we should expect to see some
improvement.
2.6 Conclusions
In this chapter, details of the Keck fusion microscope and the two implementation
platforms, FPGAs and GPUs, were provided. Next, the results of several phase un-
wrapping algorithms were analyzed and the Minimum LP Norm algorithm settled
upon as the one producing the best results. The kernel of this algorithm was imple-
mented in fixed point in software and a minimum usable bitwidth of 28 bits for the
FPGA implementation was decided on, which due to IP constraints was modified to
24 bits and an exponent. Finally, related work on both FPGA and GPU platforms
was presented and discussed.
In the next chapter we will discuss the algorithm used for the DCT and the
implementation of the preconditioner and conjugate gradient calculations on GPUs
and FPGAs.
Chapter 3
Implementation
In the previous chapter we discussed the implementation platforms and the various
phase unwrapping algorithms available in the literature. We settled on the Minimum
LP norm algorithm which iterates the Preconditioned Conjugate Gradient (PCG)
algorithm. PCG consists of two steps, a preconditioning step calculation that consists
of a DCT and some floating point calculations to solve the Poisson equation, and a
conjugate gradient calculation that consists of a variety of matrix operations.
This chapter presents details of the implementation of the preconditioner and CG
calculations on the FPGA and the GPU as well as the timing profiles that prompted
the implementation of those specific sections. It also describes the algorithms used
for the DCT and discusses the various data transfer modes utilized by the various
platforms.
3.1 Experimental Platforms And Timing Profile
This section describes the performance of the reference Minimum LP Norm algorithm
running on the two different General Purpose Processors (GPPs) used in this thesis.
CHAPTER 3. IMPLEMENTATION 53
They provide the platforms against which speedup is measured. Also discussed are
details of the Wildstar II Pro platform and the Annapolis API as well as further
details of the 8800 GTX board provided by NVIDIA.
3.1.1 Host Machine Descriptions And Timing Profiles
As discussed in Section 4.1, two different host machines are used for the FPGA
and GPU due to their different bus requirements. The reference code is profiled
on both machines since they both have drastically different performance numbers
and represent technologies four years apart. A breakdown of the two machines is
presented in Table 3.1.
Machine 1 (2004) Machine 2 (2008)Processor Pentium IV Xeon Core 2 Duo (Penryn)
L1 Data Cache 1x16 kb 2x32kbL2 Cache 1MB 6MBFrequency 3 GHz 3 Ghz
Number of Cores 1 Core 2 CoresRAM 1 GB DDR2 4 GB DDR2
Front Side Bus 4x200 MHz 4x333MHzVideo Card NVIDIA Quadro NVIDIA NVS 290 and 8800 GTX
OS Windows XP Pro Windows XP Pro x64Accelerator Wildstar II Pro PCIX NVIDIA 8800 GTX
Table 3.1: A comparison between the two platforms
The two machines differ greatly in most aspects except for peak operating fre-
quency. Execution time on both platforms for the reference software implementation
of the Minimum LP norm algorithm is also fairly different and is shown in Figure 3.1.
The data set that produced these timing numbers was the glass bead data. In this
test, the algorithm runs until it reaches the maximum number of iterations and then
CHAPTER 3. IMPLEMENTATION 54
times out. The main reason for the performance difference is the size of the caches
since with Machine 2 an entire data set can fit in cache. It should be noted that
the reference software implementation runs only on the General Purpose Processor
(GPP) and utilizes only a single core. The basic datatype used is the single precision
float, although some key values are stored in double precision.
Figure 3.1: A comparison between the performance of the two machines
Disk IO takes approximately 1.3 seconds on Machine 1 and 700ms on Machine 2,
thus representing a fair amount of the overhead (computation not involving the PCG).
As is readily noticeable, computation time is dominated by the preconditioning step
of the PCG kernel (approximately 75 percent of the total time in both cases), with
the remainder mostly occupied in performing other sections of the conjugate gradient
CHAPTER 3. IMPLEMENTATION 55
calculations. The entire Preconditioned Conjugate Gradient algorithm (which forms
the core of the Minimum LP algorithm) takes up 94 percent of the total computation
time. Thus the preconditioner and conjugate gradient portions of the algorithm are
prime candidates for implementation on an accelerator.
3.2 Algorithms
This section describes the algorithms used in the FPGA and GPU implementations.
They are modifications of those used in the reference software implementation of the
phase unwrapping algorithm and are optimized for efficient reuse of existing cores.
We start by describing the DCT algorithm since it takes up most of the computation
time in the preconditioner.
3.2.1 The Discrete Cosine Transform - An Overview
The Discrete Cosine Transform (DCT) is used in a wide variety of applications such
as image and audio processing due to its compaction of energy into the lower fre-
quencies. This property is exploited to produce efficient frequency-based compression
methods in various image and audio codecs such as JPEG and MPEG. However, the
DCT is also used in other applications that require larger sized transforms such as
those using the Preconditioned Conjugate Gradient (PCG) technique in applications
like adaptive filtering [11] and phase unwrapping [1]. In this section we discuss an al-
gorithm, first developed by Makhoul [20], and an implementation of it that utilizes a
Fast Fourier Transform (FFT) core to compute a DCT without significantly increas-
CHAPTER 3. IMPLEMENTATION 56
ing overall latency as compared to just a FFT core. The advantage of this approach
is the ready availability of a large number of FFT cores in both fixed- point [45] and
floating-point [2] formats which can be easily dropped in with minimal modifications
to the overall design.
3.2.2 Algorithm Details for the 1D DCT
The general algorithm presented here was first discussed by Makhoul[20]. It is an
indirect algorithm for computing DCTs using FFTs and describes the method we
used to implement the DCT on an FPGA. The steps are presented as well as their
correspondence to the computation done in hardware.
Given an input signal x(n), the DFT of that signal is given by:
X(k) =N−1∑i=0
xne− 2πi
Nnk k = 0 . . . N − 1
The cosine transform can be viewed as the real part ofX(k) which is the equivalent
of saying that it is the Fourier transform of the even extension of X(k) given that
x(n) is causal (i.e. x(n) = 0, n < 0). This is the inspiration for the usual technique
for implementing the DCT, which is by mirroring the set of real inputs and taking
the real DFT of the resulting sequence. This mirroring can be performed in any of
four ways: around the n=-0.5 and n=N-0.5 sample points, around n=0 and n=N,
around n=-0.5 and n=N, and finally around n=0 and n=N-0.5. All of these methods
result in slightly different DCTs. The most commonly used even extension is the one
depicted in Figure 3.2 and this will be the focus of the algorithm and implementation
CHAPTER 3. IMPLEMENTATION 57
presented.
Figure 3.2: Even extension around n=-0.5 and n=N-0.5
This category of DCT, obtained by taking the DFT of a 2N point even extension,
is known as a DCT Type II or DCT-II and is defined as:
X(k) = 2N−1∑i=0
xn cos(2n+ 1)πk
2Nk = 0 . . . N − 1, (3.1)
with the even extension defined as:
x′(n) =
{x(n) n=0 . . . N-1x(2N − n− 1) n=N . . . 2N-1
,
The DCT-II can be shown to be solvable via DFT by noting that:
X(k) =2N−1∑n=0
xne− jπn
Nnk
=N−1∑n=0
xne−πnNnk +
2N−1∑n=N
xne−πnNnk
= ejπk2N
N−1∑n=0
xn[e−jπnNnke−
jπn2N
nk + e−jπnNnke
jπn2N
nk]
= 2ejπk2N
N−1∑i=0
xn cos(2n+ 1)πk
2N,
CHAPTER 3. IMPLEMENTATION 58
This is identical to the definition of the DCT in Equation 3.1 except for a multi-
plicative factor of ejπk2N . A similar method can be used to write an IDCT in terms of
a length 2N complex IDFT. Full details can be found elsewhere[20].
The performance of the DCT in terms of latency and area can be further im-
proved upon such that an N point real DFT/IDFT may be used. The method for
accomplishing this is outlined below.
A sequence v(n) can be constructed from x(n) such that it follows the restriction:
v(n) =
{x(2n) n=0 . . . N−1
2
x(2N − 2n− 1) n= N+12
. . . N-1,(3.2)
When the DFT of v(n) is computed and the result multiplied by 2e−jπk2N the
resulting sequence can be written as:
X(k) = 2N−1∑i=0
vn cos(4n+ 1)πk
2Nk = 0 . . . N − 1,
which is an alternative version of the DCT based on v(n).
Again, for the IDCT, the real sequence X(k) can be rearranged to form a complex
Hermitian symmetric sequence V (k) where Hermitian symmetry is defined as X(N−
k) = X∗(k) and the sequence itself as:
V (k) =1
2ejπk2N [X(k)− jX(N − k)] k = 0 . . . N − 1.
An IDFT on V (k) generates the v(n) sequence described earlier, which can then be
rearranged to form x(n). This is the method used in the implementation discussed
in the later sections of this thesis. However, for both the size N DCT and IDCT
CHAPTER 3. IMPLEMENTATION 59
it should be noted that the input sequences are either entirely real or Hermitian
symmetric and can thus be computed using FFTs with a point size of N/2 [29, 20].
This can be done by setting alternating elements of v(n) to the real and imaginary
parts of a new sequence t. That is,
t(n) = v(2n) + jv(2n+ 1) n = 0 . . .N
2− 1.
The DFT of this sequence can then be computed and the original V (k) extracted
by using:
V (k) =1
2[T (k) + T ∗(
N
2− k)]− 0.5je
−2πkjN [T (k)− T ∗(N
2− k)].
This gives the original V (k) which can subsequently be used to generate X(k).
A similar method can be applied to the real IFFT to realize similar savings.
The implementation described in this thesis does not use the last FFT optimiza-
tion (that reduces required transform size from N to N/2) since it involves an extra
multiplication step that would reduce the accuracy of the results, since the data being
used is fixed-point. However, for a floating-point or low precision implementation,
this would be a feasible optimization.
3.2.3 Algorithm Details for the 2D DCT
In addition to the 1D case Makhoul presented in [20], he also discussed a method of
performing 2D DCTs using 2D FFTs. This is the technique that is utilized for the
GPU computation since decomposing the matrix into 1D arrays would have a large
data transfer overhead and 2D FFT cores are easily available from NVIDIA [25].
CHAPTER 3. IMPLEMENTATION 60
Similar to the 1D DCT transform, the 2D DCT can be broken down into three
steps. First, a shuffle rearranges the matrix according to Equation 3.2.3. Note that
x is the input matrix, v is the shuffled matrix and N1 and N2 denote the dimensions
of the matrix.
v(n1, n2) = x(2n1, 2n2) 0 ≤ n1 ≤ N1−12, 0 ≤ n2 ≤ N2−1
2
x(2N1 − 2n1 − 1, 2n2)N1+1
2≤ n1 ≤ N1 − 1, 0 ≤ n2 ≤ N2−1
2
x(2n1, 2N2 − 2n2 − 1) 0 ≤ n1 ≤ N1−12, N2+1
2≤ n2 ≤ N2 − 1
x(2N1 − 2n1 − 1, 2N2 − 2n2 − 1) N1+12≤ n1 ≤ N1 − 1, N2+1
2≤ n2 ≤ N2 − 1.
The second step involves taking a N1 × N2 2D FFT of v(n), thereby producing
V (k). V (k) is equivalent to the 2D DCT after performing the computation given in
Equation 3.3. Note that the 2D DCT function is given by C(k1, k2).
C(k1, k2) = 2Re(W k24N2
(W k14N1
V (k1, k2) +W−k14N1
V (N1 − k1, k2))). (3.3)
A similar procedure applies to the 2D IDCT as detailed in [20].
3.2.4 Algorithm Details for the Conjugate Gradient
This sections describes what happens within each iteration of the the conjugate
gradient which is implemented on the GPU. Further details can be found in [32] and
[1]. Pseudocode for Preconditioned Conjugate Gradient (PCG) is presented in Figure
3.3.
Details of the preconditioning step of the PCG algorithm by means of an un-
weighted algorithm have already been given in Section 2.3.2. The remaining steps
are fairly straightforward matrix operations that adapt well to the highly parallel
nature of a GPU.
CHAPTER 3. IMPLEMENTATION 61
for k = 0 to MaxIterations-1
Apply preconditioner to get zkif k = 0 then p1 = z0
else
βk+1 =rTk zk
rTk−1
zk−1
pk+1 = zk + βk+1pk
αk+1 =rTk zk
pTk+1
Qpk+1
φk+1 = φk + αk+1pk+1
rk+1 = rk − αk+1Qpk+1
if norm(rk+1 < ε norm(r0) then exit loop
end loop
Figure 3.3: PCG - Detailed pseudo-code[1]
Not described in Figure 3.3 is the data centering operation which subtracts the
average from a matrix. We implemented this on the GPU as well. Calculating
the sum requires an accumulation, which can be tricky to parallelize well due to
dependencies.
3.3 The FPGA Implementation of the Precondi-
tioner
Now that the algorithms for the DCT within the preconditioner and the conjugate
gradient have been presented, in this section we detail the implementation of the 2D
DCT and IDCT on the FPGA, along with the floating point calculations necessary for
the Poisson equation calculation. First, the initial 1D implementation is described,
then its extension to 2D. Finally the method and components used to solved the
floating point equation is described.
We only implement the preconditioner on the FPGA since the area requirements
CHAPTER 3. IMPLEMENTATION 62
do not allow for the implementation of a conjugate gradient solver.
3.3.1 The One Dimensional DCT Transform
The description of the algorithm for both the forward and inverse DCT in Section
3.2.2 lends itself to a clearly defined component breakdown in terms of hardware.
For the DCT, the first component creates v(n) by reordering the input sequence
and writing it to memory. The second component is the FFT that transforms the
shuffled input data into the frequency domain. The last component multiplies the
output V (k) by 2e−jπk2N and extracts the desired output values from the complex FFT
output. Roughly the same components are required for the inverse DCT but in
reverse order. First of all, a multiplication of a re-arranged sequence Y (k) by 0.5ejπk2N
is performed where Y (k) = X(k)− jX(N − k). Then the data is passed through an
inverse FFT of size N, followed by the mapping of v(n) to x(n). This organization of
the components is depicted in Figure 3.4 and Figure 3.5 for the forward and inverse
transforms respectively.
Shuffle
As input data is sent sequentially into the DCT core, the first stage of processing
that occurs is the generation of the v(n) sequence. This occurs within the shuffle
component. The shuffle has a latency of one clock cycle and calculates output indices
based on the input index according to Equation 3.2. Since the shuffle component only
affects index values, all addition and subtraction performed within it is of bitwidth
CHAPTER 3. IMPLEMENTATION 63
Figure 3.4: Components and dataflow for the forward DCT transform
log2N . Based on these output indices, the sample value is written to block RAM in
shuffled order as shown in Figure 3.6. This step of writing to block RAM is necessary
since the FFT component takes in input in sequential order but the shuffle produces
output non-sequentially.
For an inverse DCT shuffle, the FFT output data is re-arranged in the opposite
direction, forming x(n) from v(n). This is also written to block RAM before trans-
mitting back to the host since the data will not be generated in sequential order.
Fast Fourier Transform
The complex FFT used was provided by Xilinx LogicCore and generated using Core-
gen [45]. It allows for a range of options, including a parameterized bit-width of 8 to
CHAPTER 3. IMPLEMENTATION 64
Figure 3.5: Components and dataflow for the inverse DCT transform
24 bits for both input and phase factors, the use of either block RAM or distributed
RAM, the choice of algorithm and rounding used and the ability to set the output
ordering.
Because the FFT was used to implement large DCTs, it was necessary to set the
bitwidth to a large size to maximize precision. To this end a 24 bit signed input was
used along with a block floating-point exponent for each 1D transform completed.
This exponent field reduces the need to increase output bitwidth after each FFT.
Block RAM, a radix-4 block transform and bit reversed ordering were also selected.
Since the algorithm optimization for computing a real or Hermitian symmetric
FFT using a transform of length N/2 (as mentioned in the previous section) wasn’t
CHAPTER 3. IMPLEMENTATION 65
Figure 3.6: Re-ordering pattern in a forward shuffle
used due to precision issues, the imaginary input for the FFT was tied to zero. In
addition, the FFT core was set up to support both forward and inverse transforms
simultaneously as well as to have run-time configurable transform length.
Rebuild rotate
The rebuild rotate component implements the multiplication by 2e−jπk2N for the for-
ward transform and 0.5ejπk2N for the inverse. These complex exponentials can be
converted to a format consisting of sines and cosines by using Eulers formula. For
example, the forward transform is the equivalent of 2(cos(−πk2N
) + jsin(−πk2N
)).
The Coregen Sine Cosine Lookup Table 5.0 component used has a mapping be-
tween the input integer angle T and the calculated θ of θ = T 2π2T WIDTH . Thus for
θ = −πk2N
and noting that 2T WIDTH = N , the input angle works out to be −k4
and
k4
for the forward and inverse respectively. The sine and cosine of the index k is
generated and then multiplied by the results of the FFT using a complex multiply
with a latency of six cycles. This generates a 48 bit output, of which only the first
CHAPTER 3. IMPLEMENTATION 66
24 bits are retained.
The overall dataflow of this component is depicted in Figure 3.7. Note that for
the forward DCT transform only the real output of the complex multiplication is
used. The full functionality of the complex multiplier is retained however, since the
inverse transform requires it for the rebuild rotate as shown in Figure 3.5.
Figure 3.7: The rebuild component - Forward Transform
Dynamic Scaling
The hardware implementation was required to be as close as possible to a floating
point software implementation as detailed in the bitwidth analysis section. In order
to achieve this level of accuracy with a fixed point FFT with block floating point,
it was necessary to scale the input data on-the-fly to maximize available dynamic
CHAPTER 3. IMPLEMENTATION 67
range. The expanded dataflow diagram in Figure 3.8 shows the components used in
this procedure.
Figure 3.8: The 1D transform including dynamic scaling
The max tracker component recieves incoming streaming floating point data
from SRAM and records the maximum exponent of the 1D frame. It does this by
using a comparator to see if the internally registered data is less than the incoming
value. If it is, the internal register is overwritten with the new value. The final
output of this component is what the entire data frame must be shifted by. This
output is calculated as 23− (MAX−126). The value 23 is used since the fixed point
FFT uses 24 bit data and the float to fixed point conversion will round numbers less
than one to zero. The 126 is to compensate for the exponent bias in IEEE compliant
CHAPTER 3. IMPLEMENTATION 68
floating point representation. Converting the register MAX into two’s complement
and simplifying, the calculation is performed as not(MAX) + 150.
The scale component takes in data from BRAM, and adjusts the floating point
exponent field according to the MAX value calculated above. Similarly, the rescale
component adjusts the data frame back to it’s original range by subtracting the MAX
value. The rescale component also adds the block exponent produced by the FFT
as well as a constant multiplicative power of two introduced by the algorithm. This
can be summarized as:
EXPoutput = EXPinput + EXPblock − EXPscale + EXPconstant
where EXPconstant is 4 or 3 if the transform length is 1024 or 512 respectively
Data Type Conversion
The software version of the preconditioning kernel deals with all it’s data as single-
precision 32-bit floating point values. However, given the limitation of having a
fixed point 24-bit FFT available, some form of conversion between the two formats
is necessary.
The RCL VFLOAT library [28] contains parameterizable float to fixed point and
fixed to floating point units capable of performing the necessary conversions. Data
coming out of the scale component is streamed into the float to fixed point compo-
nent and then transferred to the FFT. The output of the FFT is converted back to
floating point, scaled and then stored. Thus the data-type conversion encapsulates
CHAPTER 3. IMPLEMENTATION 69
the core DCT logic, which is further encapsulated by the dynamic scaling logic.
3.3.2 The Two Dimensional DCT On The FPGA
A 2D DCT is required by the preconditioner since the input matrices are 2D arrays.
In this section we discuss how the 1D DCT presented in Section 3.3.1 is extended to
2D.
The DCT, similar to the FFT, is separable. This means that a two-dimensional
DCT can be constructed by performing the 1D DCT for each of the rows followed by
a 1D DCT of the columns of the resulting matrix. This technique is called the row-
column decomposition method.
The key components to extending the one-dimensional DCT discussed earlier into
two dimensions are exploiting the onboard SRAM banks to store entire images at
a time, and calculating the transpose of the matrix after each DCT iteration. This
eliminates much of the data transfer bottlenecks involved in performing a 2D DCT
by transferring 1D DCT input data sets to the board sequentially. Now we transfer
a full 2D phase data set at once. A high level diagram of the components involved
as well as an approximation of the data flow is given in Figure 3.9.
Coarse Grained Controller
A high level Finite State Machine (FSM) based controller was implemented that co-
ordinates the action of other controllers and components in the design. This method
of implementation was important for debugging purposes as any high level stage
CHAPTER 3. IMPLEMENTATION 70
Figure 3.9: A High Level Diagram of the Preconditioning Kernel
could be run independently by changing the sequence of execution in the FSM. The
flowchart in Figue 3.10 depicts the functioning of the state machine.
On startup, the controller defaults to the IDLE state after which it resets itself
so that all registers are in a known state. Then upon recieving a signal indicating
the start of input, it allows data to be written from the LAD bus via DMA or PIO
to SRAM A. Additional control signals are also sent via PIO. After the input stage,
execution can proceed along the full execution path or along special debug paths
depicted by the dotted line in Figure 3.10. This is set by the control signals sent by
the host on startup.
Normal execution proceeds as follows: First data is read from SRAM A and each
row is transformed and writted to SRAM B at the transposed address in a streaming
fashion with signals setting transform direction to forward and orientation to row.
Next the transformation component is reset and the orientation changed to column.
Data is now streamed from SRAM B through the transform component, through the
floating point Poisson calculator and back into SRAM A at the transposed address. A
CHAPTER 3. IMPLEMENTATION 71
Figure 3.10: The FSM controlling high level data-flow
similar flow is used to inverse transform the data, but without the Poisson calculation.
The SRAM Controllers
Two separate SRAM controllers were implemented that independently control the
two SRAMs banks used in this design. The behaviour of SRAM bank A is given in
Figure 3.11 with SRAM B following a similar format.
The SRAM A controller starts up in the IDLE state. Once the DMA transaction
initializes, it switches over to the PREWRITE (which sets up the write signals) and
then to the the WRITE2 state. It remains in WRITE2 until the entire data set has
been copied to RAM in data packed format. It then cycles through DONEWRITE
and NOP (inserting NOPs between read and write cycles is necessary for high clock
CHAPTER 3. IMPLEMENTATION 72
Figure 3.11: The SRAM A FSM
frequency transfers) and finally back to the IDLE state and PREREAD. The READ
state increments addresses while waiting for data on the SRAM to FPGA bus to
become valid. This usually takes nine cycles. Once data is valid it switches over to
READ2 and starts applying the DCT to the rows and writing to SRAM B (handled
independently by the SRAM B controller).
SRAM A is accessed again after the columns and floating point computation are
completed and need to be written back. At this point the controller transitions to
the WRITE1 state and writes the transposed results back to memory. Note that
neither SRAM A nor SRAM B take into account the direction of the transform as
the data flow through SRAM does not depend on transform direction. Direction only
CHAPTER 3. IMPLEMENTATION 73
influences the DCT core and the enable signal on the floating point calculation core.
SRAM B operates in a very similar manner to SRAM A except that it only
interfaces with the DCT core and not with the DMA controller. Thus its design is
simpler with fewer states.
Calculating The Transpose
The purpose of the transpose component is to calculate the write address of data
coming out of the DCT component. These addresses should flip the matrix along
the diagonal. This can be accomplished by switching the row and column indices,
or rather, since the SRAM memory is linearly addressed, by multiplying the column
index by the length of a row and adding the row index. In equation form:
write addr =col addr
2∗ row length+ row addr
where row length = 512 or 1024 depending on orientation.
The column address is divided by 2 by dropping the last bit. This truncation is
necessary for the data-packing of two output values into each SRAM word to occur.
Additionally, since row length is a power of two, the multiplication and addition is
accomplished by appending the row index to the column index. This arrangement
requires minimal resources and is accomplished within a single cycle.
3.3.3 Division And Scaling
In the preconditioner, there is a floating point computation of the Poisson equation
that occurs between the forward 2D DCT and the inverse 2D DCT. The section of
CHAPTER 3. IMPLEMENTATION 74
code that performs the division and scaling that characterizes the Poisson equation
in the frequency domain is given in Figure 3.12:
for (j=0; j<1024; j++)
for (i=0; i<512; i++)
{
if (i==0 && j==0)
array[0]=0;
else
array[j*512+i] = array[j*512+i]/(4-2*cos(i*pi/511)
-2*cos(j*pi/1023);
}
Figure 3.12: Poisson Equation Calculation in the Frequency Domain
This segment of code scales the image by a factor:
4− 2 ∗ cos(i ∗ π
511)− 2 ∗ cos(j ∗ π
1023)
There are two efficient ways to perform this calculation. The first is precomputing
the factors for the entire 1024 by 512 image. This is too large to load into BRAM
and so much be stored in SRAM. It will also have to be loaded into memory at
startup, although this added latency can be amortized over multiple transforms, as
long as the FPGA accelerator board is not reset. This requires added complexity
to the SRAM memory design, but does not require the two floating point add units
which are needed for the second method described next.
The second approach is to precompute only the 2∗cos(j∗ π1023
) and the 2∗cos(j∗ π511
)
terms and store them into BRAM, since they occupy a relatively small amount of
space. These initial values can be integrated into the FPGA bitstream and thus
CHAPTER 3. IMPLEMENTATION 75
do not need to be loaded onto the board after initialization. As mentioned in the
previous paragraph, the drawback to this is the addition of two floating point adders.
This method was chosen because of the relatively low area requirements as well as
the lower complexity of the required controller.
Both approaches require a floating point divide and the second requires floating
point addition. This is provided by the Xilinx floating point operator[46] which
supports both functions. Two add units and one divide unit were instantiated and
connected as detailed in Figure 3.13.
Figure 3.13: Implementation of the floating point divide and scale logic
3.4 GPU Implementation
The GPU implementation was developed for Machine 2 (Machine 2 is described in
Section 3.1.1). It uses a combination of NVIDIA supplied libraries and some custom
CHAPTER 3. IMPLEMENTATION 76
kernels to implement the preconditioner and the conjugate gradient calculation on
the GPU along with kernels from the Minimum LP norm calculation. The entire LP
norm and PCG algorithm kernels were implemented on the GPU as this eliminates
much of the data transfer between successive iterations as compared to implementing
only the preconditioner.
3.4.1 Preconditioner
The preconditioner uses a 2D DCT/IDCT and some floating point computation in
order to transform the input matrix into a form that converges rapidly when the
conjugate gradient method is used to solve the equations presented in Section 2.3.2.
The algorithm used for the DCT is described in Section 3.2.3. It focuses on the
reuse of an existing 2D FFT, in our case, the highly optimized CUFFT provided
by NVIDIA [25]. CUFFT provides a complex Fourier transform that leverages the
floating point capabilities and the parallelism available to GPUs to rapidly compute
1D, 2D or 3D transforms. Like the popular FFTW library [23] it uses a plan based
approach to setting up and executing FFTs. However, it does not possess the same
degree of flexibility as FFTW, such as supporting real-to-real transforms or DCTs.
Hence it was necessary to implement kernels that performed the two-dimensional
shuffle and complex multiplication necessary to convert between the FFT and DCT.
The FFT calculation is the most time consuming part of the DCT computation,
followed by the shuffle. This shuffle reorders the input matrix in four different ways
depending on the location of the individual data point. This procedure is not compute
CHAPTER 3. IMPLEMENTATION 77
bound and is limited by the performance of the memory bus and the efficiency of
scatter/gather operations.
The complex multiply step represents a straightforward kernel that is easily par-
allelized since each matrix value can be operated on independently. This was im-
plemented in the standard CUDA way of thread-per-pixel which involved assigning a
thread to each matrix data point.
3.4.2 Conjugate Gradient
The steps in the Conjugate Gradient calculation were presented in detail in Section
3.3. The actual implementation has a few optimizations that trade off memory for
computation, but mostly follows the psuedocode faithfully.
There are several operations that show up repeated such as pointwise matrix mul-
tiplication/addition and matrix accumulation. The pointwise functions parallelize
extremely well since there exist no dependencies between data points. To implement
these functions, a similar method to the complex multiplication discussed in Section
3.4.1 was used.
The accumulation kernel was somewhat more complicated since there are de-
pendencies inherent in accumulation and so it cannot be parallelized to the same
degree. In stream processing terminology, this operation is called reduction since
the number of threads goes from N2 for a NxN matrix to 1. This operation was
frequently used and so it was necessary to optimize it better. Techniques described
elsewhere [22] were used to implement conflict-free sequential addressing, maximal
CHAPTER 3. IMPLEMENTATION 78
thread utilization and to completely unroll loops.
The implementation of the Qpk+1 calculations as discussed in Section 2.3.2, where
the matrix Q contains the weights, was implemented using the thread-per-pixel method
since each data point can be operated on independently. However, this kernel takes
a performance hit due to the integer calculation of indices to ensure that they fulfil
boundary conditions. This non-regularity of the indexing prevents some optimization
of the code, but in practice the performance is not affected significantly since this
kernel does not consume as much of the total processing time as accumulation or the
FFT.
3.5 Data Transfer
In many applications that are not arithmetically intensive, data transfer times to the
accelerator dominate the total computation time. For this reason it is necessary to
discuss the impact of data transfer as well as to characterize the latencies involved
for the data sets in question.
The machine on which the Annapolis Wildstar II Pro is installed utilizes a 100
MHz PCI-X bus to communicate with the accelerator board. This is a parallel bus
which has a peak throughput of 6.4 Gbits per second for a 64-bit bus.
The machine on which the GPU is installed uses a PCI-E x16 bus to transfer data
which supports a throughput of 32 Gbits per second or 4 Gbytes per second, which
is around five times faster than the PCI-X version.
CHAPTER 3. IMPLEMENTATION 79
3.5.1 Programmed IO
The Programmed IO mode (PIO) only exists for the FPGA implementation. The
PIO method of data transfer is provided by the Annapolis LAD bus component. The
LAD bus provides an abstraction for dealing with the bus to the PCI controller chip
and removes some of the timing related control. DMA is also implemented using the
LAD bus abstraction.
The LAD bus allows for an address range to map directly to a register file on
the FPGA. This is typically used for transferring control data although it can also
be used to transfer information in small chunks that are typically equal to or less
than the size of the register file. The ease of implementation however led to this
being the first method of data transfer from the host to the board. The drawback of
this method is that it requires a large amount of chip area for the register file and
the addressing logic, as well as having a high transfer latency once hand-shaking is
enabled. The area overhead for a bank of 64 32-bit words can amount to over ten
percent of the entire chip area on a Virtex II Pro 70.
In our final implementation, PIO is still utilized to transfer control and debug
information. However, all data transfer occurs through DMA
3.5.2 Direct Memory Access (DMA)
Direct Memory Access (DMA) is a method by which data can be transferred using a
separate DMA controller, thus requiring no intervention by the main processor apart
CHAPTER 3. IMPLEMENTATION 80
from specifying the size and location of data in memory. It allows for large block
transfers of data while freeing up the processor for other tasks that can be run in
parallel.
DMA transfers were used in both the GPU and FPGA implementations, although
across different buses. The two implementations are described below.
DMA on the FPGA
Many of the lower level details of working with the PCI bus are abstracted away
through the use of the LAD bus and the PCI controller chip. However, there still
exist control sequences for initialization and bus read/write that require some im-
plementation effort. For the purposes of this application, we implemented a generic
reusable core that operates using both ICLK and PCLK and handles all cross do-
main data transfer transparently. It also supplies debug information through the
core-to-dma interface that the DMA controller gets using PIO.
The interface works as follows: The user specifies a source and destination memory
address in the host C code, along with the data transfer size. The FPGA receives
this data and stores it during a DMA initialize period and then gets ready to receive.
As data comes streaming in, it is buffered in a dual ported asynchronous BlockRAM
(to handle the cross-domain clocking issue) and the data avail line goes high. After
this, data is output to the core phase unwrapping design.
Writes back to the host are handled similarly. The wr en line is set high and
then data is streamed to the DMA controller. It crosses through an asynchronous
CHAPTER 3. IMPLEMENTATION 81
BlockRAM and is output to the host. Once the transaction is completed, an interrupt
is set to signal to the host that new data is available in memory.
The PCI-X bus is capable of 800 MB/s, however the FPGA chip on the board
has a 32-bit bus to the PCI-X controller chip (since the 64-bit PCI-X bus is split
between the two FPGAs on the Wildstar II Pro) that operates at the same frequency
as the PE clock (although phase shifted). This results in data transfer speed in our
application being dominated by the PE consumption rate.
DMA on the GPU
DMA is the only method available for block data transfers from host memory to the
onboard GDDR3 memory on the 8800GTX GPU board. It is used to transfer the
eight floating point data arrays and one character array. This corresponds to a total
of 16.5 MB of data. On our PCI-E x16 bus this takes 4.125 ms for a one way transfer.
There are two methods available for implementing DMA transfers using CUDA.
The first stalls the processor while the transfer occurs, but does not require the use of
pinned (or reserved) memory. This is useful in memory intensive applications. The
second does not stall the processor, but requires that memory be set aside exclusively
for DMA. We implemented the second method since the phase-unwrapping uses a
small about of memory relative to the memory available on the system.
CHAPTER 3. IMPLEMENTATION 82
3.6 Conclusion
In this chapter, the algorithms and their implementation in FPGAs and GPUs were
described. The various methods of data transfer and some of the choices regarding
implementation were also discussed.
In the next chapter, we will present the results of our implementation and directly
compare and contrast the two platforms in terms of our chosen metrics.
Chapter 4
Results
Previous chapters presented the background and implementation of the phase un-
wrapping algorithm on two separate platforms, FPGAs and GPUs. This chapter
presents the results of the experiments performed on those two platforms. We start
by describing the experimental setup and how we verified the results of both imple-
mentations. Next we present the benchmark suite that we used for testing and then
continue on to present the results. We look at three common metrics: performance,
performance per dollar and power consumption.
4.1 Experimental Setup
The platform for which the FPGA implementation was designed is the Annapolis
WildStar II Pro [3]. Synthesis was performed using Synplicity Pro 8.8 with pipelin-
ing and resource sharing enabled. Place and route was performed using the Xilinx
Foundation tools 7.1i. Version 1.1 of the CUDA SDK was used for GPU development.
CHAPTER 4. RESULTS 84
4.1.1 Verification
It was important to verify both the FPGA and GPU versions as both implementa-
tions use accuracies less than that of the reference software implementation (whose
parameters are described in Section 3.1.1). The reference implementation uses mixed
single and double precision data types for different parts of the conjugate gradient
calculation with the preconditioner performed entirely in single precision. As men-
tioned previously, the FPGA implementation uses a mixed fixed and floating point
implementation for the preconditioner whereas the GPU version implements the full
LP norm calculation in single precision. It has also been documented that the Con-
jugate Gradient method is highly sensitive to precision [36]. Thus it was important
to verify that our implementation provided sufficiently accurate results.
We used two criteria to decide upon the accuracy of our solution. First we looked
at the results produced and compared them to the original implementation visually.
The two accelerated results shown in Figure 4.1 were almost identical with only minor
variation. Figure 4.1 depicts the original software unwrap, the GPU unwrapped
version and the FPGA unwraps. In Figure 4.2 we show the difference between the
accelerated versions and the reference implementation. Both of these show only
minimal variation. The image used for verification is the glass bead sample test that
produces over fifteen thousand residues.
The second metric that we used, as a rough guide as to the quality of the unwrap
while the unwrapping process was underway, was the number of residues eliminated
CHAPTER 4. RESULTS 85
after each stage of the PCG calculation. In this case the FPGA and software im-
plementations were essentially identical in terms of the initial number of residues
detected due to only the preconditioner being implemented in hardware. However,
the lack of double precision elements in the CG part of the GPU implementation
caused some miscomparisons against the threshold at which a residue is identified.
Thus the GPU implementation had slightly more residues (around one to ten more)
at the start of the unwrap, but the FPGA had slightly more at the end of the un-
wrap due to the use of mixed-precision. These resulted in minimal differences in the
unwraps. In the case of the glass bead, this meant that initially, 15909 residues were
detected for the GPU version rather than 15907 for the FPGA and software versions,
and the final solution had 4 residues for the GPU versus 6 for the software and 8 for
the FPGA versions.
Figure 4.1: Phase unwraps on a) The reference software implementation, b) TheGPU and c) The FPGA
CHAPTER 4. RESULTS 86
Figure 4.2: Differences in phase unwraps between software and a) The GPU and b)The FPGA
4.1.2 Benchmark Suite
The benchmark suite that we used consisted of three images that encompassed the
range of possible datasets. First, a single mouse embryo image posed an unwrap that
converged to zero residues within 7 iterations of the LP norm or 140 iterations of the
PCG (note that LP norm uses PCG as detailed in Section 2.3.2). Next, the glass
bead sample iterates until it reaches the maximum number of iterations, currently
set at 10 LP norm iterations, each of which iterates the PCG core 20 times. Last
is the double embryo image which takes the full number of iterations as well, but
converges more slowly than the glass bead. Unlike the single mouse embryo which
converges with zero residues and the glass bead that ends with six, the double embryo
terminates with 11 residues remaining.
Each iteration of the LP norm takes the same amount of time, hence the glass
CHAPTER 4. RESULTS 87
bead and the double embryo unwraps take the same amount of time whereas the
single embryo image takes 70 percent of their execution time since it iterates seven
times.
4.2 Results
Now that experimental parameters such as the benchmark suite and verification pro-
cedures have been presented, we discuss performance as measured by three metrics.
First is processing time, or how quickly we arrive at the solution using the GPU and
the FPGA. Next is the cost of the two platforms relative to the processing power.
Last of all we discuss power consumption.
4.2.1 Experiments
In the following experiments, we measure performance (by means of timing profiles
on various sections of the code), cost-effectiveness (by generating the comparison of
performance per unit cost) and finally power (by measuring the current draw at the
wall while running the algorithm for a high number of iterations). All timing numbers
are given in seconds, and power in watts. Note that performance per dollar uses
the inverse of the preconditioner processing time as the performance number. This
translates roughly into FLoating point Operations Per Second or FLOPS. Machine
1 and Machine 2 are detailed and profiled in Section 3.1.1 and hold the FPGA
accelerator and the GPU board respectively.
CHAPTER 4. RESULTS 88
4.2.2 FPGA Area Consumption
The area consumption statistics for the single DCT core implementation on the
Annapolis Wildstar II Pro which uses a Virtex II Pro FPGA is presented in Table
4.1.
Component Available Used PercentageSlices 328 48 14 %
BlockRAMs 328 28 8 %PowerPCs 0 2 0 %
Slices 33088 11328 34 %
Table 4.1: FPGA area consumption for the single DCT core implementation
Up to two more DCT cores could be implemented on the Virtex II Pro since
current slice usage (the constraining factor) is in large part consumed performing
data transfer and control. This means that the DCT itself doesn’t take up much
area. The data transfer and control logic doesn’t need to be replicated if adding two
more cores, hence we can say that there is enough room for a maximum of three
cores on the FPGA.
4.2.3 Performance
Figure 4.3 shows the speedup when running the preconditioner on the FPGA board
with one DCT core, versus running the entire program on the GPP in software on
Machine 1. These timing numbers are for the glass bead dataset and represent the
sum of 200 iterations of the PCG kernel.
Execution time goes down from an average of 95 seconds for the software version to
40.5 seconds for the FPGA accelerated version, which corresponds to a 2.35x speedup.
CHAPTER 4. RESULTS 89
These are complete algorithm numbers including full disk IO, data transfer and all
related costs. Once the overhead, or non-kernel related functionality, is removed from
the timing leaving just the preconditioner, we see a 3.76x speedup. This is for a GPP
computation time of 74s and an FPGA time of 19.7s.
Figure 4.3: Speedup achieved using the FPGA versus the reference software imple-mentation on Machine 1
Figure 4.4 shows the algorithm speedup when running the LP norm kernel on the
GPU. This includes the preconditioner and the conjugate gradient calculations. It
was executed on Machine 2, whose parameters can be found in Section 3.1.1. These
numbers were generated for the glass bead dataset and thus represent the sum of 200
iterations of the PCG kernel.
CHAPTER 4. RESULTS 90
The overall algorithm speedup including disk IO and all data transfer is 5.24x.
The section seeing greatest acceleration is the preconditioner which is sped up by
a factor of 9.3x to a time of 1.2s for 200 iterations. One of the reasons why we
see this level of application acceleration is that there is no host-accelerator data
transfer occurring for the preconditioner since the data transfer occurs once per
whole PCG iteration for the GPU implementation. All data transfer occurs over the
GPU memory-processor bus which has a bandwidth of 86.4 GB/s. All sections of the
algorithm see speedups with the exception of disk IO and the overhead calculations.
The preconditioner still takes the majority of the calculation time, but now disk IO
comes a close second. Any further speedup will be hampered by the fact that disk
IO cannot be much accelerated.
Figure 4.5 attempts to compare the FPGA implementation to the GPU imple-
mentation while equalizing other parameters. This necessitates some extrapolation
of the implementation.
The solid bars represent actual synthesized results. One of the results of the single
core implementation was that there was sufficient space to implement up to three
cores on the FPGA without incurring significant increases to complexity or with any
stalling. We depict that result in the third column.
The reason why the FPGA implementation was synthesized for a Virtex 5 as well
as the Virtex II Pro was that the G80 GPU is a relatively recent product released
in November 2006. The Virtex 5 was released around the same time and thus they
CHAPTER 4. RESULTS 91
Figure 4.4: Speedup achieved using the GPU versus the reference software imple-mentation on Machine 2
both represent technologies from the same era. The Virtex 2 Pro was released in
2002 which is significantly older. The Virtex 5 implementation operates at 238 MHz,
or about twice the clock speed of the Virtex II Pro. The timing of the FFT core
used also ensures that there is enough computation time to allow three DCT cores
to operate in parallel without significantly changing the core design. Lower logic
utilization on the Virtex 5 due to embedded DSP slices also ensures that there will
be sufficient available area. The last column gives the execution time excluding
data transfer, because for the GPU, no host to board data transfer occurs for the
preconditioner. Thus this last column models only execution time. The nature of
CHAPTER 4. RESULTS 92
the results is given in parenthesis below the column label describing if the results
represent synthesized performance numbers, projected performance or if there is no
label, real and measured performance.
Even in the last case, the GPU outperforms the FPGA by a factor of 1.9x. This
is due to high degree of parallelism and the high frequency possible on the GPU in
single precision floating point, which is not possible on the FPGA.
Figure 4.5: Time to complete 200 iterations of the preconditioner on both platforms
This section only compares the performance of the preconditioner on the FPGA
versus the GPU since only the preconditioner was implemented on the FPGA. Im-
plementing the entire Conjugate Gradient calculation on the FPGA was infeasible
since it requires single precision accuracy at the very least and implementing the sort
CHAPTER 4. RESULTS 93
of matrix operations present in the CG calculation in single precision would require
many FPGAs. The best scenario would be a hybrid system like that presented in
[38] which actually saw slowdowns in some non cache optimal cases. The possibility
of running one computational kernel, storing the results in off-chip SRAM and re-
programming the FPGA with another kernel is also infeasible given that an FPGA
requires on the order of 110 ms to upload a bitstream (this was timed on the Wildstar
II Pro). The long reprogramming latency is due to the size of the bitstream (over
three megabytes for a V2P70), the serial nature of bitstream loading and the slow
write but fast read nature of the FPGA SRAM LUTs.
4.2.4 Cost Effectiveness
In this section we discuss the cost effectiveness of the platforms discussed here. Figure
4.6 presents our results. The cost metric is of some importance to the development
of the OQM modality of the Keck, since the goal is to eventually have the technology
commercialized.
The GPU is a mass market consumer level product and as such, is available
for between five and six hundred dollars for a high end model. FPGA accelerator
boards are relatively low volume products and sell for much more, in the range of ten
thousand dollars for a high end model. In the RCL lab, a new machine with a Virtex
5 accelerator board was recently purchased for twelve thousand dollars (including the
cost of the machine). This machine is shown as Machine 3 in Figure 4.6. Machine
2 cost approximately twenty thousand (and has two FPGAs) and Machine 1 cost
CHAPTER 4. RESULTS 94
about twenty two hundred. Both Machine 1 and 2 are the same as those presented
in Section 3.1.1. Performance is based on the reciprocal of the time to complete the
same phase-unwrapping algorithm with identical data sets on both platforms.
Figure 4.6: A comparison showing the performance per dollar on three platforms
The mass produced GPU clearly wins out when cost is taken into account since
it performs more than twice as fast as the Virtex 5 but at a fraction of the price.
4.2.5 Power
The last metric that we discuss is power. This metric has always been important
in the embedded space, but is becoming increasingly important in modern day HPC
applications since the cost of cooling and powering a cluster over the life of the
CHAPTER 4. RESULTS 95
hardware can be a significant fraction of the cost of the hardware itself [34]. In
Figure 4.7 we present total power consumption for the two platforms while running
the implementation. This was measured at the wall using a meter. In Figure 4.8 we
show the difference between idle consumption and processing consumption for the
two platforms. This is the actual power consumption of processing the phase unwrap
data. Note that the accelerators consume minimal power when not being used.
Figure 4.7 really shows the effect of difference in processor architectures rather
than accelerator power consumption. Machine 1 contains a Xeon with a peak power
consumption rating of 103W [13] whereas Machine 2 has a Core 2 Duo with a peak
power consumption rating of 65W. This shows up clearly on the graphs.
Figure 4.7: A comparison showing the total power consumption for the two machines
CHAPTER 4. RESULTS 96
Figure 4.8 shows the differences in power consumption between running the phase
unwrapping algorithm on the GPP and on the accelerator for the two machines.
The first pair of columns show the difference between idle power consumption and
processing power consumption when using the GPP. The second pair of columns show
the difference between idle power consumption and processing power consumption
when using the accelerator. The FPGA consumes susbstanially less power than
both the GPU and the software versions. Thus for Machine 1, power consumption
is actually lowered by 25W while processing is sped up when implemented on an
FPGA. Machine 2 sees a 69W increase in power consumption by running the phase
unwrapping on the GPU.
4.3 Summary
In this chapter we presented the results of our experiments. The GPU outperforms
the FPGA by a significant margin and also wins in terms of cost due to the fact that
the chip is mass marketed. The FPGA is more power efficient however, which may
be a consideration if phase unwrapping ever needs to be performed in an embedded
fashion. In the next chapter we present some final conclusions drawn from our results
and discuss future work.
CHAPTER 4. RESULTS 97
Figure 4.8: A comparison showing the power consumption difference between theprocessor running in the idle state and executing the algorithm using either a GPPor an accelerator
Chapter 5
Conclusion and Future Work
5.1 Conclusion
The type of computational accelerator that is used in an application depends on
the algorithmic nature of the application itself as well as on its intended usage. For
example, a high performance GPU would be a bad match for an embedded application
and a low power FPGA would be a bad match for a HPC application. In the context
of the Keck microscope, the high performance and low cost of the GPU platform
makes the most sense since power and area are not major concerns. If development
on the microscope progresses to the point where an embedded accelerator is needed
then the feasibility of an FPGA should be revisited.
The accuracy of both platforms varied as well. The need for greater than single
precision in the conjugate gradient calculation was evidenced by the differing number
of residues detected versus the software implementation. However, this difference is
minimal enough to not significantly affect the results. The preconditioner has less
stringent accuracy requirements, although since the conjugate gradient depends on
CHAPTER 5. CONCLUSION AND FUTURE WORK 99
it, it does need close to single precision. Again, the results of our mixed fixed and
single precision FPGA implementation indicate that the differences compared to the
software version are negligible.
The raw computational power of the GPU surpasses that of a comparable FPGA
platform for the preconditioner by a factor of almost two. In addition, the relatively
long time to reprogram an FPGA eliminates its viability as an accelerator for the
entire conjugate gradient application (multiple FPGAs streaming data between them
would have to be used instead which would be prohibitively expensive in terms of
cost and power). There is a limited amount of speedup that the FPGA can produce
by accelerating only the preconditioner and not the conjugate gradient calculations
as well.
5.2 Future Work
This phase unwrapping project will eventually be integrated into the registration
and preprocessing steps used by the OQM microscope, thus eliminating the need
for disk IO. This will bring the effective phase unwrapping time to 2.3 seconds. In
addition, newer GPUs on the market with faster and wider buses as well as more
stream processors [26] that support the CUDA API are currently available. These will
produce significant performance improvements with minimal, or possibly no changes
to the code.
Further exploration in the feasibility of conjugate gradient on FPGAs would also
CHAPTER 5. CONCLUSION AND FUTURE WORK 100
be of use to phase unwrapping and to other applications. Finding the right balance of
computation offloaded onto the accelerator, and run on the GPP would be a valuable
problem to solve, and one that would change with each new generation of hardware.
Finally, various optimizations could be made to increase the clock frequency and
the number of cores for the FPGA designs discussed in this thesis. It is also possible
to split the design, including the conjugate gradient calculation, over multiple FP-
GAs. Similarly, optimizations exist that could lower the execution time for the GPU
implementation as well as parallelize the execution over multiple GPUs. Exploration
of these possibilities would push the boundaries of what is currently achievable in
hardware today and could lead to valuable results in the future not only for those
interested in phase unwrapping, but also those involved in the HPC and image pro-
cessing fields.
Bibliography
[1] Dennis C. Ghiglia and Mark D. Pritt. Two-Dimensional Phase Unwrapping:Theory, Algorithms and Software. Wiley Inter-Science, 605 Third Avenue, NewYork, NY, 10158-0012, 1998.
[2] 4DSP Inc. IEEE-745 compliant floating-point FFT core for FPGA.http://www.4dsp.com/fft.htm, Last accessed March 2007.
[3] Annapolis Micro. Annapolis Micro Systems Inc. - Wildstar II Pro PCI.http://www.annapmicro.com/wsiippci.html, Last accessed July 2008.
[4] I. Buck, T. Foley, D. Horn, J. S. K. Fatahalian, M. Houston, and P. Hanrahan.Brook for GPUs: stream computing on graphics hardware. ACM Transactionson Graphics, 23(3):777–786, August 2004.
[5] Y. Chan and W. Siu. On the realization of discrete cosine transform using thedistributed arithmetic. IEEE Transactions on Circuits and Systems, 39(9):705–712, Sept 1992.
[6] C. H. Crawford, P. Henning, M. Kistler, and C. Wright. Accelerating Computingwith the Cell Broadband Engine Processor. In Proceedings of the 2008 conferenceon Computing Frontiers, pages 3–12, 2008.
[7] Cray. Cray XD1 Datasheet. http://www.cray.com/downloads/Cray XD1Datasheet.pdf, Last accessed July 2008.
[8] P. D’Alberto, P. Milder, A. Sandryhaila, F. Franchetti, J. Hoe, J. Moura, andM. Puschel. Generating fpga-accelerated dft libraries. In Proceedings of theIEEE Symposium on FPGAs for Custom Computing Machines (FCCM’07),pages 173–184, 2007.
[9] T. Dillon. Two Virtex-II FPGAs deliver fastest, cheapest, best high-performanceimage processing system. In Xilinx Xcell J., pages 70–73, 2001.
[10] HP. Accelerating HPC using GPUs. http://www.hp.com/techservers/hpccn/hpccollaboration/ADCatalyst/downloads/accelerating-HPCUsing-GPUs.pdf,Last accessed July 2008.
BIBLIOGRAPHY 102
[11] A. Hull and W. Jenkins. Preconditioned conjugate gradient methods for adaptivefiltering. In IEEE International Symposium on Circuits and Systems, pages 540–543, June 1991.
[12] Intel. Intel math kernel library 10.0. http://www.intel.com/cd/software/products/asmo-na/eng/307757.htm, Last accessed July 2008.
[13] Intel. Intel Xeon Processor 3 GHz. http://processorfinder.intel.com/details.aspx?sSpec=SL7DW, Last accessed July 2008.
[14] I.S. Uzun and A. Amira and A. Bouridane. FPGA Implementations Of FastFourier Transform For Real-Time Signal And Image Processing. In Proceedingsof the IEEE Conference On Vision, Image And Signal Processing, volume 152,pages 283–296, June 2005.
[15] Jeff Bolz and Ian Farmer and Eitan Grinspun and Peter Schroder. Sparse matrixsolvers on the GPU: conjugate gradients and multigrid. ACM Transactions onGraphics, 22(3):917–924, July 2003.
[16] Karasev, P.A. and Campbell, D.P. and Richards, M.A. Obtaining a 35x Speedupin 2D Phase Unwrapping Using Commodity Graphics Processors. In RadarConference, 2007 IEEE, pages 574–578, April 2007.
[17] Khurram Bukhari, Georgi Kuzmanov and Stamatis Vassiliadis. DCT and IDCTImplementations on Different FPGA Technologies. In Program for Research onIntegrated Systems and Circuits (ProRISC), pages 232–235, November 2002.
[18] G. Laevsky, W. C. W. II, M. Rajadhyaksha, and C. A. DiMarzio. Multi-ModalMicroscope for Biomedical Research. In Life Science Systems and ApplicationsWorkshop, pages 1–2, July 2006.
[19] M. P. Leong and Philip H. W. Leong. A Variable-Radix Digit-Serial DesignMethodology and its Application to the Discrete Cosine Transform. IEEE Trans-actions on Very Large Scale Integrated Systems, 11(1):90–104, Feb 2003.
[20] J. Makhoul. A Fast Cosine Transform in One and Two Dimensions. IEEETransactions on Acoustics, Speech, and Signal Processing, 28(1):27–34, February1980.
[21] Mark Borgerding. Kiss FFT. http://sourceforge.net/projects/kissfft/, Last ac-cessed July 2008.
[22] Mark Harris. Optimizing Parallel Reduction inCUDA. http://developer.download.nvidia.com/compute/cuda/11/Website/projects/reduction/doc/reduction.pdf, Last accessed July 2008.
BIBLIOGRAPHY 103
[23] Matteo Frigo and Steven G. Johnson. The Design and Implementation ofFFTW3. In Proceedings of the IEEE, volume 93, pages 216–231, Feb 2005.
[24] NVIDIA. Cg - Reference Manual. http://developer.download.nvidia.com/cg/Cg 2.0/2.0.0015/Cg-2.0 May2008 ReferenceManual.pdf, Last accessed July2008.
[25] NVIDIA. CUFFT Library. http://developer.download.nvidia.com/compute/cuda/1 1/CUFFT Library 1.1.pdf, Last accessed July 2008.
[26] NVIDIA. GeForce GTX 280. http://www.nvidia.com/object/geforcegtx 280.html, Last accessed July 2008.
[27] NVIDIA. NVIDIA CUDA Programming Guide.http://developer.download.nvidia.com/compute/cuda/1 1/NVIDIA CUDAProgramming Guide 1.1.pdf, Last accessed July 2008.
[28] Pavle Belanovic. Library of Parameterized Hardware Modules for Floating-Point Arithmetic with An Example Application. Masters Thesis, NortheasternUniversity, June, 2002.
[29] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. NumericalRecipies: The Art of Scientific Computing. Cambridge University Press, 1986.
[30] R. Woods and D. Trainor and J.P. Heron. Applying an XC6200 to real-timeimage processing. IEEE Design and Test of Computers, 15(1):30–38, Jan-Mar1998.
[31] Rapidmind. RAPIDMIND: Product resources.http://www.rapidmind.net/resources.php, Last accessed July 2008.
[32] J. R. Shewchuk. An introduction to the conjugate gradient method without theagonizing pain, August 1994. http://www.cs.cmu.edu/ quake-papers/painless-conjugate-gradient.pdf, Last accessed July 2008.
[33] N. Shirazi, A. Abbot, and P. Athanas. Implementation of a 2-D Fast FourierTransform on FPGA-Based Custom Computing Machines. In Proceedings ofthe IEEE Symposium on FPGAs for Custom Computing Machines (FCCM’95),pages 155–163, April 1995.
[34] Shushant Sharma and Chung-Hsing Hsu and Wu-chun Feng. Making a Casefor a Green500 List. In 20th International Parallel and Distributed ProcessingSymposium (IPDPS) Workshop on High-Performance, Power-Aware Computing(HP-PAC), April 2006.
BIBLIOGRAPHY 104
[35] C. Smith. Phase unwrapping algorithms. Masters Thesis, Northeastern Univer-sity, 2004.
[36] R. Strzodka and D. Goddeke. Pipelined Mixed Precision Algorithms on FPGAsfor Fast and Accurate PDE Solvers from Low Precision Components. In IEEEProceedings on Field-Programmable Custom Computing Machines, 2006, pages259–270, 2006.
[37] T. Valich. GPU supercomputer: Nvidia Tesla cards to debut inBull system. http://www.tomshardware.com/news/nvidia-graphics-supercomputer,5219.html, Last accessed July, 2008.
[38] V.K. Prasanna and G.R. Morris and R. D. Anderson. A Hybrid Approach forMapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Su-percomputer. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pages 3–12, 2006.
[39] W. Warger. Cell counting using the OQM modality. Masters Thesis, Northeast-ern University, June, 2006.
[40] Warren E. Ferguson Jr. Selecting Math Coprocessors. IEEE Spectrum, pages38–41, July 1991.
[41] S. Wasson. Ageia’s PhysX physics processing unit.http://techreport.com/articles.x/10223, Last accessed July 2008.
[42] William Thies and Michal Karczmarek and Saman Amarasinghe. StreamIt: ALanguage for Streaming Applications. In Proceedings of the 11th InternationalConference on Compiler Construction, volume 2304, pages 179–196, 2002.
[43] Xilinx. Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Complete DataSheet. http://www.xilinx.com/support/documentation/data sheets/ds083.pdf,Last accessed July 2008.
[44] Xilinx Inc. 1-D Discrete Cosine Transform(DCT) V2.1.http://www.xilinx.com/ipcenter/ catalog/logicore/docs/da 1d dct.pdf, Lastaccessed March 2007.
[45] Xilinx Inc. Fast fourier transform 3.2.http://www.xilinx.com/ipcenter/catalog/logicore/docs/xfft.pdf, Last accessedMarch 2007.
[46] Xilinx Inc. Floating-point operator v1.0.http://www.xilinx.com/bvdocs/ipcenter/data sheet/floating point.pdf, Lastaccessed October 2007.