northeastern university graduate school of · pdf filenortheastern university graduate school...

NORTHEASTERN UNIVERSITY

Graduate School of Engineering

Thesis Title: Phase Unwrapping on Reconfigurable Hardware and Graphics Pro-cessors

Author: Sherman Braganza

Department: Electrical and Computer Engineering

Approved for Thesis Requirements of the Master of Science Degree

Thesis Advisor: Prof. Miriam Leeser Date

Thesis Reader: Prof. Charles DiMarzio Date

Thesis Reader: Prof. David Kaeli Date

Department Chair: Prof. Ali Abur Date

Graduate School Notified of Acceptance:

Dean: Prof. David E. Luzzi Date

Copy Deposited in Library:

Reference Librarian Date

Phase Unwrapping on Reconfigurable Hardware and Graphics

Processors

A Thesis Presented

by

Sherman Braganza

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirementsfor the degree of

Master of Science

in

Electrical Engineering

in the field of

Computer Engineering

Northeastern UniversityBoston, Massachusetts

August 2008

Acknowledgement

I would like to thank my advisor Professor Miriam Leeser. Without the opportunities

and guidance she has provided, none of the work presented in this thesis would have

been possible. The patience, understanding, and technical advice she has given me

have been invaluable in my research and personal growth.

I would also like to thank my colleagues in the Reconfigurable Computing Lab at

Northeastern University. The friendly support they offer provide an enjoyable and

productive work environment that I will miss. I would also like to thank my family

and friends for their encouragement and support.

I would also like to thank Professor Charles DiMarzio and William Warger II

for their help with any questions that I had regarding the OQM microscope and

the data sets that they provided. Finally, I would like to acknowledge the sup-

port of CenSSIS, the Center for Subsurface Sensing and Imaging Systems, under the

Engineering Research Centers Program of the National Science Foundation (award

number EEC-9986821), without whose funding this would not have been possible.

Abstract

Phase unwrapping is the process of converting discontinuous phase data into a con-

tinuous image. This procedure is required by any imaging technology that uses phase

data such as MRI, SAR or OQM microscopy. Such algorithms often take a significant

amount of time to process on a general purpose computer, rendering it difficult to

process large quantities of information. This thesis focuses on implementing a specific

phase unwrapping algorithm known as Minimum LP norm unwrapping on a Field

Programmable Gate Array (FPGA) and a Graphics Processing Unit (GPU) for the

purpose of acceleration. The computation required involves a matrix preconditioner

(based on a DCT transform) and a conjugate gradient calculation along with a few

other matrix operations. These functions are partitioned to run on the host or the

accelerator depending on the capabilities of the accelerator. The tradeoffs between

the two platforms are analyzed and compared to a General Purpose Processor (GPP)

in terms of performance, power and cost.

Contents

1 Introduction 14

2 Background 16

2.1 The Keck Fusion Microscope . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.1 Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.2 Optical Quadrature Microscopy (OQM) . . . . . . . . . . . . 17

2.2 Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.2 WildStar II Pro PCI . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.3 NVIDIA GPUs - G80 . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Phase Unwrapping - Algorithms and Selection . . . . . . . . . . . . . 26

2.3.1 Path Following Algorithms . . . . . . . . . . . . . . . . . . . . 28

2.3.2 Minimum Norm Algorithms . . . . . . . . . . . . . . . . . . . 35

2.3.3 Choosing The Right Algorithm . . . . . . . . . . . . . . . . . 41

2.4 Bitwidth Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.5.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.5.2 GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3 Implementation 52

3.1 Experimental Platforms And Timing Profile . . . . . . . . . . . . . . 52

3.1.1 Host Machine Descriptions And Timing Profiles . . . . . . . . 53

3.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2.1 The Discrete Cosine Transform - An Overview . . . . . . . . . 55

3.2.2 Algorithm Details for the 1D DCT . . . . . . . . . . . . . . . 56

3.2.3 Algorithm Details for the 2D DCT . . . . . . . . . . . . . . . 59

3.2.4 Algorithm Details for the Conjugate Gradient . . . . . . . . . 60

3.3 The FPGA Implementation of the Preconditioner . . . . . . . . . . . 61

3.3.1 The One Dimensional DCT Transform . . . . . . . . . . . . . 62

3.3.2 The Two Dimensional DCT On The FPGA . . . . . . . . . . 69

3.3.3 Division And Scaling . . . . . . . . . . . . . . . . . . . . . . . 73

3.4 GPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.4.1 Preconditioner . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.4.2 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . 77

3.5 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.5.1 Programmed IO . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.5.2 Direct Memory Access (DMA) . . . . . . . . . . . . . . . . . . 79

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4 Results 83

4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.1.1 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.1.2 Benchmark Suite . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.2.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.2.2 FPGA Area Consumption . . . . . . . . . . . . . . . . . . . . 88

4.2.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.2.4 Cost Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.2.5 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5 Conclusion and Future Work 98

5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Bibliography 101

List of Figures

2.1 Optical Quadrature Microscopy setup [39] . . . . . . . . . . . . . . . 18

2.2 Virtex II Pro - Architecture [43] . . . . . . . . . . . . . . . . . . . . . 21

2.3 An image of the Annapolis Wildstar II Pro PCI [3] . . . . . . . . . . 22

2.4 A block diagram showing the various components of the Wildstar II

Pro [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5 NVIDIA G80 core - Architecture[27] . . . . . . . . . . . . . . . . . . 25

2.6 A wrapped image. Note the data range which lies between π and −π 28

2.7 The need for smart phase unwrapping algorithms a) A raster unwrap

using Matlabs ’unwrap’ routine b) A minimum LP norm unwrap . . . 29

2.8 Goldstein’s algorithm on the two embryo sample . . . . . . . . . . . . 31

2.9 Quality Mapped algorithm on the two embryo sample . . . . . . . . . 32

2.10 Mask Cut algorithm on the two embryo sample . . . . . . . . . . . . 34

2.11 Flynn’s algorithm on the two embryo sample . . . . . . . . . . . . . . 35

2.12 Preconditioned Conjugate Gradient Algorithm pseudo-code . . . . . . 38

2.13 Preconditioned Conjugate gradient algorithm on the two embryo sample 39

2.14 Minimum LP Norm Algorithm pseudo-code . . . . . . . . . . . . . . 40

2.15 The Minimum LP Norm algorithm on the two embryo sample . . . . 41

2.16 The Multi-grid algorithm on the two embryo sample . . . . . . . . . . 42

2.17 Image produced by a double-precision unwrap . . . . . . . . . . . . . 44

2.18 Using a bitwidth of 27 . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.19 Using a bitwidth of 28 . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.1 A comparison between the performance of the two machines . . . . . 54

3.2 Even extension around n=-0.5 and n=N-0.5 . . . . . . . . . . . . . . 57

3.3 PCG - Detailed pseudo-code[1] . . . . . . . . . . . . . . . . . . . . . . 61

3.4 Components and dataflow for the forward DCT transform . . . . . . 63

3.5 Components and dataflow for the inverse DCT transform . . . . . . . 64

3.6 Re-ordering pattern in a forward shuffle . . . . . . . . . . . . . . . . . 65

3.7 The rebuild component - Forward Transform . . . . . . . . . . . . . . 66

3.8 The 1D transform including dynamic scaling . . . . . . . . . . . . . . 67

3.9 A High Level Diagram of the Preconditioning Kernel . . . . . . . . . 70

3.10 The FSM controlling high level data-flow . . . . . . . . . . . . . . . . 71

3.11 The SRAM A FSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.12 Poisson Equation Calculation in the Frequency Domain . . . . . . . . 74

3.13 Implementation of the floating point divide and scale logic . . . . . . 75

4.1 Phase unwraps on a) The reference software implementation, b) The

GPU and c) The FPGA . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.2 Differences in phase unwraps between software and a) The GPU and

b) The FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.3 Speedup achieved using the FPGA versus the reference software im-

plementation on Machine 1 . . . . . . . . . . . . . . . . . . . . . . . . 89

4.4 Speedup achieved using the GPU versus the reference software imple-

mentation on Machine 2 . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.5 Time to complete 200 iterations of the preconditioner on both platforms 92

4.6 A comparison showing the performance per dollar on three platforms 94

4.7 A comparison showing the total power consumption for the two machines 95

4.8 A comparison showing the power consumption difference between the

processor running in the idle state and executing the algorithm using

either a GPP or an accelerator . . . . . . . . . . . . . . . . . . . . . . 97

List of Tables

3.1 A comparison between the two platforms . . . . . . . . . . . . . . . . 53

4.1 FPGA area consumption for the single DCT core implementation . . 88

Chapter 1

Introduction

Computational accelerators are not a new idea but the emergence of two platforms,

FPGAs and GPUs, are in the process of reshaping how developers involved in High

Performance Computing (HPC) solve problems. Both platforms are well suited for

highly parallel tasks since they both present architectures capable of exploiting data

parallelism. FPGAs provide complete control for the designer at the cost of imple-

mentation effort whereas GPUs provide a fixed memory hierarchy and architecture

to which designers must fit their algorithms. Many image processing algorithms map

well to either platform since such algorithms tend to parallelize well. In particu-

lar, two dimensional phase unwrapping has the potential to see significant speedup

through implementation on these platforms.

Our interest in two dimensional phase unwrapping stems from the research cur-

rently ongoing with the W.M. Keck 3-D Fusion microscope [18]. One of its modalities,

known as Optical Quadrature Microscopy (OQM), utilizes phase information to re-

construct an image of the sample under investigation. This phase information is

CHAPTER 1. INTRODUCTION 15

wrapped between π and −π and needs to be unwrapped before it can be of use.

In this thesis, we explore a variety of phase unwrapping algorithms and their

suitability for processing the data sets produced by the OQM mode of the Keck mi-

croscope. We select the Minimum LP norm algorithm as being the one that provides

the best results in terms of quality and then implement its kernel on both the FPGA

and the GPU platforms. We see an overall algorithm speedup of between 2x and 3x

for the FPGA and between 5x and 6x for the GPU as compared to the host machines

on which the accelerators are running. When compared to the host machine that

is currently being used (which takes two minutes per frame), we see an algorithm

speedup of 40x.

The rest of this thesis is arranged as follows. Chapter 2 presents details about

the Keck microscope as well as discussing our initial work in selecting the right phase

unwrapping algorithm. In this chapter we also discuss other related work in the

field and how our implementation compares to them. In Chapter 3 we present the

mathematical details behind our specific implementation of the preconditioner and

the conjugate gradient calculation and then proceed to discuss our implementation on

the FPGA and the GPU platforms. We also present details of the two host machines

themselves and benchmark their performance. In Chapter 4 we present our results

for the two platforms and finally in Chapter 5 we present our conclusions and discuss

potential future work.

Chapter 2

Background

This chapter describes the motivation behind our work and the related research that

has been published in the field. We start off by describing the Keck microscope and its

Optical Quadrature Microscopy modality and then talk about the FPGA and GPU

platforms. Next, we discuss our work in analyzing the results that the various phase

unwrapping algorithms produce for the OQM datasets and determine the appropriate

bitwidth for that algorithm. Finally, the state of the art in the development of related

FPGA and GPU implementations is discussed and some conclusions presented.

2.1 The Keck Fusion Microscope

This Master’s thesis research was motivated by the Keck Fusion microscope. In

order to generate useful images produced by one of it’s modalities, it is necessary to

perform some phase unwrapping calculations. This calculation takes approximately

two minutes per frame on the current platform. This leads to a bottleneck in terms

of processing time and thus it was deemed necessary to speed it up.

CHAPTER 2. BACKGROUND 17

2.1.1 Modalities

The Keck microscope utilizes multiple modalities in order to generate a complete

fused image of the target. These include Differential Interference Contrast (DIC), Re-

flectance Confocal Microscopy (CRM), Laser Scanning Confocal Microscopy (LSCM),

Two-photon Microscopy (TPLSM) and the one that is of interest to us, Optical

Quadrature Microscopy. Further details about the microscope and the various modal-

ities can be found in [18]. OQM is described in further detail in Section 2.1.2.

2.1.2 Optical Quadrature Microscopy (OQM)

Optical quadrature microscopy is a detection technique for measuring phase and

amplitude changes to a sinusoidal signal. A diagram of the setup is shown in Figure

2.1. A signal from a HeNe laser is split into two components, reference and unknown.

The unknown signal passes through the sample. The known reference signal is split,

with one component being phase shifted by 90 degrees. The unknown signal is then

mixed separately with both components of the known reference signal. The merged

signal consisting of unknown and non- phase shifted reference is referred to as the I

channel, or the in-phase signal, while the unknown signal mixed with the 90 degree

phase-shifted reference signal is referred to as the Q channel, or the quadrature signal.

By interpreting the I and Q signals as real and imaginary values of a complex number,

it is possible to find the amplitude and phase of the unknown signal.

These concepts of quadrature detection are applied to microscopy to create the


OQM mode of the Keck 3D Fusion Microscope. Since coherent (HeNe laser) detection

provides an effective gain of |Eref | × |Esig|, low levels of light can be used for

illumination, minimizing cell exposure/damage.

Figure 2.1: Optical Quadrature Microscopy setup [39]

OQM forms the motivation behind the phase unwrapping acceleration project as

it produces wrapped phase based images with data between π and −π. This data

needs to be unwrapped in order for the data to be usable. A software implementation

of the Minimum LP norm phase unwrapping algorithm in a mixture of C and Matlab

takes nearly two minutes to process a single frame. Speeding up the processing would

render the OQM modality much more useful in processing large stacks of images and

ultimately, having a near real-time version showing direct unwrapped output from

the microscope.


2.2 Platforms

The concept of using an external accelerator for application speedup is not a new

one. In the early days of PCs, it was common to have an extra socket for a math

coprocessor [40] that could be used to accelerate floating point computations. More

recently, platforms such as the Cell Broadband Engine have been used as accelerators

in petaflop supercomputers [6] such as the Roadrunner machine in order to reach new

levels of computing performance. FPGAs have also been used in machines such as

the Cray XD1 [7] and GPUs in systems like the Bull Novascale supercomputer [37].

Such systems are general accelerators in the sense that they apply to a wide range

of application domains. More specific accelerators such as Ageia’s PhysX physics

accelerator [41] also exist that target a restricted domain.

In the work presented in this thesis, a phase-unwrapping algorithm has been

implemented on two separate platforms: Field Programmable Gate Arrays (FPGAs)

and Graphics Processing Units (GPUs) and the results are compared to general

purpose processors.

2.2.1 FPGAs

Implementation of application specific funcationality in hardware can be performed

on either Application Specific Integrated Circuits (ASICs) or FPGAs. ASIC imple-

mentation is generally expensive and thus is reserved for high- volume production.

ASICS are usually mass produced and cannot be reprogrammed. Gate arrays on


the other hand are programmable and exist in both volatile and non-volatile flavors.

Non-volatile types such as those that are mask-programmed are also not feasible for

prototypes or low volume production. Reprogrammable gate arrays are thus ideal

for prototyping applications.

SRAM-based programmable FPGAs (which are the target for this implementa-

tion) use look-up tables (LUTs) to implement combinational logic. These LUTs,

along with other primitives such as registers, multipliers and memory, are arranged

in a regular pattern in hardware with each component being connected by a pro-

grammable interconnect. An example FPGA architecture is shown in Figure 2.2.

This architecture allows for the implementation of arbitrary functionality limited

only by available area. By exploiting hardware based optimizations such as paral-

lelization and pipelining, designers can achieve high performance on such hardware.

For this research we use a Virtex II Pro FPGA chip which has 18-bit embedded

multipliers, on-chip RAM elements called BlockRAMS and two embedded PowerPCs

which are not used in this implementation.

2.2.2 WildStar II Pro PCI

The specific Virtex II Pro based FPGA board on which our design was implemented

was the Wildstar II Pro by Annapolis. The Wildstar II Pro is has dual Virtex II Pro

FPGA chips on the accelerator board. It was designed to speed up signal and image

processing applications. An image of the board is shown in Figure 2.3.

FPGA implementations usually achieve performance through a mixture of coarse


Figure 2.2: Virtex II Pro - Architecture [43]

and fine grained pipelining and parallelism. In order to maximize potential exploita-

tion of such characteristics, six DDR2 SRAM banks are provided for each FPGA,

resulting in a total of twelve banks. Each bank is 18 MBits in size and is arranged in

524288 36 bit words. The SRAM banks are also all independently accessible via the

previously mentioned 36 bit wide bus, but are accessible as 72-bit words once passed

through the clock domain crossing logic. They run at a frequency governed by the

FPGA design speed although there are some setup latencies involved in any transfer.

The arrangement of the SRAM banks and FPGAs is shown in Figure 2.4.

Each FPGA also has a separate 64 MB DDR DRAM bank that remains unused in

the phase-unwrapping design, along with the differential pair and Rocket IO buses.

Also present is an independent bus for each FPGA to the PCI controller chip that


Figure 2.3: An image of the Annapolis Wildstar II Pro PCI [3]

interfaces with the bus to the host.

There are also multiple programmable clocks available on the chip, referred to as

PCLK, ICLK and MCLK. These correspond to the user-programmable clock, the

bus to the PCI controller clock and the memory clock.

Annapolis utilizes a bus architecture that they term the Local Address Data

(LAD) bus. This refers to the bus between the FPGA and the PCI controller chip.

It provides abstractions for DMA transfers and for register space read/writes (PIO).

This bus is what user generated designs interface with in order to transfer data to

and from the host.

Annapolis also provides a software API for dealing with issues such as data trans-

fer, setting up clock frequencies and other parameters of the board. These will be

described as necessary in the software section of the FPGA implementation details.


Figure 2.4: A block diagram showing the various components of the Wildstar II Pro[3]

2.2.3 NVIDIA GPUs - G80

The newest generation of DirectX 10 compatible GPU hardware supports general

purpose High Performance Computing (HPC). This was accomplished by breaking

the old graphics pipeline composed of specific shader and fragment units in favor of

unified computation units capable of handling either. In this subsection we discuss

specific details of the architecture that enable the hardware to achieve speedup on

parallel workloads.


Hardware

NVIDIA’s first foray into explicitly supported general purpose graphics hardware was

the G80 architecture depicted in Figure 2.5. The G80 GPU consists of 128 processing

elements, each capable of operating on a separate single precision floating point datum

in parallel with the others. Groups of 8 elements are grouped into a multiprocessor

with each multiprocessor having its own shared memory. This shared memory is

used by threads on the same multiprocessor to share data. However, there exists no

way of rapidly transferring data from one multiprocessor to another (the application

programmer is required to go through main memory). Each multiprocessor has its

own set of registers, a constant cache and a texture cache. These different memory

types are optimized for different access types. There is also hardware thread control

that enables rapid thread switching and thus the hardware is optimized for dealing

with thousands of threads in parallel. This approach allows for the rapid computation

of massively parallel algorithms that exhibit high arithmetic intensity.

NVIDIA 8800 GTX

The 8800 GTX is a specific model of NVIDIAs G80 family of GPUs and represents

the second highest performing member of that family (the highest being the 8800

Ultra). It possesses a full complement of 128 stream processors operating at 1.35

GHz or twice the frequency of the rest of the core (which includes components such

as the Raster Operator etc.). The EVGA board that we use comes with 768 MB of

GDDR3 RAM operating at 900 MHz (effectively 1.8 GHz double pumped) with a


Figure 2.5: NVIDIA G80 core - Architecture[27]

total bandwidth to the GPU of 86.4 GB/s via a 384-bit wide bus. The bus to the

host is PCI-E x16 and offers a peak throughput of 4 GB/s.

Application Programmers Interface (API)

The paradigm around which General Purpose GPUs are based is the kernel-stream

concept [4] [42] [31]. This approach maps well to many highly parallel applications.

Here, a kernel running on multiple identical processors (usually arranged in a regular

array) operate on separate data independently. This is known as a Single Instruction

Multiple Data (SIMD) architecture. An example application that uses this streaming

method is scalar matrix multiplication. Each data point in the matrix can be read,

multiplied and written back independently.

In their API, called the Compute Unified Device Architecture (CUDA) API,


NVIDIA espouse a similar method which they label Single Instruction Multiple

Thread (SIMT); the difference being that SIMT instructions do not contain informa-

tion as to the width of the processor array. For the above example of scalar-matrix

multiplication, this would mean that each data point would be operated on by its

own thread.

The CUDA API operates on a large number of threads, broken up into warps,

blocks and grids. A warp consists of 32 threads that are managed together on a

multiprocessor. A thread block is a larger group of threads that executes on a single

multiprocessor and it typically consists of multiple warps. A grid is a collection of

blocks and it operates on many multiprocessors. The threads per block and blocks

per grid parameters must be specified for each individual kernel, but the warp size is

fixed.

There exist two levels to the CUDA API, a higher, abstract level and the Parallel

Thread eXecution (PTX) level. The two levels operate mutually exclusively. The

higher level provides simpler abstractions whereas the PTX API gives the program-

mer access to lower level aspects of the GPU. All implementations discussed in this

thesis were implemented in the high level API.

2.3 Phase Unwrapping - Algorithms and Selection

As a preliminary step to the implementation of a specific phase-unwrapping algorithm

on an accelerator (in this case an FPGA and a GPUs), it was necessary for us to verify


the choice of phase unwrapping algorithm used. Our primary criterion was unwrap

quality and we tested each algorithm over widely differing datasets consisting of both

real embryo data and artifical targets such as glass beads in water or epoxy media,

and optical fibres. In this section we present the properties and the tradeoffs between

the different algorithms described by Ghiglia and Pritt [1] and our final analysis as to

which one produces the best results. Software implementations tailored to the OQM

data sets were provided by a previous student [35].

The idea behind most phase-unwrapping algorithms is that the correct unwrapped

phase varies slowly enough such that the gradient between pixels is less than a half-

cycle, or π radians. If this assumption holds true, a wrapped signal may be un-

wrapped by simply summing (integrating in the continuous domain) until a gradient

of |π| is reached at which point the phase is added to an integer multiple of 2π and

the summation continues. This is the only method for solving 1D phase based data

sets. However, one problem with this approach is that if the data is noisy enough,

phase gradients greater than π are created due to noise. These large phase gradients

can lead to image corruption over large segments of the data. Lower levels of noise

(i.e. below π) also lead to an accumulation of error that eventually results in large

deviations near the end of the accumulation. Residues (discussed further on in this

Section) also contribute to incorrect unwraps. A wrapped data set is shown in Figure

2.6 and an example of both bad and good unwraps performed using the the raster

based Matlab unwrap and Minimum LP norm unwrapping is shown in Figure 2.7.


To solve this problem, various two-dimensional phase unwrapping algorithms have

been developed, each with differing tradeoffs in terms of quality and performance.

Figure 2.6: A wrapped image. Note the data range which lies between π and −π

2.3.1 Path Following Algorithms

Path following algorithms solve the noisy data problem by selecting the path over

which to integrate. Goldsteins algorithm, one of the most common path-following

algorithms, operates by identifying residues (points where the integral over a closed

four pixel loop is non-zero) and connecting them via branch cuts or paths along which

the integration path may not intersect.

One problem with Goldsteins algorithm is that it does not utilize all the data


Figure 2.7: The need for smart phase unwrapping algorithms a) A raster unwrapusing Matlabs ’unwrap’ routine b) A minimum LP norm unwrap

available to guide the generation of branch cuts. By generating a map indicating the

quality of the data over the image, it is possible to unwrap instances that cannot

be done using Goldsteins algorithm. These quality maps may be user-supplied or

automatically generated using pseudo-correlation, the variance of phase derivates or

the maximum phase gradient.

Quality maps may be combined with the branch cuts used in Goldsteins algorithm

to form a hybrid mask-cut method. The quality map is used to guide the placement

of branch-cuts. Another approach, proposed by Flynn, detects discontinuities, joins

them into loops and adds the correct multiple of 2π to it if the action removes more

discontinuities than it adds. Flynns minimum discontinuity solution can also be used


with a quality map to generate higher quality solutions.

Goldstein’s Algorithm

The simplest algorithm in terms of computational complexity is Goldstein’s Branch

Cut Algorithm. It operates in the following way:

Step 1. Identify residues: This step is accomplished by integrating in a four

pixel loop starting at pixel p0. If the sum is 2π then the p0 is marked as having

a positive residue charge and if the sum is −2π then it is marked as having a

negative residue charge.

Step 2. Create Branch Cuts: This step operates by connecting residues together

by branch cuts until the sum of residue charges is zero.

Step 3. Integrate: Integration of the image is performed using a Breadth First

Search (BFS) exploration of all the pixels in the image. As each pixel is en-

countered, it is unwrapped, unless it lies on a branch cut. Next, segments of

the image that may have been isolated by branch cuts are unwrapped similarly.

Pixels on branch cuts are unwrapped separately at the end.

As can be inferred from the steps above, Goldstein’s algorithm operates in O(N2)

time while consuming O(N2) space where the input image is N ×N .

The result of a phase unwrapping using Goldstein’s algorithm is shown in Figure

2.8. The problems with this method are immediately apparent. The segments of


the image that are of interest are mostly still wrapped with the amplitudes being

incorrect by a large margin.

Figure 2.8: Goldstein’s algorithm on the two embryo sample

Quality Maps

Quality maps are based on the concept of a user-supplied or auto-generated array that

defines the goodness of each phase value. These can be used to guide the unwrapping

since corrupted phase and residues usually have low quality values.

Unwrapping using quality maps works by first taking as an input the phase array

and the quality map. The quality map is either user input or based on the variance of

phase derivatives or the maximum phase gradient. The unwrapping is then performed

in a similar fashion as the Goldstein algorithm’s BFS exploration, except that the


adjoin list is not explored in FIFO order but according to quality. This leaves the

low quality pixels to be unwrapped at the very end.

The results of the quality mapped unwrapping are shown in Figure 2.9. There

are significant failures noticeable in the center of the lower embryo as well as around

the edges of both.

Figure 2.9: Quality Mapped algorithm on the two embryo sample

Mask Cut Algorithm

The Quality Map method does not explicitly use all the information available such

as residues. A hybrid approach called the Mask Cut Algorithm also exists. It uses

both quality maps and residues to guide the placement of branch cuts. It operates

as follows:


Step 1. Identify residues: This is performed as described in Section 2.3.1.

Step 2. Create mask cuts: This operates on the lowest quality pixels in the

image, gradually growing outwards using a BFS exploration technique and

marking the low quality pixels as being part of a mask cut once a residue is

encountered. The mask cut continues growing until the charge is balanced.

Step 3. Thin the mask cuts: Mask cuts tend to be thicker than branch cuts and

need to be thinned before integration. This step clears the mask on all mask

pixels that do not lie next to a residue and can be safely removed without

changing mask connectivity.

Step 4. Integrate: This is performed as described in Section 2.3.1.

For our datasets, the mask cut algorithm performs poorly as can been seen in Fig-

ure 2.10. The diagonal flecking and inconsistent phase changes render this technique

unusable for the OQM microscope.

Flynn’s Algorithm

One method of phase-unwrapping is to segment the image along lines of discontinuity

into regions where each region has the same integer multiple of 2π associated with

it. This approach fails in the presence of high noise values or residues. Flynn’s

algorithm only segments regions along lines of discontinuity if the process of doing so

and adding the appropriate 2π multiple removes more discontinuities than it adds.

The algorithm works as follows:


Figure 2.10: Mask Cut algorithm on the two embryo sample

Step 1. Compute jump counts: Here horizontal and vertical jump counts are

computed. A jump count is the integer k associated with the 2πk multiplier

regions 2π multiple.

Step 2. Scan nodes: Go over the set of nodes adding edges and removing loops.

When no changes are made terminate.

Step 3. Compute unwrapped solution: The wrap counts are added to the

input phase data to get the final unwrapped solution.

There are various other optimizations performed on the image such as the inte-

gration of quality data that are not described here. From Figure 2.11 it can be seen


that Flynn’s algorithm provides a high quality solution, the best amongst the path

following algorithms discussed.

Figure 2.11: Flynn’s algorithm on the two embryo sample

2.3.2 Minimum Norm Algorithms

This set of phase-unwrapping algorithms seek to generate an unwrapped phase whose

local phase derivatives match the measured derivatives as closely as possible. This

comparison can be defined as the difference between the two, raised to some power

p.

The simplest case is the unweighted least squares method where p = 2. This

family of methods uses Fourier or DCT techniques to solve the least squares problem


but are vulnerable to noise. The pre-conditioned conjugate gradient (PCG) technique

overcomes this problem by using quality maps to zero-weight noisy regions so that

the unwrapped solution is not corrupted. There also exists a weighted multi-grid

algorithm that uses a combination of fine and coarse grids to converge on a solution.

Finally, the Minimum LP Norm algorithm solves the phase unwrapping problem for

p = 0. In this situation, the algorithm minimizes the number of discontinuities in

the unwrapped solution without concern for the magnitude of these discontinuities.

This value of p generally produces the best solution. This algorithm can be used

with or without user-supplied weights and can also generate its own data-dependent

weights. It iterates the PCG algorithm, which in turn iterates the DCT algorithm.

This results in the Minimum LP Norm algorithm having among the highest costs of

all the algorithms in terms of runtime and memory.

Preconditioned Conjugate Gradient

The Preconditioned Conjugate Gradient(PCG) algorithm iterates the unweighted

least squares algorithm in order to perform a weighted phase unwrap. This un-

weighted least squares technique minimizes the difference between the discrete par-

tial derivatives of the wrapped phase data and the discrete partial derivatives of the

unwrapped solution. The solution φi,j that minimizes

ε2 =M−2∑i=0

N−2∑j=0

(φi+1,j − φi,j −∆xi,j)

2 +M−2∑i=0

N−2∑j=0

(φi,j+1 − φi,j −∆yi,j)

2 (2.1)


forms the final solution where ∆xi,j represents the phase difference going in the x

direction. This solution can be reduced to the discretized Poisson equation given by:

(φi+1,j − 2φi, j + φi− 1, j) + (φi,j+1 − 2φi, j + φi, j − 1) = ρi,j, (2.2)

where

ρi,j = (∆xi,j −∆x

i−1,j) + (∆yi,j −∆y

i,j−1),

This can be solved in the frequency domain by constructing reflections in the

x and y directions (in order to fulfill boundary condition requirements) and then

applying a two-dimensional Fourier transform to the input data. Alternatively, a

two-dimensional Discrete Cosine Transform(DCT) can be used. Applying the Fourier

transform to Equation 2.2 and noting that Φ and P represent the Fourier transformed

versions of φ and p, we get:

Φm,n =Pm,n

2cos(πm/M) + 2cos(πn/N)− 4, (2.3)

The PCG algorithm uses Conjugate Gradient(CG) to solve the discretized Poisson

equation. The CG technique has rapid and robust convergence and is guaranteed to

converge in N iterations for a N ×N matrix (barring roundoff error). However, the

actual number of iterations depends on the condition of the original matrix. If the

original matrix is close to the identity matrix, the matrix converges rapidly. In order

to achieve this condition, a preconditioning step is applied that solves an approximate

problem, the unweighted least-squares phase unwrapping. After this, the usual CG

steps are performed. The algorithm is shown in the pseudocode in Figure 2.12


Compute residual R_k of weight phase Laplacians

Initialize solution phi to zero

for (k=0 to MaxNumberOfIterations-1)

Solve P z_k=r_k using unweighted phase unwrapping to get z_k

Use CG method to solve for phi using z_k

if solution lies between predefined convergence bounds

exit loop

end

Figure 2.12: Preconditioned Conjugate Gradient Algorithm pseudo-code

As can be seen in Figure 2.13, the PCG algorithm produces a smooth, continuous

result but the image has a gradually greater magnitude going from right to left. This

affects both the foreground as well as the background, rendering the technique of

limited use. However, the conjugate gradient method presented here will be used

later for the Minimum LP Norm algorithm.

Minimum LP Norm

The Minimum LP Norm is similar to the PCG method since it also aims to minimize

the difference in gradients between the measured and calculated phases. However,

PCG sets p = 2 or to the least squares norm whereas the Minimum LP Norm algo-

rithm uses p = 0. This means that the Minimum LP Norm algorithm minimizes the

number of points where gradients of the measured points differ from the calculated

solution whereas the PCG algorithm minimizes the square of the differences, hence

ensuring that the measured gradients rarely match the solution.

For the Minimum LP Norm algorithm, we are trying to solve Equation 2.4.


Figure 2.13: Preconditioned Conjugate gradient algorithm on the two embryo sample

(φi+1,j−φi,j)Ui,j +(φi,j+1−φi,j)Vi,j−(φi,j−φi−1,j)Ui−1,j−(φi,j−φi,j−1)Ui,j−1 = c(i, j)

(2.4)

where U and V are data-dependent weights and c is the weighted phase Laplacian

given by

c(i, j) = ∆xi,jU(i, j)−∆x

i−1,jU(i− 1, j) + ∆yi,jV (i, j)−∆y

i,j−1V (i, j − 1)

This equation can be rewritted as a matrix equation as in Equation 2.5.

Qφ = c, (2.5)

which is solvable by the PCG method discussed in Section 2.3.2. A pseudocode


description of the algorithm is given in Figure 2.14. The full implementation also has

options for applying user-input or dynamically generated quality maps to the data.

Initialize solution phi_0 to zero

for (k=0 to maxIterations)

Compute Residual R

If R has no residues exit

Compute data dependent weights U and V

Compute weighted phase Laplacian c

Subtract c from weighted phase Laplacian of current solution

(the left side of Lp Norm equation)

Solve Qphi = c with PCG.

end

if (no residues in residual)

Unwrap using Goldstein’s

else

Apply post-processing congruency operation

end

Figure 2.14: Minimum LP Norm Algorithm pseudo-code

The results of the Minimum LP Norm Algorithm are displayed in Figure 2.15. As

can be seen, it produces the best quality images thus far, slightly better than Flynn’s

method. There are however, several incorrect areas such as the cell boundaries in the

lower embryos. This is partly because the algorithm reached it’s maximum number

of iterations without eliminating all residues from the residual.

Multi-Grid

Multi-grid methods enable the rapid solution of PDEs on large grids. They usually

operate as fast as Fourier methods but have the advantage that they can handle

non power-of-two sized arrays. These algorithms, while theoretically operating as


Figure 2.15: The Minimum LP Norm algorithm on the two embryo sample

fast as or faster than the PCG algorithm, fail to produce meaningful results for the

data produced by the OQM modality as shown in Figure 2.16 and are not discussed

further.

2.3.3 Choosing The Right Algorithm

The primary criteria by which these algorithms were judged by was the quality of

their unwraps over a wide array of benchmarks mostly consisting of real data sets,

but occasionally with artifically constructed situations that posed challenging un-

wraps (such as imaging tiny glass beads or optical fiber). Overall, we noted that the

path following algorithms operated quickly, but often had isolated sections that were


Figure 2.16: The Multi-grid algorithm on the two embryo sample

unwrapped poorly. The minimum norm algorithms had smooth solutions, but often

had large errors as in PCG. The Minimum LP Norm algorithm produced the best

overall solution at the expense of the greatest computation time. Thus the Minimum

LP norm algorithm was chosen as the algorithm to accelerate for this research.

2.4 Bitwidth Analysis

Implementing an algorithm with full floating point accuracy is usually not feasible

on either Digital Signal Processors (DSPs) or on Field Programmable Gate Arrays

(FPGAs) due to either a lack of support on the former or the formidable size re-

quirements on the latter platform. Thus before implementing any new floating point


algorithm on these platforms, it is important both to verify that the data can be

converted to fixed point and still have the algorithm operate accurately and also to

discover the minimum bitwidth that can be used. Finding this minimum bitwidth

can result in large area savings in hardware. It has been noted that for Conjugate

Gradient (which is used in the Minimum LP norm phase unwrapping algorithm that

we use), precision issues directly affect the number of iterations and hence lower pre-

cision can actually increase time to convergence. The importance of precision is one

reason why large CG calculations are usually computed in double precision and hence

difficut to implement on FPGAs without using mixed floating point precisions[36].

It is less important for the GPU implementation since GPUs support floating point

natively in hardware, albeit currently in single precision.

We implemented C code that performs a fixed point, bit-accurate calculation of

the preconditioning step of the Minimum LP norm phase unwrap. This was possible

since the operations performed by the the preconditioner, which include the Discrete

Cosine Transform (DCT) and some intermediate floating point operations, are all

scalable with a possible loss of precision. For example if f(x) represents the DCT

and Poisson calculation, and if f(x) = V then f(Cx) = CV . This allows for the fixed

point implementation to be performed by multiplying the single precision floating

point input data x : −1 < x < 1, by a scaling factor C = 2p where p represents

the number of bits to be shifted. After the scaling, a cast to an integer data type

truncates all data after the decimal point. After the calculation of f is performed,


the results are then cast to floating point and scaled down. The FFT used in the

DCT C code implementation was KISS FFT [21], a simple open-source package that

supports both fixed and floating point.

The quality of the images produced by the different bitwidths was analyzed by

visual quality, lack of unwrapped sections and discontinuous jumps, and by the dif-

ference between the fixed point and the full double-precision implementation. We

tested it over a large number of data sets of which one is shown in Figure 2.17.

Figure 2.17: Image produced by a double-precision unwrap

The double embryo image shown was just one of the benchmark images used to

determine the optimal bitwidth. However, this image presents a challenging unwrap

that does not converge completely before reaching the maximum number of iterations.

Hence it represents amongst the worst case real-world images.

The 27-bit unwrap shown in Figure 2.18 has small isolated areas of very low

phase. This stretches the overall magnitude range of the data and causes wild visual


variation between this and the double-precision version.

Figure 2.18: Using a bitwidth of 27

This variation disappears with the 28-bit version seen in Figure 2.19. It does not

have the isolated low phase regions and so presents an unwrap very similar to that of

the double-precision version. However, the only FFT core that we had access to was

a 24-bit fixed point core provided by Xilinx [45]. Since we knew that twenty four bits

would be insufficient, we made use of the block floating point functionality present

in the core to give our data the greater dynamic range we knew was necessary.

2.5 Related Work

In this section we present work related to our research. The work discussed as pertain-

ing to FPGAs and GPUs differ in topic. This is because the FPGA implementation

presented in this paper performs the preconditioning step of the PCG algorithm,

which while the most computationally intensive part, is only a small section of the


Figure 2.19: Using a bitwidth of 28

overall algorithm. This preconditioning step involves a DCT and some floating point

computation. The GPU implementation on the other hand implements the entire

conjugate gradient calculation as well as the Minimum LP norm algorithm’s inner

loop.

2.5.1 FPGAs

There exist a large number of image processing applications that have been mapped

to FPGAs. We implement a large 2D DCT transform of size 1024× 512 along with

some floating point calculations. No 2D DCT transforms of this size were discovered

in the literature. However, the popularity of the JPEG and MPEG standards has

resulted in a proliferation of smaller hardware implementations of the DCT targeted

towards specific sizes, most commonly the 8x8 2D implementation since this is what

is required by the standard. This is usually accomplished by multiplying an 8x8 block


from the image data by an 8x8 coefficient matrix, resulting in an output matrix that

contains the component frequencies. This is a fairly expensive operation, requiring

4096 additions and another 4096 multiplications. There have been many publications

detailing the implementation of such 2D matrix multiplications using distributed

arithmetic (DA) which reduces the number of multiplications required, but which still

results in relatively large, low latency hardware. Woods et al. [30] used a combination

of 1D DCTs, distributed arithmetic, and transpose buffers on a Xilinx XC6264 to

generate such a design that utilized 30 percent of total chip area. Bukhari et al.

[17] investigated the implementation of a modified Loeffler algorithm for 8x8 DCTs

on various FPGAs. Siu and Chan proposed a multiplierless VLSI implementation[5]

for an 11x11 2D transform and many other variations on small 2D DCTs exist. Our

implementation differs from these in that our transform size is 1024×512 while using

a relatively small proportion of the chip.

Larger 2D DCTs can be implemented using 1D DCTs by taking advantage of the

DCT’s separability property. This is accomplished by first taking the DCT of all the

rows and then of all the columns. However, there do not exist many implementations

of large 1D DCTs for reconfigurable systems. In [44], an 8 to 32 point core with a

maximum of 24 bit precision was implemented using distributed arithemetic for the

vector-coefficient matrix multiplication. A 32 point, 24 bit instance of this core has

a latency of only 32 cycles, but consumes 10588 LUTs which makes the approach

impractical for large designs. In contrast, our approach is much more compact and


therefore enables larger transform sizes at the cost of higher latency. Leong [19]

implements a variable radix, bit serial DCT using a systolic array but only describes

area requirements for designs up to N=32 which consume between 457 and 1363

adders and have a high worst case error. In comparison, our approach supports

much larger transform sizes with demonstrated designs of up to 1024 points.

The Spiral project uses a heuristic algorithm to explore the DFT design space with

performance feedback to generate a hardware-software DFT implementation, but the

comparison is with a software implementation on the embedded PowerPC which is

severely outclassed by a modern desktop processor. The project does however include

a customizable 1D DCT implemented in Verilog that is available from their website

and is the only available large 1D-DCT found[8].

Unlike the DCT, there exist many implementations of the Fast Fourier Transform

(FFT) on FPGAs, some of which date back to 1995 as in the case of Shirazi et al.

[33] They implemented a 2-D FFT, complex multiply and a 2-D IFFT on the Splash-

2 computing engine using a non-standard 18-bit floating point representation. The

hardware they target is the Xilinx 4010 and for the purposes of that application,

they use 34 FPGA chips to achieve adequate thoroughput. The nature of their

application does not require sending data back to the host. Our implementation on

the other hand uses just a single, more modern FPGA and a higher accuracy, semi-

floating point representation. Dillon implements a high performance floating point

2D FFT that would greatly improve our own results if integrated into our design [9].


Bouridane et al.[14] implement a high performance 2D FFT as well but on images

half the size of ours and with similar performance .

Conjugate Gradient calculations have been implemented on FPGAs although in

a hybrid manner with strongly coupled host-accelerator interactions. This is because

of the lack of space on a single FPGA to implement the entire CG kernel in a suitably

high enough precision to ensure convergence. Prasanna et al. [38] implemented a

double precision hybrid CG solver in this manner but saw modest gains only in the

cases where limited cache sizes on the host forced page faults. If the data was already

in cache, significant slowdowns were measured. Strzodka et al.[36] investigated the

effects of using a mixture of precisions to achieve near double-precision accuracy on

FPGAs, but also noted slowdowns in performance.

2.5.2 GPUs

With the introduction of unified shaders with DirectX 10 and above, the GPGPU

community has witnessed an explosion in the number of suitable applications acceler-

ated on these platforms. The papers discussed here reflect only those closely related

to the higher level conjugate gradient kernel or preconditioning of the matrix which

is what was implemented on the GPU for this research.

Karasev et al. [16] use GPUs to implement 2D phase unwrapping on NVIDIA

GPUs using CG [24] and achieve a 35x speedup. They implement the weighted least

squares algorithm, similar to the PCG and multigrid algorithms discussed previously,

and apply it to Interferometric Synthetic Aperture Radar (IFSAR) data. They chose


to use multigrid and Gauss-Seidel iterations to solve the minimization problem and

compare their results to C and Matlab implementations. However, multigrid tech-

niques do not work on our datasets as has been shown previously in Cary Smith’s

research [35] as well as in the experiments described in Section 2.3.2. Their algorithm

also requires a very high number of iterations to converge (on the order of tens of

thousands), a known result of using Gauss-Seidel. In comparison, the PCG or Min-

imum LP norm algorithm require tens or hundreds of iterations. Thus their total

computation time is greater than ours.

Bolz et al. [15] implement sparse matrix conjugate gradient solvers and multi-grid

solvers on GPUs. They achieve only modest speedups over CPUs as they work with

the ATI 9700 and Geforce FX generation of hardware and OpenGL and are thus

limited to working within the constraints of the classical graphics pipeline rather

than the unified shader architecture of the DirectX 10+ compatible video cards.

The 2D DCT required by the preconditioning step uses a 2D FFT, a complex

multiply and some reordering. NVIDIA provides a high performance FFT library

called CUFFT [25] that has been benchmarked by HP and shown to provide a 3x

speedup for large transform sizes[10] as measured against a highly optimized software

implementation on a multicore HP server available as part of Intel’s MKL library [12].

The point at which a GPU implementation becomes feasible is between the 512×512

and the 1024× 1024 matrix sizes which see a speedup of about 0.9 (a slowdown) and

3.0 respectively in real-world scenarios. Our input data set is 1024× 512 and we are


running on a single core machine desktop machine. so we should expect to see some

improvement.

2.6 Conclusions

In this chapter, details of the Keck fusion microscope and the two implementation

platforms, FPGAs and GPUs, were provided. Next, the results of several phase un-

wrapping algorithms were analyzed and the Minimum LP Norm algorithm settled

upon as the one producing the best results. The kernel of this algorithm was imple-

mented in fixed point in software and a minimum usable bitwidth of 28 bits for the

FPGA implementation was decided on, which due to IP constraints was modified to

24 bits and an exponent. Finally, related work on both FPGA and GPU platforms

was presented and discussed.

In the next chapter we will discuss the algorithm used for the DCT and the

implementation of the preconditioner and conjugate gradient calculations on GPUs

and FPGAs.

Chapter 3

Implementation

In the previous chapter we discussed the implementation platforms and the various

phase unwrapping algorithms available in the literature. We settled on the Minimum

LP norm algorithm which iterates the Preconditioned Conjugate Gradient (PCG)

algorithm. PCG consists of two steps, a preconditioning step calculation that consists

of a DCT and some floating point calculations to solve the Poisson equation, and a

conjugate gradient calculation that consists of a variety of matrix operations.

This chapter presents details of the implementation of the preconditioner and CG

calculations on the FPGA and the GPU as well as the timing profiles that prompted

the implementation of those specific sections. It also describes the algorithms used

for the DCT and discusses the various data transfer modes utilized by the various

platforms.

3.1 Experimental Platforms And Timing Profile

This section describes the performance of the reference Minimum LP Norm algorithm

running on the two different General Purpose Processors (GPPs) used in this thesis.

CHAPTER 3. IMPLEMENTATION 53

They provide the platforms against which speedup is measured. Also discussed are

details of the Wildstar II Pro platform and the Annapolis API as well as further

details of the 8800 GTX board provided by NVIDIA.

3.1.1 Host Machine Descriptions And Timing Profiles

As discussed in Section 4.1, two different host machines are used for the FPGA

and GPU due to their different bus requirements. The reference code is profiled

on both machines since they both have drastically different performance numbers

and represent technologies four years apart. A breakdown of the two machines is

presented in Table 3.1.

Machine 1 (2004) Machine 2 (2008)Processor Pentium IV Xeon Core 2 Duo (Penryn)

L1 Data Cache 1x16 kb 2x32kbL2 Cache 1MB 6MBFrequency 3 GHz 3 Ghz

Number of Cores 1 Core 2 CoresRAM 1 GB DDR2 4 GB DDR2

Front Side Bus 4x200 MHz 4x333MHzVideo Card NVIDIA Quadro NVIDIA NVS 290 and 8800 GTX

OS Windows XP Pro Windows XP Pro x64Accelerator Wildstar II Pro PCIX NVIDIA 8800 GTX

Table 3.1: A comparison between the two platforms

The two machines differ greatly in most aspects except for peak operating fre-

quency. Execution time on both platforms for the reference software implementation

of the Minimum LP norm algorithm is also fairly different and is shown in Figure 3.1.

The data set that produced these timing numbers was the glass bead data. In this

test, the algorithm runs until it reaches the maximum number of iterations and then


times out. The main reason for the performance difference is the size of the caches

since with Machine 2 an entire data set can fit in cache. It should be noted that

the reference software implementation runs only on the General Purpose Processor

(GPP) and utilizes only a single core. The basic datatype used is the single precision

float, although some key values are stored in double precision.

Figure 3.1: A comparison between the performance of the two machines

Disk IO takes approximately 1.3 seconds on Machine 1 and 700ms on Machine 2,

thus representing a fair amount of the overhead (computation not involving the PCG).

As is readily noticeable, computation time is dominated by the preconditioning step

of the PCG kernel (approximately 75 percent of the total time in both cases), with

the remainder mostly occupied in performing other sections of the conjugate gradient


calculations. The entire Preconditioned Conjugate Gradient algorithm (which forms

the core of the Minimum LP algorithm) takes up 94 percent of the total computation

time. Thus the preconditioner and conjugate gradient portions of the algorithm are

prime candidates for implementation on an accelerator.

3.2 Algorithms

This section describes the algorithms used in the FPGA and GPU implementations.

They are modifications of those used in the reference software implementation of the

phase unwrapping algorithm and are optimized for efficient reuse of existing cores.

We start by describing the DCT algorithm since it takes up most of the computation

time in the preconditioner.

3.2.1 The Discrete Cosine Transform - An Overview

The Discrete Cosine Transform (DCT) is used in a wide variety of applications such

as image and audio processing due to its compaction of energy into the lower fre-

quencies. This property is exploited to produce efficient frequency-based compression

methods in various image and audio codecs such as JPEG and MPEG. However, the

DCT is also used in other applications that require larger sized transforms such as

those using the Preconditioned Conjugate Gradient (PCG) technique in applications

like adaptive filtering [11] and phase unwrapping [1]. In this section we discuss an al-

gorithm, first developed by Makhoul [20], and an implementation of it that utilizes a

Fast Fourier Transform (FFT) core to compute a DCT without significantly increas-


ing overall latency as compared to just a FFT core. The advantage of this approach

is the ready availability of a large number of FFT cores in both fixed- point [45] and

floating-point [2] formats which can be easily dropped in with minimal modifications

to the overall design.

3.2.2 Algorithm Details for the 1D DCT

The general algorithm presented here was first discussed by Makhoul[20]. It is an

indirect algorithm for computing DCTs using FFTs and describes the method we

used to implement the DCT on an FPGA. The steps are presented as well as their

correspondence to the computation done in hardware.

Given an input signal x(n), the DFT of that signal is given by:

X(k) =N−1∑i=0

xne− 2πi

Nnk k = 0 . . . N − 1

The cosine transform can be viewed as the real part ofX(k) which is the equivalent

of saying that it is the Fourier transform of the even extension of X(k) given that

x(n) is causal (i.e. x(n) = 0, n < 0). This is the inspiration for the usual technique

for implementing the DCT, which is by mirroring the set of real inputs and taking

the real DFT of the resulting sequence. This mirroring can be performed in any of

four ways: around the n=-0.5 and n=N-0.5 sample points, around n=0 and n=N,

around n=-0.5 and n=N, and finally around n=0 and n=N-0.5. All of these methods

result in slightly different DCTs. The most commonly used even extension is the one

depicted in Figure 3.2 and this will be the focus of the algorithm and implementation


presented.

Figure 3.2: Even extension around n=-0.5 and n=N-0.5

This category of DCT, obtained by taking the DFT of a 2N point even extension,

is known as a DCT Type II or DCT-II and is defined as:

X(k) = 2N−1∑i=0

xn cos(2n+ 1)πk

2Nk = 0 . . . N − 1, (3.1)

with the even extension defined as:

x′(n) =

{x(n) n=0 . . . N-1x(2N − n− 1) n=N . . . 2N-1

,

The DCT-II can be shown to be solvable via DFT by noting that:

X(k) =2N−1∑n=0

xne− jπn

Nnk

=N−1∑n=0

xne−πnNnk +

2N−1∑n=N

xne−πnNnk

= ejπk2N

N−1∑n=0

xn[e−jπnNnke−

jπn2N

nk + e−jπnNnke

jπn2N

nk]

= 2ejπk2N

N−1∑i=0

xn cos(2n+ 1)πk

2N,


This is identical to the definition of the DCT in Equation 3.1 except for a multi-

plicative factor of ejπk2N . A similar method can be used to write an IDCT in terms of

a length 2N complex IDFT. Full details can be found elsewhere[20].

The performance of the DCT in terms of latency and area can be further im-

proved upon such that an N point real DFT/IDFT may be used. The method for

accomplishing this is outlined below.

A sequence v(n) can be constructed from x(n) such that it follows the restriction:

v(n) =

{x(2n) n=0 . . . N−1

2

x(2N − 2n− 1) n= N+12

. . . N-1,(3.2)

When the DFT of v(n) is computed and the result multiplied by 2e−jπk2N the

resulting sequence can be written as:

X(k) = 2N−1∑i=0

vn cos(4n+ 1)πk

2Nk = 0 . . . N − 1,

which is an alternative version of the DCT based on v(n).

Again, for the IDCT, the real sequence X(k) can be rearranged to form a complex

Hermitian symmetric sequence V (k) where Hermitian symmetry is defined as X(N−

k) = X∗(k) and the sequence itself as:

V (k) =1

2ejπk2N [X(k)− jX(N − k)] k = 0 . . . N − 1.

An IDFT on V (k) generates the v(n) sequence described earlier, which can then be

rearranged to form x(n). This is the method used in the implementation discussed

in the later sections of this thesis. However, for both the size N DCT and IDCT


it should be noted that the input sequences are either entirely real or Hermitian

symmetric and can thus be computed using FFTs with a point size of N/2 [29, 20].

This can be done by setting alternating elements of v(n) to the real and imaginary

parts of a new sequence t. That is,

t(n) = v(2n) + jv(2n+ 1) n = 0 . . .N

2− 1.

The DFT of this sequence can then be computed and the original V (k) extracted

by using:

V (k) =1

2[T (k) + T ∗(

N

2− k)]− 0.5je

−2πkjN [T (k)− T ∗(N

2− k)].

This gives the original V (k) which can subsequently be used to generate X(k).

A similar method can be applied to the real IFFT to realize similar savings.

The implementation described in this thesis does not use the last FFT optimiza-

tion (that reduces required transform size from N to N/2) since it involves an extra

multiplication step that would reduce the accuracy of the results, since the data being

used is fixed-point. However, for a floating-point or low precision implementation,

this would be a feasible optimization.

3.2.3 Algorithm Details for the 2D DCT

In addition to the 1D case Makhoul presented in [20], he also discussed a method of

performing 2D DCTs using 2D FFTs. This is the technique that is utilized for the

GPU computation since decomposing the matrix into 1D arrays would have a large

data transfer overhead and 2D FFT cores are easily available from NVIDIA [25].


Similar to the 1D DCT transform, the 2D DCT can be broken down into three

steps. First, a shuffle rearranges the matrix according to Equation 3.2.3. Note that

x is the input matrix, v is the shuffled matrix and N1 and N2 denote the dimensions

of the matrix.

v(n1, n2) = x(2n1, 2n2) 0 ≤ n1 ≤ N1−12, 0 ≤ n2 ≤ N2−1

2

x(2N1 − 2n1 − 1, 2n2)N1+1

2≤ n1 ≤ N1 − 1, 0 ≤ n2 ≤ N2−1

2

x(2n1, 2N2 − 2n2 − 1) 0 ≤ n1 ≤ N1−12, N2+1

2≤ n2 ≤ N2 − 1

x(2N1 − 2n1 − 1, 2N2 − 2n2 − 1) N1+12≤ n1 ≤ N1 − 1, N2+1

2≤ n2 ≤ N2 − 1.

The second step involves taking a N1 × N2 2D FFT of v(n), thereby producing

V (k). V (k) is equivalent to the 2D DCT after performing the computation given in

Equation 3.3. Note that the 2D DCT function is given by C(k1, k2).

C(k1, k2) = 2Re(W k24N2

(W k14N1

V (k1, k2) +W−k14N1

V (N1 − k1, k2))). (3.3)

A similar procedure applies to the 2D IDCT as detailed in [20].

3.2.4 Algorithm Details for the Conjugate Gradient

This sections describes what happens within each iteration of the the conjugate

gradient which is implemented on the GPU. Further details can be found in [32] and

[1]. Pseudocode for Preconditioned Conjugate Gradient (PCG) is presented in Figure

3.3.

Details of the preconditioning step of the PCG algorithm by means of an un-

weighted algorithm have already been given in Section 2.3.2. The remaining steps

are fairly straightforward matrix operations that adapt well to the highly parallel

nature of a GPU.


for k = 0 to MaxIterations-1

Apply preconditioner to get zkif k = 0 then p1 = z0

else

βk+1 =rTk zk

rTk−1

zk−1

pk+1 = zk + βk+1pk

αk+1 =rTk zk

pTk+1

Qpk+1

φk+1 = φk + αk+1pk+1

rk+1 = rk − αk+1Qpk+1

if norm(rk+1 < ε norm(r0) then exit loop

end loop

Figure 3.3: PCG - Detailed pseudo-code[1]

Not described in Figure 3.3 is the data centering operation which subtracts the

average from a matrix. We implemented this on the GPU as well. Calculating

the sum requires an accumulation, which can be tricky to parallelize well due to

dependencies.

3.3 The FPGA Implementation of the Precondi-

tioner

Now that the algorithms for the DCT within the preconditioner and the conjugate

gradient have been presented, in this section we detail the implementation of the 2D

DCT and IDCT on the FPGA, along with the floating point calculations necessary for

the Poisson equation calculation. First, the initial 1D implementation is described,

then its extension to 2D. Finally the method and components used to solved the

floating point equation is described.

We only implement the preconditioner on the FPGA since the area requirements


do not allow for the implementation of a conjugate gradient solver.

3.3.1 The One Dimensional DCT Transform

The description of the algorithm for both the forward and inverse DCT in Section

3.2.2 lends itself to a clearly defined component breakdown in terms of hardware.

For the DCT, the first component creates v(n) by reordering the input sequence

and writing it to memory. The second component is the FFT that transforms the

shuffled input data into the frequency domain. The last component multiplies the

output V (k) by 2e−jπk2N and extracts the desired output values from the complex FFT

output. Roughly the same components are required for the inverse DCT but in

reverse order. First of all, a multiplication of a re-arranged sequence Y (k) by 0.5ejπk2N

is performed where Y (k) = X(k)− jX(N − k). Then the data is passed through an

inverse FFT of size N, followed by the mapping of v(n) to x(n). This organization of

the components is depicted in Figure 3.4 and Figure 3.5 for the forward and inverse

transforms respectively.

Shuffle

As input data is sent sequentially into the DCT core, the first stage of processing

that occurs is the generation of the v(n) sequence. This occurs within the shuffle

component. The shuffle has a latency of one clock cycle and calculates output indices

based on the input index according to Equation 3.2. Since the shuffle component only

affects index values, all addition and subtraction performed within it is of bitwidth


Figure 3.4: Components and dataflow for the forward DCT transform

log2N . Based on these output indices, the sample value is written to block RAM in

shuffled order as shown in Figure 3.6. This step of writing to block RAM is necessary

since the FFT component takes in input in sequential order but the shuffle produces

output non-sequentially.

For an inverse DCT shuffle, the FFT output data is re-arranged in the opposite

direction, forming x(n) from v(n). This is also written to block RAM before trans-

mitting back to the host since the data will not be generated in sequential order.

Fast Fourier Transform

The complex FFT used was provided by Xilinx LogicCore and generated using Core-

gen [45]. It allows for a range of options, including a parameterized bit-width of 8 to


Figure 3.5: Components and dataflow for the inverse DCT transform

24 bits for both input and phase factors, the use of either block RAM or distributed

RAM, the choice of algorithm and rounding used and the ability to set the output

ordering.

Because the FFT was used to implement large DCTs, it was necessary to set the

bitwidth to a large size to maximize precision. To this end a 24 bit signed input was

used along with a block floating-point exponent for each 1D transform completed.

This exponent field reduces the need to increase output bitwidth after each FFT.

Block RAM, a radix-4 block transform and bit reversed ordering were also selected.

Since the algorithm optimization for computing a real or Hermitian symmetric

FFT using a transform of length N/2 (as mentioned in the previous section) wasn’t


Figure 3.6: Re-ordering pattern in a forward shuffle

used due to precision issues, the imaginary input for the FFT was tied to zero. In

addition, the FFT core was set up to support both forward and inverse transforms

simultaneously as well as to have run-time configurable transform length.

Rebuild rotate

The rebuild rotate component implements the multiplication by 2e−jπk2N for the for-

ward transform and 0.5ejπk2N for the inverse. These complex exponentials can be

converted to a format consisting of sines and cosines by using Eulers formula. For

example, the forward transform is the equivalent of 2(cos(−πk2N

) + jsin(−πk2N

)).

The Coregen Sine Cosine Lookup Table 5.0 component used has a mapping be-

tween the input integer angle T and the calculated θ of θ = T 2π2T WIDTH . Thus for

θ = −πk2N

and noting that 2T WIDTH = N , the input angle works out to be −k4

and

k4

for the forward and inverse respectively. The sine and cosine of the index k is

generated and then multiplied by the results of the FFT using a complex multiply

with a latency of six cycles. This generates a 48 bit output, of which only the first


24 bits are retained.

The overall dataflow of this component is depicted in Figure 3.7. Note that for

the forward DCT transform only the real output of the complex multiplication is

used. The full functionality of the complex multiplier is retained however, since the

inverse transform requires it for the rebuild rotate as shown in Figure 3.5.

Figure 3.7: The rebuild component - Forward Transform

Dynamic Scaling

The hardware implementation was required to be as close as possible to a floating

point software implementation as detailed in the bitwidth analysis section. In order

to achieve this level of accuracy with a fixed point FFT with block floating point,

it was necessary to scale the input data on-the-fly to maximize available dynamic


range. The expanded dataflow diagram in Figure 3.8 shows the components used in

this procedure.

Figure 3.8: The 1D transform including dynamic scaling

The max tracker component recieves incoming streaming floating point data

from SRAM and records the maximum exponent of the 1D frame. It does this by

using a comparator to see if the internally registered data is less than the incoming

value. If it is, the internal register is overwritten with the new value. The final

output of this component is what the entire data frame must be shifted by. This

output is calculated as 23− (MAX−126). The value 23 is used since the fixed point

FFT uses 24 bit data and the float to fixed point conversion will round numbers less

than one to zero. The 126 is to compensate for the exponent bias in IEEE compliant


floating point representation. Converting the register MAX into two’s complement

and simplifying, the calculation is performed as not(MAX) + 150.

The scale component takes in data from BRAM, and adjusts the floating point

exponent field according to the MAX value calculated above. Similarly, the rescale

component adjusts the data frame back to it’s original range by subtracting the MAX

value. The rescale component also adds the block exponent produced by the FFT

as well as a constant multiplicative power of two introduced by the algorithm. This

can be summarized as:

EXPoutput = EXPinput + EXPblock − EXPscale + EXPconstant

where EXPconstant is 4 or 3 if the transform length is 1024 or 512 respectively

Data Type Conversion

The software version of the preconditioning kernel deals with all it’s data as single-

precision 32-bit floating point values. However, given the limitation of having a

fixed point 24-bit FFT available, some form of conversion between the two formats

is necessary.

The RCL VFLOAT library [28] contains parameterizable float to fixed point and

fixed to floating point units capable of performing the necessary conversions. Data

coming out of the scale component is streamed into the float to fixed point compo-

nent and then transferred to the FFT. The output of the FFT is converted back to

floating point, scaled and then stored. Thus the data-type conversion encapsulates


the core DCT logic, which is further encapsulated by the dynamic scaling logic.

3.3.2 The Two Dimensional DCT On The FPGA

A 2D DCT is required by the preconditioner since the input matrices are 2D arrays.

In this section we discuss how the 1D DCT presented in Section 3.3.1 is extended to

2D.

The DCT, similar to the FFT, is separable. This means that a two-dimensional

DCT can be constructed by performing the 1D DCT for each of the rows followed by

a 1D DCT of the columns of the resulting matrix. This technique is called the row-

column decomposition method.

The key components to extending the one-dimensional DCT discussed earlier into

two dimensions are exploiting the onboard SRAM banks to store entire images at

a time, and calculating the transpose of the matrix after each DCT iteration. This

eliminates much of the data transfer bottlenecks involved in performing a 2D DCT

by transferring 1D DCT input data sets to the board sequentially. Now we transfer

a full 2D phase data set at once. A high level diagram of the components involved

as well as an approximation of the data flow is given in Figure 3.9.

Coarse Grained Controller

A high level Finite State Machine (FSM) based controller was implemented that co-

ordinates the action of other controllers and components in the design. This method

of implementation was important for debugging purposes as any high level stage


Figure 3.9: A High Level Diagram of the Preconditioning Kernel

could be run independently by changing the sequence of execution in the FSM. The

flowchart in Figue 3.10 depicts the functioning of the state machine.

On startup, the controller defaults to the IDLE state after which it resets itself

so that all registers are in a known state. Then upon recieving a signal indicating

the start of input, it allows data to be written from the LAD bus via DMA or PIO

to SRAM A. Additional control signals are also sent via PIO. After the input stage,

execution can proceed along the full execution path or along special debug paths

depicted by the dotted line in Figure 3.10. This is set by the control signals sent by

the host on startup.

Normal execution proceeds as follows: First data is read from SRAM A and each

row is transformed and writted to SRAM B at the transposed address in a streaming

fashion with signals setting transform direction to forward and orientation to row.

Next the transformation component is reset and the orientation changed to column.

Data is now streamed from SRAM B through the transform component, through the

floating point Poisson calculator and back into SRAM A at the transposed address. A


Figure 3.10: The FSM controlling high level data-flow

similar flow is used to inverse transform the data, but without the Poisson calculation.

The SRAM Controllers

Two separate SRAM controllers were implemented that independently control the

two SRAMs banks used in this design. The behaviour of SRAM bank A is given in

Figure 3.11 with SRAM B following a similar format.

The SRAM A controller starts up in the IDLE state. Once the DMA transaction

initializes, it switches over to the PREWRITE (which sets up the write signals) and

then to the the WRITE2 state. It remains in WRITE2 until the entire data set has

been copied to RAM in data packed format. It then cycles through DONEWRITE

and NOP (inserting NOPs between read and write cycles is necessary for high clock


Figure 3.11: The SRAM A FSM

frequency transfers) and finally back to the IDLE state and PREREAD. The READ

state increments addresses while waiting for data on the SRAM to FPGA bus to

become valid. This usually takes nine cycles. Once data is valid it switches over to

READ2 and starts applying the DCT to the rows and writing to SRAM B (handled

independently by the SRAM B controller).

SRAM A is accessed again after the columns and floating point computation are

completed and need to be written back. At this point the controller transitions to

the WRITE1 state and writes the transposed results back to memory. Note that

neither SRAM A nor SRAM B take into account the direction of the transform as

the data flow through SRAM does not depend on transform direction. Direction only


influences the DCT core and the enable signal on the floating point calculation core.

SRAM B operates in a very similar manner to SRAM A except that it only

interfaces with the DCT core and not with the DMA controller. Thus its design is

simpler with fewer states.

Calculating The Transpose

The purpose of the transpose component is to calculate the write address of data

coming out of the DCT component. These addresses should flip the matrix along

the diagonal. This can be accomplished by switching the row and column indices,

or rather, since the SRAM memory is linearly addressed, by multiplying the column

index by the length of a row and adding the row index. In equation form:

write addr =col addr

2∗ row length+ row addr

where row length = 512 or 1024 depending on orientation.

The column address is divided by 2 by dropping the last bit. This truncation is

necessary for the data-packing of two output values into each SRAM word to occur.

Additionally, since row length is a power of two, the multiplication and addition is

accomplished by appending the row index to the column index. This arrangement

requires minimal resources and is accomplished within a single cycle.

3.3.3 Division And Scaling

In the preconditioner, there is a floating point computation of the Poisson equation

that occurs between the forward 2D DCT and the inverse 2D DCT. The section of


code that performs the division and scaling that characterizes the Poisson equation

in the frequency domain is given in Figure 3.12:

for (j=0; j<1024; j++)

for (i=0; i<512; i++)

{

if (i==0 && j==0)

array[0]=0;

else

array[j*512+i] = array[j*512+i]/(4-2*cos(i*pi/511)

-2*cos(j*pi/1023);

}

Figure 3.12: Poisson Equation Calculation in the Frequency Domain

This segment of code scales the image by a factor:

4− 2 ∗ cos(i ∗ π

511)− 2 ∗ cos(j ∗ π

1023)

There are two efficient ways to perform this calculation. The first is precomputing

the factors for the entire 1024 by 512 image. This is too large to load into BRAM

and so much be stored in SRAM. It will also have to be loaded into memory at

startup, although this added latency can be amortized over multiple transforms, as

long as the FPGA accelerator board is not reset. This requires added complexity

to the SRAM memory design, but does not require the two floating point add units

which are needed for the second method described next.

The second approach is to precompute only the 2∗cos(j∗ π1023

) and the 2∗cos(j∗ π511

)

terms and store them into BRAM, since they occupy a relatively small amount of

space. These initial values can be integrated into the FPGA bitstream and thus


do not need to be loaded onto the board after initialization. As mentioned in the

previous paragraph, the drawback to this is the addition of two floating point adders.

This method was chosen because of the relatively low area requirements as well as

the lower complexity of the required controller.

Both approaches require a floating point divide and the second requires floating

point addition. This is provided by the Xilinx floating point operator[46] which

supports both functions. Two add units and one divide unit were instantiated and

connected as detailed in Figure 3.13.

Figure 3.13: Implementation of the floating point divide and scale logic

3.4 GPU Implementation

The GPU implementation was developed for Machine 2 (Machine 2 is described in

Section 3.1.1). It uses a combination of NVIDIA supplied libraries and some custom


kernels to implement the preconditioner and the conjugate gradient calculation on

the GPU along with kernels from the Minimum LP norm calculation. The entire LP

norm and PCG algorithm kernels were implemented on the GPU as this eliminates

much of the data transfer between successive iterations as compared to implementing

only the preconditioner.

3.4.1 Preconditioner

The preconditioner uses a 2D DCT/IDCT and some floating point computation in

order to transform the input matrix into a form that converges rapidly when the

conjugate gradient method is used to solve the equations presented in Section 2.3.2.

The algorithm used for the DCT is described in Section 3.2.3. It focuses on the

reuse of an existing 2D FFT, in our case, the highly optimized CUFFT provided

by NVIDIA [25]. CUFFT provides a complex Fourier transform that leverages the

floating point capabilities and the parallelism available to GPUs to rapidly compute

1D, 2D or 3D transforms. Like the popular FFTW library [23] it uses a plan based

approach to setting up and executing FFTs. However, it does not possess the same

degree of flexibility as FFTW, such as supporting real-to-real transforms or DCTs.

Hence it was necessary to implement kernels that performed the two-dimensional

shuffle and complex multiplication necessary to convert between the FFT and DCT.

The FFT calculation is the most time consuming part of the DCT computation,

followed by the shuffle. This shuffle reorders the input matrix in four different ways

depending on the location of the individual data point. This procedure is not compute


bound and is limited by the performance of the memory bus and the efficiency of

scatter/gather operations.

The complex multiply step represents a straightforward kernel that is easily par-

allelized since each matrix value can be operated on independently. This was im-

plemented in the standard CUDA way of thread-per-pixel which involved assigning a

thread to each matrix data point.

3.4.2 Conjugate Gradient

The steps in the Conjugate Gradient calculation were presented in detail in Section

3.3. The actual implementation has a few optimizations that trade off memory for

computation, but mostly follows the psuedocode faithfully.

There are several operations that show up repeated such as pointwise matrix mul-

tiplication/addition and matrix accumulation. The pointwise functions parallelize

extremely well since there exist no dependencies between data points. To implement

these functions, a similar method to the complex multiplication discussed in Section

3.4.1 was used.

The accumulation kernel was somewhat more complicated since there are de-

pendencies inherent in accumulation and so it cannot be parallelized to the same

degree. In stream processing terminology, this operation is called reduction since

the number of threads goes from N2 for a NxN matrix to 1. This operation was

frequently used and so it was necessary to optimize it better. Techniques described

elsewhere [22] were used to implement conflict-free sequential addressing, maximal


thread utilization and to completely unroll loops.

The implementation of the Qpk+1 calculations as discussed in Section 2.3.2, where

the matrix Q contains the weights, was implemented using the thread-per-pixel method

since each data point can be operated on independently. However, this kernel takes

a performance hit due to the integer calculation of indices to ensure that they fulfil

boundary conditions. This non-regularity of the indexing prevents some optimization

of the code, but in practice the performance is not affected significantly since this

kernel does not consume as much of the total processing time as accumulation or the

FFT.

3.5 Data Transfer

In many applications that are not arithmetically intensive, data transfer times to the

accelerator dominate the total computation time. For this reason it is necessary to

discuss the impact of data transfer as well as to characterize the latencies involved

for the data sets in question.

The machine on which the Annapolis Wildstar II Pro is installed utilizes a 100

MHz PCI-X bus to communicate with the accelerator board. This is a parallel bus

which has a peak throughput of 6.4 Gbits per second for a 64-bit bus.

The machine on which the GPU is installed uses a PCI-E x16 bus to transfer data

which supports a throughput of 32 Gbits per second or 4 Gbytes per second, which

is around five times faster than the PCI-X version.


3.5.1 Programmed IO

The Programmed IO mode (PIO) only exists for the FPGA implementation. The

PIO method of data transfer is provided by the Annapolis LAD bus component. The

LAD bus provides an abstraction for dealing with the bus to the PCI controller chip

and removes some of the timing related control. DMA is also implemented using the

LAD bus abstraction.

The LAD bus allows for an address range to map directly to a register file on

the FPGA. This is typically used for transferring control data although it can also

be used to transfer information in small chunks that are typically equal to or less

than the size of the register file. The ease of implementation however led to this

being the first method of data transfer from the host to the board. The drawback of

this method is that it requires a large amount of chip area for the register file and

the addressing logic, as well as having a high transfer latency once hand-shaking is

enabled. The area overhead for a bank of 64 32-bit words can amount to over ten

percent of the entire chip area on a Virtex II Pro 70.

In our final implementation, PIO is still utilized to transfer control and debug

information. However, all data transfer occurs through DMA

3.5.2 Direct Memory Access (DMA)

Direct Memory Access (DMA) is a method by which data can be transferred using a

separate DMA controller, thus requiring no intervention by the main processor apart


from specifying the size and location of data in memory. It allows for large block

transfers of data while freeing up the processor for other tasks that can be run in

parallel.

DMA transfers were used in both the GPU and FPGA implementations, although

across different buses. The two implementations are described below.

DMA on the FPGA

Many of the lower level details of working with the PCI bus are abstracted away

through the use of the LAD bus and the PCI controller chip. However, there still

exist control sequences for initialization and bus read/write that require some im-

plementation effort. For the purposes of this application, we implemented a generic

reusable core that operates using both ICLK and PCLK and handles all cross do-

main data transfer transparently. It also supplies debug information through the

core-to-dma interface that the DMA controller gets using PIO.

The interface works as follows: The user specifies a source and destination memory

address in the host C code, along with the data transfer size. The FPGA receives

this data and stores it during a DMA initialize period and then gets ready to receive.

As data comes streaming in, it is buffered in a dual ported asynchronous BlockRAM

(to handle the cross-domain clocking issue) and the data avail line goes high. After

this, data is output to the core phase unwrapping design.

Writes back to the host are handled similarly. The wr en line is set high and

then data is streamed to the DMA controller. It crosses through an asynchronous


BlockRAM and is output to the host. Once the transaction is completed, an interrupt

is set to signal to the host that new data is available in memory.

The PCI-X bus is capable of 800 MB/s, however the FPGA chip on the board

has a 32-bit bus to the PCI-X controller chip (since the 64-bit PCI-X bus is split

between the two FPGAs on the Wildstar II Pro) that operates at the same frequency

as the PE clock (although phase shifted). This results in data transfer speed in our

application being dominated by the PE consumption rate.

DMA on the GPU

DMA is the only method available for block data transfers from host memory to the

onboard GDDR3 memory on the 8800GTX GPU board. It is used to transfer the

eight floating point data arrays and one character array. This corresponds to a total

of 16.5 MB of data. On our PCI-E x16 bus this takes 4.125 ms for a one way transfer.

There are two methods available for implementing DMA transfers using CUDA.

The first stalls the processor while the transfer occurs, but does not require the use of

pinned (or reserved) memory. This is useful in memory intensive applications. The

second does not stall the processor, but requires that memory be set aside exclusively

for DMA. We implemented the second method since the phase-unwrapping uses a

small about of memory relative to the memory available on the system.


3.6 Conclusion

In this chapter, the algorithms and their implementation in FPGAs and GPUs were

described. The various methods of data transfer and some of the choices regarding

implementation were also discussed.

In the next chapter, we will present the results of our implementation and directly

compare and contrast the two platforms in terms of our chosen metrics.

Chapter 4

Results

Previous chapters presented the background and implementation of the phase un-

wrapping algorithm on two separate platforms, FPGAs and GPUs. This chapter

presents the results of the experiments performed on those two platforms. We start

by describing the experimental setup and how we verified the results of both imple-

mentations. Next we present the benchmark suite that we used for testing and then

continue on to present the results. We look at three common metrics: performance,

performance per dollar and power consumption.

4.1 Experimental Setup

The platform for which the FPGA implementation was designed is the Annapolis

WildStar II Pro [3]. Synthesis was performed using Synplicity Pro 8.8 with pipelin-

ing and resource sharing enabled. Place and route was performed using the Xilinx

Foundation tools 7.1i. Version 1.1 of the CUDA SDK was used for GPU development.

CHAPTER 4. RESULTS 84

4.1.1 Verification

It was important to verify both the FPGA and GPU versions as both implementa-

tions use accuracies less than that of the reference software implementation (whose

parameters are described in Section 3.1.1). The reference implementation uses mixed

single and double precision data types for different parts of the conjugate gradient

calculation with the preconditioner performed entirely in single precision. As men-

tioned previously, the FPGA implementation uses a mixed fixed and floating point

implementation for the preconditioner whereas the GPU version implements the full

LP norm calculation in single precision. It has also been documented that the Con-

jugate Gradient method is highly sensitive to precision [36]. Thus it was important

to verify that our implementation provided sufficiently accurate results.

We used two criteria to decide upon the accuracy of our solution. First we looked

at the results produced and compared them to the original implementation visually.

The two accelerated results shown in Figure 4.1 were almost identical with only minor

variation. Figure 4.1 depicts the original software unwrap, the GPU unwrapped

version and the FPGA unwraps. In Figure 4.2 we show the difference between the

accelerated versions and the reference implementation. Both of these show only

minimal variation. The image used for verification is the glass bead sample test that

produces over fifteen thousand residues.

The second metric that we used, as a rough guide as to the quality of the unwrap

while the unwrapping process was underway, was the number of residues eliminated


after each stage of the PCG calculation. In this case the FPGA and software im-

plementations were essentially identical in terms of the initial number of residues

detected due to only the preconditioner being implemented in hardware. However,

the lack of double precision elements in the CG part of the GPU implementation

caused some miscomparisons against the threshold at which a residue is identified.

Thus the GPU implementation had slightly more residues (around one to ten more)

at the start of the unwrap, but the FPGA had slightly more at the end of the un-

wrap due to the use of mixed-precision. These resulted in minimal differences in the

unwraps. In the case of the glass bead, this meant that initially, 15909 residues were

detected for the GPU version rather than 15907 for the FPGA and software versions,

and the final solution had 4 residues for the GPU versus 6 for the software and 8 for

the FPGA versions.

Figure 4.1: Phase unwraps on a) The reference software implementation, b) TheGPU and c) The FPGA


Figure 4.2: Differences in phase unwraps between software and a) The GPU and b)The FPGA

4.1.2 Benchmark Suite

The benchmark suite that we used consisted of three images that encompassed the

range of possible datasets. First, a single mouse embryo image posed an unwrap that

converged to zero residues within 7 iterations of the LP norm or 140 iterations of the

PCG (note that LP norm uses PCG as detailed in Section 2.3.2). Next, the glass

bead sample iterates until it reaches the maximum number of iterations, currently

set at 10 LP norm iterations, each of which iterates the PCG core 20 times. Last

is the double embryo image which takes the full number of iterations as well, but

converges more slowly than the glass bead. Unlike the single mouse embryo which

converges with zero residues and the glass bead that ends with six, the double embryo

terminates with 11 residues remaining.

Each iteration of the LP norm takes the same amount of time, hence the glass


bead and the double embryo unwraps take the same amount of time whereas the

single embryo image takes 70 percent of their execution time since it iterates seven

times.

4.2 Results

Now that experimental parameters such as the benchmark suite and verification pro-

cedures have been presented, we discuss performance as measured by three metrics.

First is processing time, or how quickly we arrive at the solution using the GPU and

the FPGA. Next is the cost of the two platforms relative to the processing power.

Last of all we discuss power consumption.

4.2.1 Experiments

In the following experiments, we measure performance (by means of timing profiles

on various sections of the code), cost-effectiveness (by generating the comparison of

performance per unit cost) and finally power (by measuring the current draw at the

wall while running the algorithm for a high number of iterations). All timing numbers

are given in seconds, and power in watts. Note that performance per dollar uses

the inverse of the preconditioner processing time as the performance number. This

translates roughly into FLoating point Operations Per Second or FLOPS. Machine

1 and Machine 2 are detailed and profiled in Section 3.1.1 and hold the FPGA

accelerator and the GPU board respectively.


4.2.2 FPGA Area Consumption

The area consumption statistics for the single DCT core implementation on the

Annapolis Wildstar II Pro which uses a Virtex II Pro FPGA is presented in Table

4.1.

Component Available Used PercentageSlices 328 48 14 %

BlockRAMs 328 28 8 %PowerPCs 0 2 0 %

Slices 33088 11328 34 %

Table 4.1: FPGA area consumption for the single DCT core implementation

Up to two more DCT cores could be implemented on the Virtex II Pro since

current slice usage (the constraining factor) is in large part consumed performing

data transfer and control. This means that the DCT itself doesn’t take up much

area. The data transfer and control logic doesn’t need to be replicated if adding two

more cores, hence we can say that there is enough room for a maximum of three

cores on the FPGA.

4.2.3 Performance

Figure 4.3 shows the speedup when running the preconditioner on the FPGA board

with one DCT core, versus running the entire program on the GPP in software on

Machine 1. These timing numbers are for the glass bead dataset and represent the

sum of 200 iterations of the PCG kernel.

Execution time goes down from an average of 95 seconds for the software version to

40.5 seconds for the FPGA accelerated version, which corresponds to a 2.35x speedup.


These are complete algorithm numbers including full disk IO, data transfer and all

related costs. Once the overhead, or non-kernel related functionality, is removed from

the timing leaving just the preconditioner, we see a 3.76x speedup. This is for a GPP

computation time of 74s and an FPGA time of 19.7s.

Figure 4.3: Speedup achieved using the FPGA versus the reference software imple-mentation on Machine 1

Figure 4.4 shows the algorithm speedup when running the LP norm kernel on the

GPU. This includes the preconditioner and the conjugate gradient calculations. It

was executed on Machine 2, whose parameters can be found in Section 3.1.1. These

numbers were generated for the glass bead dataset and thus represent the sum of 200

iterations of the PCG kernel.


The overall algorithm speedup including disk IO and all data transfer is 5.24x.

The section seeing greatest acceleration is the preconditioner which is sped up by

a factor of 9.3x to a time of 1.2s for 200 iterations. One of the reasons why we

see this level of application acceleration is that there is no host-accelerator data

transfer occurring for the preconditioner since the data transfer occurs once per

whole PCG iteration for the GPU implementation. All data transfer occurs over the

GPU memory-processor bus which has a bandwidth of 86.4 GB/s. All sections of the

algorithm see speedups with the exception of disk IO and the overhead calculations.

The preconditioner still takes the majority of the calculation time, but now disk IO

comes a close second. Any further speedup will be hampered by the fact that disk

IO cannot be much accelerated.

Figure 4.5 attempts to compare the FPGA implementation to the GPU imple-

mentation while equalizing other parameters. This necessitates some extrapolation

of the implementation.

The solid bars represent actual synthesized results. One of the results of the single

core implementation was that there was sufficient space to implement up to three

cores on the FPGA without incurring significant increases to complexity or with any

stalling. We depict that result in the third column.

The reason why the FPGA implementation was synthesized for a Virtex 5 as well

as the Virtex II Pro was that the G80 GPU is a relatively recent product released

in November 2006. The Virtex 5 was released around the same time and thus they


Figure 4.4: Speedup achieved using the GPU versus the reference software imple-mentation on Machine 2

both represent technologies from the same era. The Virtex 2 Pro was released in

2002 which is significantly older. The Virtex 5 implementation operates at 238 MHz,

or about twice the clock speed of the Virtex II Pro. The timing of the FFT core

used also ensures that there is enough computation time to allow three DCT cores

to operate in parallel without significantly changing the core design. Lower logic

utilization on the Virtex 5 due to embedded DSP slices also ensures that there will

be sufficient available area. The last column gives the execution time excluding

data transfer, because for the GPU, no host to board data transfer occurs for the

preconditioner. Thus this last column models only execution time. The nature of


the results is given in parenthesis below the column label describing if the results

represent synthesized performance numbers, projected performance or if there is no

label, real and measured performance.

Even in the last case, the GPU outperforms the FPGA by a factor of 1.9x. This

is due to high degree of parallelism and the high frequency possible on the GPU in

single precision floating point, which is not possible on the FPGA.

Figure 4.5: Time to complete 200 iterations of the preconditioner on both platforms

This section only compares the performance of the preconditioner on the FPGA

versus the GPU since only the preconditioner was implemented on the FPGA. Im-

plementing the entire Conjugate Gradient calculation on the FPGA was infeasible

since it requires single precision accuracy at the very least and implementing the sort


of matrix operations present in the CG calculation in single precision would require

many FPGAs. The best scenario would be a hybrid system like that presented in

[38] which actually saw slowdowns in some non cache optimal cases. The possibility

of running one computational kernel, storing the results in off-chip SRAM and re-

programming the FPGA with another kernel is also infeasible given that an FPGA

requires on the order of 110 ms to upload a bitstream (this was timed on the Wildstar

II Pro). The long reprogramming latency is due to the size of the bitstream (over

three megabytes for a V2P70), the serial nature of bitstream loading and the slow

write but fast read nature of the FPGA SRAM LUTs.

4.2.4 Cost Effectiveness

In this section we discuss the cost effectiveness of the platforms discussed here. Figure

4.6 presents our results. The cost metric is of some importance to the development

of the OQM modality of the Keck, since the goal is to eventually have the technology

commercialized.

The GPU is a mass market consumer level product and as such, is available

for between five and six hundred dollars for a high end model. FPGA accelerator

boards are relatively low volume products and sell for much more, in the range of ten

thousand dollars for a high end model. In the RCL lab, a new machine with a Virtex

5 accelerator board was recently purchased for twelve thousand dollars (including the

cost of the machine). This machine is shown as Machine 3 in Figure 4.6. Machine

2 cost approximately twenty thousand (and has two FPGAs) and Machine 1 cost


about twenty two hundred. Both Machine 1 and 2 are the same as those presented

in Section 3.1.1. Performance is based on the reciprocal of the time to complete the

same phase-unwrapping algorithm with identical data sets on both platforms.

Figure 4.6: A comparison showing the performance per dollar on three platforms

The mass produced GPU clearly wins out when cost is taken into account since

it performs more than twice as fast as the Virtex 5 but at a fraction of the price.

4.2.5 Power

The last metric that we discuss is power. This metric has always been important

in the embedded space, but is becoming increasingly important in modern day HPC

applications since the cost of cooling and powering a cluster over the life of the


hardware can be a significant fraction of the cost of the hardware itself [34]. In

Figure 4.7 we present total power consumption for the two platforms while running

the implementation. This was measured at the wall using a meter. In Figure 4.8 we

show the difference between idle consumption and processing consumption for the

two platforms. This is the actual power consumption of processing the phase unwrap

data. Note that the accelerators consume minimal power when not being used.

Figure 4.7 really shows the effect of difference in processor architectures rather

than accelerator power consumption. Machine 1 contains a Xeon with a peak power

consumption rating of 103W [13] whereas Machine 2 has a Core 2 Duo with a peak

power consumption rating of 65W. This shows up clearly on the graphs.

Figure 4.7: A comparison showing the total power consumption for the two machines


Figure 4.8 shows the differences in power consumption between running the phase

unwrapping algorithm on the GPP and on the accelerator for the two machines.

The first pair of columns show the difference between idle power consumption and

processing power consumption when using the GPP. The second pair of columns show

the difference between idle power consumption and processing power consumption

when using the accelerator. The FPGA consumes susbstanially less power than

both the GPU and the software versions. Thus for Machine 1, power consumption

is actually lowered by 25W while processing is sped up when implemented on an

FPGA. Machine 2 sees a 69W increase in power consumption by running the phase

unwrapping on the GPU.

4.3 Summary

In this chapter we presented the results of our experiments. The GPU outperforms

the FPGA by a significant margin and also wins in terms of cost due to the fact that

the chip is mass marketed. The FPGA is more power efficient however, which may

be a consideration if phase unwrapping ever needs to be performed in an embedded

fashion. In the next chapter we present some final conclusions drawn from our results

and discuss future work.


Figure 4.8: A comparison showing the power consumption difference between theprocessor running in the idle state and executing the algorithm using either a GPPor an accelerator

Chapter 5

Conclusion and Future Work

5.1 Conclusion

The type of computational accelerator that is used in an application depends on

the algorithmic nature of the application itself as well as on its intended usage. For

example, a high performance GPU would be a bad match for an embedded application

and a low power FPGA would be a bad match for a HPC application. In the context

of the Keck microscope, the high performance and low cost of the GPU platform

makes the most sense since power and area are not major concerns. If development

on the microscope progresses to the point where an embedded accelerator is needed

then the feasibility of an FPGA should be revisited.

The accuracy of both platforms varied as well. The need for greater than single

precision in the conjugate gradient calculation was evidenced by the differing number

of residues detected versus the software implementation. However, this difference is

minimal enough to not significantly affect the results. The preconditioner has less

stringent accuracy requirements, although since the conjugate gradient depends on

CHAPTER 5. CONCLUSION AND FUTURE WORK 99

it, it does need close to single precision. Again, the results of our mixed fixed and

single precision FPGA implementation indicate that the differences compared to the

software version are negligible.

The raw computational power of the GPU surpasses that of a comparable FPGA

platform for the preconditioner by a factor of almost two. In addition, the relatively

long time to reprogram an FPGA eliminates its viability as an accelerator for the

entire conjugate gradient application (multiple FPGAs streaming data between them

would have to be used instead which would be prohibitively expensive in terms of

cost and power). There is a limited amount of speedup that the FPGA can produce

by accelerating only the preconditioner and not the conjugate gradient calculations

as well.

5.2 Future Work

This phase unwrapping project will eventually be integrated into the registration

and preprocessing steps used by the OQM microscope, thus eliminating the need

for disk IO. This will bring the effective phase unwrapping time to 2.3 seconds. In

addition, newer GPUs on the market with faster and wider buses as well as more

stream processors [26] that support the CUDA API are currently available. These will

produce significant performance improvements with minimal, or possibly no changes

to the code.

Further exploration in the feasibility of conjugate gradient on FPGAs would also

CHAPTER 5. CONCLUSION AND FUTURE WORK 100

be of use to phase unwrapping and to other applications. Finding the right balance of

computation offloaded onto the accelerator, and run on the GPP would be a valuable

problem to solve, and one that would change with each new generation of hardware.

Finally, various optimizations could be made to increase the clock frequency and

the number of cores for the FPGA designs discussed in this thesis. It is also possible

to split the design, including the conjugate gradient calculation, over multiple FP-

GAs. Similarly, optimizations exist that could lower the execution time for the GPU

implementation as well as parallelize the execution over multiple GPUs. Exploration

of these possibilities would push the boundaries of what is currently achievable in

hardware today and could lead to valuable results in the future not only for those

interested in phase unwrapping, but also those involved in the HPC and image pro-

cessing fields.

Bibliography

[1] Dennis C. Ghiglia and Mark D. Pritt. Two-Dimensional Phase Unwrapping:Theory, Algorithms and Software. Wiley Inter-Science, 605 Third Avenue, NewYork, NY, 10158-0012, 1998.

[2] 4DSP Inc. IEEE-745 compliant floating-point FFT core for FPGA.http://www.4dsp.com/fft.htm, Last accessed March 2007.

[3] Annapolis Micro. Annapolis Micro Systems Inc. - Wildstar II Pro PCI.http://www.annapmicro.com/wsiippci.html, Last accessed July 2008.

[4] I. Buck, T. Foley, D. Horn, J. S. K. Fatahalian, M. Houston, and P. Hanrahan.Brook for GPUs: stream computing on graphics hardware. ACM Transactionson Graphics, 23(3):777–786, August 2004.

[5] Y. Chan and W. Siu. On the realization of discrete cosine transform using thedistributed arithmetic. IEEE Transactions on Circuits and Systems, 39(9):705–712, Sept 1992.

[6] C. H. Crawford, P. Henning, M. Kistler, and C. Wright. Accelerating Computingwith the Cell Broadband Engine Processor. In Proceedings of the 2008 conferenceon Computing Frontiers, pages 3–12, 2008.

[7] Cray. Cray XD1 Datasheet. http://www.cray.com/downloads/Cray XD1Datasheet.pdf, Last accessed July 2008.

[8] P. D’Alberto, P. Milder, A. Sandryhaila, F. Franchetti, J. Hoe, J. Moura, andM. Puschel. Generating fpga-accelerated dft libraries. In Proceedings of theIEEE Symposium on FPGAs for Custom Computing Machines (FCCM’07),pages 173–184, 2007.

[9] T. Dillon. Two Virtex-II FPGAs deliver fastest, cheapest, best high-performanceimage processing system. In Xilinx Xcell J., pages 70–73, 2001.

[10] HP. Accelerating HPC using GPUs. http://www.hp.com/techservers/hpccn/hpccollaboration/ADCatalyst/downloads/accelerating-HPCUsing-GPUs.pdf,Last accessed July 2008.

BIBLIOGRAPHY 102

[11] A. Hull and W. Jenkins. Preconditioned conjugate gradient methods for adaptivefiltering. In IEEE International Symposium on Circuits and Systems, pages 540–543, June 1991.

[12] Intel. Intel math kernel library 10.0. http://www.intel.com/cd/software/products/asmo-na/eng/307757.htm, Last accessed July 2008.

[13] Intel. Intel Xeon Processor 3 GHz. http://processorfinder.intel.com/details.aspx?sSpec=SL7DW, Last accessed July 2008.

[14] I.S. Uzun and A. Amira and A. Bouridane. FPGA Implementations Of FastFourier Transform For Real-Time Signal And Image Processing. In Proceedingsof the IEEE Conference On Vision, Image And Signal Processing, volume 152,pages 283–296, June 2005.

[15] Jeff Bolz and Ian Farmer and Eitan Grinspun and Peter Schroder. Sparse matrixsolvers on the GPU: conjugate gradients and multigrid. ACM Transactions onGraphics, 22(3):917–924, July 2003.

[16] Karasev, P.A. and Campbell, D.P. and Richards, M.A. Obtaining a 35x Speedupin 2D Phase Unwrapping Using Commodity Graphics Processors. In RadarConference, 2007 IEEE, pages 574–578, April 2007.

[17] Khurram Bukhari, Georgi Kuzmanov and Stamatis Vassiliadis. DCT and IDCTImplementations on Different FPGA Technologies. In Program for Research onIntegrated Systems and Circuits (ProRISC), pages 232–235, November 2002.

[18] G. Laevsky, W. C. W. II, M. Rajadhyaksha, and C. A. DiMarzio. Multi-ModalMicroscope for Biomedical Research. In Life Science Systems and ApplicationsWorkshop, pages 1–2, July 2006.

[19] M. P. Leong and Philip H. W. Leong. A Variable-Radix Digit-Serial DesignMethodology and its Application to the Discrete Cosine Transform. IEEE Trans-actions on Very Large Scale Integrated Systems, 11(1):90–104, Feb 2003.

[20] J. Makhoul. A Fast Cosine Transform in One and Two Dimensions. IEEETransactions on Acoustics, Speech, and Signal Processing, 28(1):27–34, February1980.

[21] Mark Borgerding. Kiss FFT. http://sourceforge.net/projects/kissfft/, Last ac-cessed July 2008.

[22] Mark Harris. Optimizing Parallel Reduction inCUDA. http://developer.download.nvidia.com/compute/cuda/11/Website/projects/reduction/doc/reduction.pdf, Last accessed July 2008.

BIBLIOGRAPHY 103

[23] Matteo Frigo and Steven G. Johnson. The Design and Implementation ofFFTW3. In Proceedings of the IEEE, volume 93, pages 216–231, Feb 2005.

[24] NVIDIA. Cg - Reference Manual. http://developer.download.nvidia.com/cg/Cg 2.0/2.0.0015/Cg-2.0 May2008 ReferenceManual.pdf, Last accessed July2008.

[25] NVIDIA. CUFFT Library. http://developer.download.nvidia.com/compute/cuda/1 1/CUFFT Library 1.1.pdf, Last accessed July 2008.

[26] NVIDIA. GeForce GTX 280. http://www.nvidia.com/object/geforcegtx 280.html, Last accessed July 2008.

[27] NVIDIA. NVIDIA CUDA Programming Guide.http://developer.download.nvidia.com/compute/cuda/1 1/NVIDIA CUDAProgramming Guide 1.1.pdf, Last accessed July 2008.

[28] Pavle Belanovic. Library of Parameterized Hardware Modules for Floating-Point Arithmetic with An Example Application. Masters Thesis, NortheasternUniversity, June, 2002.

[29] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. NumericalRecipies: The Art of Scientific Computing. Cambridge University Press, 1986.

[30] R. Woods and D. Trainor and J.P. Heron. Applying an XC6200 to real-timeimage processing. IEEE Design and Test of Computers, 15(1):30–38, Jan-Mar1998.

[31] Rapidmind. RAPIDMIND: Product resources.http://www.rapidmind.net/resources.php, Last accessed July 2008.

[32] J. R. Shewchuk. An introduction to the conjugate gradient method without theagonizing pain, August 1994. http://www.cs.cmu.edu/ quake-papers/painless-conjugate-gradient.pdf, Last accessed July 2008.

[33] N. Shirazi, A. Abbot, and P. Athanas. Implementation of a 2-D Fast FourierTransform on FPGA-Based Custom Computing Machines. In Proceedings ofthe IEEE Symposium on FPGAs for Custom Computing Machines (FCCM’95),pages 155–163, April 1995.

[34] Shushant Sharma and Chung-Hsing Hsu and Wu-chun Feng. Making a Casefor a Green500 List. In 20th International Parallel and Distributed ProcessingSymposium (IPDPS) Workshop on High-Performance, Power-Aware Computing(HP-PAC), April 2006.

BIBLIOGRAPHY 104

[35] C. Smith. Phase unwrapping algorithms. Masters Thesis, Northeastern Univer-sity, 2004.

[36] R. Strzodka and D. Goddeke. Pipelined Mixed Precision Algorithms on FPGAsfor Fast and Accurate PDE Solvers from Low Precision Components. In IEEEProceedings on Field-Programmable Custom Computing Machines, 2006, pages259–270, 2006.

[37] T. Valich. GPU supercomputer: Nvidia Tesla cards to debut inBull system. http://www.tomshardware.com/news/nvidia-graphics-supercomputer,5219.html, Last accessed July, 2008.

[38] V.K. Prasanna and G.R. Morris and R. D. Anderson. A Hybrid Approach forMapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Su-percomputer. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pages 3–12, 2006.

[39] W. Warger. Cell counting using the OQM modality. Masters Thesis, Northeast-ern University, June, 2006.

[40] Warren E. Ferguson Jr. Selecting Math Coprocessors. IEEE Spectrum, pages38–41, July 1991.

[41] S. Wasson. Ageia’s PhysX physics processing unit.http://techreport.com/articles.x/10223, Last accessed July 2008.

[42] William Thies and Michal Karczmarek and Saman Amarasinghe. StreamIt: ALanguage for Streaming Applications. In Proceedings of the 11th InternationalConference on Compiler Construction, volume 2304, pages 179–196, 2002.

[43] Xilinx. Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Complete DataSheet. http://www.xilinx.com/support/documentation/data sheets/ds083.pdf,Last accessed July 2008.

[44] Xilinx Inc. 1-D Discrete Cosine Transform(DCT) V2.1.http://www.xilinx.com/ipcenter/ catalog/logicore/docs/da 1d dct.pdf, Lastaccessed March 2007.

[45] Xilinx Inc. Fast fourier transform 3.2.http://www.xilinx.com/ipcenter/catalog/logicore/docs/xfft.pdf, Last accessedMarch 2007.

[46] Xilinx Inc. Floating-point operator v1.0.http://www.xilinx.com/bvdocs/ipcenter/data sheet/floating point.pdf, Lastaccessed October 2007.

northeastern university graduate school of · pdf filenortheastern university graduate school...

Documents