06416735.pdf

7
Comparison of Processing Performance and Architectural Efficiency Metrics for FPGAs and GPUs in 3D Ultrasound Computer Tomography Matthias Birk, Matthias Balzer, Nicole Ruiter and Juergen Becker Karlsruhe Institute of Technology (KIT) Karlsruhe, Germany Email: {matthias.birk, matthias.balzer, nicole.ruiter, becker}@kit.edu Abstract—With the rise of heterogeneous computing archi- tectures, application developers are confronted with a multi- tude of hardware platforms and the challenge of identifying the most suitable processing platform for their application. Strong competitors for the acceleration of 3D Ultrasound Computer Tomography, a medical imaging method for early breast cancer diagnosis, are GPU and FPGA devices. In this work, we evaluate processing performance and efficiency metrics for current FPGA and GPU devices. We compare top- notch devices from the 40 nm generation as well as FPGA and GPU devices, which draw the same amount of power. For our two benchmark algorithms, the results show that if power consumption is not considered the GPU and the FPGA give both, a similar processing performance and processing efficiency per transistor. However, if the power budget is limited to a similar value, the FPGA performs between six and eight times better than the GPU. Keywords-FPGA; GPU; medical imaging; performance com- parison; performance analysis; I. I NTRODUCTION In the days of heterogeneous accelerated computing, ap- plication developers in academia and industry are at the same time offered and burdened with a variety of target platforms. When choosing a certain platform, several aspects have to be considered, most prominently processing performance and energy efficiency. By time consuming and error prone hand crafted evaluations, developers have to choose a suitable platform to meet the specifications of their target application and at the time same face the risk of selecting a suboptimal processing platform. Since the introduction of CUDA in 2006 and the subse- quent rise of general purpose GPU computing, especially GPUs and FPGA-based systems are strong competitors. Since then, performance comparisons of FPGAs against GPUs have been shown for a range of diverse application domains in order to determine the individual strength and weaknesses of each architecture. However, the existing per- formance and power evaluations are often unbalanced as not enough care has been taken in selecting the respective FPGA This work has been funded by Deutsche Forschungsgemeinschaft. and GPU counterparts. For example, in [1]–[3] processing performance of multiple FPGA devices is compared against a single GPU so that the resulting values give a misleading impression of the actual device-per-device value. In [4]–[6] single devices are used, but the FPGA and GPU do not origin from the same generation, i.e. the semiconductor process in which the devices are fabricated. For a meaningful compar- ison performance values would need to be normalized to a common denominator in this case. The same is true in [7]– [9], where the target devices are of the same generation, but of different grade, i.e. not the fastest, and in case of the FPGA also largest, available devices are chosen. Similarly, if power is considered as in [10]–[12], the consumed energy as product of power and execution time is reported for unbalanced devices. Also here, execution times for top-notch devices should be measured and related to their power consumption. Furthermore, as this comparison neglects overheads created by external cooling and assumes the necessary power as always available, a comparison of devices, which draw the same amount of power is of interest. Addressing the shortcomings above, we compare here GPU and FPGA devices for two benchmark algorithms from 3D Ultrasound Computer Tomography, which is a medical imaging method for early breast cancer diagnosis and has high computational needs. We compare processing performance for the largest devices available in the mature 40 nm generation as well as for devices with a similar power consumption. In order to compare the deviating FPGA and GPU computing architectures more fundamentally, we determine area efficiency (performance per transistor), power efficiency (performance per Watt) as well as computational efficiency (performance per cycle) from the values above. II. ARCHITECTURE SELECTION AND DESCRIPTION For our performance comparison, we chose an Nvidia Geforce GTX 580 GPU, based on the Fermi micro architec- ture [13]. This is Nvidia’s highest-performance single chip GPU in the 40 nm generation. As a FPGA counterpart in the same generation, we selected a Xilinx Virtex-6 SX475T- 2 FPGA [14], which is the largest available FPGA in 978-1-4673-2921-7/12/$31.00 c 2012 IEEE

Upload: includecadastro

Post on 28-Oct-2015

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 06416735.pdf

Comparison of Processing Performance and Architectural Efficiency Metricsfor FPGAs and GPUs in 3D Ultrasound Computer Tomography

Matthias Birk, Matthias Balzer, Nicole Ruiter and Juergen BeckerKarlsruhe Institute of Technology (KIT)

Karlsruhe, GermanyEmail: {matthias.birk, matthias.balzer, nicole.ruiter, becker}@kit.edu

Abstract—With the rise of heterogeneous computing archi-tectures, application developers are confronted with a multi-tude of hardware platforms and the challenge of identifyingthe most suitable processing platform for their application.Strong competitors for the acceleration of 3D UltrasoundComputer Tomography, a medical imaging method for earlybreast cancer diagnosis, are GPU and FPGA devices. Inthis work, we evaluate processing performance and efficiencymetrics for current FPGA and GPU devices. We compare top-notch devices from the 40 nm generation as well as FPGAand GPU devices, which draw the same amount of power.For our two benchmark algorithms, the results show that ifpower consumption is not considered the GPU and the FPGAgive both, a similar processing performance and processingefficiency per transistor. However, if the power budget is limitedto a similar value, the FPGA performs between six and eighttimes better than the GPU.

Keywords-FPGA; GPU; medical imaging; performance com-parison; performance analysis;

I. INTRODUCTION

In the days of heterogeneous accelerated computing, ap-plication developers in academia and industry are at the sametime offered and burdened with a variety of target platforms.When choosing a certain platform, several aspects have to beconsidered, most prominently processing performance andenergy efficiency. By time consuming and error prone handcrafted evaluations, developers have to choose a suitableplatform to meet the specifications of their target applicationand at the time same face the risk of selecting a suboptimalprocessing platform.

Since the introduction of CUDA in 2006 and the subse-quent rise of general purpose GPU computing, especiallyGPUs and FPGA-based systems are strong competitors.Since then, performance comparisons of FPGAs againstGPUs have been shown for a range of diverse applicationdomains in order to determine the individual strength andweaknesses of each architecture. However, the existing per-formance and power evaluations are often unbalanced as notenough care has been taken in selecting the respective FPGA

This work has been funded by Deutsche Forschungsgemeinschaft.

and GPU counterparts. For example, in [1]–[3] processingperformance of multiple FPGA devices is compared againsta single GPU so that the resulting values give a misleadingimpression of the actual device-per-device value. In [4]–[6]single devices are used, but the FPGA and GPU do not originfrom the same generation, i.e. the semiconductor process inwhich the devices are fabricated. For a meaningful compar-ison performance values would need to be normalized to acommon denominator in this case. The same is true in [7]–[9], where the target devices are of the same generation, butof different grade, i.e. not the fastest, and in case of theFPGA also largest, available devices are chosen.

Similarly, if power is considered as in [10]–[12], theconsumed energy as product of power and execution time isreported for unbalanced devices. Also here, execution timesfor top-notch devices should be measured and related totheir power consumption. Furthermore, as this comparisonneglects overheads created by external cooling and assumesthe necessary power as always available, a comparison ofdevices, which draw the same amount of power is of interest.

Addressing the shortcomings above, we compare hereGPU and FPGA devices for two benchmark algorithmsfrom 3D Ultrasound Computer Tomography, which is amedical imaging method for early breast cancer diagnosisand has high computational needs. We compare processingperformance for the largest devices available in the mature40 nm generation as well as for devices with a similarpower consumption. In order to compare the deviating FPGAand GPU computing architectures more fundamentally, wedetermine area efficiency (performance per transistor), powerefficiency (performance per Watt) as well as computationalefficiency (performance per cycle) from the values above.

II. ARCHITECTURE SELECTION AND DESCRIPTION

For our performance comparison, we chose an NvidiaGeforce GTX 580 GPU, based on the Fermi micro architec-ture [13]. This is Nvidia’s highest-performance single chipGPU in the 40 nm generation. As a FPGA counterpart inthe same generation, we selected a Xilinx Virtex-6 SX475T-2 FPGA [14], which is the largest available FPGA in

978-1-4673-2921-7/12/$31.00 c© 2012 IEEE

Page 2: 06416735.pdf

Table IGPU ARCHITECTURE CHARACTERISTICS

Feature GTX 580 GT 540M

Graphic Chip GF 110 GF 108Clock Rate 1.54 GHz 1.34 GHzNumber of SMs 16 2Number of CCs 512 96

the Xilinx SXT series. We decided to use a FPGA fromthis series as it is optimized for Ultra High-PerformanceDigital Signal Processing [14] and is therefore the adequatecandidate for a performance comparison. With a complexityof 3 billion (bi.) transistors in case of the GPU [15] and 2.5bi. for the FPGA [16], both devices are also technologicallycomparable.

However, the GTX 580 GPU is specified with a powerconsumption of 244 W. So in order to select a GPU witha similar power consumption as the V6SX475T-2 FPGA,we estimated its power consumption with the Xilinx PowerEstimater (XPE) [17]. We parametrized a well-filled designwith a similar distribution of device elements and clockfrequencies as in our designs. This gave an approximativepower consumption of 24 W to which we added a 6 Wconsumption of the external DDR3 RAM [18] as well as5 W for further peripherals. This results in an overall powerconsumption of about 35 W for the whole FPGA system andis equal to the Nvidia Geforce GT540M, which is also basedon the Fermi micro architecture and fabricated in the 40 nmprocess.

In the following subsections, we introduce the furthercharacteristics of the targeted FPGA and GPU devices.In case of the FPGA, we present the used PCI-Expressframework design which connects the algorithmic modulesto the outer system. Thus, the FPGA board is connected toa host PC and operated in the same configuration as theGPUs.

A. GPUs: Nvidia GTX 580 and GT540M

As both GPUs origin from the Fermi micro architecture,but only use different graphic chips, their structural compo-sition is mainly identical. Both consist of a certain number ofexecution units (Cuda Cores, CCs) grouped into StreamingMultiprocessors (SMs), see Tab. I for exact numbers perGPU. Besides the CCs each SM contains 16 Load/Store-units, a Register File and special function units (SFUs) forcomputation of transcendental functions. Furthermore, eachSM has a Texture- and Constant Cache as well as a 64 kBytelocal memory common to all its CCs. This local memory ispartly configurable for the use as scratch pad memory oras L1 cache. There is also a unified Level-2 cache present,which is shared by all SMs and transparently accelerates allaccesses of the external global memory.

Figure 1. Block diagram of the device architecture on the Xilinx Virtex-6SX475T. The communication systems consists of an PCIe endpoint andan AXI subsystem, which includes an AXI-Lite and an AXI-Full businfrastructure as well a DDR3 controller connected as a slave interface.The PCIe endpoint serves as bus master.

We programmed all GPU kernels in CUDA C using ver-sion 4.2. Nvidia’s CUDA programming framework matchesthe underlying hardware architecture. It uses a single-instruction-multiple-threads (SIMT) strategy and makes useof extensive hardware multi-threading in order to hidememory access latencies. In this way, throughput is tradedfor latency. Following the SM and CC hierarchy, a scalarkernel is launched on a specified number of threads, whichare further organized into thread-blocks. Only the globalmemory is accessible from the host CPU. Within the GPUdevice, registers are private to a single thread and data inshared memory can be accessed from all threads within thesame block. In contrast, the global memory is accessible byall threads.

B. FPGA: Xilinx Virtex-6 SX475T

The overall FPGA communication system architecture,see Fig. 1, consists of a PCI-Express subsystem and anAMBA AXI bus subsystem. The AXI subsystem includesan AXI-Lite and an AXI-Full bus infrastructure and alsoincludes the memory controller for an external 2 GByteDDR3 module connected as a slave to the AXI bus. Allnecessary AXI cores have been created using the XilinxEDK environment.

The PCIe subsystem is originally based on the design1

of Marcus et. al. [19]. We further optimized the designin order to meet the 8-lane, generation 1 timing and toprovide an improved FPGA-to-host throughput by pipelinedread transactions. Furthermore, we included an AXI-Lite(control) as well as an AXI-Full (data) bus master. Thus, thehost PC is able to access the connected AXI-Lite slaves viamemory-mapped reads and writes. The AXI-Full slaves canbe read from and written to via DMA transactions. In orderto achieve the necessary PCIe throughput, the endpoint runsat 250 MHz and interfaces to the AXI subsystem running at125 MHz, which doubles the width from 64 to 128 bit.

III. APPLICATION DESCRIPTION

The chosen benchmark algorithms origin from the medicalimaging domain and in particular from three-dimensional

1available online at http://opencores.com/project,pcie sg dma

Page 3: 06416735.pdf

Figure 2. Schematic drawing of an USCT measurement (top) and an imageof our prototype system with covers being removed (bottom). During ameasurement, the breast hangs freely into the water-filled basin. The surfaceof this basin is equipped with more than 2000 ultrasound transducers.

ultrasound computer tomography (3D USCT) [20]. 3DUSCT is a new medical imaging method for early breastcancer diagnosis. Fig. 2 shows a schematic drawing of themeasurement setup and an image of the current experimentaldevice. During a measurement, the breast is suspended intoa water-filled basin and surrounded by thousands of ultra-sound transducers, i.e. emitters and receivers. The emitterssequentially send an ultrasonic wave front, which interactswith the breast tissue and is recorded by receivers as apressure variation over time. These signals, also called A-Scans, are sampled and stored for all available emitter-receiver-combinations. This results in millions of A-Scansand gigabytes of raw data, which are recorded during themeasurement of a single breast.

The recorded data needs then to be processed off-linein order to reconstruct volume images. The chain of pro-cessing can be separated into multiple tasks. In this work,we focus on signal processing, which directly operates onthe acquired raw data, and on the core element of theactual image reconstruction. The signal processing sequence,called Adapted Matched Filtering (AMF) [21] is applied inorder to improve the contrast in the reconstructed images.The Synthetic Aperture Focusing Technique (SAFT) [22]reconstruction algorithm then exploits the pressure over timeinformation in the A-Scans in order to create reflectivityimages.

The following subsections introduce our benchmark al-gorithms with the details necessary for an understanding ofthe work-at-hand. A more thorough description of AMF andSAFT can be found in [23] and [24], respectively.

A. Signal Processing: AMF

Every A-Scan consists of 3000 discrete pressure samplesover time. The AMF can be performed separately and inde-pendently without the need of an inter A-Scan communica-tion. It consists of multiple stages: Firstly, we correlate theraw A-Scan ARAW with the matched filter signal MF . The

MF is the expected waveform at the receivers and is, in ourcase, a frequency coded chirp. The MF correlation kernelconsists of 128 samples and by a time reversal of MF ,we convert the correlation into a convolution. Secondly, wegenerate the signal envelope AENV of the matched filteredsignal AMF by an approximated Hilbert Transform [25],i.e. the input signal AMF is first convoluted with a specialset of HIL coefficients, which are in our case 79 samples.The result AHIL is then taken as the imaginary part andthe input signal AMF as real part of an intermediate signal.This complex signal is known as the analytic signal and itsabsolute value is equal to the envelope of the input signalAMF .

In the next step, we reduce the envelope signal AENV

to its local maxima APEAK by simply comparing eachsamples value with the adjacent signal values. Here, thesignal value at peaks is retained and all other samplesare set to zero. Finally, we convolute with an optimalpulse OP in order to produce the output signal AOP . Thefinal convolution shapes the output signal optimally for thesubsequent SAFT image reconstruction [26]. The appliedOP convolution kernel consists of 19 samples. The signalprocessing steps are summarized in equations 1, whereas ?denotes the convolution operator and localmax(•) the peakfunction described above.

AMF = MF ? ARAW

AHIL = HIL ? AMF

AENV = ‖AMF + jAHIL‖ =√

A2MF +A2

HIL (1)

APEAK = localmax(AENV )

AOP = OP ? APEAK

B. Image Reconstruction: SAFT

The basic idea of the SAFT reconstruction is to accumu-late many low-quality images, recorded by transducers atdifferent geometric positions in order to create one high-resolution image. The assumption is that if emitter i (atposition ~xi) sends an ultrasonic wavefront, which is scatteredat an arbitrary volume position ~x, then the echo is recordedat receiver k ( ~xk) at a specific point-in-time. This point-in-time is given by the distance emitter-scatterer-receiver andthe speed-of-sound on the travelled path.

I(~x) =∑∀(i,k)

A

(t =‖~xi − ~x‖+ ‖~xk − ~x‖

c(i,k)

)(2)

‖~a−~b‖ =√

(ax − bx)2+ (ay − by)

2+ (az − bz)

2

The SAFT reconstruction thus assumes each positionwithin the USCT device as a possible scattering position.Therefore, the imaged region-of-interest (ROI) is dividedinto a discrete grid of voxels (volume pixels). Following

Page 4: 06416735.pdf

equation 2, for each particular voxel position ~x, the expectedsignal arrival times for all recorded emitter-receiver combi-nations (i, k) are calculated. Then, the A-Scans’ amplitudevalues at these time indices t′ are loaded and accumulatedas voxel intensity value I(~x). By doing so, each A-Scansample gives a contribution to all voxels on the surface ofan ellipsoid, but as a consequence of the conducted super-position of a huge number of A-Scans, the true scatteringpositions are highlighted most.

IV. IMPLEMENTATION

A. Signal Processing on GPUs

Following the description in section III-A, we divided themain AMF signal processing sequence into four separatekernels. The first kernel correlates the A-Scan with thematched filter, the second kernel applies the approximatedHilbert Transform and calculates the absolute value of theanalytic signal. The subsequent third kernel determines localmaxima and the final kernel convolutes with the optimalpulse. We implemented two additional helper kernels, onebefore and one after the actual processing. These are usedto remove the offset of the raw signal and scale the outputsignal, respectively. As every A-Scan sample is handled bya separate thread and successive threads process successivesamples of an A-Scan, the accesses for loading and storingdata to and from global memory are aligned and can becoalesced [13].

Each of the processing kernels firstly reads its inputs fromglobal into shared memory, then processes the loaded dataset and stores the results back. However, as the maximumnumber of 1024 threads per block is smaller than the numberof samples per A-Scan (3000), we had to divide the process-ing of a complete A-Scan into multiple blocks and addedso called ghost zones. Therefore, each thread-block has acertain overlap with its neighbouring blocks. The extend ofthis overlap is given by the filter width of the particularprocessing step. Threads within these overlapping regiononly read from global memory, but do not participate inprocessing. However, using shared memory still proofed tobe more efficient than directly operating on global memory.

In order to minimize this inefficiency in thread usage, wedecided to implement a separate kernel for each processingstep. We then experimentally determined the optimal blocksizes individually for each kernel.

B. Signal Processing on FPGA

The AMF processing module implements a control andstatus interface via an AXI-Lite slave module. During pro-cessing, an external bus muster has to write the raw A-Scansto the fully bi-directional AXI-Full slave interface and, aftersome initial delay, to read back the processed data. Theoverall processing module consists of eight fully-pipelinedAMF units so that the overall processing throughput of eight

Figure 3. Block diagram of the AMF processing module clocked at 125MHz. It consists 8 distinct processing units as well as AXI Lite and AXIFull slave interfaces to connect to the outer system in Fig. 1. The fully-pipelined module has a throughput of eight samples per clock cycle, whichis equal to the peak bus throughput.

samples per cycle is equal to the peak bus throughput. Asimplified block diagram is given in Fig. 3.

In this configuration, every eighth A-Scan is given to aparticular processing module and as the incoming bus wordshave a width of 128 bit, they are firstly buffered in the inputFIFO. From there, single samples of 16 bit are fed into thescaleScan unit, where the varying signal offset is determinedand subtracted. There, the samples are also scaled so thatthe maximum value of each A-Scan uses the full availablerange. The resulting samples are then given one after theother to the AMF unit, where they are streamed through theprocessing modules in the order given in section III-A. Theprocessed samples are then given to the rescaleScan unitand rescaled as a compensation of the input scaling. In theoutput FIFO, the samples are buffered and packed into buswords of 128 bit.

The complete module runs at 125 MHz. Tab. II shows totalnumber of device elements as well as their occupation by theAMF processing system. We implemented all occurring con-volutions (MF, HIL, OP) via finite impulse response (FIR)filters and generated these with the Xilinx CoreGenator. Incase of the MF and OP filters, the coefficients are reloadablevia the control interface so that the module can be adapted todifferent experiments and reconstruction scenarios. As thereare non-linear operations like the square (SQR) and squareroot (SQRT) within the processing chain and the MF outputsignal is needed to compose the analytic signal, we hadto implement separate FIR filters for each operation and aconsolidation into a combined filter function is not possible.

C. Image Reconstruction on GPUs

We voxel-wise parallelized the SAFT reconstruction sothat each thread reconstructs one voxel. Each thread-blockconsists of a plane of 16 x 16 voxels. Within the kernel, weloop over the A-Scans and in the loop body, we calculatethe distances emitter-voxel and voxel-receiver, sum up andmultiply with the precomputed inverse speed-of-sound value(see equation 2). Then, the determined sample’s value isloaded and locally accumulated. As the sample’s memoryindex is dependent on the calculation above, the A-Scans’

Page 5: 06416735.pdf

Table IIVIRTEX-6 SX475T DEVICE OCCUPATION

Element Total AMF system SAFT system

Logic Slices 74 400 41 % 78 %DSP48E1 Slices 2 016 66 % 50 %BlockRam 1 064 7 % 61 %

Figure 4. Block diagram of the SAFT processing module clocked at 125MHz and consisting of the saftController and 64 parallel saftUnits. ThesaftController connects to the outer system, see Fig. 1, via an AXI Liteslave and an AXI Full master interface. The saftController is responsible forcontrolling and monitoring the processing flow, whereas the fully-pipelinedsaftUnits reconstruct two voxels per unit or 128 voxels in total per clockcycle.

samples are loaded in a non-deterministic manner. But as thesamples are only used once per thread, it is most efficientto directly load them from global memory via the GPUs’texture fetching units [24].

When exiting the loop, the resulting voxel value is storedback to global memory and after the reconstruction iscompleted, the voxel field is transferred from global GPUto host memory.

D. Image Reconstruction on FPGA

The overall structure of the SAFT module can be dividedinto a control part and a computational part, see also Fig.4: The saftController is on the one hand responsible forthe overall processing flow (controlFSM) and to provide thecomputational part (saftUnits) with the emitter and receivercoordinates of the currently processed A-Scan (geoCoord-Mem). Therefore, it implements the host accessible controland status registers via an AXI-Lite slave module. On theother hand, it also loads A-Scans (ascanLoad) and reads orwrites the voxel field (voxelRdWr) via an AXI-Full mastermodule from and to the DDR3 memory. The computationalpart consists of 64 parallel saftUnits, whereas each containsa voxelCoordGen, two sampleUnits, a scanBuffer and avoxelBuffer.

During processing, all available A-Scans are one after theother used to reconstruct an image slice of 64 × 1024 =65, 536 voxels, which are internally buffered in the vox-

elBuffer modules. For each A-Scan, the processing beginswith the supplied start voxel ~xP and the voxelCoordGengenerates two successive voxel coordinates in each clockcycle. These are then fed into the two sampleUnits. There,the sample indices are determined according to equation 2,i.e. the distances emitter-voxel and voxel-receiver are calcu-lated, summed up and then multiplied with the precomputedinverse speed-of-sound value. The indices are then used toaddress a true dual-ported BlockRAM holding the currentA-Scan in the scanBuffer. Finally, the loaded sample valuesare accumulated in the voxelBuffer, which contains two 64bit adders as well as a deep FIFO that buffers a image sliceof 1024 voxels.

The complete saftModule runs at 125 MHz. Tab. II givesan overview of the total number of device elements as well astheir percentile occupation by the SAFT processing system.The processing is deeply pipelined and has a throughput of2 voxels per saftUnit and 128 voxels in total. The A-Scanloading from DDR3 memory is double-buffered so that thesuccessive scan is always preloaded while the current scanin being processed. However, at larger image dimensionsthan the internally buffered 64k voxels, the voxel set has tobe exchanged with the next slice, i.e. the current voxel setis written to DDR3 memory and optionally the successiveset is read in. During this voxel exchange phase, computingis paused. To enable both, the exchange functionality and ascalable system design, all 64 FIFOs within the voxelBuffermodules are chained so that they externally behave like asingle FIFO.

V. RESULTS AND DISCUSSION

We operated both, the GPUs and the FPGA, as hardwareaccelerators connected via PCI-Express to a host PC. Foreach device, we copied the A-Scans to the acceleratorsand read back the processed A-Scans (signal processing)or the volume data (image reconstruction). We used adouble-buffering strategy and chose the problem sizes largeenough to saturate performance on each device. We thenmeasured execution time, including actual calculations aswell as transfers via PCI-Express. However, due to theapplied double-buffering strategy and as all implementationsbeside the signal processing on the FPGA are compute-bound, the transfers can be mostly neglected. In case ofthe single communication-bound implementation, we differ-entiate between the performance of the PCI-Express systemconfiguration and the performance of the pure FPGA.

In case of the FPGA, we had to measure execution timeswith scaled down designs on a ML605 evaluation board witha Virtex-6 LX240T FPGA. However, we implemented thecritical processing path exactly as in the larger V6SX475TFPGA so that the resulting performance values are valid. Forthe SAFT reconstruction, this means that we implementedonly 32 instead of 64 SAFT units, but as the idle time duringthe pixel exchange phase hinders a linear performance

Page 6: 06416735.pdf

Table IIIPERFORMANCE VALUES

Algorithm GTX 580 GT 540M V6SX475T

AMF [kAScans/s] 368 41.1 238 / 319 †SAFT [GVoxAScans/s] 17.2 2.64 15.7

† The lower value denotes the system configuration (comm-bound),the higher value the pure FPGA performance (comp-bound).

scaling, we read in and wrote out the double amount ofdata. In case of the AMF signal processing, we implementedalso eight AMF units but with reduced filter size. Althoughthis influences latency, the throughput in the measuredsaturation region is not influenced. On the other hand, weimplemented four regular filters. In that case, performanceis also compute-bound and we are able to deduce the pureFPGA performance.

A. Performance Comparison

As performance metrics for the AMF signal processing,we use the number of processed A-Scans per time unit(AScans/s). For the SAFT reconstruction, we determine thenumber of reconstructed voxels multiplied by the numberof used A-Scans per time unit (VoxAScans/s). Table IIIgives an overview of the performance results. For the AMFsignal processing, the GTX 580 is 1.5fold faster than theV6SX475T if the bandwidth limitation by PCI-Express isconsidered. However, this performance gap is reduced toonly 15% if the pure FPGA performance is compared. Weobtained a similar value of 10% speed up of the GPU forthe SAFT reconstruction. When limiting the input powerbudget to a similar value, the FPGA device performs nearlysix times faster (comm-bound) and over seven times faster(comp-bound) than the GT540M for the AMF signal pro-cessing. Likewise, the SAFT reconstruction on the FPGA issix times faster than on the GT540M.

A direct comparison of the high-performance and the low-power GPU shows a performance difference of factor 9 forthe AMF signal processing and of factor 6.5 for the SAFTreconstruction. While the value for the SAFT reconstructionis in the region of the difference in peak performances (factorof 6.1), the AMF signal processing does not scale downwell with the shrinking device size. A possible explanationfor this is the difference in the GF108 and GF110 microarchitectures: With increasing the number of Cuda Coresper Streaming Multiprocessor (see Tab. I) Nvidia added asuperscalar execution model to the GF108 chip. In absenceof instruction-level-parallelism within a thread as it is thecase in the convolutions, which dominate the AMF signalprocessing, this feature underutilizes 1/3 of the Cuda Cores.The corrected difference of peak values between the GTX580 and the GT 540M for this scenario is a factor of 9.2and in the range of the AMF signal processing ratio.

Table IVEFFICIENCY VALUES: GTX 580 VS. V6SX475T

AMF [kAScans/s] SAFT [GVoxAScans/s]Performance GTX 580 V6SX475T GTX 580 V6SX475T

per Watt 1.51 9.1 0.0705 0.449per bi. cycles 239 2548 11.2 125per bi. trans. 122.5 127.4 5.73 6.28

B. Architectural Efficiency

For the comparison of architectural efficiency, we ar-gue that the pure FPGA performance of the AMF signalprocessing has more significance and use only this value.Furthermore, as the performance scales with GPU size, welimit the discussion to the top-notch device comparison, i.e.GTX 580 GPU vs. V6SX475T FPGA.

The calculated efficiency values can be found in Tab.IV. As a consequence of similar performance values anddiffering power consumptions, the FPGA’s processing per-formance per Watt is over six times better than for the GPUfor both benchmark algorithms. The performance per cycleof the FPGA is even over 10fold higher than for the GPU.Thus, the pipelined processing in a configurable data pathin terms of cycles is much more efficient then the software-based processing with a fixed micro architecture. However,this configurability comes at the cost of a much reducedmaximum clock frequency. In order to compare these twoprocessing approaches on a fundamental level, we normalizethe measured performance to the number of transistors perdevice (area efficiency). Interestingly, area efficiency is inthe same range for the GTX 580 GPU and the V6SX475TFPGA. It differentiates only 4% for the signal processingand 10% for the SAFT reconstruction.

VI. CONCLUSIONS AND OUTLOOK

In this work, we investigated the processing performanceand architectural efficiency of processing algorithms takenfrom 3D Ultrasound Computer Tomography on GPUs anda FPGA. We selected devices of similar characteristics andcompared both, top-notch devices from the 40 nm generationas well as devices with the same power budget. The resultsshow that for our applications, if power consumption is notan issue, the GPU has a higher performance. In contrast, ifpower input and cooling requirements play a role, the FPGAis by far superior to a comparable GPU.

When calculating efficiency values for the top-notch de-vices, the energy efficiency as well as the computationalefficiency are both vastly in favour of the FPGA. The areaefficiency of both architectures is yet comparable. It is alsointeresting to note that the slight performance difference forthe top-notch devices is very much alike for both algorithmsalthough their characteristics greatly vary. Similarly, thedifference between the two benchmark algorithms is for allthree efficiency metrics in the order of only 5 %.

Page 7: 06416735.pdf

In future work, we plan to use the set of metrics on a largergroup of algorithms in order to identify their variability.Furthermore, we want to extend our investigations on tofurther types of heterogeneous target devices in order todetermine application characteristics, which are suitable fora certain type of accelerator.

REFERENCES

[1] B. V. Essen, C. Macaraeg, M. Gokhale, and R. Prenger,“Accelerating a Random Forest Classifier: Multi-Core, GP-GPU, or FPGA?” in Proc. of 20th Annual InternationalSymposium on FCCM, 2012, pp. 232–239.

[2] D. Jones, A. Powell, C.-S. Bouganis, and P. Cheung, “GPUVersus FPGA for High Productivity Computing,” in FieldProgrammable Logic and Applications (FPL), 2010 Interna-tional Conference on, 2010, pp. 119 –124.

[3] S. Kestur, J. Davis, and O. Williams, “BLAS Comparisonon FPGA, CPU and GPU,” in 2010 IEEE Computer SocietyAnnual Symposium on VLSI, 2010, pp. 288 –293.

[4] R. Weber, A. Gothandaraman, R. Hinde, and G. Peterson,“Comparing hardware accelerators in scientific applications:A case study,” IEEE Transactions on Parallel and DistributedSystems, vol. 22, no. 1, pp. 58 –68, 2011.

[5] R. Arce-Nazario and J. Ortiz-Ubarri, “Enumeration of costasarrays using gpus and fpgas,” in International Conference onReconfigurable Computing and FPGAs,, 2011, pp. 462 –467.

[6] I. Pechan and B. Feher, “Molecular docking on fpga andgpu platforms,” in International Conference on Field Pro-grammable Logic and Applications, 2011, pp. 474 –477.

[7] D. Theodoropoulos, G. Kuzmanov, and G. Gaydadjiev,“Multi-Core Platforms for Beamforming and Wave FieldSynthesis,” IEEE Transactions on Multimedia, vol. 13, no. 2,pp. 235 –245, 2011.

[8] K. Theoharoulis, C. Antoniadis, N. Bellas, and C. Antonopou-los, “Implementation and performance analysis of seal en-cryption on fpga, gpu and multi-core processors,” in 19thAnnual International Symposium on FCCM, 2011, pp. 65 –68.

[9] N. Kapre and A. DeHon, “Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, andmulti-core processors,” in International Conference on FieldProgrammable Logic and Applications, 2009, pp. 65 –72.

[10] J. Fowers, G. Brown, P. Cooke, and G. Stitt, “A per-formance and energy comparison of FPGAs, GPUs, andmulticores for sliding-window applications,” in Proceedingsof the ACM/SIGDA international symposium on Field Pro-grammable Gate Arrays, ser. FPGA ’12, 2012, pp. 47–56.

[11] B. Duan, W. Wang, X. Li, C. Zhang, P. Zhang, and N. Sun,“Floating-point mixed-radix fft core generation for fpga andcomparison with gpu and cpu,” in International Conferenceon Field-Programmable Technology, 2011, pp. 1 –6.

[12] H. Hussain, K. Benkrid, A. Erdogan, and H. Seker, “HighlyParameterized K-means Clustering on FPGAs: ComparativeResults with GPPs and GPUs,” in International Conferenceon Reconfigurable Computing and FPGAs, 2011, pp. 475 –480.

[13] J. Nickolls and W. Dally, “The GPU Computing Era,” Micro,IEEE, vol. 30, no. 2, pp. 56 – 69, 2010.

[14] Xilinx Corporation, “Xilinx Virtex-6 Docu-mentation,” Tech. Rep. [Online]. Available:http://www.xilinx.com/support/documentation/virtex-6.htm

[15] Nvidia Corporation, “NVIDIA Geforce GTX580 GPU DataSheet,” Tech. Rep.

[16] Xilinx Corporation, “Xilinx Xcell Journal 2010 CustomerInnovation Special Issue,” Tech. Rep. [Online]. Available:http://www.xilinx.com/publications/archives/xcell/Xcell-customer-innovation-2010.pdf

[17] “Xilinx XPower Estimator User Guide UG440 (v13.4),” Xil-inx Corporation, Tech. Rep.

[18] Micron Corporation, “Calculating Memory System Powerfor DDR3 TN-41-01,” Tech. Rep. [Online]. Available:http://www.micron.com/products/support/power-calc

[19] G. Marcus, W. Gao, A. Kugel, and R. Manner, “TheMPRACE framework: An open source stack for communi-cation with custom FPGA-based accelerators,” in SouthernConference on Programmable Logic, 2011, pp. 155 –160.

[20] N. V. Ruiter, G. Goebel, L. Berger, M. Zapf, and H. Gem-meke, “Realization of an optimized 3D USCT,” vol. 7968,no. 1, p. 796805, 2011.

[21] N. V. Ruiter, G. F. Schwarzenberg, M. Zapf, and H. Gem-meke, “Improvement of 3D Ultrasound Computer Tomog-raphy Images by Signal Pre-Processing,” in Proc. of IEEEUltrasonics Symposium, 2008, pp. 852 – 855.

[22] S. Doctor, T. Hall, and L. Reid, “SAFT the evolution ofa signal processing technology for ultrasonic testing,” NDTInternational, vol. 19, no. 3, pp. 163 – 167, 1986.

[23] M. Birk, S. Koehler, M. Balzer, M. Huebner, N. Ruiter, andJ. Becker, “Fpga-based embedded signal processing for 3-d ultrasound computer tomography,” IEEE Transactions onNuclear Science, vol. 58, no. 4, pp. 1647 –1651, 2011.

[24] M. Birk, M. Zapf, M. Balzer, N. Ruiter, and J. Becker, “Acomprehensive comparison of gpu and fpga-based accelera-tion of reflection image reconstruction for 3d ultrasound com-puter tomography,” Journal of Real-Time Image Processing,in press.

[25] S. L. Hahn, Hilbert transforms in signal processing, ser. TheArtech House signal processing library. Artech House, 1996.

[26] S. J. Norton and M. Linzer, “Ultrasonic reflectivity tomogra-phy: Reconstruction with circular transducer arrays,” Ultra-sonic Imaging, vol. 1, no. 2, pp. 154 – 184, 1979.