performance evaluation of sar image reconstruction on cpus and gpus

Performance Evaluation of SARPerformance Evaluation of SAR Image Reconstruction on CPUs and

GPUs

Fisnik Kraja, Alin Murarasu, Georg Acher, Arndt BodeChair of Computer Architecture, Technische Universität München, Germany

2012 IEEE Aerospace Conference, 3-10 March 2012, Big Sky, Montana

The main points

• The motivation statementThe motivation statement• Description of the SAR 2DFMFI application• Description of the benchmarked architectureDescription of the benchmarked architecture• Results of sequential optimizations and thread

parallelization on the CPUparallelization on the CPU• Porting SAR Image Reconstruction to CUDA• Comparison of CPU and GPU ResultsComparison of CPU and GPU Results• Summary and conclusions

2/24/2012 2

Motivation

• On-board space-based processing should beOn board space based processing should be increased

• Future space applications with high performance p pp g prequirements– HRWS SAR: 1 Tera FLOPS, 603.1 Gbit/s throughput

• Heterogeneous (CPU+GPU) architectures might be the solution

• Novel accelerator designs integrate in one chip CPUs and graphics processing modules

2/24/2012 3

SAR Image Reconstruction

SAR Sensor P i (SSP)Synthetic Data Processing (SSP)

Reconstructed SAR image is obtained by applying the 2D

Synthetic Data Generation(SDG):

Synthetic SAR returns from a

Fourier Matched Filtering and Interpolation

uniform grid of point reflectors

Raw Data Reconstructed ImageSCALE mc n m nx

10 1600 3290 3808 247420 3200 6460 7616 49260 3 00 6 60 76 6 9 630 4800 9630 11422 738060 9600 19140 22844 14738

2/24/2012 4

SAR Sensor Processing Profiling

SSP Processing Step Computation Execution Size &Type Time in % Layout

1. Filter the echoed signal 1d_Fw_FFT 1.1 [mc x n]2. Transposition is needed 0.3 [n x mc]3. Signal Compression along slow-time CEXP, MAC 1.1 [n x mc]4 Narrow bandwidth polar format reconstruction along slow time 1d Fw FFT 0 5 [n x mc]4. Narrow-bandwidth polar format reconstruction along slow-time 1d_Fw_FFT 0.5 [n x mc]5. Zero pad the spatial frequency domain's compressed signal 0.4 [n x mc]6. Transform-back the zero padded spatial spectrum 1d_Bw_FFT 5.2 [n x m]7. Slow-time decompression CEXp, MAC 2.3 [n x m]8. Digitally-spotlighted SAR signal spectrum 1d Fw FFT 5.2 [n x m]8. Digitally spotlighted SAR signal spectrum 1d_Fw_FFT 5.2 [n x m]9. Generate the Doppler domain representation the reference

signal's complex conjugateCEXP, MAC 3.4 [n x m]

10. Circumvent edge processing effects 2D-FFT_shift 0.4 [n x m]11. 2D Interpolation from a wedge to a rectangular area: MAC,Sin,Cos 69 [nx x m]

input[n x m] -> output[nx x m]12. Transform from the doppler domain image into a spatial domain

image. IFFT[nx x m]-> Transpose -> FFT[m x nx]1d_Bw_FFT1d_Bw_FFT

10 [m x nx]

13 Transform into a viewable image CABS 1.1 [m x nx]

2/24/2012 5

The Benchmarked Architecture

• The dual socket ccNUMA

Memory (6GB) Memory (6GB)

– 2 Intel Nehalem CPUs 4Cores @2.13GHz

– 2x6 GB=12 GB shared memory– 32 nm

CPU

(4Cores)

CPU

(4Cores)

– 32 nm– Board TDP=120 W

• 2 Accelerators with NVIDIA TeslaInput/Output Controller

PCI Express 2.0 (up to 36 lanes)2 Accelerators with NVIDIA Tesla C2070 GPUs each: – 14 Streaming Multiprocessors – 448 scalar cores @ 1.15 GHz.

p ( p )

– 6 GB of GDDR5 memory• 5.25 GB available(if ECC enabled)

– 40 nm B d TDP 238 W

GPUGPU

– Board TDP=238 W

2/24/2012 6 Memory (6GB) Memory (6GB)

CPU Sequential Optimizations

10001200140016001800

in S

econ

ds

O0 O1 O2 O3 O0 O1 O2 O3 SSE4 F_t cexp MEA 0

200400600800

Elap

sed

Tim

e

3.5

GCC 4.6 ICC 12.0 Vectorization FFTW

Elapsed_time 1606.7 1241.03 1201.6 1208.66 1060.8 861.5 773.5 761.3 751.9 582.83 562.9 537.41

1.5

2

2.5

3

Spee

dup

O0 O1 O2 O3 O0 O1 O2 O3 SSE4 F_t cexp MEA 0

0.5

1

1.5

2/24/2012 7

GCC 4.6 ICC 12.0 Vectorization FFTW

Speedup 1 1.294650 1.337133 1.329323 1.514611 1.865002 2.077181 2.110468 2.136853 2.756721 2.854325 2.989709

CPU Thread Parallelization

500

600

700

800

The vectorized code is

100

200

300

400

500

– 27 % faster in sequential– 16% faster in parallel

Best fftw_threads 8 Threads 16_Threads HT

Sequential OpenMP

Elapsed Time 733 5 183 5 122 5 100 7

0

100

6

7

8

Elapsed Time 733.5 183.5 122.5 100.7Elapsed

Time(vect) 537.41 161.97 103.06 84.36

2

3

4

5

A very well optimizedBest fftw_threads 8 Threads 16_Threads

HT

Sequential OpenMP

Speedup 1 3 997275204 5 987755102 7 284011917

0

1A very well optimized sequential code impacts the scalability of the application

2/24/2012 8

Speedup 1 3.997275204 5.987755102 7.284011917Speedup(vect) 1 3.317960116 5.214535222 6.370436226

application

Introduction to CUDA

• CUDA kernels are executed by parallel threads

parallel threads.

• A group of threads forms a thread block.

B B B

B B B

B B B

B B B

• Shared memory among the threads in one block • Exploiting the locality of the

• Thread blocks are mapped to SMs in warps (32 threads) that receive the same instruction (SIMD)

algorithms ensures performance

• Limited amount of memory brings ( )

• Branches impact the efficiency of SIMD units

the need for slow PCIecommunications

2/24/2012 9

Porting SAR Application to CUDA

• 2D Data Tiling for Loops– Tile elements are computed

Thread (tx, ty) in block (bx, by) is to calculate • row (by*TILE DIM+ty) andTile elements are computed

by a block of threads

– Tiling technique increases

• row (by*TILE_DIM+ty) and • column (bx*TILE_DIM+tx)

of the data set.g q

the number of active blocks, increasing so the level of occupancy

– On the Tesla C2070 device: max 1024 threads per blockblock. • TILE_DIM=32 (32x32=1024)

2/24/2012 10

CUDA Implementation Discussions

• CUFFT library provides a simple interface for computing parallel FFTs– Batch execution for multiple 1-dimensional transformsBatch execution for multiple 1 dimensional transforms – Drawback: memory needed on the host side increases with:

• Size of the transform • Number of the configured transforms in the batch

• Operations missing in CUDA:– Library functions like cexp() and cabs()

Atomic operations of floating point variables– Atomic operations of floating-point variables

• Transcendental instructions: efficiently execute on Special Function Units (SFUs). ( )– sine– cosine– square root

2/24/2012 11

Performance Results

• CPU vs GPU10

12

– Better performance on the GPU

– Better power efficiency on the CPU

8

10

peed

up

the CPU

• Small Scale vs Large Scale– For small scale images

4

6Sp

g(SCALE<20), the data set fits completely on the GPU memory

– For large scale images CPU S CPU 8 CPU 16 GPU0

2

For large scale images (SCALE > 30), the data set does not fit in the GPU memory

CPU_Seq Threads Threads GPU

Scale=10 1 7.9474 8.8247 11.0488Scale=20 1 7.6237 8.1752 10.6159Scale=30 1 6.0354 7.0146 10.2855

2/24/2012 12

Scale=60 1 5.2145 6.3704 10.2364

Using both CPU and GPU for processing

• Programming heterogeneous systems is impacted by:– Data dependenciesp– Scheduling algorithms– System Resources

• Frequent Transfers between CPU and GPU should be avoided

• Profiling is needed to identify the parts of the code that will benefit from executing on the GPU

• In our case, it was decided to execute on the GPU only the I t l ti L (70% f th t t l ti ti ) i d tInterpolation Loop (70% of the total execution time) in order to avoid transfers in steps like:– FFT_SHIFT– Transposition– Transposition

2/24/2012 13

Using Multiple GPU Devices

• OpenMP + CUDA: One OpenMP thread per device– Separate GPU context

• Each thread calls independently• Each thread calls independently– Memory management functions– CUDA Kernels

• 2 Approaches– Same image is reconstruction by 2 GPUs

• Bottlenecks in the QPI (remote accesses) and PCIe links( )– Separate images are reconstructed on 2 separate GPUs

(Pipelined version)• Reduced CPU <-> GPU data transfers

2/24/2012 14

Results Updated

18

20

12

14

16

peed

up

4

6

8

10Sp

CPU_Seq CPU 8 Threads

CPU 16 Threads GPU GPU+CPU 2GPUs 2GPUs

Pipelined

0

2

4

Scale=10 1 7.9474 8.8247 11.0488 10.3740 2.3472 5.3086Scale=20 1 7.6237 8.1752 10.6159 11.7166 5.7588 11.5306Scale=30 1 6.0354 7.0146 10.2855 11.6952 8.8412 13.4404Scale=60 1 5.5136 6.5883 10.2364 12.5270 11.3020 17.4938

2/24/2012 15

Summary and Conclusions

• Porting the SAR application to CUDA requires knowledge on the underlying hardware and on the CUDA paradigm.underlying hardware and on the CUDA paradigm.

• For the SAR application GPUs offer better performance than CPUs– But CPUs are more power efficientBut CPUs are more power efficient

• Heterogeneous computing improves performance but the Performance/Watt ratio is impacted by the number of CPU <-> GPU transfers.

• Static scheduling of CUDA kernels offers no flexibility in h t ti i theterogeneous computing environments

• When using multiple GPU devices, it is very important to reduce the number of CPU < > GPU and GPU < > GPU transfersnumber of CPU <-> GPU and GPU <-> GPU transfers.

2/24/2012 16

Thank You!

Questions?

Fisnik KrajaChair of Computer ArchitectureChair of Computer Architecture

Technische Universität Mü[email protected]

performance evaluation of sar image reconstruction on cpus and gpus

Technology

n x mc4

n x m11

n x mc6

n x m9

n x mc5

n x mc3

n x m8

n x m7