design and implementation of gpu-based sar image processor

54
Najeeb Ahmad Master Thesis Presentation May, 2012 Supervisor: Dr. Sun Jinping Design and Implementation of GPU based SAR Image Processor School of Electronic Information Engineering Beihang University, Beijing China.

Upload: najeeb-ahmad

Post on 11-Apr-2017

210 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: Design and implementation of GPU-based SAR image processor

Najeeb AhmadMaster Thesis Presentation

May, 2012

Supervisor: Dr. Sun Jinping

Design and Implementation of GPU based SAR Image

Processor

School of Electronic Information EngineeringBeihang University, Beijing China.

Page 2: Design and implementation of GPU-based SAR image processor

Contents1. Introduction2. GPU Computing3. SAR Processing4. Implementation5. Conclusion & Future Work

Page 3: Design and implementation of GPU-based SAR image processor

1.IntroductionProblemMotivationObjectiveMethodology

Page 4: Design and implementation of GPU-based SAR image processor

PROBLEMSynthetic Aperture Radar data processing is a computationally intensive and time consuming task using conventional CPUs. Given the increasing popularity and use of GPU for scientific computing, it is required to accelerate simplified range Doppler SAR processing algorithm on GPU using modern GPGPU technology to achieve real/near real-time performance and to evaluate its suitability for SAR processing.

Page 5: Design and implementation of GPU-based SAR image processor

MOTIVATIONComputationally intensive and time

consuming nature of SAR processing algorithms.

Inherent algorithm parallelism in most SAR processing algorithms.

Advent of modern GPGPU technology and availability of commodity GPUs as general purpose computation engines.

Architectural parallelism and availability of sufficient hardware resources in modern GPUs rendering them especially useful for handling large data quantities and parallel SAR algorithm implementation.

Page 6: Design and implementation of GPU-based SAR image processor

OBJECTIVETo implement and accelerate simplified

range Doppler SAR processing algorithm on a modern NVIDIA TESLA GPU using CUDA and MATLAB-GPU capabilities.

The resulting research will explore the areas like:Algorithm adaptation for parallel

implementation.Suitability of MATLAB for algorithm

implementation.Suitability of CUDA for algorithm

implementation.Comparison of CPU/CUDA/MATLAB-GPU

implementations.GPU as SAR processing platform.

Page 7: Design and implementation of GPU-based SAR image processor

METHODOLOGYAlgorithm implementation and verification

on Intel Xeon CPU using MATLAB.Identification of parallelizable portions of

algorithm.Algorithm implementation on TESLA C1060

GPU using MATLAB’s native GPU capabilities.

Algorithm implementation on TESLA C1060 GPU using CUDA.

Analysis of CPU, MATLAB-GPU and CUDA implementations.

Page 8: Design and implementation of GPU-based SAR image processor

2.GPU ComputingIntroduction to GPU ComputingGPGPU: Brief HistoryNVIDIA CUDAWriting efficient code

Page 9: Design and implementation of GPU-based SAR image processor

Introduction to GPU ComputingUse of Graphics Processing Units (GPUs) for

general purpose computing applications.CPU: Single, four or eight cores. Capable of

handling few threads. Suitable for serial code.

GPU: Hundreds of cores. Capable of handling hundreds of threads. Suitable for parallel code.

Page 10: Design and implementation of GPU-based SAR image processor

Introduction to GPU ComputingGPU Computing Model: Heterogeneous

computing model employing both CPU and GPU with serial computing on CPU, parallel computing on GPU.

Page 11: Design and implementation of GPU-based SAR image processor

GPGPU: Brief HistoryFirst use of GPU as general purpose

computing device, around 1999-2000 using graphics APIs. Huge performance boosts observed. Generally unpopular due to tedious programming.

Introduction of NVIDIAs “CUDA” and AMDs “Stream Computing” in 2007. Beginning of modern GPGPU era. Other vendors introduced their own GPGPU systems.

NVIDIAs CUDA gaining popularity due to its maturity and performance.

Page 12: Design and implementation of GPU-based SAR image processor

NVIDIA CUDACompute Unified Device Architecture.Comprises of Instruction Set Architecture

(ISA) and parallel compute engine in GPU programmable with high level languages extended for GPU computing.

CUDA framework comprises of two parts; hardware and software. From software perspective, CUDA means extended C/C++, FORTRAN to support GPU computing.

CUDA is “Single Instruction Multiple Thread” (SIMT) architecture.

Page 13: Design and implementation of GPU-based SAR image processor

CUDA HardwareStreaming multiprocessor (SM): Basic computing unit of

the GPU. Comprises of eight streaming processors (SP) and memory. Different GPUs differ in number of SMs and SP clock frequency.

SP SP

SP SP

SP SP

SP SP

SFU SFU

MT IU

Shared Memory

Page 14: Design and implementation of GPU-based SAR image processor

CUDA Memory ArchitectureUnderstanding of memory architecture

critical for writing efficient CUDA programs.All CUDA-enabled hardware have following

types of memory:Global memoryShared memory and registers.Texture memory and texture cache.Constant memory and constant cache.Local memory for register spilling.

SP SPShared memory

SP SP SP

Texture cache

Constant cache

SM n

SP SPShared memory

SP SP SP

Texture cache

Constant cache

SM 3

SP SPShared memory

SP SP SP

Texture cache

Constant cache

SP SPShared memory

SP SP SP

Texture cache

Constant cache

SM 1SM 2

GPU

Global memory (RAM)

Local MemoryTexture memory Constant memory

Page 15: Design and implementation of GPU-based SAR image processor

NVIDIA TESLA C1060 GPUPCI Express 2.0 compliant computing

processor board based on NVIDIA Tesla T10 graphics processing unit targeted for HPC applications. Feature highlights30 SMs = 240 SPs.SP Clock = 1.296 GHz4 GB DDR3 memory with 120

GB/s bandwidth. IEEE 754 single and double

floating point compliant.933 GFLOPS single and 78

GFLOPS double precision performance.

Compute capability: 1.3Supported by MATLAB for GPU

computing

Page 16: Design and implementation of GPU-based SAR image processor

CUDA Programming ModelAt its core are thread groups, shared

memory and barrier synchronization.Provides coarse-grained data and task

parallelism and fine-grained data and thread parallelism providing expressivity and scalability.

Thread hierarchy: Grid, blocks, threads.Kernels: Functions executed on device

(GPU) in parallel threads.CUDA provides APIs to run and launch

kernels in parallel threads and to synchronize them.

Page 17: Design and implementation of GPU-based SAR image processor

Processing FlowCopy input data from CPU to GPU memory.Load GPU program and execute, caching

result on the device.Copy results from GPU to CPU.

RAM

CPU

Host

Global memory

Constant

Texture

GPU

Device

Page 18: Design and implementation of GPU-based SAR image processor

Writing Efficient CodeHigh priority considerations

Minimum CPU-GPU transfers.Use of coalesced data transfers.Use of shared memory instead of global

memory whenever possible.Avoiding different execution paths within a

warp.Medium priority considerations

Access to shared memory should be planned to avoid serialization.

Redundant data transfers from global memory should be avoided.

Page 19: Design and implementation of GPU-based SAR image processor

Writing Efficient CodeThreads per block should be multiple of 32.Use of fast math library whenever possible.

Low Priority ConsiderationsUse of zero copy operations.For kernels with long argument list, some

argument should be placed in constant memory.

Expensive modulo, division operations should be avoided in favor of shift operations whenever possible.

Automatic conversion of double to float should be avoided.

Loop unrolling should be used whenever possible.

Page 20: Design and implementation of GPU-based SAR image processor

3.SAR ProcessingWhat is Synthetic Aperture RadarSAR ProcessingProcessing AlgorithmsBasic RDASimplified RDA

Page 21: Design and implementation of GPU-based SAR image processor

What is Synthetic Aperture RadarAn active microwave remote sensing imaging system.Employs long range propagation characteristics of radar

and complex signal processing techniques to produce high resolution images.

High resolution achieved by synthesizing long antenna aperture through signal processing techniques.

Pros (in comparison with optical systems):All weather and day and night operation.No effects of constituents of atmosphere.Sensitivity to dielectric properties (can image ice, biomass

etc.)Sensitivity to surface roughness (oceans, wind speed etc.)

Page 22: Design and implementation of GPU-based SAR image processor

What is Synthetic Aperture Radar

Accurate measurement of distance.Sensitivity to man made objects.Sensitivity to target structure.Subsurface penetration.

Cons Complex interactions (difficult to visualize

and understand)Speckle effects (difficult in visual

interpretation)Topographic effects

Page 23: Design and implementation of GPU-based SAR image processor

SAR ProcessingA set of procedures to obtain interpretable image

from raw scattered in azimuth and range directions.In range, data is scattered by duration of transmitted

FM pulse.In azimuth, data spread by duration point target is

illuminated by the radar beam. SAR processing compresses this data taking into

account range cell migration, earth curvature, earth rotation, air/spacecraft attitude noise to produce the final image.

Given nature of SAR system and signals, signal processing rather than image processing provide appropriate tools for SAR processing.

Page 24: Design and implementation of GPU-based SAR image processor

SAR Processing AlgorithmsMainstream SAR processing include:

Range Doppler algorithm (RDA)High resolution images for low squint and for

relatively smaller aperture sizes. Very popular.Chirp scaling algorithm (CSA)

Two-dimensional operations with range independence followed by range corrections in range Doppler domain.

Omega-K algorithm (ωKA)Efficient and accurate in two-dimensional frequency

domain.SPECAN algorithm

Good for medium to low resolution requirements.

Page 25: Design and implementation of GPU-based SAR image processor

Range Doppler AlgorithmVersions of range Doppler:

Basic RDARDA with accurate SRCRDA with approximate SRCSimplified range Doppler

Page 26: Design and implementation of GPU-based SAR image processor

Basic RDARaw data Range

Compression Azimuth FFT

RCMCAzimuth Compression

Azimuth IFFT and lookup Summation

Final Image

Range FFT, matched filter multiply, range

IFFT

Data in range Doppler domain

Interpolation operation in

range Doppler domain

Azimuth matched filter

multiply

To bring back signal into time

domain.

Page 27: Design and implementation of GPU-based SAR image processor

Simplified RDAFor narrower swath width and medium

resolution requirements, RCM can be assumed independent of range.Raw data Pre-filtering Range

Compression

Azimuth FFTRCMCRange IFFT

Azimuth Compression

Azimuth IFFT and lookup Summation

Final Image

To remove Doppler centroid

Range FFT, matched filter multiply (No range IFFT)

Both range and azimuth in frequency domain

RCM phase function

multiply with each range line

Data in range Doppler domain

Page 28: Design and implementation of GPU-based SAR image processor

4.ImplementationHardware resourcesSoftware resourcesCPU ImplementationMATLAB GPU ImplementationCUDA ImplementationResult Comparison

Page 29: Design and implementation of GPU-based SAR image processor

Hardware resourcesCPU GPU

Name NVIDIA Tesla C1060

# of cores 240SP Clock 1.296 GHzMemory 4 GB GDDR3Maximum memory bandwidth

102 GB/s

Memory interface

512 bit – PCI Express

GFLOPS 933 single precision, 78 double precision

Name Intel Xeon E5504

CPU Clock 2 GHz# of cores 4System Memory

4 GB

DDR3 Clock 800 MHzMaximum memory bandwidth

19.2 GB/s

Memory type DDR3 PC3PCI Slot PCI Express

Page 30: Design and implementation of GPU-based SAR image processor

Software resourcesCPU GPUWindows 7

Ultimate 64-bitMATLAB release

2010bVisual Studio 2008

SP1

CUDA Toolkit 4.1MATLAB release

2010b NVIDIA Parallel

NsightVisual ProfilerCUDA MEMCHECKCUFFT library

Page 31: Design and implementation of GPU-based SAR image processor

RADARSAT – I Data• CEOS Format• Raw data is required to

be extracted from CEOS data before SAR processing algorithm can be applied.

Parameter Value UnitsSampling rate 32.317 MHzRange FM rate 0.7213

5MHz/µs

Pulse duration 41.74 µsRadar frequency 5.3 GHzRadar wavelength

0.05657

m

Pulse repetition frequency

1256.98

Hz

Effective radar velocity

7062 m/s

Azimuth FM rate 1733 Hz/sDoppler centroid -6900 Hz

Table RADARSAT – I data parameters

CEOS data

CEOS data extraction

utility

RAW SAR data

Page 32: Design and implementation of GPU-based SAR image processor

SAR Processing GUIFunctions• CEOS data

extraction.• MATLAB-

CPU SAR processing.

• MATLAB-GPU SAR processing

• CUDA input/output manipulation.

• CUDA program execution.

Page 33: Design and implementation of GPU-based SAR image processor

CPU ImplementationImplemented using MATLABFFT/IFFT using standard MATLAB functions

Page 34: Design and implementation of GPU-based SAR image processor

CPU Processed SAR image

A 2048 x 4096 SAR image using CPU based implementation

Page 35: Design and implementation of GPU-based SAR image processor

MATLAB-GPU ImplementationMATLAB started supporting GPU computing since

MATLAB release 2010b. Implemented using native MATLAB-GPU functions

only (no CUDA kernel calls).Vectorization strategy employed to implement

vector-matrix multiplications on GPU.

All FFT/IFFTs performed using MATLAB-GPU FFT/IFFT support functions.

Column 1

Column 2

………...

Column n

Column 1

Column 2

………...

Column n

Column 1

Column 2

………...

Column n

Page 36: Design and implementation of GPU-based SAR image processor

MATLAB-GPU ImplementationLimit on maximum image size that can be

calculated due to GPU memory constraints.

Page 37: Design and implementation of GPU-based SAR image processor

MATLAB-GPU ImplementationSpeedup as high as 21 achieved compared

with CPU implementation

Page 38: Design and implementation of GPU-based SAR image processor

MATLAB-GPU Implementation

A 2048 x 4096 SAR image using MATLAB-GPU based implementation

Page 39: Design and implementation of GPU-based SAR image processor

MATLAB-GPU ImplementationAdvantages

Quick and easy to implementSufficient speedups obtained with little effortLittle knowledge of GPU hardware and no

knowledge of optimization techniques required.Disadvantages

Currently, limited number of MATLAB functions supported on GPU.

Not all overloads of a function available for GPU.Lesser control of hardware resources and

memory.Not many optimization options.

Page 40: Design and implementation of GPU-based SAR image processor

CUDA ImplementationStrategy

Signal data read as binary fileVectors, matched filters calculated on CPUVectors/signal data transferred to GPUFollowing kernels executed in order on GPU

Pre-filtering kernelRange compression kernelRCMC kernelAzimuth compression kernelImage pixel calculation kernel

Data transferred from GPU to CPU and saved on disk as image.

Page 41: Design and implementation of GPU-based SAR image processor

Optimization considerationsChosen block size = 8 × 8 = 64. Conforms

with memory coalescing requirements.Constant variables stored in constant

memoryLocal variable and phase function

calculation whenever possible to reduce global memory access.

CPU-GPU data transfer kept to minimum by transferring data from CPUGPU at beginning and GPUCPU transfers at the end of algorithm.

Using CUFFTs cufftPlanMany() plan for FFT/IFFTs along data columns.

Page 42: Design and implementation of GPU-based SAR image processor

CUDA Implementation Results

A 2048 x 4096 SAR image using CUDA based implementation

Page 43: Design and implementation of GPU-based SAR image processor

CUDA Implementation Results

Page 44: Design and implementation of GPU-based SAR image processor

CUDA Implementation Results

Page 45: Design and implementation of GPU-based SAR image processor

CUDA/MATLAB-CPU/MATLAB-CPU Computation Time Comparison

Page 46: Design and implementation of GPU-based SAR image processor

MATLAB-GPU/CUDA Computation Time Comparison

Page 47: Design and implementation of GPU-based SAR image processor

MATLAB-GPU/CUDA speedup comparisonSpeedups as high as 53 times achieved in

comparison with maximum speedup of 21 times in MATLAB.

Page 48: Design and implementation of GPU-based SAR image processor

5. Conclusions & Future Work

Page 49: Design and implementation of GPU-based SAR image processor

ConclusionsFeasibility of GPU for SAR processing

Amount of data, computational effort and inherent algorithm parallelism makes SAR processing suitable on GPU.

TESLA C1060 GPU offers enough memory to handle various common SAR image sizes.

Cooling GPU may be a challenge in some environments.

Scalability of CUDA will prove to be an advantage to port existing SAR code to newer GPUs.

GPUs might not be suitable where customizable hardware is required or military hardware standards are to be adhered.

Page 50: Design and implementation of GPU-based SAR image processor

ConclusionsMATLAB-GPU based SAR Processing

Significant speedups compared with CPU.Quick and easy to implement.Has some limitations:

Currently have lesser function support for GPU. Expected to improve with future MATLAB releases.

Vectorization strategy needs more memory. Future release promise to take away need for vectorization (e.g. bsxfun in release 2012a).

Lesser control over GPU resources (memory etc.).CUDA SAR Processing

CUDA: Flexible and scalable with least learning curve.More control over GPU resources.Optimization strategies can be applied.Faster and more memory efficient than MATLAB

implementation.

Page 51: Design and implementation of GPU-based SAR image processor

ConclusionsDownsides of GPU

Significant testing/verification effort might be required if GPU hardware have to be upgraded (due to old one becoming obsolete).

Proprietary nature of CUDA might be problematic in case company discontinues CUDA or its support.

Page 52: Design and implementation of GPU-based SAR image processor

Future workCUDA kernels can be called in MATLAB code

using MATLAB’s CUDA kernel calling support.

MATLAB GPU implementation can be improved as newer and better functions become available.

C/C++ based CPU implementation can be developed to better judge MATLAB-CPU/CUDA performance.

Other SAR processing algorithms can be implemented using framework laid out in this project.

Page 53: Design and implementation of GPU-based SAR image processor

Q & A

Page 54: Design and implementation of GPU-based SAR image processor

Thank You