general purpose computing using graphics hardware

82
General Purpose Computing using Graphics Hardware Hanspeter Pfister Harvard University

Upload: daniel-blezek

Post on 10-May-2015

3.872 views

Category:

Health & Medicine


2 download

TRANSCRIPT

Page 1: General Purpose Computing using Graphics Hardware

General Purpose Computingusing Graphics Hardware

Hanspeter PfisterHarvard University

Page 2: General Purpose Computing using Graphics Hardware

2

Acknowledgements

• Won-Ki Jeong, Harvard University• Kayvon Fatahalian, Stanford University

Page 3: General Purpose Computing using Graphics Hardware

3

GPU (Graphics Processing Unit)

• PC hardware dedicated for 3D graphics– Massively parallel SIMD processor

• Performance pushed by game industry

NVIDIA SLI System

Page 4: General Purpose Computing using Graphics Hardware

4

GPGPU

• General Purpose computation on the GPU– Started in computer graphics research

community– Mapping computational problems to graphics

rendering pipeline

Image CourtesyJens Krueger, Aaron Lefohn, and Won-Ki Jeong

Page 5: General Purpose Computing using Graphics Hardware

5

Why GPU for computing?• GPU is fast

– Massively parallel• CPU : ~4 cores (16 SIMD lanes) @ 3.2 Ghz (Intel Quad

Core)• GPU : ~30 cores (240 SIMD lanes) @ 1.3 Ghz (NVIDIA

GT200)– High memory bandwidth

• Programmable– NVIDIA CUDA, DirectX Compute Shader, OpenCL

• High precision floating point support– 64bit floating point (IEEE 754)

• Inexpensive desktop supercomputer– NVIDIA Tesla C1060 : ~1 TFLOPS @ $1000

Page 6: General Purpose Computing using Graphics Hardware

6

FLOPS

Image Courtesy NVIDIA

Page 7: General Purpose Computing using Graphics Hardware

7

Memory Bandwidth

Image Courtesy NVIDIA

Page 8: General Purpose Computing using Graphics Hardware

8

GPGPU Biomedical Examples

Level-Set Segmentation (Lefohn et al.)

EM Image Processing (Jeong et al.)Image Registration (Strzodka et al.)

CT/MRI Reconstruction (Sumanaweera et al.)

Page 9: General Purpose Computing using Graphics Hardware

9

Overview

1. GPU Architecture Overview2. GPU Programming Overview

– Programming Model– NVIDIA CUDA– OpenCL

3. Application Example– CUDA ITK

Page 10: General Purpose Computing using Graphics Hardware

1. GPU Architecture Overview

Kayvon Fatahalian Stanford University

10

Page 11: General Purpose Computing using Graphics Hardware

11

What’s in a GPU?

Compute

Core

ComputeCore

Compute

Core

Compute

Core

Compute

Core

Compute

Core

Compute

Core

Compute

Core

Tex

Tex

Tex

Tex

Input Assembly

Rasterizer

Output Blend

Video Decode

WorkDistributor

Heterogeneous chip multi-processor (highly tuned for graphics)

HWor

SW?

Page 12: General Purpose Computing using Graphics Hardware

12

CPU-“style” cores

ALU(Execute)

Fetch/Decode

ExecutionContext

Out-of-order control logic

Fancy branch predictor

Memory pre-fetcher

Data Cache(A big one)

Page 13: General Purpose Computing using Graphics Hardware

13

Slimming down

ALU(Execute)

Fetch/Decode

ExecutionContext

Idea #1:

Remove components thathelp a single instructionstream run fast

Page 14: General Purpose Computing using Graphics Hardware

14

Two cores (two threads in parallel)

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

<diffuseShader>:

sample r0, v4, t0, s0

mul r3, v0, cb0[0]

madd r3, v1, cb0[1], r3

madd r3, v2, cb0[2], r3

clmp r3, r3, l(0.0), l(1.0)

mul o0, r0, r3

mul o1, r1, r3

mul o2, r2, r3

mov o3, l(1.0)

thread1

<diffuseShader>:

sample r0, v4, t0, s0

mul r3, v0, cb0[0]

madd r3, v1, cb0[1], r3

madd r3, v2, cb0[2], r3

clmp r3, r3, l(0.0), l(1.0)

mul o0, r0, r3

mul o1, r1, r3

mul o2, r2, r3

mov o3, l(1.0)

thread 2

Page 15: General Purpose Computing using Graphics Hardware

15

Four cores (four threads in parallel)

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

Page 16: General Purpose Computing using Graphics Hardware

16

Sixteen cores (sixteen threads in parallel)

ALU ALU

ALUALU

ALU ALU

ALUALU

ALU ALU

ALUALU

ALU ALU

ALUALU

16 cores = 16 simultaneous instruction streams

Page 17: General Purpose Computing using Graphics Hardware

17

Instruction stream sharing

<diffuseShader>:

sample r0, v4, t0, s0

mul r3, v0, cb0[0]

madd r3, v1, cb0[1], r3

madd r3, v2, cb0[2], r3

clmp r3, r3, l(0.0), l(1.0)

mul o0, r0, r3

mul o1, r1, r3

mul o2, r2, r3

mov o3, l(1.0)

But… many threads shouldbe able to share an instructionstream!

Page 18: General Purpose Computing using Graphics Hardware

18

Recall: simple processing core

Fetch/Decode

ALU(Execute)

ExecutionContext

Page 19: General Purpose Computing using Graphics Hardware

19

Add ALUs

Fetch/Decode

Idea #2:

Amortize cost/complexity ofmanaging an instructionstream across many ALUs

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

SIMD processing

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

Page 20: General Purpose Computing using Graphics Hardware

20

Modifying the code

<diffuseShader>:

sample r0, v4, t0, s0

mul r3, v0, cb0[0]

madd r3, v1, cb0[1], r3

madd r3, v2, cb0[2], r3

clmp r3, r3, l(0.0), l(1.0)

mul o0, r0, r3

mul o1, r1, r3

mul o2, r2, r3

mov o3, l(1.0)

Original compiled shader:

Fetch/Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

Processes one threadusing scalar ops on scalarregisters

Page 21: General Purpose Computing using Graphics Hardware

21

Modifying the code

Fetch/Decode

<VEC8_diffuseShader>:

VEC8_sample vec_r0, vec_v4, t0, vec_s0

VEC8_mul vec_r3, vec_v0, cb0[0]

VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3

VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3

VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)

VEC8_mul vec_o0, vec_r0, vec_r3

VEC8_mul vec_o1, vec_r1, vec_r3

VEC8_mul vec_o2, vec_r2, vec_r3

VEC8_mov vec_o3, l(1.0)Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data Processes 8 threadsusing vector ops on vectorregisters

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

New compiled shader:

Page 22: General Purpose Computing using Graphics Hardware

22

Modifying the code

Fetch/Decode

<VEC8_diffuseShader>:

VEC8_sample vec_r0, vec_v4, t0, vec_s0

VEC8_mul vec_r3, vec_v0, cb0[0]

VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3

VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3

VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)

VEC8_mul vec_o0, vec_r0, vec_r3

VEC8_mul vec_o1, vec_r1, vec_r3

VEC8_mul vec_o2, vec_r2, vec_r3

VEC8_mov vec_o3, l(1.0)Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

2 31 4

6 75 8

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

Page 23: General Purpose Computing using Graphics Hardware

23

128 threads in parallel

= 16 simultaneous instruction streams16 cores = 128 ALUs

Page 24: General Purpose Computing using Graphics Hardware

24

But what about branches?

ALU 1ALU 2 . . . ALU 8. . . Time

(clocks)

2...

1...

8

if (x> 0) {

} else {

}

<unconditional shader code>

<resume unconditional shader code>

y = pow(x, exp);

y *= Ks;

refl = y + Ka;

x = 0;

refl = Ka;

Page 25: General Purpose Computing using Graphics Hardware

25

But what about branches?

ALU 1ALU 2 . . . ALU 8. . . Time

(clocks)

2...

1...

8

if (x> 0) {

} else {

}

<unconditional shader code>

<resume unconditional shader code>

y = pow(x, exp);

y *= Ks;

refl = y + Ka;

x = 0;

refl = Ka;

T T T F FF F F

Page 26: General Purpose Computing using Graphics Hardware

26

But what about branches?

ALU 1ALU 2 . . . ALU 8. . . Time

(clocks)

2...

1...

8

if (x> 0) {

} else {

}

<unconditional shader code>

<resume unconditional shader code>

y = pow(x, exp);

y *= Ks;

refl = y + Ka;

x = 0;

refl = Ka;

T T T F FF F F

Not all ALUs do useful work! Worst case: 1/8 performance

Page 27: General Purpose Computing using Graphics Hardware

27

But what about branches?

ALU 1ALU 2 . . . ALU 8. . . Time

(clocks)

2...

1...

8

if (x> 0) {

} else {

}

<unconditional shader code>

<resume unconditional shader code>

y = pow(x, exp);

y *= Ks;

refl = y + Ka;

x = 0;

refl = Ka;

T T T F FF F F

Page 28: General Purpose Computing using Graphics Hardware

28

Clarification

• Option 1: Explicit vector instructions– Intel/AMD x86 SSE, Intel Larrabee

• Option 2: Scalar instructions, implicit HW vectorization– HW determines instruction stream sharing across

ALUs (amount of sharing hidden from software)– NVIDIA GeForce (“SIMT” warps), ATI Radeon

architectures

SIMD processing does not imply SIMD instructions

In practice: 16 to 64 threads share an instruction stream

Page 29: General Purpose Computing using Graphics Hardware

29

Stalls!

Texture access latency = 100’s to 1000’s of cycles

We’ve removed the fancy caches and logic that helps avoid stalls.

Stalls occur when a core cannot run the next instruction because of a dependency on a

previous operation.

Page 30: General Purpose Computing using Graphics Hardware

30

But we have LOTS of independent threads.

Idea #3:Interleave processing of many threads on a single

core to avoid stalls caused by high latency operations.

Page 31: General Purpose Computing using Graphics Hardware

31

Hiding stallsTime

(clocks)Thread1 …

8

Fetch/Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

SharedCtx Data

ALU ALU ALU ALU

ALU ALU ALU ALU

Page 32: General Purpose Computing using Graphics Hardware

32

Hiding stallsTime

(clocks)

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2

3 4

1 2 3 4

Thread1 … 8

Thread9… 16 Thread17 … 24 Thread25 … 32

Page 33: General Purpose Computing using Graphics Hardware

33

Hiding stallsTime

(clocks)

Stall

Runnable

1 2 3 4

Thread1 … 8

Thread9… 16 Thread17 … 24 Thread25 … 32

Page 34: General Purpose Computing using Graphics Hardware

34

Hiding stallsTime

(clocks)

Stall

Runnable

1 2 3 4

Thread1 … 8

Thread9… 16 Thread17 … 24 Thread25 … 32

Page 35: General Purpose Computing using Graphics Hardware

35

Hiding stallsTime

(clocks)

1 2 3 4

Stall

Stall

Stall

Stall

Runnable

Runnable

Runnable

Thread1 … 8

Thread9… 16 Thread17 … 24 Thread25 … 32

Page 36: General Purpose Computing using Graphics Hardware

36

Throughput!Time

(clocks)

Stall

Runnable

2 3 4

Thread1 … 8

Thread9… 16 Thread17 … 24 Thread25 … 32

Done!

Stall

Runnable

Done!

Stall

Runnable

Done!

Stall

Runnable

Done!

1

Increase run time of one groupTo maximum throughput of many groups

Start

Start

Start

Page 37: General Purpose Computing using Graphics Hardware

37

Storing contexts

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Pool of context storage

32KB

Page 38: General Purpose Computing using Graphics Hardware

38

Twenty small contexts

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2 3 4 5

6 7 8 9 10

11 1512 13 14

16 2017 18 19

(maximal latency hiding ability)

Page 39: General Purpose Computing using Graphics Hardware

39

Twelve medium contexts

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2 3 4

5 6 7 8

9 10 11 12

Page 40: General Purpose Computing using Graphics Hardware

40

Four large contexts

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

43

1 2

(low latency hiding ability)

Page 41: General Purpose Computing using Graphics Hardware

41

GPU block diagram key

= single “physical” instruction stream fetch/decode (functional unit control)

= SIMD programmable functional unit (FU), control shared with other functional units. This functional unit may contain multiple 32-bit “ALUs”

= execution context storage

= fixed function unit

= 32-bit mul-add unit= 32-bit multiply unit

Page 42: General Purpose Computing using Graphics Hardware

42

Example: NVIDIA GeForce GTX 280• NVIDIA-speak:

– 240 stream processors– “SIMT execution” (automatic HW-managed sharing of instruction

stream)

• Generic speak:– 30 processing cores– 8 SIMD functional units per core– 1 mul-add (2 flops) + 1 mul per functional units (3 flops/clock)– Best case: 240 mul-adds + 240 muls per clock– 1.3 GHz clock– 30 * 8 * (2 + 1) * 1.3 = 933 GFLOPS

• Mapping data-parallelism to chip:– Instruction stream shared across 32 threads– 8 threads run on 8 SIMD functional units in one clock

Page 43: General Purpose Computing using Graphics Hardware

43

GTX 280 core

Tex

Tex

Tex

Tex

Tex

Tex

Tex

Tex

Tex

Tex

Zcull/Clip/Rast Output Blend Work Distributor

… … …

………

………

………

………

………

………

………

………

………

Page 44: General Purpose Computing using Graphics Hardware

44

Example: ATI Radeon 4870• AMD/ATI-speak:

– 800 stream processors– Automatic HW-managed sharing of scalar instruction stream (like

“SIMT”)

• Generic speak:– 10 processing cores– 16 SIMD functional units per core– 5 mul-adds per functional unit (5 * 2 =10 flops/clock)– Best case: 800 mul-adds per clock– 750 MHz clock– 10 * 16 * 5 * 2 * .75 = 1.2 TFLOPS

• Mapping data-parallelism to chip:– Instruction stream shared across 64 threads– 16 threads run on 16 SIMD functional units in one clock

Page 45: General Purpose Computing using Graphics Hardware

45

ATI Radeon 4870 core

Zcull/Clip/Rast Output Blend Work Distributor

Tex

Tex

Tex

Tex

Tex

Tex

Tex

Tex

Tex

Tex

Page 46: General Purpose Computing using Graphics Hardware

46

Summary: three key ideas

1. Use many “slimmed down cores” to run in parallel

2. Pack cores full of ALUs (by sharing instruction stream across groups of threads)– Option 1: Explicit SIMD vector instructions– Option 2: Implicit sharing managed by

hardware

3. Avoid latency stalls by interleaving execution of many groups of threads– When one group stalls, work on another group

Page 47: General Purpose Computing using Graphics Hardware

2. GPU Programming Models

Programming ModelNVIDIA CUDA

OpenCL

47

Page 48: General Purpose Computing using Graphics Hardware

48

Task parallelism

• Distribute the tasks across processors based on dependency

• Coarse-grain parallelism

Task 1Task

2

Task 4Task

5Task 6

Task 7 Task 8Task 9

Task 3

Task dependency graph

Task assignment across 3 processors

Task 1

Task 4

Task 7

Task 5

Task 8

Task 2

Task 6

Task 3

Task 9

P1P2P3

Time

Page 49: General Purpose Computing using Graphics Hardware

49

Data parallelism

• Run a single kernel over many elements– Each element is independently updated– Same operation is applied on each element

• Fine-grain parallelism– Many lightweight threads, easy to switch

context– Maps well to ALU heavy architecture : GPU

Kernel P1

P2

P3

P4

P5

Pn

…….

…….Data

Page 50: General Purpose Computing using Graphics Hardware

GPU-friendly Problems

• Data-parallel processing• High arithmetic intensity

– Keep GPU busy all the time– Computation offsets memory latency

• Coherent data access– Access large chunk of contiguous memory– Exploit fast on-chip shared memory

50

Page 51: General Purpose Computing using Graphics Hardware

The Algorithm Matters

• Jacobi: Parallelizable

for(inti=0; i<num; i++){

vn+1[i] = (vn[i-1] + vn[i+1])/2.0;}

• Gauss-Seidel: Difficult to parallelize

for(inti=0; i<num; i++){

v[i] = (v[i-1] + v[i+1])/2.0;}

51

Page 52: General Purpose Computing using Graphics Hardware

Example: Reduction

• Serial version (O(N))for(int i=1; i<N; i++){

v[0] += v[i];}

• Parallel version (O(logN))width = N/2;while(width > 1){

for(int i=0; i<width; i++) { v[i] += v[i+width]; // computed in parallel } width /= 2;

}

52

Page 53: General Purpose Computing using Graphics Hardware

53

GPU programming languages

• Using graphics APIs– GLSL, Cg, HLSL

• Computing-specific APIs– DX 11 Compute Shaders– NVIDIA CUDA– OpenCL

Page 54: General Purpose Computing using Graphics Hardware

54

NVIDIA CUDA• C-extension programming language

– No graphics API– Supports debugging tools

• Extensions / API– Function type : __global__, __device__, __host__– Variable type : __shared__, __constant__– Low-level functions

• cudaMalloc(), cudaFree(), cudaMemcpy(),…• __syncthread(), atomicAdd(),…

• Program types– Device program (kernel) : runs on the GPU– Host program : runs on the CPU to call device programs

Page 55: General Purpose Computing using Graphics Hardware

55

CUDA Programming Model

• Kernel– GPU program that runs on a thread grid

• Thread hierarchy– Grid : a set of blocks– Block : a set of threads– Grid size * block size = total # of threads

Grid

Block 1

Threads

Block 2

Threads

Block n

Threads

. . . . .

<diffuseShader>:

sample r0, v4, t0, s0

mul r3, v0, cb0[0]

madd r3, v1, cb0[1], r3

madd r3, v2, cb0[2], r3

clmp r3, r3, l(0.0), l(1.0)

mul o0, r0, r3

mul o1, r1, r3

mul o2, r2, r3

mov o3, l(1.0)

Kernel

Page 56: General Purpose Computing using Graphics Hardware

56

CUDA Memory Structure• Memory hierarchy

– PC memory : off-card– GPU Global : off-chip / on-card– Shared/register/cache : on-chip

• The host can read/write global memory• Each thread communicates using shared

memory

PC Memory

(DRAM)

GPU GlobalMemory(DRAM)

GPU SharedMemory

(On-Chip)ALUs

Graphics cardGPU Core

12004000

Page 57: General Purpose Computing using Graphics Hardware

57

Synchronization

• Threads in the same block can communicate using shared memory

• No HW global synchronization function yet

• __syncthreads()– Barrier for threads only within the current block

• __threadfence()– Flushes global memory writes to make them

visible to all threads

Page 58: General Purpose Computing using Graphics Hardware

58

Example: CPU Vector Addition

// Pair-wise addition of vector elements// CPU version : serial add

void vectorAdd(float* iA, float* iB, float* oC, int num)

{ for(inti=0; i<num; i++) { oC[i] = iA[i] + iB[i]; }

}

Page 59: General Purpose Computing using Graphics Hardware

59

Example: CUDA Vector Addition

// Pair-wise addition of vector elements// CUDA version : one thread per addition

__global__ voidvectorAdd(float* iA, float* iB, float* oC) {

intidx = threadIdx.x + blockDim.x * blockIdx.x;oC[idx] = iA[idx] + iB[idx];

}

Page 60: General Purpose Computing using Graphics Hardware

60

Example: CUDA Host Code

float* h_A = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));// …initalizeh_A and h_B

// allocate device memoryfloat* d_A, d_B, d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to devicecudaMemcpy( d_A, h_A, N *

sizeof(float),cudaMemcpyHostToDevice );cudaMemcpy( d_B, h_B, N * sizeof(float),

cudaMemcpyHostToDevice );

// execute the kernel on N/256 blocks of 256 threads eachvectorAdd<<< N/256, 256>>>( d_A, d_B, d_C );

Page 61: General Purpose Computing using Graphics Hardware

61

OpenCL (Open Computing Language)

• First industry standard for computing language– Based on C language– Platform independent

• NVIDIA, ATI, Intel, ….

• Data and task parallel compute model– Use all computational resources in system

• CPU, GPU, …– Work-item : same as thread / fragment / etc..– Work-group : a group of work-items

• Work-items in a same work-group can communicate

– Execute multiple work-groups in parallel

Page 62: General Purpose Computing using Graphics Hardware

62

OpenCL program structure

• Host program (CPU)– Platform layer

• Query compute devices• Create context

– Runtime• Create memory objects• Compile and create kernel program objects• Issue commands (i.e., kernel launching) to command-

queue• Synchronization of commands• Clean up OpenCL resources

• Kernel (CPU, GPU)– C-like code with some extensions– Runs on compute device

Page 63: General Purpose Computing using Graphics Hardware

63

CUDA v.s. OpenCL comparison

• Conceptually almost identical– Work-item == thread– Work-group == block– Similar memory model

• Global, local, shared memory– Kernel, host program

• CUDA is highly optimized only for NVIDIA GPUs

• OpenCL can be widely used for any GPUs/CPUs

Page 64: General Purpose Computing using Graphics Hardware

64

Implementation status of OpenCL

• Specification 1.0 released by Khronos• NVIDIA released Beta 1.2 driver and SDK

– Available for registered GPU computing developers

• Apple will include in Mac OS X Snow Leopard– Q3 2009– NVIDIA and ATI GPUs, Intel CPU for Mac

• More companies will join

Page 65: General Purpose Computing using Graphics Hardware

GPU optimization tips: configuration

• Identify bottleneck– Computing / bandwidth bound (use profiler)– Focus on most expensive but parallelizable

parts (Amdahl’s law)• Maximize parallel execution

– Use large input (many threads)– Avoid divergent execution– Efficient use of limited resource

• Minimize shared memory / register use

65

Page 66: General Purpose Computing using Graphics Hardware

GPU optimization tips: memory

• Memory access: the most important optimization– Minimize device to host memory overhead

• Overlap kernel with memory copy (asynchronous copy)– Avoid shared memory bank conflict– Coalesced global memory access– Texture or constant memory can be helpful (cache)

PC Memory

(DRAM)

GPU GlobalMemory(DRAM)

GPU SharedMemory

(On-Chip)ALUs

Graphics cardGPU Core

12004000

66

Page 67: General Purpose Computing using Graphics Hardware

67

GPU optimization tips: instructions

• Use less expensive operators– division: 32 cycles, multiplication: 4 cycles

• *0.5 instead of /2.0– Atomic operator is expensive

• Possible race condition– Double precision is much slower than float– Use less accurate floating point instruction

when possible• __sin(), __exp(), __pow()

• Save unnecessary instructions– Loop unrolling

Page 68: General Purpose Computing using Graphics Hardware

3. Application Example

CUDA ITK

68

Page 69: General Purpose Computing using Graphics Hardware

ITK image filters implemented using CUDA• Convolution filters

– Mean filter– Gaussian filter– Derivative filter– Hessian of Gaussian filter

• Statistical filter– Median filter

• PDE-based filter– Anisotropic diffusion filter

69

Page 70: General Purpose Computing using Graphics Hardware

70

CUDA ITK

• CUDA code is integrated into ITK– Transparent to the ITK users– No need to modify current code using ITK

library• Check environment variable ITK_CUDA

– Entry point• GenerateData() or ThreadedGenerateData()

– If ITK_CUDA == 0• Execute original ITK code

– If ITK_CUDA == 1• Execute CUDA code

Page 71: General Purpose Computing using Graphics Hardware

71

Convolution filters

• Weighted sum of neighbors– For size n filter, each pixel is reused n times

• Non-separable filter (Anisotropic)– Reusing data using shared memory

• Separable filter (Gaussian)– N-dimensional convolution = N*1D convolution

kernel

*

kernel

*

kernel

*

Page 72: General Purpose Computing using Graphics Hardware

72

• Read from input image whenever needed

Naïve C/CUDA implementation

int xdim, ydim; // size of input imagefloat *in, *out; // input/output image of size xdim*ydimfloat w[][]; // convolution kernel of size n*m

for(x=0; x<xdim; x++){ for(y=0; y<ydim; y++) { // compute convolution for(sx=x-n/2; sx<=x+n/2; sx++) { for(sy=y-m/2; sy<=y+m/2; sy++) { wx = sx – x + n/2; wy = sy – y + m/2; out[x][y] = w[wx][wy]*in[sx][sy]; } } }}

n*m

xdim*ydim

load from global memory, n*m times

Page 73: General Purpose Computing using Graphics Hardware

73

• For size n*m filter, each pixel is reused n*m times– Save n*m-1 global memory loads by using shared

memory

Improved CUDA convolution filter

__global__ cudaConvolutionFilter2DKernel(in, out, w){ // copy global to shared memory

sharedmem[] = in[][];

__syncthreads();

// sum neighbor pixel values float _sum = 0;

for(uint j=threadIdx.y; j<=threadIdx.y + m; j++) { for(uint i=threadIdx.x; i<=threadIdx.x + n; i++) { wx = i – threadIdx.x; wy = j – threadIdx.y; _sum += w[wx][wy]*sharedmem[j*sharedmemdim.x + i]; } }}

n*m

load from global memory (slow), only once

load from shared memory (fast), n*m times

Page 74: General Purpose Computing using Graphics Hardware

74

CUDA Gaussian filter

• Apply 1D convolution filter along each axis– Use temporary buffers: ping-pong rendering

// temp[0], temp[1] : temporary buffer to store intermediate results

void cudaDiscreteGaussianImageFilter(in, out, stddev){ // create Gaussian weight w = ComputeGaussKernel(stddev);

temp[0] = in; // call 1D convolution with Gaussian weight dim3 G, B; for(i=0; i<dimension; i++) { cudaConvolutionFilter1DKernel<<<G,B>>>(temp[i%2], temp[(i+1)%2], w);

}

out = temp[i%2];}

1D convolution cuda kernel

Page 75: General Purpose Computing using Graphics Hardware

Median filter• Viola et al. [VIS 03]

– Finding median by bisection of histogram bins

– Log(# bins) iterations• 8-bit pixel : log(256) = 8

iterations

14 3 18 2 10

16 41.

0 1 2 3 4 5 6 7Intensity :

3. 14 3 18 2 10

1152. 14 3 18 2 10

4. 14 3 18 2 10

Copy current block from global to shared memorymin = 0;max = 255;pivot = (min+max)/2.0f;For(i=0; i<8; i++){ count = 0; For(j=0; j<kernelsize; j++) { if(kernel[j] > pivot) count++: } if(count <kernelsize/2) max = floor(pivot); else min = ceil(pivot); pivot = (min + max)/2.0f;}return floor(pivot);

75

Page 76: General Purpose Computing using Graphics Hardware

76

Perona & Malik anisotropic diffusion

• Nonlinear diffusion– Adaptive smoothing based on magnitude of

gradient– Preserves edges (high gradient)

• Numerical solution– Euler explicit integration (iterative method)– Finite difference for derivative computation

Input Image Linear diffusion P & M diffusion

Page 77: General Purpose Computing using Graphics Hardware

77

Performance

• Convolution filters– Mean filter : ~140x– Gaussian filter : ~60x– Derivative filter– Hessian of Gaussian filter

• Statistical filter– Median filter : ~25x

• PDE-based filter– Anisotropic diffusion

filter : ~70x

Page 78: General Purpose Computing using Graphics Hardware

78

CUDA ITK• Source code available at

– http://sourceforge.net/projects/cudaitk/

Page 79: General Purpose Computing using Graphics Hardware

79

CUDA ITK Future Work• ITK GPU image class

– Reduce CPU to GPU memory I/O– Pipelining support

• Native interface for GPU code– Similar to ThreadedGenerateData() for GPU threads

• Numerical library (vnl)• Out-of-GPU-core / GPU-cluster

– Processing large images (10~100 Terabytes)

• GPU Platform independent implementation– OpenCL could be a solution

Page 80: General Purpose Computing using Graphics Hardware

80

Conclusions• GPU computing delivers high performance

– Many scientific computing problems are parallelizable

– More consistency/stability in HW/SW• Main GPU architecture is mature• Industry-wide programming standard now exists (OpenCL)

– Better support/tools available• C-based language, compiler, and debugger

• Issues– Not every problem is suitable for GPUs

• Re-engineering of algorithms/software required– Unclear future performance growth of GPU

hardware• Intel’s Larrabee

Page 81: General Purpose Computing using Graphics Hardware

thrust

• thrust: a library of data parallel algorithms & data structures with an interface similar to the C++ Standard Template Library for CUDA

• C++ template metaprogramming automatically chooses the fastest code path at compile time

Page 82: General Purpose Computing using Graphics Hardware

thrust::sort

#include <thrust/host_vector.h>

#include <thrust/device_vector.h>

#include <thrust/generate.h>

#include <thrust/sort.h>

#include <cstdlib>

int main(void)

{

// generate random data on the host

thrust::host_vector<int> h_vec(1000000);

thrust::generate(h_vec.begin(), h_vec.end(), rand);

// transfer to device and sort

thrust::device_vector<int> d_vec = h_vec;

// sort 140M 32b keys/sec on GT200

thrust::sort(d_vec.begin(), d_vec.end());

return 0;}

http://thrust.googlecode.com