parallelizing and optimizing programs for gpu acceleration using cuda martin burtscher department of...

113
Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

Upload: aiyana-sawyer

Post on 14-Dec-2015

229 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

Parallelizing and Optimizing Programs for GPU Acceleration using CUDA

Martin BurtscherDepartment of Computer Science

Page 2: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

2

CUDA Optimization Tutorial Martin Burtscher

[email protected] http://www.cs.txstate.edu/~burtscher/

Tutorial slides http://www.cs.txstate.edu/~burtscher/tutorials/COT5/slides.pptx

CUDA Optimization Tutorial

Page 3: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

3

High-end CPU-GPU ComparisonXeon E5-2687W Kepler GTX

680Cores 8 (superscalar) 1536 (simple)Active threads 2 per core ~11 per coreFrequency 3.1 GHz 1.0 GHzPeak performance (SP) 397 GFlop/s 3090 GFlop/sPeak mem. bandwidth 51 GB/s 192 GB/sMaximum power 150 W 195 W*Price $1900 $500*

Release datesXeon: March 2012Kepler: March 2012

*entire cardCUDA Optimization Tutorial

NvidiaIntel

Page 4: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

4

GPU Advantages Performance

8x as many operations executed per second Main memory bandwidth

4x as many bytes transferred per second Cost-, energy-, and size-efficiency

29x as much performance per dollar 6x as much performance per watt 11x as much performance per area

(based on peak values)

CUDA Optimization Tutorial

Page 5: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

5

GPU Disadvantages Clearly, we should be using GPUs all the time

So why aren’t we?

GPUs can only execute some types of code fast Need lots of data parallelism, data reuse, & regularity

GPUs are harder to program and tune than CPUs In part because of poor tool support In part because of their architecture In part because of poor support for irregular codes

CUDA Optimization Tutorial

Page 6: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

6

Outline (Part I) Introduction GPU programming N-body example Porting and tuning Other considerations Conclusions

CUDA Optimization Tutorial

Hightechreview.com

Page 7: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

7

CUDA Programming Model Non-graphics

programming Uses GPU as massively

parallel co-processor

SIMT (single-instruction multiple-threads) model Thousands of threads

needed for full efficiency

C/C++ with extensions Function launch

Calling functions on GPU Memory management

GPU memory allocation, copying data to/from GPU

Declaration qualifiers Device, shared, local, etc.

Special instructions Barriers, fences, etc.

Keywords threadIdx, blockIdx

CUDA Optimization Tutorial

GPUCPUPCI-Express

bus

Page 8: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

8

Calling GPU Kernels Kernels are functions that run on the GPU

Callable by CPU code CPU can continue processing while GPU runs kernel

KernelName<<<m, n>>>(arg1, arg2, ...);

Launch configuration (programmer selectable) GPU spawns m blocks with n threads (i.e., m*n

threads total) that run a copy of the same function Normal function parameters: passed conventionally

Different address space, should never pass CPU pointers

CUDA Optimization Tutorial

Page 9: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

9

GPU Architecture GPUs consist of Streaming Multiprocessors (SMs)

1 to 30 SMs per chip (run blocks) SMs contain Processing Elements (PEs)

8, 32, or 192 PEs per SM (run threads)

CUDA Optimization Tutorial

Global Memory

SharedMemory

SharedMemory

SharedMemory

SharedMemory

SharedMemory

SharedMemory

SharedMemory

SharedMemory

Adapted from NVIDIA

Page 10: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

10

Block Scalability Hardware can assign blocks to SMs in any order

A kernel with enough blocks scales across GPUs Not all blocks may be resident at the same time

CUDA Optimization Tutorial

GPU with 2 SMs

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Kernel

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

GPU with 4 SMs

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7time

Adapted from NVIDIA

Page 11: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

11

GPU Memories Separate from CPU memory

CPU can access GPU’s global & constant mem. via PCIe bus

Requires slow explicit transfer Visible GPU memory types

Registers (per thread) Local mem. (per thread) Shared mem. (per block)

Software-controlled cache Global mem. (per kernel) Constant mem. (read only)

CUDA Optimization Tutorial

GPU

Global + Local Memory (DRAM)

Block (0, 0)

Shared Memory (SRAM)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory (SRAM)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

CPU Constant Memory (DRAM, cached)

Adapted from NVIDIA

Slow communic. between blocks

Page 12: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

12

SM Internals (Fermi and Kepler) Caches

Software-controlled shared memory Hardware-controlled incoherent L1 data cache 64 kB combined size, can be split 16/48, 32/32, 48/16

Synchronization support Fast hardware barrier within block (__syncthreads()) Fence instructions: memory consistency & coherency

Special operations Thread voting (warp-based reduction operations)

CUDA Optimization Tutorial

Page 13: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

13

Block and Thread Allocation Limits Blocks assigned to SMs

Until first limit reached Threads assigned to PEs

Hardware limits 8/16 active blocks/SM 1024, 1536, or 2048

resident threads/SM 512 or 1024 threads/blk 16k, 32k, or 64k regs/SM 16 kB or 48 kB shared

memory per SM 216-1 or 231-1 blks/kernel

CUDA Optimization Tutorial

t0 t1 t2 … tm

Blocks

PE

SharedMemory

MT IU

PE

SharedMemory

MT IU

t0 t1 t2 … tm

Blocks

SM 1SM 0

Adapted from NVIDIA

Page 14: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

14

Warp-based Execution 32 contiguous threads form a warp

Execute same instruction in same cycle (or disabled) Warps are scheduled out-of-order with respect to each

other to hide latencies

Thread divergence Some threads in warp jump to different PC than others Hardware runs subsets of warp until they re-converge Results in reduction of parallelism (performance loss)

CUDA Optimization Tutorial

Page 15: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

15

Thread Divergence Non-divergent code

if (threadID >= 32) {some_code;

} else {other_code;

}

Divergent codeif (threadID >= 13) {

some_code;} else {

other_code;}

CUDA Optimization Tutorial

Thread ID:0 1 2 3 … 31

Adapted from NVIDIA

Thread ID:0 1 2 3 … 31

Adapted from NVIDIA

disabled

disabled

Page 16: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

16

Parallel Memory Accesses Coalesced main memory access (16/32x faster)

Under some conditions, HW combines multiple (half) warp memory accesses into a single coalesced access

CC 1.3: 64-byte aligned 64-byte line (any permutation) CC 2.x+3.0: 128-byte aligned 128-byte line (cached)

Bank-conflict-free shared memory access (16/32) No superword alignment or contiguity requirements

CC 1.3: 16 different banks per half warp or same word CC 2.x+3.0 : 32 different banks + 1-word broadcast each

CUDA Optimization Tutorial

Page 17: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

17

Coalesced Main Memory Accesses single coalesced access one and two coalesced accesses*

NVIDIA NVIDIA

CUDA Optimization Tutorial

Page 18: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

18

Outline Introduction GPU programming N-body example Porting and tuning Other considerations Conclusions

CUDA Optimization Tutorial

NASA/JPL-Caltech/SSC

Page 19: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

19

N-body Simulation Time evolution of physical system

System consists of bodies “n” is the number of bodies Bodies interact via pair-wise forces

Many systems can be modeled in this way Star/galaxy clusters (gravitational force) Particles (electric force, magnetic force)

CUDA Optimization Tutorial

RUG

Cornell

Page 20: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

20

Simple N-body Algorithm Algorithm

Initialize body masses, positions, and velocitiesIterate over time steps {

Accumulate forces acting on each bodyUpdate body positions and velocities based on force

}Output result

More sophisticated n-body algorithms exist Barnes Hut algorithm (covered in Part II) Fast Multipole Method (FMM)

CUDA Optimization Tutorial

Page 21: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

21

Key Loops (Pseudo Code) bodySet = ...; // input for timestep do { // sequential foreach Body b1 in bodySet { // O(n2) parallel foreach Body b2 in bodySet { if (b1 != b2) { b1.addInteractionForce(b2); } } } foreach Body b in bodySet { // O(n) parallel b.Advance(); } } // output resultCUDA Optimization Tutorial

Page 22: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

22

Force Calculation C Codestruct Body { float mass, posx, posy, posz; // mass and 3D position float velx, vely, velz, accx, accy, accz; // 3D velocity & accel} *body;

for (i = 0; i < nbodies; i++) { . . .

for (j = 0; j < nbodies; j++) { if (i != j) { dx = body[j].posx - px; // delta x dy = body[j].posy - py; // delta y dz = body[j].posz - pz; // delta z dsq = dx*dx + dy*dy + dz*dz; // distance squared dinv = 1.0f / sqrtf(dsq + epssq); // inverse distance scale = body[j].mass * dinv * dinv * dinv; // scaled force ax += dx * scale; // accumulate x contribution of accel ay += dy * scale; az += dz * scale; // ditto for y and z } }. . .

CUDA Optimization Tutorial

Page 23: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

23

Outline Introduction GPU programming N-body example Porting and tuning Other considerations Conclusions

CUDA Optimization Tutorial

Page 24: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

24

N-body Algorithm Suitability for GPU Lots of data parallelism

Force calculations are independent Should be able to keep SMs and PEs busy

Sufficient memory access regularity All force calculations access body data in same order* Should have lots of coalesced memory accesses

Sufficient code regularity All force calculations are identical* There should be little thread divergence

Plenty of data reuse O(n2) operations on O(n) data CPU/GPU transfer time is insignificant

CUDA Optimization Tutorial

Page 25: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

25

C to CUDA Conversion Two CUDA kernels

Force calculation Advance position and velocity

Benefits Force calculation requires over 99.9% of runtime

Primary target for acceleration Advancing kernel unimportant to runtime

But allows to keep data on GPU during entire simulation Minimizes GPU/CPU transfers

CUDA Optimization Tutorial

Page 26: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

26

C to CUDA Conversion__global__ void ForceCalcKernel(int nbodies, struct Body *body, ...) { . . .}__global__ void AdvancingKernel(int nbodies, struct Body *body, ...) { . . .}

int main(...) { Body *body, *bodyl; . . . cudaMalloc((void**)&bodyl, sizeof(Body)*nbodies); cudaMemcpy(bodyl, body, sizeof(Body)*nbodies, cuda…HostToDevice); for (timestep = ...) { ForceCalcKernel<<<1, 1>>>(nbodies, bodyl, ...); AdvancingKernel<<<1, 1>>>(nbodies, bodyl, ...); } cudaMemcpy(body, bodyl, sizeof(Body)*nbodies, cuda…DeviceToHost); cudaFree(bodyl); . . .}

CUDA Optimization Tutorial

Indicates GPU kernel that CPU can call

Separate address spaces, need two pointers

Allocate memory on GPU

Copy CPU data to GPU

Copy GPU data back to CPUCall GPU kernel with 1 block and 1 thread per block

Page 27: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

27

Evaluation Methodology Systems and compilers

CC 1.3: Quadro FX 5800, nvcc 3.2 30 SMs, 240 PEs, 1.3 GHz, 30720 resident threads

CC 2.0: Tesla C2050, nvcc 3.2 14 SMs, 448 PEs, 1.15 GHz, 21504 resident threads

CC 3.0: GeForce GTX 680, nvcc 4.2 8 SMs, 1536 PEs, 1.0 GHz, 16384 resident threads

Inputs and metric 1k, 10k, or 100k star clusters (Plummer model) Median runtime of three experiments, excluding I/O

CUDA Optimization Tutorial

Page 28: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

28

1-Thread Performance Problem size

n=10000, step=1 n=10000, step=1 n=3000, step=1

Slowdown rel. to CPU CC 1.3: 72.4 CC 2.0: 36.7 CC 3.0: 68.1(Note: comparing different GPUs to

different CPUs)

Performance 1 thread is one to two

orders of magnitude slower on GPU than CPU

Reasons No caches (CC 1.3) Not superscalar Slower clock frequency No SMT latency hiding

CUDA Optimization Tutorial

Page 29: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

29

Using N Threads Approach

Eliminate outer loop Instantiate n copies of inner loop, one per body

Threading Blocks can only hold 512 or 1024 threads

Up to 8/16 blocks can be resident in an SM at a time SM can hold 1024, 1536, or 2048 threads We use 256 threads per block (works for all of our GPUs)

Need multiple blocks Last block may not need all of its threads

CUDA Optimization Tutorial

Page 30: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

30

Using N Threads__global__ void ForceCalcKernel(int nbodies, struct Body *body, ...) { for (i = 0; i < nbodies; i++) { i = threadIdx.x + blockIdx.x * blockDim.x; // compute i if (i < nbodies) { // in case last block is only partially used for (j = ...) { . . . } }}__global__ void AdvancingKernel(int nbodies,...) // same changes

#define threads 256int main(...) { . . . int blocks = (nbodies + threads - 1) / threads; // compute block cnt for (timestep = ...) { ForceCalcKernel<<<1, 1blocks, threads>>>(nbodies, bodyl, ...); AdvancingKernel<<<1, 1blocks, threads>>>(nbodies, bodyl, ...); }}

CUDA Optimization Tutorial

Page 31: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

31

N Thread Speedup Relative to 1 GPU thread

CC 1.3: 7781 (240 PEs) CC 2.0: 6495 (448 PEs) CC 3.0: 12150 (1536 PEs)

Relative to 1 CPU thread CC 1.3: 107.5 CC 2.0: 176.7 CC 3.0: 176.2

Performance Speedup much higher

than number of PEs(32, 14.5, and 7.9 times)

Due to SMT latency hiding

Per-core performance CPU core delivers up to

4.4, 5, and 8.7 times as much performance as a GPU core (PE)

CUDA Optimization Tutorial

Page 32: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

32

structs in array

scalar arrays

Using Scalar Arrays Data structure conversion

Arrays of structs are bad for coalescing Bodies’ elements (e.g., mass fields) are not adjacent

Optimize data structure Use multiple scalar arrays, one per field (need 10) Results in code bloat but often much better speed

CUDA Optimization Tutorial

Page 33: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

33

Using Scalar Arrays__global__ void ForceCalcKernel(int nbodies, float *mass, ...) { // change all “body[k].blah” to “blah[k]”}__global__ void AdvancingKernel(int nbodies, float *mass, ...) { // change all “body[k].blah” to “blah[k]”}

int main(...) { float *mass, *posx, *posy, *posz, *velx, *vely, *velz, *accx, *accy,*accz; float *massl, *posxl, *posyl, *poszl, *velxl, *velyl, *velzl, ...; mass = (float *)malloc(sizeof(float) * nbodies); // etc . . . cudaMalloc((void**)&massl, sizeof(float)*nbodies); // etc cudaMemcpy(massl, mass, sizeof(float)*nbodies, cuda…HostToDevice); // etc for (timestep = ...) { ForceCalcKernel<<<blocks, threads>>>(nbodies, massl, posxl, ...); AdvancingKernel<<<blocks, threads>>>(nbodies, massl, posxl, ...); } cudaMemcpy(mass, massl, sizeof(float)*nbodies, cuda…DeviceToHost); // etc . . .}

CUDA Optimization Tutorial

Page 34: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

34

Scalar Array Speedup Problem size

n=100000, step=1 n=100000, step=1 n=300000, step=1

Relative to struct CC 1.3: 0.83 CC 2.0: 0.96 CC 3.0: 0.82

Performance Threads access same

memory locations, not adjacent ones

Always combined but not really coalesced access

Slowdowns may be due to DRAM page/TLB misses

Scalar arrays Still needed (see later)

CUDA Optimization Tutorial

Page 35: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

35

Constant Kernel Parameters Kernel parameters

Lots of parameters due to scalar arrays All but one parameter never change their value

Constant memory “Pass” parameters only once Copy them into GPU’s constant memory

Performance implications Reduced parameter passing overhead Constant memory has hardware cache

CUDA Optimization Tutorial

Page 36: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

36

Constant Kernel Parameters__constant__ int nbodiesd;__constant__ float dthfd, epssqd, float *massd, *posxd, ...;

__global__ void ForceCalcKernel(int step) { // rename affected variables (add “d” to name)}

__global__ void AdvancingKernel() { // rename affected variables (add “d” to name)}

int main(...) { . . . cudaMemcpyToSymbol(massd, &massl, sizeof(void *)); // etc . . . for (timestep = ...) { ForceCalcKernel<<<1, 1>>>(step); AdvancingKernel<<<1, 1>>>(); } . . .}

CUDA Optimization Tutorial

Page 37: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

37

Constant Mem. Parameter Speedup Problem size

n=1000, step=10000 n=1000, step=10000 n=3000, step=10000

Speedup CC 1.3: 1.015 CC 2.0: 1.016 CC 3.0: 0.971

Performance Minimal perf. impact May be useful for very

short kernels that are often invoked

Benefit Less shared memory

used on CC 1.3 devices

CUDA Optimization Tutorial

Page 38: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

38

Using the RSQRTF Instruction Slowest kernel operation

Computing one over the square root is very slow GPU has slightly imprecise but fast 1/sqrt instruction

(frequently used in graphics code to calculate inverse of distance to a point)

IEEE floating-point accuracy compliance CC 1.x is not entirely compliant CC 2.x and above are compliant but also offer faster

non-compliant instructions

CUDA Optimization Tutorial

Page 39: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

39

Using the RSQRT Instruction for (i = 0; i < nbodies; i++) { . . . for (j = 0; j < nbodies; j++) { if (i != j) { dx = body[j].posx - px; dy = body[j].posy - py; dz = body[j].posz - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = 1.0f / sqrtf(dsq + epssq); dinv = rsqrtf(dsq + epssq); scale = body[j].mass * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } } . . . }

CUDA Optimization Tutorial

Page 40: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

40

RSQRT Speedup Problem size

n=100000, step=1 n=100000, step=1 n=300000, step=1

Speedup CC 1.3: 0.99 CC 2.0: 1.83 CC 3.0: 1.64

Performance No change for CC 1.3

Compiler automatically uses less precise RSQRTF as most FP ops are not fully precise anyhow

83% speedup for CC 2.0 Over entire application Compiler defaults to

precise instructions Explicit use of RSQRTF

indicates imprecision okay

CUDA Optimization Tutorial

Page 41: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

41

Using 2 Loops to Avoid If Statement “if (i != j)” causes thread divergence

Break loop into two loops to avoid if statement

for (j = 0; j < nbodies; j++) { if (i != j) { dx = body[j].posx - px; dy = body[j].posy - py; dz = body[j].posz - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssq); scale = body[j].mass * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } }

CUDA Optimization Tutorial

Page 42: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

42

Using 2 Loops to Avoid If Statement for (j = 0; j < i; j++) { dx = body[j].posx - px; dy = body[j].posy - py; dz = body[j].posz - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssq); scale = body[j].mass * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } for (j = i+1; j < nbodies; j++) { dx = body[j].posx - px; dy = body[j].posy - py; dz = body[j].posz - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssq); scale = body[j].mass * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; }CUDA Optimization Tutorial

Page 43: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

43

Loop Duplication Speedup Problem size

n=100000, step=1 n=100000, step=1 n=300000, step=1

Speedup CC 1.3: 0.55 CC 2.0: 1.00 CC 3.0: 1.00

Performance No change for 2.0 & 3.0

Divergence moved to loop 45% slowdown for CC 1.3

Unclear why

Discussion Not a useful optimization Code bloat A little divergence is okay

(only 1 in 3125 iterations)

CUDA Optimization Tutorial

Page 44: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

44

Blocking using Shared Memory Code is memory bound

Each warp streams in all bodies’ masses and positions Block inner loop

Read block of mass & position info into shared mem Requires barriers (fast hardware barrier within SM)

Advantage A lot fewer main memory accesses Remaining main memory accesses are fully coalesced

(due to usage of scalar arrays)

CUDA Optimization Tutorial

Page 45: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

45

__shared__ float posxs[threads], posys[threads], poszs[…], masss[…];j = 0;for (j1 = 0; j1 < nbodiesd; j1 += THREADS) { // first part of loop idx = tid + j1; if (idx < nbodiesd) { // each thread copies 4 words (fully coalesced) posxs[id] = posxd[idx]; posys[id] = posyd[idx]; poszs[id] = poszd[idx]; masss[id] = massd[idx]; } __syncthreads(); // wait for all copying to be done bound = min(nbodiesd - j1, THREADS); for (j2 = 0; j2 < bound; j2++, j++) { // second part of loop if (i != j) { dx = posxs[j2] – px; dy = posys[j2] – py; dz = poszs[j2] - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssqd); scale = masss[j2] * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } } __syncthreads(); // wait for all force calculations to be done}

Blocking using Shared Memory

CUDA Optimization Tutorial

Page 46: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

46

Blocking Speedup Problem size

n=100000, step=1 n=100000, step=1 n=300000, step=1

Speedup CC 1.3: 3.7 CC 2.0: 1.1 CC 3.0: 1.6

Performance Great speedup for CC 1.3 Some speedup for others

Has hardware data cache

Discussion Very important

optimization for memory bound code

Even with L1 cache

CUDA Optimization Tutorial

Page 47: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

47

Loop Unrolling CUDA compiler

Generally good at unrolling loops with fixed bounds Does not unroll inner loop of our example code

Use pragma to unroll (and pad arrays) #pragma unroll 8 for (j2 = 0; j2 < bound; j2++, j++) { if (i != j) { dx = posxs[j2] – px; dy = posys[j2] – py; dz = poszs[j2] - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssqd); scale = masss[j2] * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } }

CUDA Optimization Tutorial

Page 48: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

48

Loop Unrolling Speedup Problem size

n=100000, step=1 n=100000, step=1 n=300000, step=1

Speedup CC 1.3: 1.07 CC 2.0: 1.16 CC 3.0: 1.07

Performance Noticeable speedup All three GPUs

Discussion Can be useful May increase register

usage, which may lower maximum number of threads per block and result in slowdown

CUDA Optimization Tutorial

Page 49: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

49

CC 2.0 Absolute Performance Problem size

n=100000, step=1 Runtime

612 ms FP operations

326.7 GFlop/s Main mem throughput

1.035 GB/s

Not peak performance Only 32% of 1030 GFlop/s

Peak assumes FMA every cycle

3 sub (1c), 3 fma (1c), 1 rsqrt (8c), 3 mul (1c), 3 fma (1c) = 20c for 20 Flop

63% of realistic peak of 515.2 GFlop/s

Assumes no non-FP operations

With int ops = 31c for 20 Flop 99% of actual peak of 330.45

GFlop/s

CUDA Optimization Tutorial

Page 50: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

50

Eliminating the If Statement Algorithmic optimization

Potential softening parameter avoids division by zero If-statement is not necessary and can be removed

Eliminates thread divergence

for (j2 = 0; j2 < bound; j2++, j++) { if (i != j) { dx = posxs[j2] – px; dy = posys[j2] – py; dz = poszs[j2] - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssqd); scale = masss[j2] * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } }

CUDA Optimization Tutorial

Page 51: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

51

If Elimination Speedup Problem size

n=100000, step=1 n=100000, step=1 n=300000, step=1

Speedup CC 1.3: 1.38 CC 2.0: 1.54 CC 3.0: 1.64

Performance Large speedup All three GPUs

Discussion No thread divergence Allows compiler to

schedule code much better

CUDA Optimization Tutorial

Page 52: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

52

Rearranging Terms Generated code is suboptimal

Compiler does not emit as many fused multiply-add (FMA) instructions as it could

Rearrange terms in expressions to help compiler Need to check generated assembly code

for (j2 = 0; j2 < bound; j2++, j++) { dx = posxs[j2] – px; dy = posys[j2] – py; dz = poszs[j2] - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssqd); dsq = dx*dx + (dy*dy + (dz*dz + epssqd)); dinv = rsqrtf(dsq); scale = masss[j2] * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; }

CUDA Optimization Tutorial

Page 53: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

53

FMA Speedup Problem size

n=100000, step=1 n=100000, step=1 n=300000, step=1

Speedup CC 1.3: 1.03 CC 2.0: 1.05 CC 3.0: 1.06

Performance Small speedup All three GPUs

Discussion Seemingly needless

transformations can make a difference

CUDA Optimization Tutorial

Page 54: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

54

Higher Unroll Factor Problem size

n=100000, step=1 n=100000, step=1 n=300000, step=1

Speedup CC 1.3: 1.01 CC 2.0: 1.04 CC 3.0: 0.93

Unroll 128 times Avoid looping overhead Now that there are no ifs

Performance Little speedup/slowdown

Discussion Carefully choose unroll

factor (manually tune)

CUDA Optimization Tutorial

Page 55: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

55

Compiler Flags Problem size

n=100000, step=1 n=100000, step=1 n=300000, step=1

Speedup CC 1.3: 1.00 CC 2.0: 1.18 CC 3.0: 1.15

-use_fast_math “-ftz=true” suffices

(flush denormals to zero) Makes SP FP operations

faster except on CC 1.3 Performance

Significant speedup Discussion

Use faster but less precise operations when prudent

CUDA Optimization Tutorial

Page 56: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

56

Final Absolute Performance CC 2.0 Fermi GTX 480

Problem size n=100000, step=1

Runtime 296.1 ms

FP operations 675.6 GFlop/s (SP) 66% of peak performance 261.1 GFlops/s (DP)

Main mem throughput 2.139 GB/s

CC 3.0 Kepler GTX 680 Problem size

n=300000, step=1 Runtime

1073 ms FP operations

1677.6 GFlop/s (SP) 54% of peak performance 88.7 GFlops/s (DP)

Main mem throughput 5.266 GB/s

CUDA Optimization Tutorial

Page 57: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

57

Outline Introduction GPU programming N-body example Porting and tuning Other considerations Conclusions

CUDA Optimization Tutorial

gamedsforum.ca

Thepcreport.net

Page 58: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

58

Things to Consider Minimize PCIe transfers

Implementing entire algorithm on GPU, even some slow serial code sections, might be overall win

Can stream data to/from GPU while computing Locks and synchronization

Lightweight locks & fast barriers possible within SM Slow across different SMs

L1 data caches are not coherent Use volatile & fences to avoid deadlocks

CUDA Optimization Tutorial

Page 59: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

59

Warp-based Execution// wrong on GPU, correct on CPUdo { cnt = 0; if (ready[i] != 0) cnt++; if (ready[j] != 0) cnt++;} while (cnt < 2);ready[k] = 1;

// correctdo { cnt = 0; if (ready[i] != 0) cnt++; if (ready[j] != 0) cnt++; if (cnt == 2) ready[k] = 1;} while (cnt < 2);

Problem Thread divergence Loop exiting threads wait

for other threads in warp to also exit

“ready[k] = 1” is not executed until all threads in warp are done with loop

Possible deadlock

CUDA Optimization Tutorial

Page 60: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

60

Hybrid Execution CPU always needed for program launch and I/O

CPU much faster on serial program segments GPU 10 times faster than CPU on parallel code

Running 10% of problem on CPU is hardly worthwhile Complicates programming and requires data transfer

Best CPU data structure is often not best for GPU PCIe bandwidth much lower than GPU bandwidth

1.6 to 6.5 GB/s versus 192 GB/s But can send data while CPU & GPU are computing Merging CPU and GPU on same die (e.g., AMD’s

Fusion APU) makes finer grain switching possible

CUDA Optimization Tutorial

Page 61: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

61

Outline Introduction GPU programming N-body example Porting and tuning Other considerations Conclusions

CUDA Optimization Tutorial

gamedsforum.ca

Page 62: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

62

Summary and Conclusions (Part I) Step-by-step porting and tuning of CUDA code

Example: n-body simulation

GPUs have very powerful hardware But only exploitable with some codes Even harder to program and optimize than CPU hardware

CUDA Optimization Tutorial

Page 63: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

Parallelizing and Optimizing Programs for GPU Acceleration using CUDA (Part II)

Martin BurtscherDepartment of Computer Science

Page 64: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

64

Mapping Regular Code to GPUs Regular codes

Operate on array- and matrix-based data structures Exhibit mostly strided memory access patterns Have relatively predictable control flow (control flow

behavior is mainly determined by input size) Largely independent computations

Many regular codes are easy to port to GPUs E.g., matrix codes executing many ops/word

Dense matrix operations (level 2 and 3 BLAS) Stencil codes (PDE solvers)

CUDA Optimization Tutorial

LLNL

Page 65: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

65

Mapping Irregular Code to GPUs Irregular codes

Build, traverse, and update dynamic data structures (trees, graphs, linked lists, priority queues, etc.)

Exhibit pointer-chasing memory access patterns Have complex control flow (control flow behavior

depends on input values and changes dynamically) Many important scientific programs are irregular

E.g., n-body simulation, data clustering, SAT solving, social networks, discrete-event simulation, meshing, …

Need case studies on how to best map irreg codesCUDA Optimization Tutorial

FSU

Page 66: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

66

Example: N-body Simulation Irregular Barnes Hut algorithm

Repeatedly builds unbalanced tree and performs complex traversals on it

Our implementation Designed for GPUs (not just port of CPU code) First GPU implementation of entire BH algorithm

Results GPU is 21 times faster than CPU (6 cores) on this code

CUDA Optimization Tutorial

Page 67: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

67

Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions

CUDA Optimization Tutorial

NASA/JPL-Caltech/SSC

Page 68: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

68

Barnes Hut Idea Precise force calculation

Requires O(n2) operations (O(n2) body pairs) Computationally intractable for large n

Barnes and Hut (1986) Algorithm to approximately compute forces

Bodies’ initial position & velocity are also approximate Requires only O(n log n) operations Idea is to “combine” far away bodies Error should be small because force 1/distance2

CUDA Optimization Tutorial

Page 69: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

69

Barnes Hut Algorithm Set bodies’ initial position and velocity Iterate over time steps

1. Compute bounding box around bodies2. Subdivide space until at most one body per cell

Record this spatial hierarchy in an octree

3. Compute mass and center of mass of each cell4. Compute force on bodies by traversing octree

Stop traversal path when encountering a leaf (body) or an internal node (cell) that is far enough away

5. Update each body’s position and velocity

CUDA Optimization Tutorial

Page 70: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

70

Build Tree (Level 1)

*

* *

** *

* ** * *

* * *

* **

* * *

* * *

o

Compute bounding box around all bodies → tree root

CUDA Optimization Tutorial

Page 71: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

71

Build Tree (Level 2)

*

* *

** *

* ** * *

* * *

* **

* * *

* * *

o o o o

o

Subdivide space until at most one body per cell

CUDA Optimization Tutorial

Page 72: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

72

Build Tree (Level 3)

*

* *

** *

* ** * *

* * *

* **

* * *

* * *

o o o o

o

o o o o o o o o o o o o

Subdivide space until at most one body per cell

CUDA Optimization Tutorial

Page 73: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

73

Build Tree (Level 4)

*

* *

** *

* ** * *

* * *

* **

* * *

* * *

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

Subdivide space until at most one body per cell

CUDA Optimization Tutorial

Page 74: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

74

Build Tree (Level 5)

*

* *

** *

* ** * *

* * *

* **

* * *

* * *

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

Subdivide space until at most one body per cell

CUDA Optimization Tutorial

Page 75: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

75

Compute Cells’ Center of Mass

For each internal cell, compute sum of mass and weighted averageof position of all bodies in subtree; example shows two cells only

CUDA Optimization Tutorial

*

* *

** *

* ** * *

* * * o

* **

* * o *

* * *

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

Page 76: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

76

Compute Forces

Compute force, for example, acting upon green body

CUDA Optimization Tutorial

*

* *

** *

* ** * *

* * * o

* **

* * o *

* * *

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

Page 77: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

77

Compute Force (short distance)

Scan tree depth first from left to right; green portion already completed

CUDA Optimization Tutorial

*

* *

** *

* ** * *

* * * o

* **

* * o *

* * *

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

Page 78: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

78

Compute Force (down one level)

Red center of mass is too close, need to go down one level

CUDA Optimization Tutorial

*

* *

** *

* ** * *

* * * o

* **

* * o *

* * *

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

Page 79: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

79

Compute Force (long distance)

Blue center of mass is far enough away

CUDA Optimization Tutorial

*

* *

** *

* ** * *

* * * o

* **

* * o *

* * *

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

Page 80: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

80

Compute Force (skip subtree)

Therefore, entire subtree rooted in the blue cell can be skipped

CUDA Optimization Tutorial

*

* *

** *

* ** * *

* * * o

* **

* * o *

* * *

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

Page 81: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

81

Pseudocode bodySet = ... foreach timestep do { bounding_box = new Bounding_Box(); foreach Body b in bodySet { bounding_box.include(b); } octree = new Octree(bounding_box); foreach Body b in bodySet { octree.Insert(b); } cellList = octree.CellsByLevel(); foreach Cell c in cellList { c.Summarize(); } foreach Body b in bodySet { b.ComputeForce(octree); } foreach Body b in bodySet { b.Advance(); } }

CUDA Optimization Tutorial

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

*

* *

** *

* ** * *

* * *

* **

* * *

* * *

*

* *

** *

* ** * *

* * *

* **

* * *

* * *

*

* *

** *

* ** * *

* * * o

* **

* * o *

* * *

*

* *

** *

* ** * *

* * * o

* **

* * o *

* * *

Page 82: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

82

Complexity and Parallelism bodySet = ... foreach timestep do { // O(n log n) + ordered sequential bounding_box = new Bounding_Box(); foreach Body b in bodySet { // O(n) parallel reduction bounding_box.include(b); } octree = new Octree(bounding_box); foreach Body b in bodySet { // O(n log n) top-down tree building octree.Insert(b); } cellList = octree.CellsByLevel(); foreach Cell c in cellList { // O(n) + ordered bottom-up

traversal c.Summarize(); } foreach Body b in bodySet { // O(n log n) fully parallel b.ComputeForce(octree); } foreach Body b in bodySet { // O(n) fully parallel b.Advance(); } }

CUDA Optimization Tutorial

Page 83: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

83

Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions

CUDA Optimization Tutorial

Page 84: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

84

Efficient GPU Code Large amounts of data parallelism Coalesced main memory accesses Little thread divergence Relatively little synchronization between blocks Little CPU/GPU data transfer Efficient use of shared memory

CUDA Optimization Tutorial

Thepcreport.net

Page 85: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

85

o o o o

o o o o

o

o o o o

o o o o o o o o o o o o

o o o o o o o o

Main BH Implementation Challenges Uses irregular tree-based data structure

Initially little parallelism Little coalescing Load imbalance

Complex recursive traversals Recursion not well supported Lots of thread divergence

Memory-bound pointer-chasing operations Not enough computation to hide latency

CUDA Optimization Tutorial

Page 86: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

86

Six GPU KernelsRead initial data and transfer to GPUfor each timestep do {

1. Compute bounding box around bodies (not irregular)2. Build hierarchical decomposition, i.e., octree3. Summarize body information in internal octree nodes4. Approximately sort bodies by spatial location (optional)5. Compute forces acting on each body with help of octree6. Update body positions and velocities (not irregular)

}Transfer result from GPU and output

CUDA Optimization Tutorial

Page 87: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

87

Global Optimizations Make code iterative (recursion not supported*) Keep data on GPU between kernel calls Use array elements instead of heap nodes

One aligned array per field for coalesced accesses

CUDA Optimization Tutorial

objects on heap

objects in array

fields in arrays

Page 88: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

88

c0

c2 c4 c1

b5 ba c3 b6 b2 b7 b0 c5

b3 b1 b8 b4 b9

bodies (fixed) cell allocation direction

b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba c5 c4 c3 c2 c1 c0. . .

Global Optimizations (cont.) Maximize thread count (round down to warp size) Maximize resident block count (all SMs filled) Pass kernel parameters through constant memory Use special allocation order Alias arrays (56 B/node) Use index arithmetic Persistent blocks & threads Unroll loops over children

CUDA Optimization Tutorial

Page 89: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

89

main memory

threads

shared memory

threads

shared memory

threads

shared memory

threads t1 t2

shared memory

threads t1

shared memory

barrier

barrier

. . .

warp 1

barrier

barrier

warp 1 warp 2 warp 3 warp 4

warp 1 warp 2

Kernel 1: Bounding Box (Regular) Optimizations

Fully coalesced Fully cached No bank conflicts Minimal divergence Built-in min and max 2 red/mem, 6 red/bar Bodies load balanced 512*3 threads per SM

CUDA Optimization Tutorial

Reduction operation

Page 90: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

90

*

Kernel 2: Build Octree (Irregular) Optimizations

Only lock leaf “pointers” Lock-free fast path Light-weight lock release No re-traverse after lock

acquire failure Combined memory fence Re-compute position

during traversal Separate init kernels 512*3 threads per SM

Top-down tree building

CUDA Optimization Tutorial

Page 91: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

91

Kernel 2: Build Octree (cont.)// initializecell = find_insertion_point(body); // no locks, cache cellchild = get_insertion_index(cell, body);if (child != locked) { // skip atomic if already locked if (child == null) { // fast path (frequent) if (null == atomicCAS(&cell[child], null, body)) { // lock-free insertion // move on to next body } } else { if (child == atomicCAS(&cell[child], child, lock)) { // acquire lock // build subtree with new and existing body flag = true; } }}__syncthreads(); // optional barrier__threadfence(); // make data visibleif (flag) { cell[child] = new_subtree; // insert subtree and releases lock // move on to next body}

CUDA Optimization Tutorial

Page 92: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

92

4

3 1 2

3 4

allocation direction

3 4 1 2 3 4

scan direction

. . .

Kernel 3: Summarize Subtrees (Irreg.)

Bottom-up tree traversal

Optimizations Scan avoids deadlock Use mass as flag + fence

No locks, no atomics Use wait-free first pass Cache the ready info Piggyback on traversal

Count bodies in subtrees No parent “pointers” 128*6 threads per SM

CUDA Optimization Tutorial

Page 93: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

93

Kernel 4: Sort Bodies (Irregular)

Top-down tree traversal

Optimizations (Similar to Kernel 3) Scan avoids deadlock Use data field as flag

No locks, no atomics Use counts from Kernel 3 Piggyback on traversal

Move nulls to back Throttle warps with

optional barrier 64*6 threads per SM

CUDA Optimization Tutorial

4

3 1 2

3 4

allocation direction

3 4 1 2 3 4

scan direction

. . .

Page 94: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

94

Kernel 5: Force Calculation (Irregular)

Multiple prefix traversals

Optimizations Group similar work together

Uses sorting to minimize size of prefix union in each warp

Early out (nulls in back) Traverse whole union to avoid

divergence (warp voting) Lane 0 controls iteration stack

for entire warp (fits in shmem) Minimize volatile accesses Use fast 1/sqrtf instruction Cache tree-level-based data 256*5 threads per SM

CUDA Optimization Tutorial

Page 95: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

95

Architectural Support Coalesced memory accesses & lockstep execution

All threads in warp read same tree node at same time Only one mem access per warp instead of 32 accesses

Warp-based execution Enables data sharing in warps w/o synchronization

RSQRTF instruction Quickly computes good approximation of 1/sqrtf(x)

Warp voting instructions Quickly perform reduction operations within a warp

CUDA Optimization Tutorial

Page 96: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

main memory

threads

main memory

warp 2

. . .

. . .

. . .

warp 1 warp 2 warp 3 warp 4 warp 1 warp 2

Kernel 6: Advance Bodies (Regular) Optimizations

Fully coalesced, no divergence Load balanced, 1024*1 threads per SM

Straightforward streaming

CUDA Optimization Tutorial

Page 97: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

97

Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions

CUDA Optimization Tutorial

0.0

0.1

1.0

10.0

100.0

1000.0

10000.0

10,000 100,000 1,000,000 10,000,000runti

me

per ti

mes

tep

[s]

number of bodies

CPUbh

GPUbh

GPUsq

Page 98: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

98

Evaluation Methodology Implementations

CUDA/GPU: Barnes Hut and O(n2) algorithms OpenMP/CPU: Barnes Hut algorithm (derived from CUDA) Pthreads/CPU: Barnes Hut algorithm (SPLASH-2 suite)

Systems and compilers nvcc 4.0 (-O3 -arch=sm_20 -ftz=true*) GeForce GTX 480, 1.4 GHz, 15 SMs, 32 cores per SM gcc 4.1.2 (-O3 -fopenmp* -ffast-math*) Xeon X5690, 3.46 GHz, 6 cores, 2 threads per core

Inputs and metric 5k, 50k, 500k, and 5M star clusters (Plummer model) Best runtime of three experiments, excluding I/O

CUDA Optimization Tutorial

Page 99: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

99

Nodes Touched per Activity (5M Input) Kernel “activities”

K1: pair reduction K2: tree insertion K3: bottom-up step K4: top-down step K5: prefix traversal K6: integration step

Max tree depth ≤ 22 Cells have 3.1 children

Prefix ≤ 6,315 nodes(≤ 0.1% of 7.4 million)

BH algorithm & sorting to min. union work well

CUDA Optimization Tutorial

min avg maxkernel 1 1 2.0 2kernel 2 2 13.2 22kernel 3 2 4.1 9kernel 4 2 4.1 9kernel 5 818 4,117.0 6,315kernel 6 1 1.0 1

neighborhood size

Page 100: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

100

Available Amorphous Data Parallelism

Almost every “round” has lots of activities without data dependencies that can be processed in parallel

CUDA Optimization Tutorial

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

k1.1

k1.2

k1.3

k1.4

k1.5

k1.6

k1.7

k1.8

k1.9

k1.1

0k1

.11

k1.1

2k1

.13

k1.1

4k1

.15

k1.1

6k1

.17

k1.1

8k1

.19

k1.2

0k1

.21

k1.2

2k1

.23

k2.1

k2.2

k2.3

k2.4

k2.5

k2.6

k2.7

k2.8

k2.9

k2.1

0k2

.11

k2.1

2k2

.13

k2.1

4k2

.15

k2.1

6k2

.17

k2.1

8k2

.19

k3.1

k3.2

k3.3

k3.4

k3.5

k3.6

k3.7

k3.8

k3.9

k3.1

0k3

.11

k3.1

2k3

.13

k3.1

4k3

.15

k3.1

6k3

.17

k3.1

8k3

.19

k3.2

0

k4.1

k4.2

k4.3

k4.4

k4.5

k4.6

k4.7

k4.8

k4.9

k4.1

0k4

.11

k4.1

2k4

.13

k4.1

4k4

.15

k4.1

6k4

.17

k4.1

8k4

.19

k4.2

0

k5.1

k6.1

avai

labl

e pa

ralle

lism

kernel and round

5,000 bodies

50,000 bodies

500,000 bodies

5,000,000 bodies

bounding box tree building summarization sorting

forc

e c

alc

.in

tegr

atio

n

Page 101: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

101

Runtime Comparison GPU BH inefficiency

5k input too small for 5,760 to 23,040 threads

BH vs. O(n2) algorithm O(n2) faster with fewer

than about 15k bodies GPU (5M input)

21.1x faster than OpenMP 23.2x faster than Pthreads

CUDA Optimization Tutorial

0

1

10

100

1,000

10,000

100,000

5,000 50,000 500,000 5,000,000ru

ntim

e pe

r tim

este

p [m

s]

number of bodies

GPU CUDA

GPU O(n^2)

CPU OpenMP

CPU Pthreads

Page 102: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

102

Kernel Performance for 5M Input $200 GPU delivers 228 GFlops/s on irregular code

GPU chip is 2.7 to 23.5 times faster than CPU chip

GPU hardware is better suited for BH than CPU hw But difficult and very time consuming to program

CUDA Optimization Tutorial

kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 BarnesHut O(n2)Gflops/s 71.6 5.8 2.5 n/a 240.6 33.5 228.4 897.0GB/s 142.9 26.8 10.6 12.8 8.0 133.9 8.8 2.8runtime [ms] 0.4 44.6 28.0 14.2 1641.2 2.2 1730.6 557421.5

kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6X5690 CPU 5.5 185.7 75.8 52.1 38,540.3 16.4 10.3 193.1 101.0 51.6 47,706.4 33.1GTX 480 GPU 0.4 44.6 28.0 14.2 1,641.2 2.2 0.8 46.7 31.0 14.2 5,177.1 4.2CPU/GPU 13.1 4.2 2.7 3.7 23.5 7.3 12.7 4.1 3.3 3.6 9.2 7.9

non-compliant fast single-precision version IEEE 754-compliant double-precision version

Page 103: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

103

Kernel Speedups Optimizations that are generally applicable

Optimizations for irregular kernels

CUDA Optimization Tutorial

avoid rsqrtf recalc. thread full multi-volatile instr. data voting threading

50,000 1.14x 1.43x 0.99x 2.04x 20.80x500,000 1.19x 1.47x 1.32x 2.49x 27.99x

5,000,000 1.18x 1.46x 1.69x 2.47x 28.85x

throttling waitfree combined sorting of sync'edbarrier pre-pass mem fence bodies execution

50,000 0.97x 1.02x 1.54x 3.60x 6.23x500,000 1.03x 1.21x 1.57x 6.28x 8.04x

5,000,000 1.04x 1.31x 1.50x 8.21x 8.60x

Page 104: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

104

Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions

CUDA Optimization Tutorial

Page 105: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

105

Optimization Summary Reduce main memory accesses

Share data within warp, combine memory fences & traversals, re-compute data, avoid volatile accesses

Minimize thread divergence Group similar work together, force synchronicity

Implement entire algorithm on and for GPU Avoid data transfers & data structure inefficiencies,

wait-free pre-pass, scan entire prefix union

CUDA Optimization Tutorial

Page 106: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

106

Optimization Summary (cont.) Exploit hardware features

Fast synchronization & thread startup, special instrs., coalesced memory accesses, even lockstep execution

Use light-weight locking and synchronization Minimize locks, reuse fields, and use fence + store ops

Maximize parallelism Parallelize every step within and across SMs

CUDA Optimization Tutorial

Page 107: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

107

CPU/GPU Implementation Comparison Irregular CPU code

Dynamically (incrementally) allocated shared data structures

Structure-based shared data structures

Logical lock-based implementation

Global/local worklists Recursive or iterative

implementation

Irregular GPU code Statically (wholly)

allocated shared data structures

Multiple-array-based shared data structures

Lock-free implementation

(Implicit) local worklists Iterative implementation

CUDA Optimization Tutorial

Page 108: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

108

Useful GPU Hardware Features Wide parallelism

Great for exploiting large amounts of parallelism

Massive multithreading Ideal for hiding latency of

irregular mem. accesses Fast thread startup

Essential when launching thousands of threads

Shared memory Fast data sharing Useful for local worklists

HW support for reduction and synchronization Makes otherwise costly

operations very fast Coalesced accesses

Memory access combining is useful in irregular codes

Lockstep execution Can share data without

explicit synchronization Allows to consolidate

iteration stacksCUDA Optimization Tutorial

Page 109: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

109

Challenges with GPUs Warp-based execution

Often requires sorting of work or algorithm change

Data structure layout Best layout for CPU differs

from best layout for GPU SoA can be tedious to

code and deal with (parameter passing)

Separate memory space Slow transfers Pack/unpack data

Incoherent L1 caches May need to explicitly

manage data (fences) Poor recursion support

Need to make code iterative and maintain explicit iteration stacks

Thread and block counts Hierarchy complicates

implementation Optimal counts have to

be (auto-)tunedCUDA Optimization Tutorial

Page 110: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

110

Running Irregular Algorithms on GPUs Mandatory

Need vast amounts of data parallelism

Can do large chunks of computation on GPU

Very Important Cautious implementation DS can be expressed

through fixed arrays Uses local worklists that

can be statically populated

Important Scheduling is independent

of previous activities Easy to sort activities by

similarity (if needed)

Beneficial Easy to express iteratively Has statically known range

of neighborhoods DS size (or bound) can be

determined based on input

CUDA Optimization Tutorial

Page 111: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

111

Conclusions Irregularity does not necessarily prevent high-

performance on GPUs Entire Barnes Hut algorithm implemented on GPU

Builds and traverses unbalanced octree GPU is 21.1 times (float) and 9.1 times (double)

faster than high-end 6-core Xeon Code directly for GPU, do not merely adjust CPU code

Requires different data and code structures Benefits from different algorithmic modifications

CUDA Optimization Tutorial

Page 112: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

112

Acknowledgments Hardware

NVIDIA Corp. and Intel Corp. Funding

NVIDIA Corp. and Texas State University OpenMP code

Ricardo Alves (Universidade do Minho, Portugal) Collaborator

Keshav Pingali (University of Texas at Austin)

CUDA Optimization Tutorial

Page 113: Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science

113

CUDA Optimization Tutorial Martin Burtscher

[email protected] http://www.cs.txstate.edu/~burtscher/

Barnes Hut CUDA code http://www.gpucomputing.net/?q=node/1314

Tutorial slides http://www.cs.txstate.edu/~burtscher/tutorials/COT5/slides.pptx

CUDA Optimization Tutorial