parallelizing and optimizing programs for gpu acceleration using cuda martin burtscher department of...
TRANSCRIPT
Parallelizing and Optimizing Programs for GPU Acceleration using CUDA
Martin BurtscherDepartment of Computer Science
2
CUDA Optimization Tutorial Martin Burtscher
[email protected] http://www.cs.txstate.edu/~burtscher/
Tutorial slides http://www.cs.txstate.edu/~burtscher/tutorials/COT5/slides.pptx
CUDA Optimization Tutorial
3
High-end CPU-GPU ComparisonXeon E5-2687W Kepler GTX
680Cores 8 (superscalar) 1536 (simple)Active threads 2 per core ~11 per coreFrequency 3.1 GHz 1.0 GHzPeak performance (SP) 397 GFlop/s 3090 GFlop/sPeak mem. bandwidth 51 GB/s 192 GB/sMaximum power 150 W 195 W*Price $1900 $500*
Release datesXeon: March 2012Kepler: March 2012
*entire cardCUDA Optimization Tutorial
NvidiaIntel
4
GPU Advantages Performance
8x as many operations executed per second Main memory bandwidth
4x as many bytes transferred per second Cost-, energy-, and size-efficiency
29x as much performance per dollar 6x as much performance per watt 11x as much performance per area
(based on peak values)
CUDA Optimization Tutorial
5
GPU Disadvantages Clearly, we should be using GPUs all the time
So why aren’t we?
GPUs can only execute some types of code fast Need lots of data parallelism, data reuse, & regularity
GPUs are harder to program and tune than CPUs In part because of poor tool support In part because of their architecture In part because of poor support for irregular codes
CUDA Optimization Tutorial
6
Outline (Part I) Introduction GPU programming N-body example Porting and tuning Other considerations Conclusions
CUDA Optimization Tutorial
Hightechreview.com
7
CUDA Programming Model Non-graphics
programming Uses GPU as massively
parallel co-processor
SIMT (single-instruction multiple-threads) model Thousands of threads
needed for full efficiency
C/C++ with extensions Function launch
Calling functions on GPU Memory management
GPU memory allocation, copying data to/from GPU
Declaration qualifiers Device, shared, local, etc.
Special instructions Barriers, fences, etc.
Keywords threadIdx, blockIdx
CUDA Optimization Tutorial
GPUCPUPCI-Express
bus
8
Calling GPU Kernels Kernels are functions that run on the GPU
Callable by CPU code CPU can continue processing while GPU runs kernel
KernelName<<<m, n>>>(arg1, arg2, ...);
Launch configuration (programmer selectable) GPU spawns m blocks with n threads (i.e., m*n
threads total) that run a copy of the same function Normal function parameters: passed conventionally
Different address space, should never pass CPU pointers
CUDA Optimization Tutorial
9
GPU Architecture GPUs consist of Streaming Multiprocessors (SMs)
1 to 30 SMs per chip (run blocks) SMs contain Processing Elements (PEs)
8, 32, or 192 PEs per SM (run threads)
CUDA Optimization Tutorial
Global Memory
SharedMemory
SharedMemory
SharedMemory
SharedMemory
SharedMemory
SharedMemory
SharedMemory
SharedMemory
Adapted from NVIDIA
10
Block Scalability Hardware can assign blocks to SMs in any order
A kernel with enough blocks scales across GPUs Not all blocks may be resident at the same time
CUDA Optimization Tutorial
GPU with 2 SMs
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Kernel
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
GPU with 4 SMs
Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 Block 6 Block 7time
Adapted from NVIDIA
11
GPU Memories Separate from CPU memory
CPU can access GPU’s global & constant mem. via PCIe bus
Requires slow explicit transfer Visible GPU memory types
Registers (per thread) Local mem. (per thread) Shared mem. (per block)
Software-controlled cache Global mem. (per kernel) Constant mem. (read only)
CUDA Optimization Tutorial
GPU
Global + Local Memory (DRAM)
Block (0, 0)
Shared Memory (SRAM)
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory (SRAM)
Thread (0, 0)
Registers
Thread (1, 0)
Registers
CPU Constant Memory (DRAM, cached)
Adapted from NVIDIA
Slow communic. between blocks
12
SM Internals (Fermi and Kepler) Caches
Software-controlled shared memory Hardware-controlled incoherent L1 data cache 64 kB combined size, can be split 16/48, 32/32, 48/16
Synchronization support Fast hardware barrier within block (__syncthreads()) Fence instructions: memory consistency & coherency
Special operations Thread voting (warp-based reduction operations)
CUDA Optimization Tutorial
13
Block and Thread Allocation Limits Blocks assigned to SMs
Until first limit reached Threads assigned to PEs
Hardware limits 8/16 active blocks/SM 1024, 1536, or 2048
resident threads/SM 512 or 1024 threads/blk 16k, 32k, or 64k regs/SM 16 kB or 48 kB shared
memory per SM 216-1 or 231-1 blks/kernel
CUDA Optimization Tutorial
t0 t1 t2 … tm
Blocks
PE
SharedMemory
MT IU
PE
SharedMemory
MT IU
t0 t1 t2 … tm
Blocks
SM 1SM 0
Adapted from NVIDIA
14
Warp-based Execution 32 contiguous threads form a warp
Execute same instruction in same cycle (or disabled) Warps are scheduled out-of-order with respect to each
other to hide latencies
Thread divergence Some threads in warp jump to different PC than others Hardware runs subsets of warp until they re-converge Results in reduction of parallelism (performance loss)
CUDA Optimization Tutorial
15
Thread Divergence Non-divergent code
if (threadID >= 32) {some_code;
} else {other_code;
}
Divergent codeif (threadID >= 13) {
some_code;} else {
other_code;}
CUDA Optimization Tutorial
Thread ID:0 1 2 3 … 31
Adapted from NVIDIA
Thread ID:0 1 2 3 … 31
Adapted from NVIDIA
disabled
disabled
16
Parallel Memory Accesses Coalesced main memory access (16/32x faster)
Under some conditions, HW combines multiple (half) warp memory accesses into a single coalesced access
CC 1.3: 64-byte aligned 64-byte line (any permutation) CC 2.x+3.0: 128-byte aligned 128-byte line (cached)
Bank-conflict-free shared memory access (16/32) No superword alignment or contiguity requirements
CC 1.3: 16 different banks per half warp or same word CC 2.x+3.0 : 32 different banks + 1-word broadcast each
CUDA Optimization Tutorial
17
Coalesced Main Memory Accesses single coalesced access one and two coalesced accesses*
NVIDIA NVIDIA
CUDA Optimization Tutorial
18
Outline Introduction GPU programming N-body example Porting and tuning Other considerations Conclusions
CUDA Optimization Tutorial
NASA/JPL-Caltech/SSC
19
N-body Simulation Time evolution of physical system
System consists of bodies “n” is the number of bodies Bodies interact via pair-wise forces
Many systems can be modeled in this way Star/galaxy clusters (gravitational force) Particles (electric force, magnetic force)
CUDA Optimization Tutorial
RUG
Cornell
20
Simple N-body Algorithm Algorithm
Initialize body masses, positions, and velocitiesIterate over time steps {
Accumulate forces acting on each bodyUpdate body positions and velocities based on force
}Output result
More sophisticated n-body algorithms exist Barnes Hut algorithm (covered in Part II) Fast Multipole Method (FMM)
CUDA Optimization Tutorial
21
Key Loops (Pseudo Code) bodySet = ...; // input for timestep do { // sequential foreach Body b1 in bodySet { // O(n2) parallel foreach Body b2 in bodySet { if (b1 != b2) { b1.addInteractionForce(b2); } } } foreach Body b in bodySet { // O(n) parallel b.Advance(); } } // output resultCUDA Optimization Tutorial
22
Force Calculation C Codestruct Body { float mass, posx, posy, posz; // mass and 3D position float velx, vely, velz, accx, accy, accz; // 3D velocity & accel} *body;
for (i = 0; i < nbodies; i++) { . . .
for (j = 0; j < nbodies; j++) { if (i != j) { dx = body[j].posx - px; // delta x dy = body[j].posy - py; // delta y dz = body[j].posz - pz; // delta z dsq = dx*dx + dy*dy + dz*dz; // distance squared dinv = 1.0f / sqrtf(dsq + epssq); // inverse distance scale = body[j].mass * dinv * dinv * dinv; // scaled force ax += dx * scale; // accumulate x contribution of accel ay += dy * scale; az += dz * scale; // ditto for y and z } }. . .
CUDA Optimization Tutorial
23
Outline Introduction GPU programming N-body example Porting and tuning Other considerations Conclusions
CUDA Optimization Tutorial
24
N-body Algorithm Suitability for GPU Lots of data parallelism
Force calculations are independent Should be able to keep SMs and PEs busy
Sufficient memory access regularity All force calculations access body data in same order* Should have lots of coalesced memory accesses
Sufficient code regularity All force calculations are identical* There should be little thread divergence
Plenty of data reuse O(n2) operations on O(n) data CPU/GPU transfer time is insignificant
CUDA Optimization Tutorial
25
C to CUDA Conversion Two CUDA kernels
Force calculation Advance position and velocity
Benefits Force calculation requires over 99.9% of runtime
Primary target for acceleration Advancing kernel unimportant to runtime
But allows to keep data on GPU during entire simulation Minimizes GPU/CPU transfers
CUDA Optimization Tutorial
26
C to CUDA Conversion__global__ void ForceCalcKernel(int nbodies, struct Body *body, ...) { . . .}__global__ void AdvancingKernel(int nbodies, struct Body *body, ...) { . . .}
int main(...) { Body *body, *bodyl; . . . cudaMalloc((void**)&bodyl, sizeof(Body)*nbodies); cudaMemcpy(bodyl, body, sizeof(Body)*nbodies, cuda…HostToDevice); for (timestep = ...) { ForceCalcKernel<<<1, 1>>>(nbodies, bodyl, ...); AdvancingKernel<<<1, 1>>>(nbodies, bodyl, ...); } cudaMemcpy(body, bodyl, sizeof(Body)*nbodies, cuda…DeviceToHost); cudaFree(bodyl); . . .}
CUDA Optimization Tutorial
Indicates GPU kernel that CPU can call
Separate address spaces, need two pointers
Allocate memory on GPU
Copy CPU data to GPU
Copy GPU data back to CPUCall GPU kernel with 1 block and 1 thread per block
27
Evaluation Methodology Systems and compilers
CC 1.3: Quadro FX 5800, nvcc 3.2 30 SMs, 240 PEs, 1.3 GHz, 30720 resident threads
CC 2.0: Tesla C2050, nvcc 3.2 14 SMs, 448 PEs, 1.15 GHz, 21504 resident threads
CC 3.0: GeForce GTX 680, nvcc 4.2 8 SMs, 1536 PEs, 1.0 GHz, 16384 resident threads
Inputs and metric 1k, 10k, or 100k star clusters (Plummer model) Median runtime of three experiments, excluding I/O
CUDA Optimization Tutorial
28
1-Thread Performance Problem size
n=10000, step=1 n=10000, step=1 n=3000, step=1
Slowdown rel. to CPU CC 1.3: 72.4 CC 2.0: 36.7 CC 3.0: 68.1(Note: comparing different GPUs to
different CPUs)
Performance 1 thread is one to two
orders of magnitude slower on GPU than CPU
Reasons No caches (CC 1.3) Not superscalar Slower clock frequency No SMT latency hiding
CUDA Optimization Tutorial
29
Using N Threads Approach
Eliminate outer loop Instantiate n copies of inner loop, one per body
Threading Blocks can only hold 512 or 1024 threads
Up to 8/16 blocks can be resident in an SM at a time SM can hold 1024, 1536, or 2048 threads We use 256 threads per block (works for all of our GPUs)
Need multiple blocks Last block may not need all of its threads
CUDA Optimization Tutorial
30
Using N Threads__global__ void ForceCalcKernel(int nbodies, struct Body *body, ...) { for (i = 0; i < nbodies; i++) { i = threadIdx.x + blockIdx.x * blockDim.x; // compute i if (i < nbodies) { // in case last block is only partially used for (j = ...) { . . . } }}__global__ void AdvancingKernel(int nbodies,...) // same changes
#define threads 256int main(...) { . . . int blocks = (nbodies + threads - 1) / threads; // compute block cnt for (timestep = ...) { ForceCalcKernel<<<1, 1blocks, threads>>>(nbodies, bodyl, ...); AdvancingKernel<<<1, 1blocks, threads>>>(nbodies, bodyl, ...); }}
CUDA Optimization Tutorial
31
N Thread Speedup Relative to 1 GPU thread
CC 1.3: 7781 (240 PEs) CC 2.0: 6495 (448 PEs) CC 3.0: 12150 (1536 PEs)
Relative to 1 CPU thread CC 1.3: 107.5 CC 2.0: 176.7 CC 3.0: 176.2
Performance Speedup much higher
than number of PEs(32, 14.5, and 7.9 times)
Due to SMT latency hiding
Per-core performance CPU core delivers up to
4.4, 5, and 8.7 times as much performance as a GPU core (PE)
CUDA Optimization Tutorial
32
structs in array
scalar arrays
Using Scalar Arrays Data structure conversion
Arrays of structs are bad for coalescing Bodies’ elements (e.g., mass fields) are not adjacent
Optimize data structure Use multiple scalar arrays, one per field (need 10) Results in code bloat but often much better speed
CUDA Optimization Tutorial
33
Using Scalar Arrays__global__ void ForceCalcKernel(int nbodies, float *mass, ...) { // change all “body[k].blah” to “blah[k]”}__global__ void AdvancingKernel(int nbodies, float *mass, ...) { // change all “body[k].blah” to “blah[k]”}
int main(...) { float *mass, *posx, *posy, *posz, *velx, *vely, *velz, *accx, *accy,*accz; float *massl, *posxl, *posyl, *poszl, *velxl, *velyl, *velzl, ...; mass = (float *)malloc(sizeof(float) * nbodies); // etc . . . cudaMalloc((void**)&massl, sizeof(float)*nbodies); // etc cudaMemcpy(massl, mass, sizeof(float)*nbodies, cuda…HostToDevice); // etc for (timestep = ...) { ForceCalcKernel<<<blocks, threads>>>(nbodies, massl, posxl, ...); AdvancingKernel<<<blocks, threads>>>(nbodies, massl, posxl, ...); } cudaMemcpy(mass, massl, sizeof(float)*nbodies, cuda…DeviceToHost); // etc . . .}
CUDA Optimization Tutorial
34
Scalar Array Speedup Problem size
n=100000, step=1 n=100000, step=1 n=300000, step=1
Relative to struct CC 1.3: 0.83 CC 2.0: 0.96 CC 3.0: 0.82
Performance Threads access same
memory locations, not adjacent ones
Always combined but not really coalesced access
Slowdowns may be due to DRAM page/TLB misses
Scalar arrays Still needed (see later)
CUDA Optimization Tutorial
35
Constant Kernel Parameters Kernel parameters
Lots of parameters due to scalar arrays All but one parameter never change their value
Constant memory “Pass” parameters only once Copy them into GPU’s constant memory
Performance implications Reduced parameter passing overhead Constant memory has hardware cache
CUDA Optimization Tutorial
36
Constant Kernel Parameters__constant__ int nbodiesd;__constant__ float dthfd, epssqd, float *massd, *posxd, ...;
__global__ void ForceCalcKernel(int step) { // rename affected variables (add “d” to name)}
__global__ void AdvancingKernel() { // rename affected variables (add “d” to name)}
int main(...) { . . . cudaMemcpyToSymbol(massd, &massl, sizeof(void *)); // etc . . . for (timestep = ...) { ForceCalcKernel<<<1, 1>>>(step); AdvancingKernel<<<1, 1>>>(); } . . .}
CUDA Optimization Tutorial
37
Constant Mem. Parameter Speedup Problem size
n=1000, step=10000 n=1000, step=10000 n=3000, step=10000
Speedup CC 1.3: 1.015 CC 2.0: 1.016 CC 3.0: 0.971
Performance Minimal perf. impact May be useful for very
short kernels that are often invoked
Benefit Less shared memory
used on CC 1.3 devices
CUDA Optimization Tutorial
38
Using the RSQRTF Instruction Slowest kernel operation
Computing one over the square root is very slow GPU has slightly imprecise but fast 1/sqrt instruction
(frequently used in graphics code to calculate inverse of distance to a point)
IEEE floating-point accuracy compliance CC 1.x is not entirely compliant CC 2.x and above are compliant but also offer faster
non-compliant instructions
CUDA Optimization Tutorial
39
Using the RSQRT Instruction for (i = 0; i < nbodies; i++) { . . . for (j = 0; j < nbodies; j++) { if (i != j) { dx = body[j].posx - px; dy = body[j].posy - py; dz = body[j].posz - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = 1.0f / sqrtf(dsq + epssq); dinv = rsqrtf(dsq + epssq); scale = body[j].mass * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } } . . . }
CUDA Optimization Tutorial
40
RSQRT Speedup Problem size
n=100000, step=1 n=100000, step=1 n=300000, step=1
Speedup CC 1.3: 0.99 CC 2.0: 1.83 CC 3.0: 1.64
Performance No change for CC 1.3
Compiler automatically uses less precise RSQRTF as most FP ops are not fully precise anyhow
83% speedup for CC 2.0 Over entire application Compiler defaults to
precise instructions Explicit use of RSQRTF
indicates imprecision okay
CUDA Optimization Tutorial
41
Using 2 Loops to Avoid If Statement “if (i != j)” causes thread divergence
Break loop into two loops to avoid if statement
for (j = 0; j < nbodies; j++) { if (i != j) { dx = body[j].posx - px; dy = body[j].posy - py; dz = body[j].posz - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssq); scale = body[j].mass * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } }
CUDA Optimization Tutorial
42
Using 2 Loops to Avoid If Statement for (j = 0; j < i; j++) { dx = body[j].posx - px; dy = body[j].posy - py; dz = body[j].posz - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssq); scale = body[j].mass * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } for (j = i+1; j < nbodies; j++) { dx = body[j].posx - px; dy = body[j].posy - py; dz = body[j].posz - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssq); scale = body[j].mass * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; }CUDA Optimization Tutorial
43
Loop Duplication Speedup Problem size
n=100000, step=1 n=100000, step=1 n=300000, step=1
Speedup CC 1.3: 0.55 CC 2.0: 1.00 CC 3.0: 1.00
Performance No change for 2.0 & 3.0
Divergence moved to loop 45% slowdown for CC 1.3
Unclear why
Discussion Not a useful optimization Code bloat A little divergence is okay
(only 1 in 3125 iterations)
CUDA Optimization Tutorial
44
Blocking using Shared Memory Code is memory bound
Each warp streams in all bodies’ masses and positions Block inner loop
Read block of mass & position info into shared mem Requires barriers (fast hardware barrier within SM)
Advantage A lot fewer main memory accesses Remaining main memory accesses are fully coalesced
(due to usage of scalar arrays)
CUDA Optimization Tutorial
45
__shared__ float posxs[threads], posys[threads], poszs[…], masss[…];j = 0;for (j1 = 0; j1 < nbodiesd; j1 += THREADS) { // first part of loop idx = tid + j1; if (idx < nbodiesd) { // each thread copies 4 words (fully coalesced) posxs[id] = posxd[idx]; posys[id] = posyd[idx]; poszs[id] = poszd[idx]; masss[id] = massd[idx]; } __syncthreads(); // wait for all copying to be done bound = min(nbodiesd - j1, THREADS); for (j2 = 0; j2 < bound; j2++, j++) { // second part of loop if (i != j) { dx = posxs[j2] – px; dy = posys[j2] – py; dz = poszs[j2] - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssqd); scale = masss[j2] * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } } __syncthreads(); // wait for all force calculations to be done}
Blocking using Shared Memory
CUDA Optimization Tutorial
46
Blocking Speedup Problem size
n=100000, step=1 n=100000, step=1 n=300000, step=1
Speedup CC 1.3: 3.7 CC 2.0: 1.1 CC 3.0: 1.6
Performance Great speedup for CC 1.3 Some speedup for others
Has hardware data cache
Discussion Very important
optimization for memory bound code
Even with L1 cache
CUDA Optimization Tutorial
47
Loop Unrolling CUDA compiler
Generally good at unrolling loops with fixed bounds Does not unroll inner loop of our example code
Use pragma to unroll (and pad arrays) #pragma unroll 8 for (j2 = 0; j2 < bound; j2++, j++) { if (i != j) { dx = posxs[j2] – px; dy = posys[j2] – py; dz = poszs[j2] - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssqd); scale = masss[j2] * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } }
CUDA Optimization Tutorial
48
Loop Unrolling Speedup Problem size
n=100000, step=1 n=100000, step=1 n=300000, step=1
Speedup CC 1.3: 1.07 CC 2.0: 1.16 CC 3.0: 1.07
Performance Noticeable speedup All three GPUs
Discussion Can be useful May increase register
usage, which may lower maximum number of threads per block and result in slowdown
CUDA Optimization Tutorial
49
CC 2.0 Absolute Performance Problem size
n=100000, step=1 Runtime
612 ms FP operations
326.7 GFlop/s Main mem throughput
1.035 GB/s
Not peak performance Only 32% of 1030 GFlop/s
Peak assumes FMA every cycle
3 sub (1c), 3 fma (1c), 1 rsqrt (8c), 3 mul (1c), 3 fma (1c) = 20c for 20 Flop
63% of realistic peak of 515.2 GFlop/s
Assumes no non-FP operations
With int ops = 31c for 20 Flop 99% of actual peak of 330.45
GFlop/s
CUDA Optimization Tutorial
50
Eliminating the If Statement Algorithmic optimization
Potential softening parameter avoids division by zero If-statement is not necessary and can be removed
Eliminates thread divergence
for (j2 = 0; j2 < bound; j2++, j++) { if (i != j) { dx = posxs[j2] – px; dy = posys[j2] – py; dz = poszs[j2] - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssqd); scale = masss[j2] * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } }
CUDA Optimization Tutorial
51
If Elimination Speedup Problem size
n=100000, step=1 n=100000, step=1 n=300000, step=1
Speedup CC 1.3: 1.38 CC 2.0: 1.54 CC 3.0: 1.64
Performance Large speedup All three GPUs
Discussion No thread divergence Allows compiler to
schedule code much better
CUDA Optimization Tutorial
52
Rearranging Terms Generated code is suboptimal
Compiler does not emit as many fused multiply-add (FMA) instructions as it could
Rearrange terms in expressions to help compiler Need to check generated assembly code
for (j2 = 0; j2 < bound; j2++, j++) { dx = posxs[j2] – px; dy = posys[j2] – py; dz = poszs[j2] - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssqd); dsq = dx*dx + (dy*dy + (dz*dz + epssqd)); dinv = rsqrtf(dsq); scale = masss[j2] * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; }
CUDA Optimization Tutorial
53
FMA Speedup Problem size
n=100000, step=1 n=100000, step=1 n=300000, step=1
Speedup CC 1.3: 1.03 CC 2.0: 1.05 CC 3.0: 1.06
Performance Small speedup All three GPUs
Discussion Seemingly needless
transformations can make a difference
CUDA Optimization Tutorial
54
Higher Unroll Factor Problem size
n=100000, step=1 n=100000, step=1 n=300000, step=1
Speedup CC 1.3: 1.01 CC 2.0: 1.04 CC 3.0: 0.93
Unroll 128 times Avoid looping overhead Now that there are no ifs
Performance Little speedup/slowdown
Discussion Carefully choose unroll
factor (manually tune)
CUDA Optimization Tutorial
55
Compiler Flags Problem size
n=100000, step=1 n=100000, step=1 n=300000, step=1
Speedup CC 1.3: 1.00 CC 2.0: 1.18 CC 3.0: 1.15
-use_fast_math “-ftz=true” suffices
(flush denormals to zero) Makes SP FP operations
faster except on CC 1.3 Performance
Significant speedup Discussion
Use faster but less precise operations when prudent
CUDA Optimization Tutorial
56
Final Absolute Performance CC 2.0 Fermi GTX 480
Problem size n=100000, step=1
Runtime 296.1 ms
FP operations 675.6 GFlop/s (SP) 66% of peak performance 261.1 GFlops/s (DP)
Main mem throughput 2.139 GB/s
CC 3.0 Kepler GTX 680 Problem size
n=300000, step=1 Runtime
1073 ms FP operations
1677.6 GFlop/s (SP) 54% of peak performance 88.7 GFlops/s (DP)
Main mem throughput 5.266 GB/s
CUDA Optimization Tutorial
57
Outline Introduction GPU programming N-body example Porting and tuning Other considerations Conclusions
CUDA Optimization Tutorial
gamedsforum.ca
Thepcreport.net
58
Things to Consider Minimize PCIe transfers
Implementing entire algorithm on GPU, even some slow serial code sections, might be overall win
Can stream data to/from GPU while computing Locks and synchronization
Lightweight locks & fast barriers possible within SM Slow across different SMs
L1 data caches are not coherent Use volatile & fences to avoid deadlocks
CUDA Optimization Tutorial
59
Warp-based Execution// wrong on GPU, correct on CPUdo { cnt = 0; if (ready[i] != 0) cnt++; if (ready[j] != 0) cnt++;} while (cnt < 2);ready[k] = 1;
// correctdo { cnt = 0; if (ready[i] != 0) cnt++; if (ready[j] != 0) cnt++; if (cnt == 2) ready[k] = 1;} while (cnt < 2);
Problem Thread divergence Loop exiting threads wait
for other threads in warp to also exit
“ready[k] = 1” is not executed until all threads in warp are done with loop
Possible deadlock
CUDA Optimization Tutorial
60
Hybrid Execution CPU always needed for program launch and I/O
CPU much faster on serial program segments GPU 10 times faster than CPU on parallel code
Running 10% of problem on CPU is hardly worthwhile Complicates programming and requires data transfer
Best CPU data structure is often not best for GPU PCIe bandwidth much lower than GPU bandwidth
1.6 to 6.5 GB/s versus 192 GB/s But can send data while CPU & GPU are computing Merging CPU and GPU on same die (e.g., AMD’s
Fusion APU) makes finer grain switching possible
CUDA Optimization Tutorial
61
Outline Introduction GPU programming N-body example Porting and tuning Other considerations Conclusions
CUDA Optimization Tutorial
gamedsforum.ca
62
Summary and Conclusions (Part I) Step-by-step porting and tuning of CUDA code
Example: n-body simulation
GPUs have very powerful hardware But only exploitable with some codes Even harder to program and optimize than CPU hardware
CUDA Optimization Tutorial
Parallelizing and Optimizing Programs for GPU Acceleration using CUDA (Part II)
Martin BurtscherDepartment of Computer Science
64
Mapping Regular Code to GPUs Regular codes
Operate on array- and matrix-based data structures Exhibit mostly strided memory access patterns Have relatively predictable control flow (control flow
behavior is mainly determined by input size) Largely independent computations
Many regular codes are easy to port to GPUs E.g., matrix codes executing many ops/word
Dense matrix operations (level 2 and 3 BLAS) Stencil codes (PDE solvers)
CUDA Optimization Tutorial
LLNL
65
Mapping Irregular Code to GPUs Irregular codes
Build, traverse, and update dynamic data structures (trees, graphs, linked lists, priority queues, etc.)
Exhibit pointer-chasing memory access patterns Have complex control flow (control flow behavior
depends on input values and changes dynamically) Many important scientific programs are irregular
E.g., n-body simulation, data clustering, SAT solving, social networks, discrete-event simulation, meshing, …
Need case studies on how to best map irreg codesCUDA Optimization Tutorial
FSU
66
Example: N-body Simulation Irregular Barnes Hut algorithm
Repeatedly builds unbalanced tree and performs complex traversals on it
Our implementation Designed for GPUs (not just port of CPU code) First GPU implementation of entire BH algorithm
Results GPU is 21 times faster than CPU (6 cores) on this code
CUDA Optimization Tutorial
67
Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions
CUDA Optimization Tutorial
NASA/JPL-Caltech/SSC
68
Barnes Hut Idea Precise force calculation
Requires O(n2) operations (O(n2) body pairs) Computationally intractable for large n
Barnes and Hut (1986) Algorithm to approximately compute forces
Bodies’ initial position & velocity are also approximate Requires only O(n log n) operations Idea is to “combine” far away bodies Error should be small because force 1/distance2
CUDA Optimization Tutorial
69
Barnes Hut Algorithm Set bodies’ initial position and velocity Iterate over time steps
1. Compute bounding box around bodies2. Subdivide space until at most one body per cell
Record this spatial hierarchy in an octree
3. Compute mass and center of mass of each cell4. Compute force on bodies by traversing octree
Stop traversal path when encountering a leaf (body) or an internal node (cell) that is far enough away
5. Update each body’s position and velocity
CUDA Optimization Tutorial
70
Build Tree (Level 1)
*
* *
** *
* ** * *
* * *
* **
* * *
* * *
o
Compute bounding box around all bodies → tree root
CUDA Optimization Tutorial
71
Build Tree (Level 2)
*
* *
** *
* ** * *
* * *
* **
* * *
* * *
o o o o
o
Subdivide space until at most one body per cell
CUDA Optimization Tutorial
72
Build Tree (Level 3)
*
* *
** *
* ** * *
* * *
* **
* * *
* * *
o o o o
o
o o o o o o o o o o o o
Subdivide space until at most one body per cell
CUDA Optimization Tutorial
73
Build Tree (Level 4)
*
* *
** *
* ** * *
* * *
* **
* * *
* * *
o o o o
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
Subdivide space until at most one body per cell
CUDA Optimization Tutorial
74
Build Tree (Level 5)
*
* *
** *
* ** * *
* * *
* **
* * *
* * *
o o o o
o o o o
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
Subdivide space until at most one body per cell
CUDA Optimization Tutorial
75
Compute Cells’ Center of Mass
For each internal cell, compute sum of mass and weighted averageof position of all bodies in subtree; example shows two cells only
CUDA Optimization Tutorial
*
* *
** *
* ** * *
* * * o
* **
* * o *
* * *
o o o o
o o o o
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
76
Compute Forces
Compute force, for example, acting upon green body
CUDA Optimization Tutorial
*
* *
** *
* ** * *
* * * o
* **
* * o *
* * *
o o o o
o o o o
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
77
Compute Force (short distance)
Scan tree depth first from left to right; green portion already completed
CUDA Optimization Tutorial
*
* *
** *
* ** * *
* * * o
* **
* * o *
* * *
o o o o
o o o o
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
78
Compute Force (down one level)
Red center of mass is too close, need to go down one level
CUDA Optimization Tutorial
*
* *
** *
* ** * *
* * * o
* **
* * o *
* * *
o o o o
o o o o
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
79
Compute Force (long distance)
Blue center of mass is far enough away
CUDA Optimization Tutorial
*
* *
** *
* ** * *
* * * o
* **
* * o *
* * *
o o o o
o o o o
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
80
Compute Force (skip subtree)
Therefore, entire subtree rooted in the blue cell can be skipped
CUDA Optimization Tutorial
*
* *
** *
* ** * *
* * * o
* **
* * o *
* * *
o o o o
o o o o
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
81
Pseudocode bodySet = ... foreach timestep do { bounding_box = new Bounding_Box(); foreach Body b in bodySet { bounding_box.include(b); } octree = new Octree(bounding_box); foreach Body b in bodySet { octree.Insert(b); } cellList = octree.CellsByLevel(); foreach Cell c in cellList { c.Summarize(); } foreach Body b in bodySet { b.ComputeForce(octree); } foreach Body b in bodySet { b.Advance(); } }
CUDA Optimization Tutorial
o o o o
o o o o
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
*
* *
** *
* ** * *
* * *
* **
* * *
* * *
*
* *
** *
* ** * *
* * *
* **
* * *
* * *
*
* *
** *
* ** * *
* * * o
* **
* * o *
* * *
*
* *
** *
* ** * *
* * * o
* **
* * o *
* * *
82
Complexity and Parallelism bodySet = ... foreach timestep do { // O(n log n) + ordered sequential bounding_box = new Bounding_Box(); foreach Body b in bodySet { // O(n) parallel reduction bounding_box.include(b); } octree = new Octree(bounding_box); foreach Body b in bodySet { // O(n log n) top-down tree building octree.Insert(b); } cellList = octree.CellsByLevel(); foreach Cell c in cellList { // O(n) + ordered bottom-up
traversal c.Summarize(); } foreach Body b in bodySet { // O(n log n) fully parallel b.ComputeForce(octree); } foreach Body b in bodySet { // O(n) fully parallel b.Advance(); } }
CUDA Optimization Tutorial
83
Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions
CUDA Optimization Tutorial
84
Efficient GPU Code Large amounts of data parallelism Coalesced main memory accesses Little thread divergence Relatively little synchronization between blocks Little CPU/GPU data transfer Efficient use of shared memory
CUDA Optimization Tutorial
Thepcreport.net
85
o o o o
o o o o
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
Main BH Implementation Challenges Uses irregular tree-based data structure
Initially little parallelism Little coalescing Load imbalance
Complex recursive traversals Recursion not well supported Lots of thread divergence
Memory-bound pointer-chasing operations Not enough computation to hide latency
CUDA Optimization Tutorial
86
Six GPU KernelsRead initial data and transfer to GPUfor each timestep do {
1. Compute bounding box around bodies (not irregular)2. Build hierarchical decomposition, i.e., octree3. Summarize body information in internal octree nodes4. Approximately sort bodies by spatial location (optional)5. Compute forces acting on each body with help of octree6. Update body positions and velocities (not irregular)
}Transfer result from GPU and output
CUDA Optimization Tutorial
87
Global Optimizations Make code iterative (recursion not supported*) Keep data on GPU between kernel calls Use array elements instead of heap nodes
One aligned array per field for coalesced accesses
CUDA Optimization Tutorial
objects on heap
objects in array
fields in arrays
88
c0
c2 c4 c1
b5 ba c3 b6 b2 b7 b0 c5
b3 b1 b8 b4 b9
bodies (fixed) cell allocation direction
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba c5 c4 c3 c2 c1 c0. . .
Global Optimizations (cont.) Maximize thread count (round down to warp size) Maximize resident block count (all SMs filled) Pass kernel parameters through constant memory Use special allocation order Alias arrays (56 B/node) Use index arithmetic Persistent blocks & threads Unroll loops over children
CUDA Optimization Tutorial
89
main memory
threads
shared memory
threads
shared memory
threads
shared memory
threads t1 t2
shared memory
threads t1
shared memory
barrier
barrier
. . .
warp 1
barrier
barrier
warp 1 warp 2 warp 3 warp 4
warp 1 warp 2
Kernel 1: Bounding Box (Regular) Optimizations
Fully coalesced Fully cached No bank conflicts Minimal divergence Built-in min and max 2 red/mem, 6 red/bar Bodies load balanced 512*3 threads per SM
CUDA Optimization Tutorial
Reduction operation
90
*
Kernel 2: Build Octree (Irregular) Optimizations
Only lock leaf “pointers” Lock-free fast path Light-weight lock release No re-traverse after lock
acquire failure Combined memory fence Re-compute position
during traversal Separate init kernels 512*3 threads per SM
Top-down tree building
CUDA Optimization Tutorial
91
Kernel 2: Build Octree (cont.)// initializecell = find_insertion_point(body); // no locks, cache cellchild = get_insertion_index(cell, body);if (child != locked) { // skip atomic if already locked if (child == null) { // fast path (frequent) if (null == atomicCAS(&cell[child], null, body)) { // lock-free insertion // move on to next body } } else { if (child == atomicCAS(&cell[child], child, lock)) { // acquire lock // build subtree with new and existing body flag = true; } }}__syncthreads(); // optional barrier__threadfence(); // make data visibleif (flag) { cell[child] = new_subtree; // insert subtree and releases lock // move on to next body}
CUDA Optimization Tutorial
92
4
3 1 2
3 4
allocation direction
3 4 1 2 3 4
scan direction
. . .
Kernel 3: Summarize Subtrees (Irreg.)
Bottom-up tree traversal
Optimizations Scan avoids deadlock Use mass as flag + fence
No locks, no atomics Use wait-free first pass Cache the ready info Piggyback on traversal
Count bodies in subtrees No parent “pointers” 128*6 threads per SM
CUDA Optimization Tutorial
93
Kernel 4: Sort Bodies (Irregular)
Top-down tree traversal
Optimizations (Similar to Kernel 3) Scan avoids deadlock Use data field as flag
No locks, no atomics Use counts from Kernel 3 Piggyback on traversal
Move nulls to back Throttle warps with
optional barrier 64*6 threads per SM
CUDA Optimization Tutorial
4
3 1 2
3 4
allocation direction
3 4 1 2 3 4
scan direction
. . .
94
Kernel 5: Force Calculation (Irregular)
Multiple prefix traversals
Optimizations Group similar work together
Uses sorting to minimize size of prefix union in each warp
Early out (nulls in back) Traverse whole union to avoid
divergence (warp voting) Lane 0 controls iteration stack
for entire warp (fits in shmem) Minimize volatile accesses Use fast 1/sqrtf instruction Cache tree-level-based data 256*5 threads per SM
CUDA Optimization Tutorial
95
Architectural Support Coalesced memory accesses & lockstep execution
All threads in warp read same tree node at same time Only one mem access per warp instead of 32 accesses
Warp-based execution Enables data sharing in warps w/o synchronization
RSQRTF instruction Quickly computes good approximation of 1/sqrtf(x)
Warp voting instructions Quickly perform reduction operations within a warp
CUDA Optimization Tutorial
main memory
threads
main memory
warp 2
. . .
. . .
. . .
warp 1 warp 2 warp 3 warp 4 warp 1 warp 2
Kernel 6: Advance Bodies (Regular) Optimizations
Fully coalesced, no divergence Load balanced, 1024*1 threads per SM
Straightforward streaming
CUDA Optimization Tutorial
97
Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions
CUDA Optimization Tutorial
0.0
0.1
1.0
10.0
100.0
1000.0
10000.0
10,000 100,000 1,000,000 10,000,000runti
me
per ti
mes
tep
[s]
number of bodies
CPUbh
GPUbh
GPUsq
98
Evaluation Methodology Implementations
CUDA/GPU: Barnes Hut and O(n2) algorithms OpenMP/CPU: Barnes Hut algorithm (derived from CUDA) Pthreads/CPU: Barnes Hut algorithm (SPLASH-2 suite)
Systems and compilers nvcc 4.0 (-O3 -arch=sm_20 -ftz=true*) GeForce GTX 480, 1.4 GHz, 15 SMs, 32 cores per SM gcc 4.1.2 (-O3 -fopenmp* -ffast-math*) Xeon X5690, 3.46 GHz, 6 cores, 2 threads per core
Inputs and metric 5k, 50k, 500k, and 5M star clusters (Plummer model) Best runtime of three experiments, excluding I/O
CUDA Optimization Tutorial
99
Nodes Touched per Activity (5M Input) Kernel “activities”
K1: pair reduction K2: tree insertion K3: bottom-up step K4: top-down step K5: prefix traversal K6: integration step
Max tree depth ≤ 22 Cells have 3.1 children
Prefix ≤ 6,315 nodes(≤ 0.1% of 7.4 million)
BH algorithm & sorting to min. union work well
CUDA Optimization Tutorial
min avg maxkernel 1 1 2.0 2kernel 2 2 13.2 22kernel 3 2 4.1 9kernel 4 2 4.1 9kernel 5 818 4,117.0 6,315kernel 6 1 1.0 1
neighborhood size
100
Available Amorphous Data Parallelism
Almost every “round” has lots of activities without data dependencies that can be processed in parallel
CUDA Optimization Tutorial
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
k1.1
k1.2
k1.3
k1.4
k1.5
k1.6
k1.7
k1.8
k1.9
k1.1
0k1
.11
k1.1
2k1
.13
k1.1
4k1
.15
k1.1
6k1
.17
k1.1
8k1
.19
k1.2
0k1
.21
k1.2
2k1
.23
k2.1
k2.2
k2.3
k2.4
k2.5
k2.6
k2.7
k2.8
k2.9
k2.1
0k2
.11
k2.1
2k2
.13
k2.1
4k2
.15
k2.1
6k2
.17
k2.1
8k2
.19
k3.1
k3.2
k3.3
k3.4
k3.5
k3.6
k3.7
k3.8
k3.9
k3.1
0k3
.11
k3.1
2k3
.13
k3.1
4k3
.15
k3.1
6k3
.17
k3.1
8k3
.19
k3.2
0
k4.1
k4.2
k4.3
k4.4
k4.5
k4.6
k4.7
k4.8
k4.9
k4.1
0k4
.11
k4.1
2k4
.13
k4.1
4k4
.15
k4.1
6k4
.17
k4.1
8k4
.19
k4.2
0
k5.1
k6.1
avai
labl
e pa
ralle
lism
kernel and round
5,000 bodies
50,000 bodies
500,000 bodies
5,000,000 bodies
bounding box tree building summarization sorting
forc
e c
alc
.in
tegr
atio
n
101
Runtime Comparison GPU BH inefficiency
5k input too small for 5,760 to 23,040 threads
BH vs. O(n2) algorithm O(n2) faster with fewer
than about 15k bodies GPU (5M input)
21.1x faster than OpenMP 23.2x faster than Pthreads
CUDA Optimization Tutorial
0
1
10
100
1,000
10,000
100,000
5,000 50,000 500,000 5,000,000ru
ntim
e pe
r tim
este
p [m
s]
number of bodies
GPU CUDA
GPU O(n^2)
CPU OpenMP
CPU Pthreads
102
Kernel Performance for 5M Input $200 GPU delivers 228 GFlops/s on irregular code
GPU chip is 2.7 to 23.5 times faster than CPU chip
GPU hardware is better suited for BH than CPU hw But difficult and very time consuming to program
CUDA Optimization Tutorial
kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 BarnesHut O(n2)Gflops/s 71.6 5.8 2.5 n/a 240.6 33.5 228.4 897.0GB/s 142.9 26.8 10.6 12.8 8.0 133.9 8.8 2.8runtime [ms] 0.4 44.6 28.0 14.2 1641.2 2.2 1730.6 557421.5
kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6X5690 CPU 5.5 185.7 75.8 52.1 38,540.3 16.4 10.3 193.1 101.0 51.6 47,706.4 33.1GTX 480 GPU 0.4 44.6 28.0 14.2 1,641.2 2.2 0.8 46.7 31.0 14.2 5,177.1 4.2CPU/GPU 13.1 4.2 2.7 3.7 23.5 7.3 12.7 4.1 3.3 3.6 9.2 7.9
non-compliant fast single-precision version IEEE 754-compliant double-precision version
103
Kernel Speedups Optimizations that are generally applicable
Optimizations for irregular kernels
CUDA Optimization Tutorial
avoid rsqrtf recalc. thread full multi-volatile instr. data voting threading
50,000 1.14x 1.43x 0.99x 2.04x 20.80x500,000 1.19x 1.47x 1.32x 2.49x 27.99x
5,000,000 1.18x 1.46x 1.69x 2.47x 28.85x
throttling waitfree combined sorting of sync'edbarrier pre-pass mem fence bodies execution
50,000 0.97x 1.02x 1.54x 3.60x 6.23x500,000 1.03x 1.21x 1.57x 6.28x 8.04x
5,000,000 1.04x 1.31x 1.50x 8.21x 8.60x
104
Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions
CUDA Optimization Tutorial
105
Optimization Summary Reduce main memory accesses
Share data within warp, combine memory fences & traversals, re-compute data, avoid volatile accesses
Minimize thread divergence Group similar work together, force synchronicity
Implement entire algorithm on and for GPU Avoid data transfers & data structure inefficiencies,
wait-free pre-pass, scan entire prefix union
CUDA Optimization Tutorial
106
Optimization Summary (cont.) Exploit hardware features
Fast synchronization & thread startup, special instrs., coalesced memory accesses, even lockstep execution
Use light-weight locking and synchronization Minimize locks, reuse fields, and use fence + store ops
Maximize parallelism Parallelize every step within and across SMs
CUDA Optimization Tutorial
107
CPU/GPU Implementation Comparison Irregular CPU code
Dynamically (incrementally) allocated shared data structures
Structure-based shared data structures
Logical lock-based implementation
Global/local worklists Recursive or iterative
implementation
Irregular GPU code Statically (wholly)
allocated shared data structures
Multiple-array-based shared data structures
Lock-free implementation
(Implicit) local worklists Iterative implementation
CUDA Optimization Tutorial
108
Useful GPU Hardware Features Wide parallelism
Great for exploiting large amounts of parallelism
Massive multithreading Ideal for hiding latency of
irregular mem. accesses Fast thread startup
Essential when launching thousands of threads
Shared memory Fast data sharing Useful for local worklists
HW support for reduction and synchronization Makes otherwise costly
operations very fast Coalesced accesses
Memory access combining is useful in irregular codes
Lockstep execution Can share data without
explicit synchronization Allows to consolidate
iteration stacksCUDA Optimization Tutorial
109
Challenges with GPUs Warp-based execution
Often requires sorting of work or algorithm change
Data structure layout Best layout for CPU differs
from best layout for GPU SoA can be tedious to
code and deal with (parameter passing)
Separate memory space Slow transfers Pack/unpack data
Incoherent L1 caches May need to explicitly
manage data (fences) Poor recursion support
Need to make code iterative and maintain explicit iteration stacks
Thread and block counts Hierarchy complicates
implementation Optimal counts have to
be (auto-)tunedCUDA Optimization Tutorial
110
Running Irregular Algorithms on GPUs Mandatory
Need vast amounts of data parallelism
Can do large chunks of computation on GPU
Very Important Cautious implementation DS can be expressed
through fixed arrays Uses local worklists that
can be statically populated
Important Scheduling is independent
of previous activities Easy to sort activities by
similarity (if needed)
Beneficial Easy to express iteratively Has statically known range
of neighborhoods DS size (or bound) can be
determined based on input
CUDA Optimization Tutorial
111
Conclusions Irregularity does not necessarily prevent high-
performance on GPUs Entire Barnes Hut algorithm implemented on GPU
Builds and traverses unbalanced octree GPU is 21.1 times (float) and 9.1 times (double)
faster than high-end 6-core Xeon Code directly for GPU, do not merely adjust CPU code
Requires different data and code structures Benefits from different algorithmic modifications
CUDA Optimization Tutorial
112
Acknowledgments Hardware
NVIDIA Corp. and Intel Corp. Funding
NVIDIA Corp. and Texas State University OpenMP code
Ricardo Alves (Universidade do Minho, Portugal) Collaborator
Keshav Pingali (University of Texas at Austin)
CUDA Optimization Tutorial
113
CUDA Optimization Tutorial Martin Burtscher
[email protected] http://www.cs.txstate.edu/~burtscher/
Barnes Hut CUDA code http://www.gpucomputing.net/?q=node/1314
Tutorial slides http://www.cs.txstate.edu/~burtscher/tutorials/COT5/slides.pptx
CUDA Optimization Tutorial