GPU Programming 1
GPU PROGRAMMING
GPU Programming 2
Assignment 4• Consists of two programming assignments
• Concurrency• GPU programming
• Requires a computer with a CUDA/OpenCL/DirectCompute compatible GPU
• Due Jun 07• We have no final exams
GPU Programming 3
GPU Resources• Download CUDA toolkit from the web
• Very good text book:• Programming Massively Parallel Processors
• Wen-mei Hwu and David Kirk• Available at
• http://courses.engr.illinois.edu/ece498/al/Syllabus.html
GPU Programming 4
Acknowledgments• Slides and material from
• Wen-mei Hwu (UIUC) and David Kirk (NVIDIA)
GPU Programming 5
Why GPU Programming• More processing power + higher memory bandwidth
• GPU in every PC and workstation – massive volume and potential impact
Current CPU
4 Cores4 float wide SIMD3GHz48-96GFlops2x HyperThreaded64kB $L1/core20GB/s to Memory$200200W
CPU 0 CPU 1
CPU 2 CPU 3
L2 Cache
Current GPU
32 Cores32 Float wide1GHz1TeraFlop32x “HyperThreaded”64kB $L1/Core150GB/s to Mem$200, 200W
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
L2 Cache
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD SIMD
SIMD
SIMD
SIMD
SIMD
SIMD SIMD
GPU Programming
Bandwidth and Capacity
8
CPU50GFlops
GPU1TFlop
CPU RAM4-6 GB
GPU RAM1 GB
10GB/s 100GB/s
1GB/s
All values are approximate
GPU Programming 9
CUDA• “Compute Unified Device Architecture”• General purpose programming model
• User kicks off batches of threads on the GPU• GPU = dedicated super-threaded, massively data parallel co-processor
• Targeted software stack• Compute oriented drivers, language, and tools
• Driver for loading computation programs into GPU
GPU Programming 10
Languages with Similar Capabilities• CUDA• OpenCL• DirectCompute
• You are free to use any of the above for assignment 4• I will focus on CUDA for the rest of the lecture
• Same abstractions present in all three with different (and confusing) names
11
CUDA Programming Model:• The GPU = compute device that:
• Is a coprocessor to the CPU or host• Has its own DRAM (device memory)• Runs many threads in parallel
• GPU program = kernel• Differences between GPU and CPU threads
• GPU threads are extremely lightweight• Very little creation overhead
• GPU needs 1000s of threads for full efficiency• Multi-core CPU needs only a few
GPU Programming
GPU Programming 12
A CUDA Program1. Host performs some CPU computation2. Host copies input data into the device3. Host instructs the device to execute a kernel4. Device executes the kernel produces results5. Host copies the results6. Goto step 1
GPU Programming 13
CUDA Kernel is a SPMD program
• All threads run the same code• Each thread uses its id to
• Operate on different memory addresses
• Make control decisions
Kernel:…i = input[tid];o = f(i);output[tid] = o;…
• SPMD = Single Program Multiple Data
GPU Programming 14
CUDA Kernel is a SPMD program
• All threads run the same code• Each thread uses its id to
• Operate on different memory addresses
• Make control decisions• Difference with SIMD
• Threads can execute different control flow
• At a performance cost
Kernel:…i = input[tid];if(i%2 == 0) o = f(i);else o = g(i);output[tid] = o;…
• SPMD = Single Program Multiple Data
GPU Programming 15
Threads Organization• Kernel threads = Grid of Thread Blocks (1D or 2D)
• Thread Block = Array of Threads (1D or 2D or 3D)
• Simplifies memory addressing for multidimensional data
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
GPU Programming 16
Threads Organization• Kernel threads = Grid of Thread Blocks (1D or 2D)
• Thread Block = Array of Threads (1D or 2D or 3D)
• Simplifies memory addressing for multidimensional data
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Thread(0,0)
Thread(1,0)
Thread(0,1)
Thread(1,1)
GPU Programming 17
Threads within a Block• Execute in lock step• Can share memory• Can synchronize with each other
CUDA Thread Block
Thread Id #:0 1 2 3 … m
Thread program
Courtesy: John Nickolls, NVIDIA
GPU Programming 18
CUDA Function Declarations
hosthost__host__ float HostFunc()
hostdevice__global__ void KernelFunc()
devicedevice__device__ float DeviceFunc()
Only callable from the:
Executed on the:
• __global__ defines a kernel function• Must return void
• __device__ and __host__ can be used together
GPU Programming 19
CUDA Function Declarations (cont.)
• __device__ functions cannot have their address taken
• For functions executed on the device:• No recursion• No static variable declarations inside the function• No variable number of arguments
GPU Programming 20
Putting it all together
__global__ void KernelFunc(…)
dim3 DimGrid(100, 50);
dim3 DimBlock(4, 8, 8);
KernelFunc<<< DimGrid, DimBlock >>>(...);
GPU Programming 21
CUDA Memory Model• Registers
• Read/write per thread
• Local memory• Read/write per thread
• Shared memory• Read/write per block
• Global memory• Read/write per grid
• Constant memory• Read only, per grid
• Texture memory• Read only, per grid
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
Texture Memory
GPU Programming 22
Memory Access Efficiency• Registers
• Fast• Local memory
• Not cached -> Slow• Registers spill into local memory
• Shared memory• On chip -> Fast
• Global memory• Not cached -> Slow
• Constant memory• Cached – Fast if good reuse
• Texture memory• Cached – Fast if good reuse
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
Texture Memory
GPU Programming 23
CUDA Variable Type Qualifiers
• __device__ is optional when used with __local__, __shared__, or __constant__
• Automatic variables without any qualifier reside in a register• Except arrays that reside in local memory
Variable declaration Memory Scope Lifetime__device__ __local__ int LocalVar; local thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application
GPU Programming 24
Variable Type Restrictions• Pointers can only point to memory allocated or
declared in global memory:• Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr)
• Obtained as the address of a global variable: float* ptr = &GlobalVar;
GPU Programming 25
Simple Example: Matrix Multiplication
GPU Programming 26
Matrix Multiplication
• P = M * N of size WIDTH x WIDTH• Simple strategy
• One thread calculates one element of P• M and N are loaded WIDTH times from
global memory
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
GPU Programming 27
GPU Matrix Multiplication: Hostfloat *M, *N, *P; int width;
int size = width * width * sizeof(float);
cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
GPU Programming 28
GPU Matrix Multiplication: Hostfloat *M, *N, *P; int width;
int size = width * width * sizeof(float);
cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
cudaMalloc(&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
cudaMalloc(&Pd, size);
GPU Programming 29
GPU Matrix Multiplication: Hostfloat *M, *N, *P; int width;
int size = width * width * sizeof(float);
cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
cudaMalloc(&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
cudaMalloc(&Pd, size);
// call kernel
cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
GPU Programming 30
GPU Matrix Multiplication: Hostfloat *M, *N, *P; int width;
int size = width * width * sizeof(float);
cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
cudaMalloc(&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
cudaMalloc(&Pd, size);
// call kernel
cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);cudaFree(Md); cudaFree(Nd); cudaFree(Pd);
GPU Programming 31
GPU Matrix Multiplication: Host• How many threads do we need?
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
GPU Programming 32
GPU Matrix Multiplication: Host
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
dim3 dimGrid(1,1);
dim3 dimBlock(width, width);
MatrixMul<<<dimGrid, dimBlock>>>
(Md, Nd, Pd, width);
GPU Programming 33
GPU Matrix Multiplication: Kernel__global__ void MatrixMul(
float* Md, float* Nd,
float* Pd, int width)
{
Pd[ty*width + tx] = …
}
Md
Nd
Pd
WID
TH
WID
TH
WIDTH WIDTH
tx
tyshort forms:
tx = threadIdx.x;
ty = threadIdx.y;
GPU Programming 34
GPU Matrix Multiplication: Kernel__global__ void MatrixMul(…){
for(k=0; k<width; k++){
r = Md[ty*width+k] +
Nd[k*width+tx];
Pd[ty*width + tx] = r;
}}
Md
Nd
Pd
WID
TH
WID
TH
WIDTH WIDTH
tx
ty
GPU Programming 35
Only One Thread Block Used• One Block of threads compute
matrix Pd• Each thread computes one
element of Pd• Each thread
• Loads a row of matrix Md• Loads a column of matrix Nd• Perform one multiply and
addition for each pair of Md and Nd elements
• Compute to off-chip memory access ratio close to 1:1 (not very high)
• Size of matrix limited by the number of threads allowed in a thread block
Grid 1
Block 1
3 2 5 4
2
4
2
6
48
Thread(2, 2)
WIDTH
Md Pd
Nd
GPU Programming 36
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
How about performance on G80?
• All threads access global memory for their input matrix elements
• Compute: 346.5 GFLOPS• Memory bandwidth: 86.4 GBps
GPU Programming 37
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
How about performance on G80?
• All threads access global memory for their input matrix elements• Two memory accesses (8 bytes) per
floating point multiply-add• 4B/s of memory bandwidth/FLOPS• 4*346.5 = 1386 GB/s required to
achieve peak FLOP rating• 86.4 GB/s limits the code at 21.6
GFLOPS
• The actual code runs at about 15 GFLOPS
• Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS
GPU Programming 38
G80 Example: Executing Thread Blocks
• Threads are assigned to Streaming Multiprocessors in block granularity• Up to 8 blocks to each SM as resource
allows• SM in G80 can take up to 768 threads
• Could be 256 (threads/block) * 3 blocks • Or 128 (threads/block) * 6 blocks, etc.
• Threads run concurrently• SM maintains thread/block id #s• SM manages/schedules thread execution
t0 t1 t2 … tm
Blocks
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
t0 t1 t2 … tm
Blocks
SM 1SM 0
GPU Programming 39
G80 Example: Thread Scheduling
• Each Block is executed as
32-thread Warps– Warps are scheduling units
in SM
• If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM?
…t0 t1 t2 …
t31
…
…t0 t1 t2 …
t31
…Block 1 Warps Block 2 Warps
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1
Streaming Multiprocessor
Shared Memory
…t0 t1 t2 …
t31
…Block 1 Warps
GPU Programming 40
G80 Example: Thread Scheduling
• Each Block is executed as
32-thread Warps– Warps are scheduling units
in SM
• If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM?– Each Block is divided into
256/32 = 8 Warps– There are 8 * 3 = 24 Warps
…t0 t1 t2 …
t31
…
…t0 t1 t2 …
t31
…Block 1 Warps Block 2 Warps
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1
Streaming Multiprocessor
Shared Memory
…t0 t1 t2 …
t31
…Block 1 Warps
GPU Programming 41
SM Warp Scheduling
• SM hardware implements zero-overhead Warp scheduling• Warps whose next instruction has
its operands ready for consumption are eligible for execution
• Eligible Warps are selected for execution on a prioritized scheduling policy
• All threads in a Warp execute the same instruction when selected
warp 8 instruction 11
SM multithreadedWarp scheduler
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
...
time
warp 3 instruction 96
GPU Programming 42
G80 Block Granularity Considerations• For Matrix Multiplication using multiple blocks, should I use
8X8, 16X16 or 32X32 blocks?• Each SM can take max 8 blocks and max 768 threads
GPU Programming 43
G80 Block Granularity Considerations• For Matrix Multiplication using multiple blocks, should I use
8X8, 16X16 or 32X32 blocks?
• For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, there are 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!
• For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule.
• For 32X32, we have 1024 threads per Block. Not even one can fit into an SM!
GPU Programming 44
A Common Programming Strategy• Global memory resides in device memory (DRAM) - much
slower access than shared memory• So, a profitable way of performing computation on the
device is to tile data to take advantage of fast shared memory:• Partition data into subsets that fit into shared memory• Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism
• Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element
• Copying results from shared memory to global memory
GPU Programming 45
A Common Programming Strategy (Cont.)• Constant memory also resides in device memory
(DRAM) - much slower access than shared memory• But… cached!• Highly efficient access for read-only data
• Carefully divide data according to access patterns• R/Only constant memory (very fast if in cache)• R/W shared within Block shared memory (very fast)• R/W within each thread registers (very fast)• R/W inputs/results global memory (very slow)
GPU Programming 46
Idea: Use Shared Memory to reuse global memory data
• Each input element is read by Width threads.
• Load each element into Shared Memory and have several threads use the local version to reduce the memory bandwidth• Tiled algorithms
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
ty
tx
GPU Programming 47
Md
Nd
Pd
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
bx
tx01 TILE_WIDTH-12
0 1 2
by ty 210
TILE_WIDTH-1
2
1
0
TIL
E_W
IDT
HT
ILE
_WID
TH
TIL
E_W
IDT
HE
WID
TH
WID
TH
Tiled Multiply
• Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd
GPU Programming 48
Pd1,0
A Small Example: 2X2 Tiling of P
Md2,0
Md1,1
Md1,0Md0,0
Md0,1
Md3,0
Md2,1
Pd0,0
Md3,1 Pd0,1
Pd2,0 Pd3,0
Nd0,3 Nd1,3
Nd1,2
Nd1,1
Nd1,0Nd0,0
Nd0,1
Nd0,2
Pd1,1
Pd0,2 Pd2,2 Pd3,2Pd1,2
Pd3,1Pd2,1
Pd0,3 Pd2,3 Pd3,3Pd1,3
GPU Programming 49
Every Md and Nd Element is used exactly twice in generating a 2X2 tile of P
P0,0
thread0,0
P1,0
thread1,0
P0,1
thread0,1
P1,1
thread1,1
M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0
M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1
M2,0 * N0,2 M2,0 * N1,2 M2,1 * N0,2 M2,1 * N1,2
M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3
Accessorder
GPU Programming 50
Every Md and Nd Element is used exactly twice in generating a 2X2 tile of P
P0,0
thread0,0
P1,0
thread1,0
P0,1
thread0,1
P1,1
thread1,1
M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0
M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1
M2,0 * N0,2 M2,0 * N1,2 M2,1 * N0,2 M2,1 * N1,2
M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3
Accessorder
GPU Programming
Pd1,0Md2,0
Md1,1
Md1,0Md0,0
Md0,1
Md3,0
Md2,1
Pd0,0
Md3,1 Pd0,1
Pd2,0 Pd3,0
Nd0,3 Nd1,3
Nd1,2
Nd1,1
Nd1,0Nd0,0
Nd0,1
Nd0,2
Pd1,1
Pd0,2 Pd2,2 Pd3,2Pd1,2
Pd3,1Pd2,1
Pd0,3 Pd2,3 Pd3,3Pd1,3
Breaking Md and Nd into Tiles
• Break up the inner product loop of each thread into phases
• At the beginning of each phase, load the Md and Nd elements that everyone needs during the phase into shared memory
• Everyone access the Md and Nd elements from the shared memory during the phase
GPU Programming
Pd1,0Md2,0
Md1,1
Md1,0Md0,0
Md0,1
Md3,0
Md2,1
Pd0,0
Md3,1 Pd0,1
Pd2,0 Pd3,0
Nd0,3 Nd1,3
Nd1,2
Nd1,1
Nd1,0Nd0,0
Nd0,1
Nd0,2
Pd1,1
Pd0,2 Pd2,2 Pd3,2Pd1,2
Pd3,1Pd2,1
Pd0,3 Pd2,3 Pd3,3Pd1,3
Breaking Md and Nd into Tiles
• Break up the inner product loop of each thread into phases
• At the beginning of each phase, load the Md and Nd elements that everyone needs during the phase into shared memory
• Everyone access the Md and Nd elements from the shared memory during the phase
GPU Programming 53
Tiled Kernel__global__ void Tiled(float* Md, float* Nd, float* Pd, int Width)
{ __shared __float Mds[TILE_WIDTH][TILE_WIDTH]; __shared __float Nds[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the Pd element to work on
int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx;
float Pvalue = 0; // compute Pvalue Pd[Row*Width + Col] = Pvalue;}
GPU Programming 54
Tiled Kernel: Computing Pvalue //… float Pvalue = 0; // Loop over the Md and Nd tiles required for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of Md and Nd tiles Mds[ty]
[tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];__syncthreads();
for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[ty][k] * Nds[k][tx]; __syncthreads(); } Pd[Row*Width + Col] = Pvalue; //…
GPU Programming 55
CUDA Code – Kernel Execution Configuration
// Setup the execution configuration
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);dim3 dimGrid(Width / TILE_WIDTH,
Width / TILE_WIDTH);
GPU Programming 56
First-order Size Considerations in G80• Each thread block should have many threads
• TILE_WIDTH of 16 gives 16*16 = 256 threads
• There should be many thread blocks• A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks• TILE_WIDTH of 16 gives each SM 3 blocks, 768 threads (full capacity)
• Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. • Memory bandwidth no longer a limiting factor
GPU Programming 57
Md
Nd
Pd
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
bx
tx01 TILE_WIDTH-12
0 1 2
by ty 210
TILE_WIDTH-1
2
1
0
TIL
E_W
IDT
HT
ILE
_WID
TH
TIL
E_W
IDT
HE
WID
TH
WID
TH
Tiled Multiply
• Each block computes one square sub-matrix Pdsub of size TILE_WIDTH
• Each thread computes one element of Pdsub
m
kbx
by
k
m
GPU Programming 58
G80 Shared Memory and Threading• Each SM in G80 has 16KB shared memory
• SM size is implementation dependent!• For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory. • The shared memory can potentially have up to 8 Thread Blocks actively executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)• The threading model limits the number of thread blocks to 3 so shared memory is not the
limiting factor here• The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage
per thread block, allowing only up to two thread blocks active at the same time
• Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16• The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!
GPU Programming 59
Parallel Memory Architecture
• In a parallel machine, many threads access memory• Therefore, memory is divided into banks• Essential to achieve high bandwidth
• Each bank can service one address per cycle• A memory can service as many simultaneous
accesses as it has banks
• Multiple simultaneous accesses to a bankresult in a bank conflict • Conflicting accesses are serialized
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
GPU Programming 60
Bank Addressing Examples• No Bank Conflicts
• Linear addressing stride == 1
• No Bank Conflicts• Random 1:1 Permutation
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
GPU Programming 61
Bank Addressing Examples• 2-way Bank Conflicts
• Linear addressing stride == 2
• 8-way Bank Conflicts• Linear addressing
stride == 8
Thread 11Thread 10Thread 9Thread 8
Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 9Bank 8
Bank 15
Bank 7
Bank 2Bank 1Bank 0
x8
x8
GPU Programming 62
How addresses map to banks on G80• Each bank has a bandwidth of 32 bits per clock cycle• Successive 32-bit words are assigned to successive
banks• G80 has 16 banks
• So bank = address % 16• Same as the size of a half-warp
• No bank conflicts between different half-warps, only within a single half-warp
GPU Programming 63
Shared memory bank conflicts• Shared memory is as fast as registers if there are no bank
conflicts
• The fast case:• If all threads of a half-warp access different banks, there is no bank
conflict• If all threads of a half-warp access the identical address, there is no
bank conflict (broadcast)• The slow case:
• Bank Conflict: multiple threads in the same half-warp access the same bank
• Must serialize the accesses• Cost = max # of simultaneous accesses to a single bank
GPU Programming 64
Linear Addressing• Given:
__shared__ float shared[256];float foo = shared[baseIndex + s *
threadIdx.x];
• This is only bank-conflict-free if s shares no common factors with the number of banks • 16 on G80, so s must be odd
Bank 15
Bank 7
Bank 6
Bank 5
Bank 4
Bank 3
Bank 2
Bank 1
Bank 0
Thread 15
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Bank 15
Bank 7
Bank 6
Bank 5
Bank 4
Bank 3
Bank 2
Bank 1
Bank 0
Thread 15
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
s=3
s=1
GPU Programming 65
Control Flow Instructions• Main performance concern with branching is divergence
• Threads within a single warp take different paths• Different execution paths are serialized in G80
• The control paths taken by the threads in a warp are traversed one at a time until there is no more.
• A common case: avoid divergence when branch condition is a function of thread ID• Example with divergence:
• If (threadIdx.x > 2) { }• This creates two different control paths for threads in a block• Branch granularity < warp size; threads 0, 1 and 2 follow different path
than the rest of the threads in the first warp• Example without divergence:
• If (threadIdx.x / WARP_SIZE > 2) { }• Also creates two different control paths for threads in a block• Branch granularity is a whole multiple of warp size; all threads in any
given warp follow the same path