lecture 5: gpu programming · 2018-12-26 · lecture 5: gpu programming cse599w: spring 2018....
TRANSCRIPT
![Page 1: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/1.jpg)
Lecture 5: GPU Programming
CSE599W: Spring 2018
![Page 2: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/2.jpg)
Typical Deep Learning System Stack
Gradient Calculation (Differentiation API)
Computational Graph Optimization and Execution
Runtime Parallel Scheduling
GPU Kernels, Optimizing Device Code
Programming API
Accelerators and Hardwares
User API
System Components
Architecture
High level Packages
✓
![Page 3: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/3.jpg)
Typical Deep Learning System Stack
Gradient Calculation (Differentiation API)
Computational Graph Optimization and Execution
Runtime Parallel Scheduling
GPU Kernels, Optimizing Device Code
Programming API
Accelerators and Hardwares
Architecture
High level Packages
![Page 4: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/4.jpg)
Overview
● GPU architecture
● CUDA programming model
● Case study of efficient GPU kernels
![Page 5: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/5.jpg)
CPU vs GPU
CPU
input
output
![Page 6: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/6.jpg)
CPU vs GPU
ALUCPU
Fetch
Decode
Write back
input
output
input
output
![Page 7: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/7.jpg)
CPU vs GPU
ALUCPU
Fetch
Decode
Write back
input
output
input
output
Too much overhead in compute resources and energy efficiency
![Page 8: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/8.jpg)
CPU vs GPU
ALUCPU
Fetch
Decode
Write back
input
output
input
output
ALU ALU ALU
Vector operations (SSE / AVX)
![Page 9: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/9.jpg)
CPU vs GPU
ALUCPU
Fetch
Decode
Write back
input
output
input
output
ALU ALU ALU
Vector operations (SSE / AVX)
GPU: specialized accelerator
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
Decode
Fetch
Write back
![Page 10: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/10.jpg)
Streaming Multiprocessor (SM)
SP float core
DP float core
Decode and schedule the next instructions
Multiple caches
Registers
Load/store memory
Special function unit
![Page 11: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/11.jpg)
GPU Architecture
![Page 12: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/12.jpg)
Theoretical peak FLOPS comparison
* https://github.com/oxford-cs-deepnlp-2017/lectures/blob/master/Lecture%206%20-%20Nvidia%20RNNs%20and%20GPUs.pdf
![Page 13: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/13.jpg)
Core
Memory Hierarchy
Reg
L1 cache
L2 cache
CPU memory hierarchy
![Page 14: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/14.jpg)
Core Core
Memory Hierarchy
Reg
L1 cache
L2 cache
L3 cache
Reg
L1 cache
L2 cache
DRAM
CPU memory hierarchy
![Page 15: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/15.jpg)
Core Core
Memory Hierarchy
Reg
L1 cache
L2 cache
L3 cache
Reg
L1 cache
L2 cache
DRAM
CPU memory hierarchy GPU memory hierarchy
SM
Reg
L1 cache Shared memory
Read-only cache
L2 cache
SM
GPU DRAM
![Page 16: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/16.jpg)
Core
Memory Hierarchy
Reg
L1 cache
L2 cache
L3 cache
DRAM
CPU memory hierarchy GPU memory hierarchy
SM
Reg
L1 cache Shared memory
Read-only cache
L2 cache
GPU DRAM
Intel Xeon E7-8870v4Cores: 20Reg / core: ??
L1 / core: 32KB
L2 / core: 256KB
L3 cache: 50MB
DRAM: 100s GB
Price: $12,000
Titan X PascalSMs: 28Cores / SM: 128Reg / SM: 256 KB
L1 / SM: 48 KBSharedmem / SM: 64 KB
L2 cache: 3 MB
GPU DRAM: 12 GB
Price: $1,200
![Page 17: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/17.jpg)
Core
Memory Hierarchy
Reg
L1 cache
L2 cache
L3 cache
Titan X PascalSMs: 28Cores / SM: 128Reg / SM: 256 KB
L1 / SM: 48 KBSharedmem / SM: 64 KB
L2 cache: 3 MB
GPU DRAM: 12 GB
Price: $1,200
DRAM
CPU memory hierarchy GPU memory hierarchy
SM
Reg
L1 cache Shared memory
Read-only cache
L2 cache
GPU DRAM
More registers than L1 cache
Intel Xeon E7-8870v4Cores: 20Reg / core: ??
L1 / core: 32KB
L2 / core: 256KB
L3 cache: 50MB
DRAM: 100s GB
Price: $12,000
![Page 18: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/18.jpg)
Core
Memory Hierarchy
Reg
L1 cache
L2 cache
L3 cache
Titan X PascalSMs: 28Cores / SM: 128Reg / SM: 256 KB
L1 / SM: 48 KBSharedmem / SM: 64 KB
L2 cache: 3 MB
GPU DRAM: 12 GB
Price: $1,200
DRAM
CPU memory hierarchy GPU memory hierarchy
SM
Reg
L1 cache Shared memory
Read-only cache
L2 cache
GPU DRAM
L1 cache controlled by programmer
Intel Xeon E7-8870v4Cores: 20Reg / core: ??
L1 / core: 32KB
L2 / core: 256KB
L3 cache: 50MB
DRAM: 100s GB
Price: $12,000
![Page 19: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/19.jpg)
GPU Memory Latency
Registers: R 0 cycle / R-after-W ~20 cycles
L1/texture cache: 92 cyclesShared memory: 28 cyclesConstant L1 cache: 28 cycles
L2 cache: 200 cycles
DRAM: 350 cycles
(for Nvidia Maxwell architecture)
Registers
* http://lpgpu.org/wp/wp-content/uploads/2013/05/poster_andresch_acaces2014.pdf
![Page 20: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/20.jpg)
Memory bandwidth comparison
* https://github.com/oxford-cs-deepnlp-2017/lectures/blob/master/Lecture%206%20-%20Nvidia%20RNNs%20and%20GPUs.pdf
![Page 21: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/21.jpg)
Nvidia GPU ComparisonGPU Tesla K40 (2014) Titan X (2015) Titan X (2016)
Architecture Kepler GK110 Maxwell GM200 Pascal GP102
Number of SMs 15 24 28
CUDA cores 2880 (192 * 15SM) 3072 (128 * 24SM) 3584 (128 * 28SM)
Max clock rate 875 MHz 1177 MHz 1531 MHz
FP32 GFLOPS 5040 7230 10970
32-bit Registers / SM 64K (256KB) 64K (256KB) 64K (256KB)
Shared Memory / SM 16 KB / 48 KB 96 KB 64 KB
L2 Cache / SM 1.5 MB 3 MB 3 MB
Global DRAM 12 GB 12 GB 12 GB
Power 235 W 250 W 250 W
![Page 22: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/22.jpg)
CUDA Programming Model
![Page 23: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/23.jpg)
Programming model: SIMT● SIMT: Single Instruction, Multiple Threads● Programmer writes code for a single thread in
simple C program.○ All threads executes the same code, but can take
different paths.
thread
![Page 24: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/24.jpg)
Programming model: SIMT● SIMT: Single Instruction, Multiple Threads● Programmer writes code for a single thread in
simple C program.○ All threads executes the same code, but can take
different paths.● Threads are grouped into a block.
○ Threads within the same block can synchronize execution.
thread
thread block
![Page 25: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/25.jpg)
Programming model: SIMT● SIMT: Single Instruction, Multiple Threads● Programmer writes code for a single thread in
simple C program.○ All threads executes the same code, but can take
different paths.● Threads are grouped into a block.
○ Threads within the same block can synchronize execution.
● Blocks are grouped into a grid.○ Blocks are independently scheduled on the GPU,
can be executed in any order.● A kernel is executed as a grid of blocks of
threads.
thread
thread block
block 0
grid
block 1
block 2 block 3
![Page 26: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/26.jpg)
Kernel Execution
Kernel (Grid)
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
GPU with 2 SMs
SM 0
Block 0
Block 2
Block 4
Block 6
SM 1
Block 1
Block 3
Block 5
Block 7
GPU with 4 SMs
SM 0
Block 0
Block 4
SM 1
Block 1
Block 5
SM 2
Block 2
Block 6
SM 3
Block 3
Block 7
● Each block is executed by one SM and does not migrate.● Several concurrent blocks can reside on one SM depending on block’s memory
requirement and the SM’s memory resources.
![Page 27: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/27.jpg)
SM0 Thread Block Pool
Kernel Execution● A warp consists of 32 threads
○ A warp is the basic schedule unit in kernel execution.
● A thread block consists of 32-thread warps.
● Each cycle, a warp scheduler selects one ready warps and dispatches the warps to CUDA cores to execute.
thread block 1
warpwarp
thread block 2
thread block n
![Page 28: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/28.jpg)
Control flow100: ...101: if (condition) {102: ...103: } else {104: ...105: }
warp
timepc: 100
![Page 29: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/29.jpg)
Control flow100: ...101: if (condition) {102: ...103: } else {104: ...105: }
warp
timepc: 100 pc: 101
![Page 30: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/30.jpg)
Control flow100: ...101: if (condition) {102: ...103: } else {104: ...105: }
warp
timepc: 100 pc: 101 pc: 102
![Page 31: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/31.jpg)
Control flow100: ...101: if (condition) {102: ...103: } else {104: ...105: }
warp
timepc: 100 pc: 101 pc: 102 pc: 104
![Page 32: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/32.jpg)
Control flow100: ...101: if (condition) {102: ...103: } else {104: ...105: }
warp
timepc: 100 pc: 101 pc: 102 pc: 104 pc: 105
![Page 33: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/33.jpg)
Thread Hierarchy & Memory Hierarchy
thread
thread block
block 0
grid
block 1
block 2 block 3
registers & local memory
shared memory
global memory
SM
Reg
L1 cache Shared memory
Read-only cache
L2 cache
GPU DRAM
GPU memory hierarchy
![Page 34: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/34.jpg)
Example: Vector Add// compute vector sum C = A + B
Void vecAdd_cpu(const float* A, const float* B, float* C, int n) {
for (int i = 0; i < n; ++i)
C[i] = A[i] + B[i];
}
![Page 35: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/35.jpg)
Example: Vector Add// compute vector sum C = A + B
Void vecAdd_cpu(const float* A, const float* B, float* C, int n) {
for (int i = 0; i < n; ++i)
C[i] = A[i] + B[i];
}
__global__ void vecAddKernel(const float* A, const float* B, float* C, int n) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < n) {
C[i] = A[i] + B[i];
}
}
![Page 36: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/36.jpg)
Example: Vector Add
__global__ void vecAddKernel(const float* A, const float* B, float* C, int n) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < n) {
C[i] = A[i] + B[i];
}
}
Compute the global index
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 0 1 2 3 0 1 2 3
0 1 2
global index
threadIdx.x
blockIdx.x
Suppose each block only includes 4 threads: blockDim.x = 4
![Page 37: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/37.jpg)
Example: Vector Add
__global__ void vecAddKernel(const float* A, const float* B, float* C, int n) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < n) {
C[i] = A[i] + B[i];
}
}
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 0 1 2 3 0 1 2 3
0 1 2
global index
threadIdx.x
blockIdx.x
Suppose each block only includes 4 threads: blockDim.x = 4
Each thread only performs one pair-wise addition
![Page 38: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/38.jpg)
Example: Vector Add (Host)#define THREADS_PER_BLOCK 512
void vecAdd(const float* A, const float* B, float* C, int n) {
float *d_A, *d_B, *d_C;
int size = n * sizeof(float);
cudaMalloc((void **) &d_A, size);
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_B, size);
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_C, size);
int nblocks = (n + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
vecAddKernel<<<nblocks, THREADS_PER_BLOCK>>>(d_A, d_B, d_C, n);
cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
}
![Page 39: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/39.jpg)
Example: Vector Add (Host)#define THREADS_PER_BLOCK 512
void vecAdd(const float* A, const float* B, float* C, int n) {
float *d_A, *d_B, *d_C;
int size = n * sizeof(float);
cudaMalloc((void **) &d_A, size);
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_B, size);
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_C, size);
int nblocks = (n + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
vecAddKernel<<<nblocks, THREADS_PER_BLOCK>>>(d_A, d_B, d_C, n);
cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
}
Launch the GPU kernel asynchronously
![Page 40: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/40.jpg)
Example: Sliding Window Sum
● Consider computing the sum of a sliding window over a vector○ Each output element is the sum of input elements within a radius○ Example: image blur kernel
● If radius is 3, each output element is sum of 7 input elements
input
output
![Page 41: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/41.jpg)
A naive implementation#define RADIUS 3
__global__ void windowSumNaiveKernel(const float* A, float* B, int n) {
int out_index = blockDim.x * blockIdx.x + threadIdx.x;
int in_index = out_index + RADIUS;
if (out_index < n) {
float sum = 0.;
for (int i = -RADIUS; i <= RADIUS; ++i) {
sum += A[in_index + i];
}
B[out_index] = sum;
}
}
![Page 42: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/42.jpg)
A naive implementationvoid windowSum(const float* A, float* B, int n) {
float *d_A, *d_B;
int size = n * sizeof(float);
cudaMalloc((void **) &d_A, (n + 2 * RADIUS) * sizeof(float));
cudaMemset(d_A, 0, (n + 2 * RADIUS) * sizeof(float));
cudaMemcpy(d_A + RADIUS, A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_B, size);
dim3 threads(THREADS_PER_BLOCK, 1, 1);
dim3 blocks((n + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK, 1, 1);
windowSumNaiveKernel<<<blocks, threads>>>(d_A, d_B, n);
cudaMemcpy(B, d_B, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B);
}
![Page 43: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/43.jpg)
How to improve it?
● For each element in the input, how many times it is loaded?
![Page 44: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/44.jpg)
How to improve it?
● For each element in the input, how many times it is read?○ Each input element is read 7 times!○ Neighboring threads read most of the same elements
● How can we avoid redundant reading of data?
input
output
![Page 45: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/45.jpg)
Sharing data between threads within a block
● A thread block first cooperatively loads the needed input data into the shared memory.
input
output
Computed by block 1
![Page 46: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/46.jpg)
Kernel with shared memory__global__ void windowSumKernel(const float* A, float* B, int n) {
__shared__ float temp[THREADS_PER_BLOCK + 2 * RADIUS];
int out_index = blockDim.x * blockIdx.x + threadIdx.x;
int in_index = out_index + RADIUS;
int local_index = threadIdx.x + RADIUS;
if (out_index < n) {
temp[local_index] = A[in_index];
if (threadIdx.x < RADIUS) {
temp[local_index - RADIUS] = A[in_index - RADIUS];
temp[local_index + THREADS_PER_BLOCK] = A[in_index+THREADS_PER_BLOCK];
}
__syncthreads();
![Page 47: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/47.jpg)
Kernel with shared memory float sum = 0.;
for (int i = -RADIUS; i <= RADIUS; ++i) {
sum += temp[local_index + i];
}
B[out_index] = sum;
}
}
![Page 48: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/48.jpg)
Performance comparison
Demo!
Code: https://github.com/dlsys-course/examples/blob/master/cuda/window_sum.cu
![Page 49: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/49.jpg)
Case study of efficient GPU kernels
![Page 50: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/50.jpg)
Case study: GEMM
A
B
C
b
b
b
b
s
s
M
N
K
K
C = A x BA: MxK matrixB: KxN matrixC: MxN matrix
Workload of a thread block
![Page 51: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/51.jpg)
Case study: GEMM
A
B
C
b
b
bs
s
M
N
K
K
Workload of a thread block
C = A x BA: MxK matrixB: KxN matrixC: MxN matrix
b
![Page 52: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/52.jpg)
Case study: GEMM
A
B
C
b
b
b
b
s
s
M
N
K
K
C = A x BA: MxK matrixB: KxN matrixC: MxN matrix
Workload of a thread block
![Page 53: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/53.jpg)
Case study: GEMM
A
B
C
b
b
b
b
s
s
M
N
K
K
Global memory
Shared memory
A s
trip
B strip
s
b
b
Each thread block computes a b x b area
Registers
s
C tile
Thread 1
Thread 2
Cooperatively loaded by both thread 1 and 2
C = A x BA: MxK matrixB: KxN matrixC: MxN matrix
Suppose each thread block has t * t threads,bt=b / t
bt
![Page 54: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/54.jpg)
Case study: GEMM pseudocodeblock_dim: <M / b, N / b>
thread_dim: <t, t>
// thread function
__global__ void SGEMM(float *A, float *B, float *C, int b, int s) {
__shared__ float sA[2][b][s], sB[2][s][b]; // shared by a thread block
float rC[bt][b
t] = {0}; // thread local buffer, in the registers
Cooperative fetch first strip from A, B to sA[0], sB[0]
__sync_threads();
for (k = 0; k < K / s; k += 1) {
Cooperative fetch next strip from A, B to sA[(k+1)%2], sB[(k+1)%2]
__sync_threads();
for (kk = 0; kk < s; kk += 1) {
for (j = 0; j < bt; j += 1) { // unroll loop
for (i = 0; i < bt; i += 1) { // unroll loop
rC[j][i] += sA[k%2][threadIdx.x*bt+j][kk]*sB[k%2][kk][threadIdx.y*bt+i];
}
}}}
Write rC back to C
}
Run in parallel
![Page 55: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/55.jpg)
Case study: GEMM
More details see:
● http://homes.cs.washington.edu/~tws10/cse599i/CSE%20599%20I%20Accelerated%20Computing%20-%20Programming%20GPUs%20Lecture%204.pdf
● Lai, Junjie, and André Seznec. "Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs." Code Generation and Optimization (CGO), 2013 IEEE/ACM International Symposium on. IEEE, 2013.
![Page 56: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/56.jpg)
Case study: Reduction Sum
http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf
![Page 57: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/57.jpg)
Tips for high performance
● Use existing libraries, which are highly optimized, e.g. cublas, cudnn.
● Use nvprof or nvvp (visual profiler) to debug the performance.
● Use high level language to write GPU kernels.
![Page 58: Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational](https://reader035.vdocument.in/reader035/viewer/2022063003/5f753f8ac93bd630c038b5d5/html5/thumbnails/58.jpg)
References
● CUDA Programming Guide: http://docs.nvidia.com/cuda/cuda-c-programming-guide/