mohammad motamedi, nvidia most of the slides are from: gtc ... notes/cme213_2020_cud… · 2 volta...
TRANSCRIPT
![Page 1: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/1.jpg)
Most of the slides are from: GTC’18 – S81006 and GTC’19 – S9234
Mohammad Motamedi, NVIDIA
![Page 2: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/2.jpg)
2
VOLTA V100
Per Streaming Multiprocessor
• 64 FP32 lanes
• 32 FP64 lanes
• 64 INT32 lanes
• 16 SFU lanes (transcendental)
• 32 LD/ST lanes (G-mem/L-mem/S-mem)
• 8 Tensor Cores
• 4 TEX lanes
SM
L1 …SM
L1
SM
L1
SM
L1
SM
L1
L2
DRAM
80 SM
![Page 3: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/3.jpg)
3
SM Resources
Thread blocks require registers and shared memory.
SM can schedule any resident warp without context switching.
Volta limits per SM
• Register file size: 256 KB.
• Maximum shared memory: 96 KB.
• Maximum number of threads: 2048.
• Maximum number of warps: 64.
• Maximum number of blocks: 32.
![Page 4: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/4.jpg)
4
Performance Constraints
Compute bound – saturates compute units
• Reduce the number of instructions executed
• Vector types, intrinsic operations, tensor cores, FMAs.
Bandwidth bound – saturates memory bandwidth
• Optimize access pattern
• Use lower precision
Latency bound
• Increase the number of instructions / mem accesses in flight
![Page 5: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/5.jpg)
5
Little’s LawFor Escalators
Our escalator parameters
• 1 Person per step
• A step arrives every 2 secondsBandwidth: 0.5 person/s
• 20 steps tallLatency = 40 seconds
![Page 6: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/6.jpg)
6
Little’s LawFor Escalators
• One person at a time?Achieved bandwidth = 0.025 person/s
• To saturate bandwidthNeed one person arriving with every step,we need 20 persons in flight.
• Need Bandwidth x Latency persons in flight.
A step arrives every 2 secondsBandwidth: 0.5 person/s20 steps tall : Latency = 40 seconds
![Page 7: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/7.jpg)
7
Optimization GoalsSaturating the compute units
• 𝑎 = 𝑏 + 𝑐 → 𝑎𝑟𝑖𝑡ℎ𝑚𝑒𝑡𝑖𝑐 𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 =1
3×4𝐹/𝐵
• 𝐺𝑃𝑈’𝑠 𝑐𝑎𝑝𝑎𝑏𝑖𝑙𝑖𝑡𝑦 =15 𝑇𝐹𝑝𝑠
900 𝐺𝐵𝑝𝑠= 16.6 𝐹/𝐵
• It is hard to saturate the compute units.
• One can improve it by saturating the memory bandwidth.
Saturating the memory bandwidth
• Hiding the latency is the key in achieving this goal.
![Page 8: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/8.jpg)
8
Memory Bandwidth
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.01024
1536
2048
2560
3072
3584
4096
4608
5120
5632
6144
6656
7168
7680
8192
8704
9216
9728
10240
10752
11264
11776
12288
12800
13312
13824
14336
14848
15360
16384
17408
18432
19456
20480
21504
22528
23552
24576
25600
28672
29696
32768
% o
f Peak B
andw
idth
Bytes in flight per SM
V100
Volta reaches 90% of peak bandwidth with ~6KB of data in flight per SM.
![Page 9: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/9.jpg)
9
Little’s lawTakeaways
Instructions in flight =
instructions in flight per thread x threads executing
Instruction Level Parallelism Occupancy
![Page 10: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/10.jpg)
10
Higher occupancy hides latencySM has more warp candidates to schedule while other warps are waiting for instructions to complete.
Occupancy = Achieved number of threads per SM
Maximumnumber of threads per SM
• Use the profiler to compute it.
Achieved occupancy vs theoretical occupancyNeed to run enough thread blocks to fill all the SMs.
Be mindful of diminishing returns.
Occupancy
![Page 11: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/11.jpg)
11
Instruction Issue
Instructions are issued in-order
If an instruction is not eligible, it stalls the warp.
An instruction is eligible for issue if both are true:
• A pipeline is available for execution
Some pipelines need multiple cycles to issue a warp.
• All the arguments are ready
Argument is not ready if a previous instruction has not produced it yet.
![Page 12: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/12.jpg)
12
Instruction Issue Example
__global__ void kernel (float *a, float *b, float *c) {
int i= blockIdx.x * blockDim.x + threadIdx.x;
c[i] += a[i] * b[i];} LDG.E R2, [R2];
LDG.E R4, [R4];
LDG.E R9, [R6];
FFMA R9, R2, R4, R9;
STG.E [R6], R9;
stall!
stall!
12B / threadin flight
![Page 13: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/13.jpg)
13
Computing 2 values per thread
__global__ void kernel (float2 *a, float2 *b, float2 *c) {
int i= blockIdx.x * blockDim.x + threadIdx.x;
c[i].x += a[i].x * b[i].x;c[i].y += a[i].y * b[i].y;
}LDG.E.64 R2, [R2];
LDG.E.64 R4, [R4];
LDG.E.64 R6, [R8];
FFMA R7, R3, R5, R7;
FFMA R6, R2, R4, R6;
STG.E.64 [R8], R6;
24B/ threadin flight
2 Independent instructions
stall!
stall!
![Page 14: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/14.jpg)
14
Control FlowBlocks of threads, warps
• Single Instruction Multiple Threads (SIMT) model.
• CUDA hierarchy: Grid -> Blocks -> Threads.
• One warp = 32 threads.
• Why does it matter?Many optimizations depend on behavior at the warp level.
![Page 15: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/15.jpg)
15
Control Flow
• Thread blocks can have 1D, 2D, or 3D representation. Threads are linear in hardware.
• Consecutive 32 threads belong to the same warp.
Mapping threads
80 Threads:40 threads in X
2 rows of threads in Y
40
2
![Page 16: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/16.jpg)
16
Control Flow
• Thread blocks can have 1D, 2D, or 3D representation. Threads are linear in hardware.
• Consecutive 32 threads belong to the same warp.
Mapping threads
80 Threads:40 threads in X
2 rows of threads in Y
40
2
3 warps (96 threads)16 inactive threads in 3rd warp
32
24
8
16 162
40
![Page 17: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/17.jpg)
17
Control Flow
• Different warps can execute different codeNo impact on performance.Each warp maintains its own Program Counter.
• Different code path inside the same warp?Threads that do not participate are masked out,but the whole warp executes both sides of the branch.
Divergence
![Page 18: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/18.jpg)
18
Control Flow
1
2
2
3 3
ThreadIdx.x0 39
0
1ThreadIdx.y
A;
if(threadIdx.y==0)
B;
else
C;
D;
Warp 10…
31
Warp 20…
31
Warp 30…
31
Instructions, time
![Page 19: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/19.jpg)
19
Control Flow
1
2
2
3 3
ThreadIdx.x0 39
0
1ThreadIdx.y
A;
if(threadIdx.y==0)
B;
else
C;
D;
AWarp 10…
31
Warp 20…
31
Warp 30…
31
Instructions, time
![Page 20: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/20.jpg)
20
Control Flow
1
2
2
3 3
ThreadIdx.x0 39
0
1ThreadIdx.y
A;
if(threadIdx.y==0)
B;
else
C;
D;
A BWarp 10…
31
Warp 20…
31
Warp 30…
31
Instructions, time
![Page 21: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/21.jpg)
21
Control Flow
1
2
2
3 3
ThreadIdx.x0 39
0
1ThreadIdx.y
A;
if(threadIdx.y==0)
B;
else
C;
D;
A B DWarp 10…
31
Warp 20…
31
Warp 30…
31
Instructions, time
![Page 22: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/22.jpg)
22
Control Flow
1
2
2
3 3
ThreadIdx.x0 39
0
1ThreadIdx.y
A;
if(threadIdx.y==0)
B;
else
C;
D;
A
A B DWarp 10…
31
Warp 20…
31
Warp 30…
31
Instructions, time
![Page 23: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/23.jpg)
23
Control Flow
1
2
2
3 3
ThreadIdx.x0 39
0
1ThreadIdx.y
A;
if(threadIdx.y==0)
B;
else
C;
D;
A
A B D
B
Warp 10…
31
Warp 20…
31
Warp 30…
31
Instructions, time
![Page 24: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/24.jpg)
24
Control Flow
1
2
2
3 3
ThreadIdx.x0 39
0
1ThreadIdx.y
A;
if(threadIdx.y==0)
B;
else
C;
D;
A
A B D
B C
Warp 10…
31
Warp 20…
31
Warp 30…
31
Instructions, time
![Page 25: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/25.jpg)
25
Control Flow
1
2
2
3 3
ThreadIdx.x0 39
0
1ThreadIdx.y
A;
if(threadIdx.y==0)
B;
else
C;
D;
A
A B D
DB C
Warp 10…
31
Warp 20…
31
Warp 30…
31
Instructions, time
![Page 26: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/26.jpg)
26
Control Flow
1
2
2
3 3
ThreadIdx.x0 39
0
1ThreadIdx.y
A;
if(threadIdx.y==0)
B;
else
C;
D;
A
A B D
DB C
Warp 10…
31
Warp 20…
31
Warp 30…
31
Instructions, time
A
![Page 27: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/27.jpg)
27
Control Flow
1
2
2
3 3
ThreadIdx.x0 39
0
1ThreadIdx.y
A;
if(threadIdx.y==0)
B;
else
C;
D;
A
A B D
DB C
Warp 10…
31
Warp 20…
31
Warp 30…
31
Instructions, time
A C
![Page 28: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/28.jpg)
28
Control Flow
1
2
2
3 3
ThreadIdx.x0 39
0
1ThreadIdx.y
A;
if(threadIdx.y==0)
B;
else
C;
D;
A
A B D
DB C
Warp 10…
31
Warp 20…
31
Warp 30…
31
Instructions, time
A C D
![Page 29: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/29.jpg)
29
Control Flow
Problem: 2D convolution with padding on an input image of size 1024x1024.
• Number of divergence in first and last row: 0
• Number of divergence in other rows
• 2-way divergence in first and last warp of each row.
• Total warps with divergence: 1022 x 2 = 2044.
• 6% of threads will have a 2-way divergence.
Case study: Thread divergence
![Page 30: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/30.jpg)
30
Control Flow
• Not every branch is a divergence.
• Minimize thread divergence inside a warp.
• Divergence between warps is fine.
• Maximize “useful” cycles for each thread (maximize # of threads executing + minimize thread divergence).
• Do not call a warp-wide instruction on a divergent branch (e.g. __syncthreads()).
Takeaways
![Page 31: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/31.jpg)
31
Intrinsic Functions
• Fast but less accurate math intrinsic functions are available.
• 2 ways to use the intrinsic functions
• Whole file: compile with --fast-math
• Individual callsE.g. __sinf(x), __logf(x), __fdivide(x,y)
• The programming guide has a list of intrinsic functions and their impact on accuracy.
![Page 32: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/32.jpg)
32
Tensor Cores
Dedicated matrix multiplication pipeline.
Input precision: FP16.
Peak Performance: 125 TFLOPS.
Used in CUBLAS, CUDNN, CUTLASS.Optimized libraries can reach ~90% of peak.
Exposed in CUDA.
Volta/Turing
![Page 33: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/33.jpg)
33
Tensor Cores
Warp Matrix Multiply Add (WMMA)
• Warp-wide macro-instructions.
• All threads in the warp must be active.
Performs matrix multiplication on 16x16 tiles(8x32x16 and 32x8x16 tiles also available)
D = A x B + CA and B: FP16 onlyC and D: Same, either FP16 or FP32.
C
B
DA
16
16
Using Tensor Cores in your CUDA code
![Page 34: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/34.jpg)
34
Tensor Cores
Each warp processes a 16x16 output tile
Each warp:Loop on all input tiles Ak and Bk
C = C + Ak x Bk
Write the output tile.
Typical use
B
CA
![Page 35: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/35.jpg)
35
Tensor Cores
Each warp processes a 16x16 output tile
Each warp:Loop on all input tiles Ak and Bk
C = C + Ak x Bk
Write the output tile.
Typical use
B
CA
Can compute several tiles per block,
with inputs staged in shared memory.
![Page 36: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/36.jpg)
36
Tensor CoresCase Study: Deep Learning
Source: https://arxiv.org/pdf/1710.03740.pdf
Scaling is required in the backward path.
![Page 37: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/37.jpg)
37
Volta’s Memory SystemV100
SM
L1 …SM
L1
SM
L1
SM
L1
SM
L1
L2
DRAM
80 Streaming Multiprocessors256KB register file (20 MB)
Unified Shared Mem / L1 Cache128KB, Variable split(~10MB Total, 14 TB/s)
6 MB L2 Cache(2.5TB/s Read, 1.6TB/s Write)
16/32 GB HBM2 (900 GB/s)“Free” ECC.
PCIe
![Page 38: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/38.jpg)
38
Memory System
By default, the driver is using the configuration that maximizes the occupancy.
Shared Memory Split
Shared Memory / L1 Split
Volta Turing
96 KB / 32 KB 64 KB / 32 KB
64 KB / 64 KB 32 KB / 64 KB
32 KB / 96 KB
16 KB / 112 KB
8 KB / 120 KB
0 KB /128 KB
Examples Volta Turing
0 KB Shared Mem. Other resources: 16 Blocks/SM.
Config: 0/128 Blocks/SM: 16
Config: 32/64 Blocks/SM: 16
40 KB Shared Mem. Other resources: 4 Blocks/SM.
Config: 96/32 Blocks/SM: 2
Config: 64/32Blocks/SM: 1
Examples
![Page 39: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/39.jpg)
39
Cache Lines and Sectors
Sector-level memory access granularity.
• Sector size in Maxwell, Pascal, and Volta: 32B.
• Sector size in Kepler and before: variable (32 or 128).
A cache line is 128 Bytes, made of 4 sectors.
Moving data between L1, L2, DRAM
128 Byte cache line
Sector 0 Sector 1 Sector 2 Sector 3
128-Byte alignment
![Page 40: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/40.jpg)
40
Memory ReadsGetting Data from Global Memory
SM
L1…
SM
L1
L2
DRAM
• Checking if the data is in L1 (if not, check L2).
• Checking if the data is in L2 (if not, get in DRAM).
• Unit of data moved: Sectors.
![Page 41: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/41.jpg)
41
Memory Writes
SM
L1…
SM
L1
L2
DRAM
Before Volta : Writes were not cached in L1.Volta+ : L1 will cache writes.L1 is write-through: Write to L1 AND L2.
L2 is write back : Will flush data to DRAM only when needed.
Partial writes are supported (masked portion of sector, but behavior can change with ECC on/off).
![Page 42: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/42.jpg)
42
L1, L2 Caches
In general, not for cache blocking
• 100s ~ 1000s of threads running per SM.Tens of thousands of threads sharing the L2 cache.L1, L2 are small per thread.E.g. at 2048 threads/SM, with 80 SMs: 64 bytes L1, 38 Bytes L2 per thread.Running at lower occupancy increases bytes of cache per thread.
• Shared Memory is usually a better option to cache data explicitly:User managed, no evictions out of your control.
Why Does GPU Have Caches?
![Page 43: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/43.jpg)
43
L1, L2 Caches
Caches on GPUs are useful for:
• “Smoothing” irregular, unaligned access patterns.
• Caching common data accessed by many threads.
• Faster register spills, local memory.
• Fast atomics.
• Codes that do not use shared memory (naïve code).
Why Does GPU Have Caches?
![Page 44: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/44.jpg)
44
Access Patterns
For each warp: How many sectors needed?Depends on addresses, active threads, access size.Natural element sizes = 1B, 2B, 4B, 8B, 16B.
Warps and Sectors
0 32 64 96 128 160 224 256 320288192 352
Memory Addresses
WARP
0 314-Byte element access4 sectors
![Page 45: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/45.jpg)
45
Access PatternsWarps and Sectors
0 32 64 96 128 160 224 256 320288192 352
Memory Addresses
WARP
0 318-Byte element access8 sectors
Examples of 8-byte elements: long long, int2, double, float2.
![Page 46: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/46.jpg)
46
Access PatternsWarps and Sectors
0 32 64 96 128 160 224 256 320288192 352
Memory Addresses
WARP
0 314-Byte access4 sectors
![Page 47: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/47.jpg)
47
Access PatternsWarps and Sectors
0 32 64 96 128 160 224 256 320288192 352
Memory Addresses
WARP
0 314-Byte access, unaligned5 sectors
128 Bytes requested; 160 bytes read (80% efficiency).
![Page 48: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/48.jpg)
48
Access PatternsWarps and Sectors
0 32 64 96 128 160 224 256 320288192 352
Memory Addresses
WARP
0 314-Byte access, unaligned5 sectors NEXT WARP
With >1 warp per block, this sector might be found in L1 or L2.
![Page 49: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/49.jpg)
49
Access PatternsWarps and Sectors
0 32 64 96 128 160 224 256 320288192 352
Memory Addresses
WARP
0 31Same address1 sector
![Page 50: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/50.jpg)
50
Access PatternsWarps and Sectors
0 32 64 96 128 160 224 256 320288192 352
Memory Addresses
WARP
0 314-Byte strided access.32 sectors
128 bytes requested; 1024 bytes transferred.Using only a few bytes per sector. Wasting lots of bandwidth.
![Page 51: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/51.jpg)
51
Access Patterns
• Know your access patterns.
• Use the profiler (metrics, counters) to check how many sectors are moved. Is that what you expect? Is it optimal?
• Using the largest type possible (e.g. float4) will maximize the number of sectors moved per instruction.
Takeaways
![Page 52: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/52.jpg)
52
Shared Memory Addressing
• Number of banks: 32, band width: 4B.
• Mapping addresses to banks- Successive 4B words go to successive banks- Bank index computation examples:
(4B word index) % 32.
((1B word index) / 4 ) % 32.8B word spans two successive banks.
![Page 53: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/53.jpg)
53
Logical View Of Shared Memory Banks
0 1
32 33
Bank-0
2 3 4 5 8 96 7
256 260 264
10 30 31
Bank-31
384
Bank-1
0 4 8 12 16 20 24 28 44Byte-address: 32 38 40 120 128124
128 132 136 140 144 148 248 256252
With 4-Bytes data
![Page 54: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/54.jpg)
54
Shared Memory Bank Conflicts
A bank conflict occurs when, inside a warp:2 or more threads access within different 4B words in the same bank.Think: 2 or more threads access different “rows” in the same bank.
N-way bank conflict: N threads in a warp conflict- Increases latency.- Worst case: 32-way conflict → 31 replays.- Each replay adds a few cycles of latency.
There is no bank conflict if:- Several threads access the same 4-byte word.- Several threads access different bytes of the same 4-byte word.
![Page 55: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/55.jpg)
55
No Bank Conflicts
0 1
32 33
Bank-0
2 3 4 5 8 96 7
0 4 8 12 16 20 24 28 44Byte-address:
10 30 31
Bank-31
32 38 40 120 128
Bank-1
124
T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31
![Page 56: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/56.jpg)
56
No Bank Conflicts
0 1
32 33
Bank-0
2 3 4 5 8 96 7
0 4 8 12 16 20 24 28 44Byte-address:
10 30 31
Bank-31
32 38 40 120 128
Bank-1
124
T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31
![Page 57: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/57.jpg)
57
No Bank Conflicts
0 1
32 33
Bank-0
2 3 4 5 8 96 7
0 4 8 12 16 20 24 28 44Byte-address:
10 30 31
Bank-31
32 38 40 120 128
Bank-1
124
T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31
![Page 58: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/58.jpg)
58
No Bank Conflicts
0 1
32 33
Bank-0
2 3 4 5 8 96 7
0 4 8 12 16 20 24 28 44Byte-address:
10 30 31
Bank-31
32 38 40 120 128
Bank-1
124
T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31
![Page 59: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/59.jpg)
59
2-way Bank Conflict
0 1
32 33
Bank-0
2 3 4 5 8 96 7
0 4 8 12 16 20 24 28 44Byte-address:
10 30 31
Bank-31
32 38 40 120 128
Bank-1
124
T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31
![Page 60: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/60.jpg)
60
2-way Bank Conflict
0 1
32 33
Bank-0
2 3 4 5 8 96 7
0 4 8 12 16 20 24 28 44Byte-address:
10 30 31
Bank-31
32 38 40 120 128
Bank-1
124
T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31
![Page 61: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/61.jpg)
61
3-way Bank Conflict
0 1
32 33
Bank-0
2 3 4 5 8 96 7
0 4 8 12 16 20 24 28 44Byte-address:
10 30 31
Bank-31
32 38 40 120 128
Bank-1
124
T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31
![Page 62: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/62.jpg)
62
Bank Conflict and Access Phase
4B or smaller words
• Process addresses of all threads in a warp in a single phase.
8B words are accessed in 2 phases
• Process addresses of the first 16 threads in a warp.
• Process addresses of the second 16 threads in a warp.
16B words are accessed in 4 phases
• Each phase processes a quarter of a warp.
Bank conflicts occur only between threads in the same phase
![Page 63: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/63.jpg)
63
8B words, No Conflicts
0 0
16 16
Bank-0
1 1 2 2 4 43 3
0 4 8 12 16 20 24 28 44Byte-address:
5 15 15
Bank-31
32 38 40 120 128
Bank-1
124
T-0 T-1 T-2 T-3 T-4 T-5 T-15
T-16 T-17 T-18 T-19 T-20 T-21 T-31
Phase2
Phase1
![Page 64: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/64.jpg)
64
8B words, 2-way Conflict
0 0
16 16
Bank-0
1 1 2 2 4 43 3
0 4 8 12 16 20 24 28 44Byte-address:
5 15 15
Bank-31
32 38 40 120 128
Bank-1
124
T-0 T-1 T-2 T-3 T-4 T-5 T-15
T-16 T-17 T-18 T-19 T-20 T-21 T-31
Phase2(no conflict)
Phase1(2 way conflict)
![Page 65: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/65.jpg)
65
Case Study: Matrix Transpose
Staged via SMEM to coalesce GMEM addresses
32x32 blocks, single-precision values.
32x32 array in shared memory.
Initial implementation:
A warp reads a row from GMEM, writes to a row of SMEM.
Synchronize the threads in a block.
A warp reads a column of from SMEM, writes to a row in GMEM.
![Page 66: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/66.jpg)
66
Case Study: Matrix Transpose
32x32 SMEM array (.e.g. __shared__ float sm[32][32])
Warp accesses a row : No conflict.
Warp accesses a column : 32-way conflict.
Bank 0Bank 1…
Bank 31
31
210
31210
31210
Threads:0 1 2 31
20 1
31
Number indentifies which warp is accessing data.Color indicates in which bank data resides.
![Page 67: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/67.jpg)
67
Case Study: Matrix Transpose
Solution: add a column for padding: 32x33(.e.g. __shared__ float sm[32][33])
Warp accesses a row or a column: no conflict.
Bank 0Bank 1…
Bank 31
Number indentifies which warp is accessing data.Color indicates in which bank data resides.
Threads:0 1 2 31 padding
31210
31210
31210
3120 1
Speedup1.3x
![Page 68: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/68.jpg)
68
Summary: Shared Memory
Shared memory is a precious resource
Very high bandwidth (14 TB/s), much lower latency than Global Memory.
Data is programmer-managed, no evictions by hardware.
Volta: up to 96KB of shared memory per thread block.
4B granularity.
Performance issues to look out for
Bank conflicts add latency and reduce throughput.
Use profiling tools to identify bank conflicts.
![Page 69: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/69.jpg)
69
2D Stencil Experiment
index = iy * nx + ix;
res = coef[0] * in[index];
for(i=1; i<=RADIUS; i++)
res += coef[i] * (in[index-i] +
in[index+i] +
in[index-i*n1] +
in[index+i*n1]);
out[index] = res;
with and without Shared Memory
With Shared Memory:
• Load the input array and halos in shared memory
• __syncthreads()
• Compute the stencil from the shared memory
![Page 70: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/70.jpg)
70
2D – Small stencils
Relative speed of L1 implementation versus Smem implementation
103%
78%
RADIUS=1
Volta Pascal
102%
55%
RADIUS=2
Volta Pascal
L1 implementation is faster than Shared Memory on Volta!
![Page 71: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/71.jpg)
71
2D – Larger Stencils
Relative speed of L1 implementation versus Smem implementation
94%
40%
RADIUS=4
Volta Pascal
95%
32%
RADIUS=8
Volta Pascal
79%
29%
RADIUS=16
Volta Pascal
Shared Memory implementation is always faster for larger stencils
![Page 72: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/72.jpg)
72
Local Memory• Local Memory
• Volta’s max register/thread: 255
• Option to use smaller number of registers: -maxrregcount
• Developers might use it to increase occupancy
• Issues
• Increased memory traffic
• Increased instruction counts
• Solutions
• Increase register count, increase L1 cache size, careful non-caching loads.
![Page 73: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/73.jpg)
73
Constant Memory
• Globally-scoped arrays qualified with __constant__.
• Total constant data size limited to 64 KB.
• Throughput = 4B per clock per SM (ideal if entire warp reads the same address).
• Can be used directly in arithmetic instructions (saving registers).
• Example use : Stencil coefficients.
![Page 74: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/74.jpg)
74
Running Faster
• Start with something simple. Simplicity and massive parallelism are key.
• Profile first, optimize then. Do not over optimize (opportunity cost).
• Use libraries as building blocks whenever possible – e.g. CUB for parallel primitives (reduce/scan/sort/partition/..) at warp/block/device level.
• Other libraries : CUBLAS, CUFFT, CUTLASS, CUSOLVER.
![Page 75: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/75.jpg)
![Page 76: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/76.jpg)
76
Shared Memory
Scratch-pad memory on each SMUser-managed cache, hardware does not evict data.Data written to shared memory stays there till user overwrites.
UsageStoring frequently-accessed data, to reduce DRAM accesses.Communication among threads of a block.
Performance benefits compared to DRAMLatency: 20X – 40X lower. Bandwidth: 15X higher. Access granularity: 4X finer (4B vs. 32B).
![Page 77: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/77.jpg)
77
Volta Shared Memory
• Shared Memory Size Per Block
• Default: 48 KB, Maximum: 96KB.
• Shared Memory Bandwidth
• Number of banks: 32, band width: 4B.
• Bandwidth per SM: 128 B/clock
• Total Bandwidth: 10 KB/clock ~ 14 TB/s
![Page 78: Mohammad Motamedi, NVIDIA Most of the slides are from: GTC ... Notes/CME213_2020_CUD… · 2 VOLTA V100 Per Streaming Multiprocessor • 64 FP32 lanes • 32 FP64 lanes • 64 INT32](https://reader035.vdocument.in/reader035/viewer/2022071118/60161fa9b9cbae39df2092e1/html5/thumbnails/78.jpg)
78
Shared Memory Instruction Operation
Threads in a warp provide addressesHW determines into which 4-byte words addresses fall.
Reads (LDS):Fetch the data, distribute the requested bytes among threads.Multi-cast capable.
Writes (STS):
Multiple threads writing the same address: race condition.