(1) isa execution and occupancy ©sudhakar yalamanchili unless otherwise noted
DESCRIPTION
(3) Reading Assignment Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6.3 CUDA Programming Guide Get the CUDA Occupancy Calculator Find at nvidia.com Use web version at 3TRANSCRIPT
![Page 1: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/1.jpg)
(1)
ISA Execution and Occupancy
©Sudhakar Yalamanchili unless otherwise noted
![Page 2: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/2.jpg)
(2)
Objectives
• Understand the sequencing of instructions in a kernel through a set of scalar pipelines (width =warp)
• Have a basic, high level understanding of instruction fetch, decode, issue, and memory access
![Page 3: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/3.jpg)
(3)
Reading Assignment
• Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6.3
• CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/#abstract
• Get the CUDA Occupancy Calculator Find at nvidia.com Use web version at http://lxkarthi.github.io/cuda-calculator/
3
![Page 4: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/4.jpg)
(4)
Recap__global__void vecAddKernel(float *A_d, float *B_d, float *C_d, int n){ int i = blockIdx.x * blockDim.x + threadIdx.x;
if( i<n ) C_d[i] = A_d[i]+B_d[i];}
__host__Void vecAdd(){ dim3 DimGrid = (ceil(n/256,1,1); dim3 DimBlock = (256,1,1); vecAddKernel<<<DimGrid,DimBlock>>>(A_d,B_d,C_d,n);}
KernelBlk 0 Blk N-1• • •
GPUM0
RAM
Mk• • •
Schedule onto multiprocessors
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois,
Urbana-Champaign
![Page 5: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/5.jpg)
(5)
Kernel LaunchCUDA streams
Host Processor
HW Queues
Kernel Management Unit (Device)
• Commands by host issued through streams Kernels in the same stream executed sequentially Kernels in different streams may be executed concurrently
• Streams are mapped to hardware queues in the device in the kernel management unit (KMU) Multiple streams mapped to each queue serializes some
kernels• Kernel launch distributes thread blocks to SMs
Kernel dispatch to SMs
![Page 6: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/6.jpg)
(6)
TBs, Warps, & Scheduling
• Imagine one TB has 64 threads or 2 warps
• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels
Register FileCores
L1 Cache/Shared Memory
Register FileCores
L1 Cache/Shared Memory
Register FileCores
L1 Cache/Shared Memory
SMXs
……
……
SMX0 SMX1 SMX12
Thread Blocks
![Page 7: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/7.jpg)
(7)
TBs, Warps, & Utilization
• One TB has 64 threads or 2 warps
• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels
Register FileCores
L1 Cache/Shared Memory
Register FileCores
L1 Cache/Shared Memory
Register FileCores
L1 Cache/Shared Memory
SMXs
……
……
SMX0 SMX1 SMX12
Thread Blocks
![Page 8: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/8.jpg)
(8)
TBs, Warps, & Utilization
• One TB has 64 threads or 2 warps
• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels
Register FileCores
L1 Cache/Shared Memory
Register FileCores
L1 Cache/Shared Memory
Register FileCores
L1 Cache/Shared Memory
SMXs
……
……
SMX0 SMX1 SMX12
Thread Blocks
![Page 9: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/9.jpg)
(9)
TBs, Warps, & Utilization
• One TB has 64 threads or 2 warps
• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels
Register FileCores
L1 Cache/Shared Memory
Register FileCores
L1 Cache/Shared Memory
Register FileCores
L1 Cache/Shared Memory
SMXs
……
……
SMX0 SMX1 SMX12
Thread Blocks Remaining TBs are queued
![Page 10: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/10.jpg)
(10)
NVIDIA GK110 (Keplar)Thread Block Scheduler
Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/
![Page 11: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/11.jpg)
(11)
SMX Organization
Multiple Warp Schedulers
192 cores – 6 clusters of 32 cores each
64K 32-bit registers
Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/
![Page 12: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/12.jpg)
(12)
Execution Sequencing in a Single SM
• Cycle level view of the execution of a GPU kernel
Warp 6
Warp 1Warp 2
Decode
RFPRF
D-CacheDataAll Hit?
Writeback
Pending Warps
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
I-Fetch
Miss?
Execution Sequencing?
![Page 13: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/13.jpg)
(13)
Example: VectorAdd on GPU
__global__ vector_add ( float *a, float *b, float *c, int N) {int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < N) c[index] = a[index]+b[index];}
setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N@p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; //r6 = &a[index]ld.global.f32 %f2, [%r7]; //r7 = &b[index]add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3; //r8 = &c[index]
L2:ret;
PTX (Assembly):CUDA:
![Page 14: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/14.jpg)
(14)
Example: VectorAdd on GPU
• N=8, 8 Threads, 1 block, warp size = 4
• 1 SM, 4 Cores
• Pipeline: Fetch:
o One instruction from each warpo Round-robin through all warps
Execution:o In-order execution within warps
With proper data forwarding 1 Cycle each stage
• How many warps?
![Page 15: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/15.jpg)
(15)
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
Execution Sequence
![Page 16: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/16.jpg)
(16)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
setp W0 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
Execution Sequence (cont.)
![Page 17: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/17.jpg)
(17)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
setp W1 setp W0
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
Execution Sequence (cont.)
![Page 18: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/18.jpg)
(18)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
bra W0setp W1
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
setp W0 setp W0 setp W0 setp W0
Execution Sequence (cont.)
![Page 19: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/19.jpg)
(19)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
@p bra W1@p bra W0
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
setp W1 setp W1 setp W1 setp W1
setp W0 setp W0 setp W0 setp W0
Execution Sequence (cont.)
![Page 20: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/20.jpg)
(20)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
bra L2@p bra W1
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
bra W0 bra W0 bra W0 bra W0
setp W1 setp W1 setp W1 setp W1
setp W0 setp W0 setp W0 setp W0
Execution Sequence (cont.)
![Page 21: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/21.jpg)
(21)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
bra L2 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
bra W1 bra W1 bra W1 bra W1
bra W0 bra W0 bra W0 bra W0
setp W1 setp W1 setp W1 setp W1
Execution Sequence (cont.)
![Page 22: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/22.jpg)
(22)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ld W0 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
bra W1 bra W1 bra W1 bra W1
bra W0 bra W0 bra W0 bra W0
Execution Sequence (cont.)
![Page 23: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/23.jpg)
(23)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ld W1 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBbra W1 bra W1 bra W1 bra W1
ld W0
Execution Sequence (cont.)
![Page 24: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/24.jpg)
(24)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ld W0 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
ld W1ld W0 ld W0 ld W0 ld W0
Execution Sequence (cont.)
![Page 25: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/25.jpg)
(25)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ld W1 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
ld W0ld W1 ld W1 ld W1 ld W1
ld W0 ld W0 ld W0 ld W0
Execution Sequence (cont.)
![Page 26: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/26.jpg)
(26)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
add W0 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
ld W1
ld W1 ld W1 ld W1 ld W1
ld W0 ld W0 ld W0 ld W0
ld W0 ld W0 ld W0 ld W0
Execution Sequence (cont.)
![Page 27: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/27.jpg)
(27)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
add W1 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBld W1 ld W1 ld W1 ld W1
ld W0 ld W0 ld W0 ld W0
ld W1 ld W1 ld W1 ld W1add W0
Execution Sequence (cont.)
![Page 28: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/28.jpg)
(28)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
st W0 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBld W0 ld W0 ld W0 ld W0
ld W1 ld W1 ld W1 ld W1
add W1add W0 add W0 add W0 add W0
Execution Sequence (cont.)
![Page 29: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/29.jpg)
(29)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
st W1 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBld W1 ld W1 ld W1 ld W1
add W0 add W0 add W0 add W0
add W1 add W1 add W1 add W1st W0
Execution Sequence (cont.)
![Page 30: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/30.jpg)
(30)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ret FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBadd W0 add W0 add W0 add W0
add W1 add W1 add W1 add W1
st W1st W0 st W0 st W0 st W0
Execution Sequence (cont.)
![Page 31: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/31.jpg)
(31)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ret FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBadd W1 add W1 add W1 add W1
ret
st W0 st W0 st W0 st W0
st W1 st W1 st W1 st W1
Execution Sequence (cont.)
![Page 32: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/32.jpg)
(32)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ret FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBst W0 st W0 st W0 st W0
st W1 st W1 st W1 st W1
ret ret ret ret
Execution Sequence (cont.)
![Page 33: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/33.jpg)
(33)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBst W1 st W1 st W1 st W1
ret ret ret ret
ret ret ret ret
Execution Sequence (cont.)
![Page 34: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/34.jpg)
(34)
Warp0 Warp1
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBret ret ret ret
ret ret ret ret
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Execution Sequence (cont.)
![Page 35: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/35.jpg)
(35)
Warp0 Warp1
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBret ret ret ret
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Execution Sequence (cont.)
![Page 36: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/36.jpg)
(36)
Warp0 Warp1
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Execution Sequence (cont.)
Note: Idealized execution without memory or special function delays
![Page 37: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/37.jpg)
(37)
Fine Grained Multithreading
• First introduced in the Denelcor HEP (1980’s)
• Can eliminate data bypassing in an instruction stream Maintain separation between successive instructions in a
warp
• Simplifies dependency between successive instructions E.g., have only one instruction in the pipeline at a time
• What happens when warp size > #lanes? Why have such a relationship?
![Page 38: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/38.jpg)
(38)
Multi-Cycle Dispatch
scalarpipeline
scalarpipeline
scalarpipeline
scalarpipeline
scalarpipeline
scalarpipeline
scalarpipeline
scalarpipeline
Cycle 1
Cycle 2
dispatch
dispatch
One warp
warp
Issue the warp over multiple cycles
NB: Fetch BW
![Page 39: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/39.jpg)
(39)
SIMD vs. SIMT
SISD SIMD
MISD MIMD
Data Streams
Inst
ruct
ion
Stre
ams
Register File
+
Loosely synchronized threadsMultiple threads
Synchronous operation
RFRF RF RF
Single Scalar Thread
SIMT
Flynn Taxonomy
e.g., pthreads
e.g., SSE/AVX
e.g., PTX, HSA
![Page 40: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/40.jpg)
(40)
Comparison
Register File
+
RFRF RF RFMIMD SIMD SIMT
•Multiple independent threads and explicit synchronization
•TLP, DLP, ILP, SPMD
•Single thread + vector operations
•DLP
•Easily embedded in sequential code
•Multiple, synchronous threads
•TLP, DLP, ILP (more recent)
•Explicitly parallel
![Page 41: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/41.jpg)
(41)
Performance Metrics: Occupancy
• Performance implications of programming model properties Warps, thread blocks, register usage
• Capture which resources can be dynamically shared, and how to reason about resource demands of a CUDA kernel Enable device-specific online tuning of kernel parameters
41
![Page 42: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/42.jpg)
(42)
CUDA Occupancy
• Occupancy = (#Active Warps) /(#MaximumActive Warps) Measure of how well you are using max capacity
• Limits on the numerator? Registers/thread Shared memory/thread block Number of scheduling slots: thread blocks or warps
• Limits on the denominator? Memory bandwidth Scheduler slots
• What is the performance impact of varying kernel resource demands?
![Page 43: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/43.jpg)
(43)
Performance Goals
Core
Util
izatio
n
BW U
tiliza
tion
What are the limiting Factors?
![Page 44: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/44.jpg)
(44)
Resource Limits on Occupancy
SM Scheduler
Kernel Distributor
SM SM SM SM
DRAM
Limits the #threads
Limits the #thread blocks
Warp Schedulers
Warp ContextWarp ContextWarp ContextWarp Context
Register File
SPSPSPSP
SPSPSPSP
SPSPSPSP
SPSPSPSP
TB 0
Thread Block Control
L1/Shared MemoryLimits the #thread
blocks
Limits the #threads
SM – Stream MultiprocessorSP – Stream Processor
![Page 45: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/45.jpg)
(45)
Programming Model AttributesGrid 1
Block (0, 0)
Block (1, 1)
Block (1, 0)
Block (0, 1)
Block (1,1)
Thread(0,0,0)Thread
(0,1,3)Thread(0,1,0)
Thread(0,1,1)
Thread(0,1,2)
Thread(0,0,0)
Thread(0,0,1)
Thread(0,0,2)
Thread(0,0,3)
(1,0,0) (1,0,1) (1,0,2) (1,0,3)
• #SMs• Max Threads/Block• Max Threads/Dim• Warp size
• What do we need to know about target processor?
• How do these map to kernel configuration parameters that we can control?
![Page 46: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/46.jpg)
(46)
Impact of Thread Block Size
• Consider Fermi with 1536 threads/SM With 512 threads/block, can only up to 3 thread blocks
executing at an instant With 128 threads/block 12 thread blocks per SM Consider how many instructions can be in flight?
• Consider limit of 8 thread blocks/SM? Only 1024 active threads at a time Occupancy = 0.666
• To maximize utilization, thread block size should balance demand for thread blocks vs. thread slots
![Page 47: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/47.jpg)
(47)
Impact of #Registers Per Thread
• Assume 10 registers/thread and a thread block size of 256
• Number of registers per SM = 16K
• A thread block requires 2560 registers for a maximum of 6 thread blocks per SM Uses all 1536 thread slots
• What is the impact of increasing number of registers by 2? Granularity of management is a thread block! Loss of concurrency of 256 threads!
![Page 48: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/48.jpg)
(48)
Impact of Shared Memory
• Shared memory is allocated per thread block Can limit the number of thread blocks executing concurrently
per SM
• As we change the gridDim and blockDim parameters how does demand change for shared memory, number of thread slots, or number of thread block slots?
![Page 49: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/49.jpg)
(49)
Thread Granularity
• How fine grained should threads be? Coarser grain threads
o Increased register pressure, shared memory pressure
o Lower pressure on thread block slots and thread slots
• Merge threads blocks
• Share row values Reduce memory BW
d_N
d_M d_P
Merge these two threads
![Page 50: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/50.jpg)
(50)
Balance
#Threads/Block
#Thread Blocks
Shared memory/T
hread block
#Registers/Thread
• Navigate the tradeoffs to maximize core utilization and memory bandwidth utilization for the target device
• Goal: Increase occupancy until one or the other is saturated
![Page 51: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/51.jpg)
(51)
Performance Tuning
• Auto-tuning to maximize performance for a device
• Query device properties to tune kernel parameters prior to launch
• Tune kernel properties to maximize efficiency of execution
• http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/index.html
![Page 52: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/52.jpg)
(52)
Device Properties
![Page 53: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/53.jpg)
(53)
Summary
• Sequence warps through the scalar pipelines
• Overlap execution from multiple warps to hide memory latency (or special functions)
• Out of order execution not shown need scoreboard for correctness
• Occupancy calculator as a first order estimate of correctness
![Page 54: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted](https://reader033.vdocument.in/reader033/viewer/2022052308/5a4d1ad37f8b9ab05997206f/html5/thumbnails/54.jpg)
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated)
Questions?