(1)
ISA Execution and Occupancy
©Sudhakar Yalamanchili unless otherwise noted
(2)
Objectives
• Understand the sequencing of instructions in a kernel through a set of scalar pipelines (width =warp)
• Have a basic, high level understanding of instruction fetch, decode, issue, and memory access
(3)
Reading Assignment
• Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6.3
• CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/#abstract
• Get the CUDA Occupancy Calculator Find at nvidia.com Use web version at http://lxkarthi.github.io/cuda-calculator/
3
(4)
Recap__global__void vecAddKernel(float *A_d, float *B_d, float *C_d, int n){ int i = blockIdx.x * blockDim.x + threadIdx.x;
if( i<n ) C_d[i] = A_d[i]+B_d[i];}
__host__Void vecAdd(){ dim3 DimGrid = (ceil(n/256,1,1); dim3 DimBlock = (256,1,1); vecAddKernel<<<DimGrid,DimBlock>>>(A_d,B_d,C_d,n);}
KernelBlk 0 Blk N-1• • •
GPUM0
RAM
Mk• • •
Schedule onto multiprocessors
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois,
Urbana-Champaign
(5)
Kernel LaunchCUDA streams
Host Processor
HW Queues
Kernel Management Unit (Device)
• Commands by host issued through streams Kernels in the same stream executed sequentially Kernels in different streams may be executed concurrently
• Streams are mapped to hardware queues in the device in the kernel management unit (KMU) Multiple streams mapped to each queue serializes some
kernels• Kernel launch distributes thread blocks to SMs
Kernel dispatch to SMs
(6)
TBs, Warps, & Scheduling
• Imagine one TB has 64 threads or 2 warps
• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels
Register FileCores
L1 Cache/Shared Memory
Register FileCores
L1 Cache/Shared Memory
Register FileCores
L1 Cache/Shared Memory
SMXs
……
……
SMX0 SMX1 SMX12
Thread Blocks
(7)
TBs, Warps, & Utilization
• One TB has 64 threads or 2 warps
• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels
Register FileCores
L1 Cache/Shared Memory
Register FileCores
L1 Cache/Shared Memory
Register FileCores
L1 Cache/Shared Memory
SMXs
……
……
SMX0 SMX1 SMX12
Thread Blocks
(8)
TBs, Warps, & Utilization
• One TB has 64 threads or 2 warps
• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels
Register FileCores
L1 Cache/Shared Memory
Register FileCores
L1 Cache/Shared Memory
Register FileCores
L1 Cache/Shared Memory
SMXs
……
……
SMX0 SMX1 SMX12
Thread Blocks
(9)
TBs, Warps, & Utilization
• One TB has 64 threads or 2 warps
• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels
Register FileCores
L1 Cache/Shared Memory
Register FileCores
L1 Cache/Shared Memory
Register FileCores
L1 Cache/Shared Memory
SMXs
……
……
SMX0 SMX1 SMX12
Thread Blocks Remaining TBs are queued
(10)
NVIDIA GK110 (Keplar)Thread Block Scheduler
Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/
(11)
SMX Organization
Multiple Warp Schedulers
192 cores – 6 clusters of 32 cores each
64K 32-bit registers
Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/
(12)
Execution Sequencing in a Single SM
• Cycle level view of the execution of a GPU kernel
Warp 6
Warp 1Warp 2
Decode
RFPRF
D-CacheDataAll Hit?
Writeback
Pending Warps
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
I-Fetch
Miss?
Execution Sequencing?
(13)
Example: VectorAdd on GPU
__global__ vector_add ( float *a, float *b, float *c, int N) {int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < N) c[index] = a[index]+b[index];}
setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N@p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; //r6 = &a[index]ld.global.f32 %f2, [%r7]; //r7 = &b[index]add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3; //r8 = &c[index]
L2:ret;
PTX (Assembly):CUDA:
(14)
Example: VectorAdd on GPU
• N=8, 8 Threads, 1 block, warp size = 4
• 1 SM, 4 Cores
• Pipeline: Fetch:
o One instruction from each warpo Round-robin through all warps
Execution:o In-order execution within warps
With proper data forwarding 1 Cycle each stage
• How many warps?
(15)
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
Execution Sequence
(16)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
setp W0 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
Execution Sequence (cont.)
(17)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
setp W1 setp W0
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
Execution Sequence (cont.)
(18)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
bra W0setp W1
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
setp W0 setp W0 setp W0 setp W0
Execution Sequence (cont.)
(19)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
@p bra W1@p bra W0
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
setp W1 setp W1 setp W1 setp W1
setp W0 setp W0 setp W0 setp W0
Execution Sequence (cont.)
(20)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
bra L2@p bra W1
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
bra W0 bra W0 bra W0 bra W0
setp W1 setp W1 setp W1 setp W1
setp W0 setp W0 setp W0 setp W0
Execution Sequence (cont.)
(21)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
bra L2 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
bra W1 bra W1 bra W1 bra W1
bra W0 bra W0 bra W0 bra W0
setp W1 setp W1 setp W1 setp W1
Execution Sequence (cont.)
(22)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ld W0 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
bra W1 bra W1 bra W1 bra W1
bra W0 bra W0 bra W0 bra W0
Execution Sequence (cont.)
(23)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ld W1 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBbra W1 bra W1 bra W1 bra W1
ld W0
Execution Sequence (cont.)
(24)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ld W0 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
ld W1ld W0 ld W0 ld W0 ld W0
Execution Sequence (cont.)
(25)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ld W1 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
ld W0ld W1 ld W1 ld W1 ld W1
ld W0 ld W0 ld W0 ld W0
Execution Sequence (cont.)
(26)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
add W0 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
ld W1
ld W1 ld W1 ld W1 ld W1
ld W0 ld W0 ld W0 ld W0
ld W0 ld W0 ld W0 ld W0
Execution Sequence (cont.)
(27)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
add W1 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBld W1 ld W1 ld W1 ld W1
ld W0 ld W0 ld W0 ld W0
ld W1 ld W1 ld W1 ld W1add W0
Execution Sequence (cont.)
(28)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
st W0 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBld W0 ld W0 ld W0 ld W0
ld W1 ld W1 ld W1 ld W1
add W1add W0 add W0 add W0 add W0
Execution Sequence (cont.)
(29)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
st W1 FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBld W1 ld W1 ld W1 ld W1
add W0 add W0 add W0 add W0
add W1 add W1 add W1 add W1st W0
Execution Sequence (cont.)
(30)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ret FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBadd W0 add W0 add W0 add W0
add W1 add W1 add W1 add W1
st W1st W0 st W0 st W0 st W0
Execution Sequence (cont.)
(31)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ret FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBadd W1 add W1 add W1 add W1
ret
st W0 st W0 st W0 st W0
st W1 st W1 st W1 st W1
Execution Sequence (cont.)
(32)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
ret FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBst W0 st W0 st W0 st W0
st W1 st W1 st W1 st W1
ret ret ret ret
Execution Sequence (cont.)
(33)
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Warp0 Warp1
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBst W1 st W1 st W1 st W1
ret ret ret ret
ret ret ret ret
Execution Sequence (cont.)
(34)
Warp0 Warp1
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBret ret ret ret
ret ret ret ret
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Execution Sequence (cont.)
(35)
Warp0 Warp1
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WBret ret ret ret
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Execution Sequence (cont.)
(36)
Warp0 Warp1
FEDE
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
EXE
MEM
WB
setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;
L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3;
L2:ret;
Execution Sequence (cont.)
Note: Idealized execution without memory or special function delays
(37)
Fine Grained Multithreading
• First introduced in the Denelcor HEP (1980’s)
• Can eliminate data bypassing in an instruction stream Maintain separation between successive instructions in a
warp
• Simplifies dependency between successive instructions E.g., have only one instruction in the pipeline at a time
• What happens when warp size > #lanes? Why have such a relationship?
(38)
Multi-Cycle Dispatch
scalarpipeline
scalarpipeline
scalarpipeline
scalarpipeline
scalarpipeline
scalarpipeline
scalarpipeline
scalarpipeline
Cycle 1
Cycle 2
dispatch
dispatch
One warp
warp
Issue the warp over multiple cycles
NB: Fetch BW
(39)
SIMD vs. SIMT
SISD SIMD
MISD MIMD
Data Streams
Inst
ruct
ion
Stre
ams
Register File
+
Loosely synchronized threadsMultiple threads
Synchronous operation
RFRF RF RF
Single Scalar Thread
SIMT
Flynn Taxonomy
e.g., pthreads
e.g., SSE/AVX
e.g., PTX, HSA
(40)
Comparison
Register File
+
RFRF RF RFMIMD SIMD SIMT
•Multiple independent threads and explicit synchronization
•TLP, DLP, ILP, SPMD
•Single thread + vector operations
•DLP
•Easily embedded in sequential code
•Multiple, synchronous threads
•TLP, DLP, ILP (more recent)
•Explicitly parallel
(41)
Performance Metrics: Occupancy
• Performance implications of programming model properties Warps, thread blocks, register usage
• Capture which resources can be dynamically shared, and how to reason about resource demands of a CUDA kernel Enable device-specific online tuning of kernel parameters
41
(42)
CUDA Occupancy
• Occupancy = (#Active Warps) /(#MaximumActive Warps) Measure of how well you are using max capacity
• Limits on the numerator? Registers/thread Shared memory/thread block Number of scheduling slots: thread blocks or warps
• Limits on the denominator? Memory bandwidth Scheduler slots
• What is the performance impact of varying kernel resource demands?
(43)
Performance Goals
Core
Util
izatio
n
BW U
tiliza
tion
What are the limiting Factors?
(44)
Resource Limits on Occupancy
SM Scheduler
Kernel Distributor
SM SM SM SM
DRAM
Limits the #threads
Limits the #thread blocks
Warp Schedulers
Warp ContextWarp ContextWarp ContextWarp Context
Register File
SPSPSPSP
SPSPSPSP
SPSPSPSP
SPSPSPSP
TB 0
Thread Block Control
L1/Shared MemoryLimits the #thread
blocks
Limits the #threads
SM – Stream MultiprocessorSP – Stream Processor
(45)
Programming Model AttributesGrid 1
Block (0, 0)
Block (1, 1)
Block (1, 0)
Block (0, 1)
Block (1,1)
Thread(0,0,0)Thread
(0,1,3)Thread(0,1,0)
Thread(0,1,1)
Thread(0,1,2)
Thread(0,0,0)
Thread(0,0,1)
Thread(0,0,2)
Thread(0,0,3)
(1,0,0) (1,0,1) (1,0,2) (1,0,3)
• #SMs• Max Threads/Block• Max Threads/Dim• Warp size
• What do we need to know about target processor?
• How do these map to kernel configuration parameters that we can control?
(46)
Impact of Thread Block Size
• Consider Fermi with 1536 threads/SM With 512 threads/block, can only up to 3 thread blocks
executing at an instant With 128 threads/block 12 thread blocks per SM Consider how many instructions can be in flight?
• Consider limit of 8 thread blocks/SM? Only 1024 active threads at a time Occupancy = 0.666
• To maximize utilization, thread block size should balance demand for thread blocks vs. thread slots
(47)
Impact of #Registers Per Thread
• Assume 10 registers/thread and a thread block size of 256
• Number of registers per SM = 16K
• A thread block requires 2560 registers for a maximum of 6 thread blocks per SM Uses all 1536 thread slots
• What is the impact of increasing number of registers by 2? Granularity of management is a thread block! Loss of concurrency of 256 threads!
(48)
Impact of Shared Memory
• Shared memory is allocated per thread block Can limit the number of thread blocks executing concurrently
per SM
• As we change the gridDim and blockDim parameters how does demand change for shared memory, number of thread slots, or number of thread block slots?
(49)
Thread Granularity
• How fine grained should threads be? Coarser grain threads
o Increased register pressure, shared memory pressure
o Lower pressure on thread block slots and thread slots
• Merge threads blocks
• Share row values Reduce memory BW
d_N
d_M d_P
Merge these two threads
(50)
Balance
#Threads/Block
#Thread Blocks
Shared memory/T
hread block
#Registers/Thread
• Navigate the tradeoffs to maximize core utilization and memory bandwidth utilization for the target device
• Goal: Increase occupancy until one or the other is saturated
(51)
Performance Tuning
• Auto-tuning to maximize performance for a device
• Query device properties to tune kernel parameters prior to launch
• Tune kernel properties to maximize efficiency of execution
• http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/index.html
(52)
Device Properties
(53)
Summary
• Sequence warps through the scalar pipelines
• Overlap execution from multiple warps to hide memory latency (or special functions)
• Out of order execution not shown need scoreboard for correctness
• Occupancy calculator as a first order estimate of correctness
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated)
Questions?