ME964High Performance Computing for Engineering Applications
Execution Model and Its Hardware Support
Sept. 25, 2008
Before we get started… Last Time
The CUDA execution model Wrapped up overview the CUDA API
Read CUDA Programming Guide 1.1 (for next Tu)
Today Review of concepts discussed over the previous two lectures More on the CUDA execution model and its hardware support
Focus on thread scheduling HW4 assigned
Due on Thursday, Oct. 2 at 11:59 PM Timing Kernel Call Overhead, Matrix-Matrix multiplication (tiled, arbitrary size matrices),
Vector Reduction operation
Please Note: On Nov 11 and 13 we’ll have a Guest Lecturer, Dr. Darius Buntinas, of Argonne National Lab Lectures will cover MPI, a different parallel computational model The two lectures will run *two* hours long
You’ll get a free Tu or Th afterwards…
2
The GPU has evolved into a flexible and powerful processor: It’s programmable using high-level languages (soon in FORTRAN) It supports 32-bit floating point precision and dbl precision (2.0) Capable of GFLOP-level crunching number speed:
GPU in each of today’s PC and workstation
Why Use the GPU for Computing ?
3
What is Driving this Evolution?
The GPU is specialized for compute-intensive, highly data parallel computation (owing to its graphics rendering origin) More transistors can be devoted to data processing rather than
data caching and flow control
The fast-growing video game industry exerts strong economic pressure that forces constant innovation
DRAM
Cache
ALUControl
ALU
ALU
ALU
DRAM
CPU GPU
4HK-UIUC
ALU – Arithmetic Logic Unit
Digital circuit that performs arithmetic and logical operations Fundamental building block of a processing unit (CPU and GPU)
A and B operands (the data, coming from input registers) F is an operator (“+”, “-”, etc.) – specified by the control unit R is the result, stored in output register D is an output flag passed back to the control unit 5
6
Some Useful Information on Tools (short detour)
Compilation
Any source file containing CUDA language extensions must be compiled with nvcc You spot such a file by its .cu suffix
nvcc is a compile driver Works by invoking all the necessary tools and compilers like cudacc,
g++, cl, ... Assignment: Read the nvcc document available on the class website
nvcc can output: C code
Must then be compiled with the rest of the application using another tool Assembly code (ptx) Or directly object code
7HK-UIUC
Linking
Any executable with CUDA code requires two dynamic libraries:
The CUDA runtime library (cudart)
The CUDA core library (cuda)
8HK-UIUC
Debugging Using the Device Emulation Mode
An executable compiled in device emulation mode (using the nvcc -deviceemu) runs entirely on the host using the CUDA runtime
No need of any device and CUDA driver
Each device thread is emulated with a host thread
For your assignments: in Developer Studio project select the “EmuDebug” or “EmuRelease” build configurations
9
When running in device emulation mode, one can: Use host native debug support (breakpoints, variable QuickWatch and edit, etc.) Access any device-specific data from host code and vice-versa Call any host function from device code (e.g. printf) and vice-versa Detect deadlock situations caused by improper usage of __syncthreads
Device Emulation Mode Pitfalls
Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results
Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode
Results of floating-point computations will slightly differ because of: Different compiler outputs, instruction sets Use of extended precision for intermediate results
There are various options to force strict single precision on the host
10HK-UIUC
11
End Information on ToolsBegin Discussion on Block/Thread
Scheduling
Review: The CUDA Programming Model
GPU Architecture Paradigm: Single Instruction Multiple Data (SIMD) CUDA perspective: Single Program Multiple Threads
What’s the overall software (application) development model? CUDA integrated CPU + GPU application C program
Serial C code executes on CPU Parallel Kernel C code executes on GPU thread blocks
Grid 0. . .
. . .
GPU Parallel Kernel
KernelA<<< nBlkA, nTidA >>>(args);
Grid 1
CPU Serial Code
GPU Parallel Kernel
KernelB<<< nBlkB, nTidB >>>(args);
CPU Serial Code
12
Execution Configuration: Grids and Blocks (Review)
A kernel is executed as a grid of blocks of threads
All threads in a kernel can access several device data memory spaces
A block [of threads] is a batch of threads that can cooperate with each other by:
Synchronizing their execution For hazard-free shared memory
accesses Efficiently sharing data through a low
latency shared memory
Threads from two different blocks cannot cooperate!!!
This has important software design implications
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy: NDVIA 13HK-UIUC
CUDA Thread Block: Review
In relation to a Block, the programmer decides: Block size: from 1 to 512 concurrent threads Block dimension (shape): 1D, 2D, or 3D # of threads in each dimension
All threads in a Block execute the same thread code
Threads have thread id numbers within Block Threads share data and synchronize while
doing their share of the work Thread program uses thread id to select work
and address shared data
CUDA Thread Block
Thread Id #:0 1 2 3 … m
Thread code
Courtesy: John Nickolls, NVIDIA
14
GeForce-8 Series HW Overview
TPC TPC TPC TPC TPC TPC
TEX
SM
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1
Texture Processor Cluster Stream Multiprocessor
SM
Shared Memory
Stream Processor Array
…
15HK-UIUC
SPA Stream Processor Array (variable across GeForce 8-series, 8 in
GeForce8800 GTX)
TPC Texture Processor Cluster (2 SM + TEX)
SM Stream Multiprocessor (8 SP) Multi-threaded processor core Fundamental processing unit for CUDA thread block
SP Scalar [Stream] Processor (SP) Scalar ALU for a single CUDA thread
CUDA Processor Terminology
16HK-UIUC
Stream Multiprocessor (SM)
Stream Multiprocessor (SM) 8 Scalar Processors (SP) 2 Special Function Units (SFU) It’s where a block lands for execution
Multi-threaded instruction dispatch 1 to 768 (!) threads active Shared instruction fetch per 32 threads
20+ GFLOPS on G80
16 KB shared memory
DRAM texture and memory access
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1
Stream Multiprocessor
Shared Memory
17HK-UIUC
Scheduling on the HW
Grid is launched on the SPA
Thread Blocks are serially distributed to all the SMs
Potentially >1 Thread Block per SM
Each SM launches Warps of Threads
SM schedules and executes Warps that are ready to run
As Warps and Thread Blocks complete, resources are freed
SPA can launch next Block[s] in line
NOTE: Two levels of scheduling: For running [desirably] a large number of
blocks on a small number of SMs (16/14/etc.)
For running up to 24 warps of threads on the 8 SPs available on each SM
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
18
SM Executes Blocks
Threads are assigned to SMs in Block granularity
Up to 8 Blocks to each SM (doesn’t mean you’ll have eight though…)
SM in G80 can take up to 768 threads This is 24 warps (occupancy calculator!!) Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc.
Threads run concurrently but time slicing is involved
SM assigns/maintains thread id #s SM manages/schedules thread execution
t0 t1 t2 … tm
Blocks
Texture L1
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
TF
L2
Memory
t0 t1 t2 … tm
Blocks
SM 1SM 0
19HK-UIUC
Thread Scheduling/Execution
Each Thread Block is divided in 32-thread
Warps This is an implementation decision, not part
of the CUDA programming model
Warps are the basic scheduling units in SM
If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM?
Each Block is divided into 256/32 = 8 Warps
There are 8 * 3 = 24 Warps At any point in time, only *one* of the 24
Warps will be selected for instruction fetch and execution.
…t0 t1 t2 … t31…
…t0 t1 t2 … t31…Block 1 Warps Block 2 Warps
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1
Streaming Multiprocessor
Shared Memory
20HK-UIUC
SM Warp Scheduling
SM hardware implements zero-overhead Warp scheduling
Warps whose next instruction has its operands ready for consumption are eligible for execution
Eligible Warps are selected for execution on a prioritized scheduling policy
All threads in a Warp execute the same instruction when selected
4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80
Side-comment: Suppose your code has one global
memory access every four instructions Then, a minimal of 13 Warps are needed
to fully tolerate 200-cycle memory latency
warp 8 instruction 11
SM multithreadedWarp scheduler
warp 1 instruction 42
warp 3 instruction 35
warp 8 instruction 12
...
time
warp 3 instruction 36
21HK-UIUC
SM Instruction Buffer – Warp Scheduling
Fetch one warp instruction/cycle from instruction L1 cache into any instruction buffer slot
Issue one “ready-to-go” warp instruction/4 cycle from any warp - instruction buffer slot operand scoreboarding used to prevent hazards
Issue selection based on round-robin/age of warp
SM broadcasts the same instruction to 32 Threads of a Warp
I$L1
MultithreadedInstruction Buffer
RF
C$L1
SharedMem
Operand Select
MAD SFU
22HK-UIUC
Scoreboarding
All register operands of all instructions in the Instruction Buffer are scoreboarded Status becomes “ready” after the needed values are
deposited Prevents hazards Cleared instructions are eligible for issue
Decoupled Memory/Processor pipelines Any thread can continue to issue instructions until
scoreboarding prevents issue
TB1W1
TB = Thread Block, W = Warp
TB2W1
TB3W1
TB2W1
TB1W1
TB3W2
TB1W2
TB1W3
TB3W2
Time
TB1, W1 stallTB3, W2 stallTB2, W1 stall
Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4
23HK-UIUC
Granularity Considerations
For Matrix Multiplication, should I use 8X8, 16X16 or 32X32 tiles?
For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, it can take up to 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!
For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule.
For 32X32, we have 1024 threads per Block. This is not an option anyway (we need less then 512 per block, and less than 768 per SM)
24HK-UIUC
How would you scale up the GPU?
Scaling up here means beefing it up Two issues:
As a company, you don’t want to rock the boat a lot when scaling up You don’t want to have legacy code re-written to take advantage of new HW
You can beef up the memory, not discussed here
Increase the number of TCP Easy to do, basically more HW Implications on our side: If you have enough blocks, you rise with the tide too
Increase the number of SMs on each TCP Easy to do, basically more HW Implications on our side: If you have enough blocks, you rise with the tide too
Increase the number of SP This is tricky, you’d have to fiddle with the control unit of the SM The Warp size would change, most likely this would require more threads in a block to
be efficient, but that requires more memory on the chip (shared & registers) It snowballs, this is probably going to stay like this for a while…
25
TPCTPCSMSM SMSM SMSM
TPCTPCSMSM SMSM SMSM
TPCTPCSMSM SMSM SMSM
TPCTPCSMSM SMSM SMSM
TPCTPCSMSM SMSM SMSM
TPCTPCSMSM SMSM SMSM
TPCTPCSMSM SMSM SMSM
TPCTPCSMSM SMSM SMSM
TPCTPCSMSM SMSM SMSM
TPCTPCSMSM SMSM SMSM
Stream Processor Array
New GT200 GPU Architecture
Texture Processing Cluster
G80 – up to 8 TCP in SPAGT200 – 10 TCP in SPA
26
27
End Discussion on Block/Thread Scheduling
Begin Discussion on Memory Access