introduction to gpu architecture review cuda (1 of n*)cis565/lectures2011/lecture9.pdf · kayvon...

21
Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3 Agenda GPU architecture review CUDA First of two or three dedicated classes Acknowledgements Many slides are from Kayvon Fatahalian's From Shader Code to a Teraflop: How GPU Shader Cores Work: http://bps10.idav.ucdavis.edu/talks/03- fatahalian_gpuArchTeraflop_BPS_SIGGRAPH201 0.pdf David Kirk and Wen-mei Hwu’s UIUC course: http://courses.engr.illinois.edu/ece498/al/ GPU Architecture Review GPUs are: Parallel Multithreaded Many-core GPUs have: Tremendous computational horsepower High memory bandwidth GPU Architecture Review GPUs are specialized for Compute-intensive, highly parallel computation Graphics! Transistors are devoted to: Processing Not: Data caching Flow control GPU Architecture Review Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf Transistor Usage

Upload: dokhuong

Post on 09-May-2018

242 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

Introduction to CUDA (1 of n*)

Joseph KiderUniversity of PennsylvaniaCIS 565 - Spring 2011

* Where n is 2 or 3

Agenda

GPU architecture reviewCUDA

First of two or three dedicated classes

Acknowledgements

Many slides are fromKayvon Fatahalian's From Shader Code to a Teraflop: How GPU Shader Cores Work:

http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

David Kirk and Wen-mei Hwu’s UIUC course:http://courses.engr.illinois.edu/ece498/al/

GPU Architecture Review

GPUs are:ParallelMultithreadedMany-core

GPUs have:Tremendous computational horsepowerHigh memory bandwidth

GPU Architecture Review

GPUs are specialized forCompute-intensive, highly parallel computationGraphics!

Transistors are devoted to:ProcessingNot:

Data cachingFlow control

GPU Architecture Review

Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf

Transistor Usage

Page 2: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Page 3: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Threading Hardware in G80

Page 4: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

Sources

Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei HwuJohn Nickolls, NVIDIA

VertexIndex

Stream

3D APICommands

AssembledPrimitives

PixelUpdates

PixelLocationStream

ProgrammableFragmentProcessor

ProgrammableFragmentProcessor

Tra

nsf

orm

edVer

tice

s

ProgrammableVertex

Processor

ProgrammableVertex

Processor

GPUFront End

GPUFront End

PrimitiveAssembly

PrimitiveAssembly

Frame Buffer

Frame Buffer

RasterOperations

Rasterizationand

Interpolation

3D API:OpenGL orDirect3D

3D API:OpenGL orDirect3D

3DApplication

Or Game

3DApplication

Or Game

Pre-transfo

rmed

Vertices

Pre-transfo

rmed

Fragm

ents

Tra

nsf

orm

edFr

agm

ents

GPU

Com

man

d &

Data S

tream

CPU-GPU Boundary (AGP/PCIe)

Fixed-function pipeline

VertexIndex

Stream

3D APICommands

AssembledPrimitives

PixelUpdates

PixelLocationStream

ProgrammableFragmentProcessor

ProgrammableFragmentProcessor

Tra

nsf

orm

edVer

tice

s

ProgrammableVertex

Processor

ProgrammableVertex

Processor

GPUFront End

GPUFront End

PrimitiveAssembly

PrimitiveAssembly

Frame Buffer

Frame Buffer

RasterOperations

Rasterizationand

Interpolation

3D API:OpenGL orDirect3D

3D API:OpenGL orDirect3D

3DApplication

Or Game

3DApplication

Or Game

Pre-transfo

rmed

Vertices

Pre-transfo

rmed

Fragm

ents

Tra

nsf

orm

edFr

agm

ents

GPU

Com

man

d &

Data S

tream

CPU-GPU Boundary (AGP/PCIe)

Programmable pipeline

VertexIndex

Stream

3D APICommands

AssembledPrimitives

PixelUpdates

PixelLocationStream

Unified Vertex,Fragment, GeometryProcessor

Unified Vertex,Fragment, GeometryProcessor

Tra

nsf

orm

edVer

tice

sGPU

Front End

GPUFront End

PrimitiveAssembly

PrimitiveAssembly

Frame Buffer

Frame Buffer

RasterOperations

Rasterizationand

Interpolation

3D API:OpenGL orDirect3D

3D API:OpenGL orDirect3D

3DApplication

Or Game

3DApplication

Or Game

Pre-transfo

rmed

Vertices

Pre-transfo

rmed

Fragm

ents

Tra

nsf

orm

edFr

agm

ents

GPU

Com

man

d &

Data S

tream

CPU-GPU Boundary (AGP/PCIe)

Unified Programmable pipeline

General Diagram (6800/NV40) TurboCache

Uses PCI-Express bandwidth to render directly to system memoryCard needs less memoryPerformance boost while lowering costTurboCache Manager dynamically allocates from main memoryLocal memory used to cache data and to deliver peak performance when needed

Page 5: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

NV40 Vertex Processor

An NV40 vertex processor is able to execute one vector operation (up to four FP32 components), one scalar FP32 operation, and make one access to the texture per clock cycle

NV40 Fragment ProcessorsEarly termination from mini z buffer and z buffer checks; resulting sets of 4 pixels

(quads) passed on to fragment units

Why NV40 series was better

Massive parallelismScalability

Lower end products have fewer pixel pipes and fewer vertex shader units

Computation Power222 million transistorsFirst to comply with Microsoft’s DirectX 9 spec

Dynamic Branching in pixel shaders

Dynamic Branching

Helps detect if pixel needs shadingInstruction flow handled in groups of pixelsSpecify branch granularity (the number of consecutive pixels that take the same branch) Better distribution of blocks of pixels between the different quad engines

General Diagram (7800/G70)General Diagram (7800/G70)

General Diagram (6800/NV40)

Page 6: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

GeForce Go 7800 – Power Issues

Power consumption and package are the same as the 6800 Ultra chip, meaning notebook designers do not have to change very muchabout their thermal designsDynamic clock scaling can run as slow as 16 MHz

This is true for the engine, memory, and pixel clocksHeavier use of clock gating than the desktop versionRuns at voltages lower than any other mobile performance partRegardless, you won’t get much battery-based runtime for a 3D game

Triangle Setup/Raster

Shader Instruction Dispatch

Fragment Crossbar

MemoryPartition

MemoryPartition

MemoryPartition

MemoryPartition

Z-Cull

8 Vertex Engines

24 Pixel Shaders

16 Raster Operation Pipelines

GeForce 7800 GTX ParallelismGeForce 7800 GTX Parallelism

L2

FB

SP SP

L1

TF

Thre

ad P

roce

ssor

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

The future of GPUs is programmable processingSo – build the architecture around the processor

G80 – Graphics Mode

G80 CUDA mode – A Device Example

Processors execute computing threadsNew operating mode/HW interface for computing

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

The GPU has evolved into a very flexible and powerful processor:

It’s programmable using high-level languagesIt supports 32-bit floating point precisionIt offers lots of GFLOPS:

GPU in every PC and workstation

GFL

OP

S

G80 = GeForce 8800 GTX

G71 = GeForce 7900 GTX

G70 = GeForce 7800 GTX

NV40 = GeForce 6800 Ultra

NV35 = GeForce FX 5950 Ultra

NV30 = GeForce FX 5800

Why Use the GPU for Computing ?

Page 7: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

What is Behind such an Evolution?The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about)

So, more transistors can be devoted to data processing rather than data caching and flow control

The fast-growing video game industry exerts

DRAM

Cache

ALUControl

ALU

ALU

ALU

DRAM

CPU GPU

What is (Historical) GPGPU ?

General Purpose computation using GPU and graphics API in applications other than 3D graphics

GPU accelerates critical path of application

Data parallel algorithms leverage GPU attributesLarge data arrays, streaming throughputFine-grain SIMD parallelismLow-latency floating point (FP) computation

Applications – see //GPGPU.orgGame effects (FX) physics image processing

Previous GPGPU ConstraintsDealing with graphics API

Working with the corner cases of the graphics API

Addressing modesLimited texture size/dimension

Shader capabilitiesLimited outputs

Instruction setsLack of Integer & bit ops

Communication limitedBetween pixelsScatter a[i] = p

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

per threadper Shaderper Context

FB Memory

An Example of Physical Reality Behind CUDA CPU

(host)GPU w/

local DRAM(device)

Arrays of Parallel Threads

• A CUDA kernel is executed by an array ofthreads– All threads run the same code (SPMD) – Each thread has an ID that it uses to compute

memory addresses and make control decisions

76543210

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

Thread Block 0

……float x = input[threadID];float y = func(x);output[threadID] = y;…

Thread Block 0

…float x = input[threadID];float y = func(x);output[threadID] = y;…

Thread Block N - 1

Thread Blocks: Scalable Cooperation

Divide monolithic thread array into multiple blocks

Threads within a block cooperate via shared memory, atomic operations and barrier synchronizationThreads in different blocks cannot cooperate

76543210 76543210 76543210

Page 8: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

Thread Batching: Grids and BlocksA kernel is executed as a grid of thread blocks

All threads share data memory space

A thread block is a batch of threads that can cooperatewith each other by:

Synchronizing their executionFor hazard-free shared memory accesses

Efficiently sharing data through a low latency shared memory

Two threads from two different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NDVIA

Block and Thread IDs

Threads and blocks have IDsSo each thread can decide what data to work onBlock ID: 1D or 2DThread ID: 1D, 2D, or 3D

Simplifies memoryaddressing when processingmultidimensional data

Image processingSolving PDEs on volumes…

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NDVIA

CUDA Device Memory Space Overview

Each thread can:R/W per-thread registersR/W per-thread local memoryR/W per-block shared memoryR/W per-grid global memoryRead only per-grid constant memoryRead only per-grid texture memory

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

HostThe host can R/W global, constant, and texture memories

Global, Constant, and Texture Memories(Long Latency Accesses)

Global memoryMain means of communicating R/W Data between host and deviceContents visible to all threads

Texture and Constant Memories

Constants initialized by host Contents visible to all threads

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

Courtesy: NDVIA

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Block IDs and Thread IDsEach thread uses IDs to decide what data to work on

Block ID: 1D or 2DThread ID: 1D, 2D, or 3D

Simplifies memoryaddressing when processingmultidimensional data

Image processingSolving PDEs on

CUDA Memory Model OverviewGlobal memory

Main means of communicating R/W Data between host and deviceContents visible to all threadsLong latency access

We will focus on global memory for now

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Page 9: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

Parallel Computing on a GPU

8-series GPUs deliver 25 to 200+ GFLOPSon compiled parallel C applications

Available in laptops, desktops, and clusters

GPU parallelism is doubling every yearProgramming model scales transparently

Programmable in C with CUDA tools

GeForce 8800

Tesla S870

Tesla D870

Single-Program Multiple-Data (SPMD)

CUDA integrated CPU + GPU application C program

Serial C code executes on CPUParallel Kernel C code executes on GPU thread blocksCPU Serial Code

Grid 0

. . .

. . .

GPU Parallel KernelKernelA<<< nBlk, nTid >>>(args);

Grid 1CPU Serial Code

GPU Parallel Kernel KernelB<<< nBlk, nTid >>>(args);

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Grids and BlocksA kernel is executed as a grid of thread blocks

All threads share global memory space

A thread block is a batch of threads that can cooperate with each other by:

Synchronizing their execution using barrierEfficiently sharing data through a low latency shared memoryTwo threads from two different blocks cannot

CUDA Thread BlockProgrammer declares (Thread) Block:

Block size 1 to 512 concurrent threadsBlock shape 1D, 2D, or 3DBlock dimensions in threads

All threads in a Block execute the same thread programThreads share data and synchronize while doing their share of the workThreads have thread idnumbers within Block

CUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

GeForce-8 Series HW Overview

TPC TPC TPC TPC TPC TPC

TEX

SM

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1Texture Processor Cluster Streaming Multiprocessor

SM

Shared Memory

Streaming Processor Array

SPAStreaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800)

TPCTexture Processor Cluster (2 SM + TEX)

SMStreaming Multiprocessor (8 SP)Multi-threaded processor coreFundamental processing unit for CUDA thread block

SPStreaming ProcessorScalar ALU for a single CUDA thread

CUDA Processor Terminology

Page 10: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

Streaming Multiprocessor (SM)Streaming Multiprocessor (SM)

8 Streaming Processors (SP)2 Super Function Units (SFU)

Multi-threaded instruction dispatch

1 to 512 threads activeShared instruction fetch per 32 threadsCover latency of texture/memory loads

20+ GFLOPS16 KB shared memorytexture and global memory access

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1Streaming Multiprocessor

Shared Memory

G80 Thread Computing PipelineProcessors execute computing threadsAlternative operating mode specifically for computing

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

The future of GPUs is programmable processingSo – build the architecture around the processor

L2

FB

SP SP

L1

TF

Thre

ad P

roce

ssor

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Generates Thread grids based on

kernel calls

Thread Life Cycle in HWGrid is launched on the SPAThread Blocks are serially distributed to all the SM’s

Potentially >1 Thread Block per SM

Each SM launches Warps of Threads

2 levels of parallelismSM schedules and executes Warps that are ready to runAs Warps and Thread Blocks complete,

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

SM Executes Blocks

Threads are assigned to SMs in Block granularity

Up to 8 Blocks to each SM as resource allowsSM in G80 can take up to 768 threads

Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc.

Threads run concurrently

t0 t1 t2 … tm

Blocks

Texture L1

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

TF

L2

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Thread Scheduling/ExecutionEach Thread Blocks is divided in 32-thread Warps

This is an implementation decision, not part of the CUDA programming model

Warps are scheduling units in SMIf 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM?

Each Block is divided into 256/32 = 8 WarpsThere are 8 * 3 = 24 Warps

…t0 t1 t2 … t31

……

t0 t1 t2 … t31…Block 1 Warps Block 2 Warps

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1Streaming Multiprocessor

Shared Memory

SM Warp SchedulingSM hardware implements zero-overhead Warp scheduling

Warps whose next instruction has its operands ready for consumption are eligible for executionEligible Warps are selected for execution on a prioritized scheduling policyAll threads in a Warp execute the same instruction when selected

4 clock cycles needed to

warp 8 instruction 11

SM multithreadedWarp scheduler

warp 1 instruction 42

warp 3 instruction 95

warp 8 instruction 12

...

time

warp 3 instruction 96

Page 11: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

SM Instruction Buffer – Warp Scheduling

Fetch one warp instruction/cyclefrom instruction L1 cache into any instruction buffer slot

Issue one “ready-to-go” warp instruction/cycle

from any warp - instruction buffer slotoperand scoreboarding used to prevent hazards

Issue selection based on round-robin/age of warpSM broadcasts the same instruction

I$L1

MultithreadedInstruction Buffer

RF

C$L1

SharedMem

Operand Select

MAD SFU

Scoreboarding

All register operands of all instructions in the Instruction Buffer are scoreboarded

Instruction becomes ready after the needed values are depositedprevents hazardscleared instructions are eligible for issue

Decoupled Memory/Processor pipelinesany thread can continue to issue instructions until scoreboarding prevents issueallows Memory/Processor ops to proceed in shadow of other waiting Memory/Processor ops

Granularity ConsiderationsFor Matrix Multiplication, should I use 4X4, 8X8, 16X16 or 32X32tiles?

For 4X4, we have 16 threads per block, Since each SM can take up to 768 threads, the thread capacity allows 48 blocks. However, each SM can only take up to 8 blocks, thus there will be only 128 threads in each SM!

There are 8 warps but each warp is only half full.

For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, it could take up to 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!

There are 16 warps available for scheduling in each SMEach warp spans four slices in the y dimension

For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule.

There are 24 warps available for scheduling in each SMEach warp spans two slices in the y dimension

Memory Hardware in G80

CUDA Device Memory Space: Review

Each thread can:R/W per-thread registersR/W per-thread local memoryR/W per-block shared memoryR/W per-grid global memoryRead only per-grid constant memoryRead only per-grid texture memory

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

HostThe host can R/W global, constant, and texturememories

Parallel Memory SharingLocal Memory: per-thread

Private per threadAuto variables, register spill

Shared Memory: per-Block

Shared by threads of the same blockInter-thread communication

Global Memory: per-application

Shared by all threadsInter-Grid communication

Thread

Local Memory

Grid 0

. . .Global

Memory

. . .

Grid 1SequentialGridsin Time

Block

SharedMemory

Page 12: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

SM Memory Architecture

Threads in a block share data & results

In Memory and Shared MemorySynchronize at barrier instruction

Per-Block Shared Memory Allocation

Keeps data close to processor

t0 t1 t2 … tm

Blocks

Texture L1

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

TF

L2

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Courtesy: John Nicols, NVIDIA

SM Register File

Register File (RF)32 KB (8K entries) for each SM in G80

TEX pipe can also read/write RF2 SMs share 1 TEX

Load/Store pipe can also read/write RF

I$L1

MultithreadedInstruction Buffer

RF

C$L1

SharedMem

Operand Select

MAD SFU

Programmer View of Register File

There are 8192 registers in each SM in G80

This is an implementation decision, not part of CUDARegisters are dynamically partitioned across all blocks assigned to the SMOnce assigned to a

4 blocks 3 blocksMatrix Multiplication Example

If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM?

Each block requires 10*256 = 2560 registers8192 = 3 * 2560 + changeSo, three blocks can run on an SM as far as registers are concerned

How about if each thread increases the use of registers by 1?

Each Block now requires 11*256 = 2816 registers8192 < 2816 *3

More on Dynamic Partitioning

Dynamic partitioning gives more flexibility to compilers/programmers

One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each

This allows for finer grain threading than traditional CPU threading models.

The compiler can tradeoff between instruction-level parallelism and thread level parallelism

Let’s program this thing!

Page 13: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

GPU Computing History

2001/2002 – researchers see GPU as data-parallel coprocessor

The GPGPU field is born2007 – NVIDIA releases CUDA

CUDA – Compute Uniform Device ArchitectureGPGPU shifts to GPU Computing

2008 – Khronos releases OpenCLspecification

CUDA Abstractions

A hierarchy of thread groupsShared memoriesBarrier synchronization

CUDA Terminology

Host – typically the CPUCode written in ANSI C

Device – typically the GPU (data-parallel)Code written in extended ANSI C

Host and device have separate memoriesCUDA Program

Contains both host and device code

CUDA Terminology

Kernel – data-parallel functionInvoking a kernel creates lightweight threads on the device

Threads are generated and scheduled with hardware

Does a kernel remind you of a shader in OpenGL?

CUDA Kernels

Executed N times in parallel by N different CUDA threads

Thread ID

ExecutionConfiguration

DeclarationSpecifier

CUDA Program Execution

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Page 14: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

Thread Hierarchies

Grid – one or more thread blocks1D or 2D

Block – array of threads1D, 2D, or 3DEach block in a grid has the same number of threadsEach thread in a block can

SynchronizeAccess shared memory

Thread Hierarchies

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Thread Hierarchies

Block – 1D, 2D, or 3DExample: Index into vector, matrix, volume

Thread Hierarchies

Thread ID: Scalar thread identifierThread Index: threadIdx

1D: Thread ID == Thread Index2D with size (Dx, Dy)

Thread ID of index (x, y) == x + y Dy

3D with size (Dx, Dy, Dz)Thread ID of index (x, y, z) == x + y Dy + z Dx Dy

Thread Hierarchies

1 Thread Block 2D Block

2D Index

Thread Hierarchies

Thread BlockGroup of threads

G80 and GT200: Up to 512 threadsFermi: Up to 1024 threads

Reside on same processor coreShare memory of that core

Page 15: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

Thread Hierarchies

Thread BlockGroup of threads

G80 and GT200: Up to 512 threadsFermi: Up to 1024 threads

Reside on same processor coreShare memory of that core

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Thread Hierarchies

Block Index: blockIdxDimension: blockDim

1D or 2D

Thread Hierarchies

2D Thread Block

16x16Threads per block

Thread Hierarchies

Example: N = 3216x16 threads per block (independent of N)

threadIdx ([0, 15], [0, 15])

2x2 thread blocks in gridblockIdx ([0, 1], [0, 1])blockDim = 16

i = [0, 1] * 16 + [0, 15]

Thread Hierarchies

Thread blocks execute independentlyIn any order: parallel or seriesScheduled in any order by any number of cores

Allows code to scale with core count

Thread Hierarchies

Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf

Page 16: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

Thread Hierarchies

Threads in a blockShare (limited) low-latency memorySynchronize execution

To coordinate memory accesses__syncThreads()

Barrier – threads in block wait until all threads reach thisLightweight

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

CUDA Memory Transfers

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

CUDA Memory Transfers

Host can transfer to/from deviceGlobal memoryConstant memory

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

CUDA Memory Transfers

cudaMalloc()Allocate global memory on device

cudaFree()Frees memory

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

CUDA Memory Transfers

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

CUDA Memory Transfers

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Pointer to device memory

Page 17: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

CUDA Memory Transfers

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Size in bytes

CUDA Memory Transfers

cudaMemcpy()Memory transfer

Host to hostHost to deviceDevice to hostDevice to device

Host

Device

Global Memory

Does this remind you of VBOs in OpenGL?

CUDA Memory Transfers

cudaMemcpy()Memory transfer

Host to hostHost to deviceDevice to hostDevice to device

Host

Device

Global Memory

CUDA Memory Transfers

cudaMemcpy()Memory transfer

Host to hostHost to deviceDevice to hostDevice to device

Host

Device

Global Memory

CUDA Memory Transfers

cudaMemcpy()Memory transfer

Host to hostHost to deviceDevice to hostDevice to device

Host

Device

Global Memory

CUDA Memory Transfers

cudaMemcpy()Memory transfer

Host to hostHost to deviceDevice to hostDevice to device

Host

Device

Global Memory

All transfers are asynchronous

Page 18: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

CUDA Memory Transfers

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Host

Device

Global Memory

Host to device

CUDA Memory Transfers

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Host

Device

Global Memory

Source (host)Destination (device)

CUDA Memory Transfers

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Host

Device

Global Memory

Matrix Multiply

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

P = M * NAssume M and N are square for simplicityIs this data-parallel?

Matrix Multiply

1,000 x 1,000 matrix1,000,000 dot products

Each 1,000 multiples and 1,000 adds

Matrix Multiply: CPU Implementation

Code from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture3%20cuda%20threads%20spring%202010.ppt

void MatrixMulOnHost(float* M, float* N, float* P, intwidth) { for (int i = 0; i < width; ++i) for (int j = 0; j < width; ++j){float sum = 0;for (int k = 0; k < width; ++k){float a = M[i * width + k];float b = N[k * width + j];sum += a * b;

}P[i * width + j] = sum;

}}

Page 19: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

Matrix Multiply: CUDA Skeleton

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Matrix Multiply: CUDA Skeleton

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Matrix Multiply: CUDA Skeleton

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Matrix Multiply

Step 1Add CUDA memory transfers to the skeleton

Matrix Multiply: Data Transfer

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Allocate input

Matrix Multiply: Data Transfer

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Allocate output

Page 20: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

Matrix Multiply: Data Transfer

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Matrix Multiply: Data Transfer

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Read back from device

Matrix Multiply: Data Transfer

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Does this remind you of GPGPU with GLSL?

Matrix Multiply

Step 2Implement the kernel in CUDA C

Matrix Multiply: CUDA Kernel

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Accessing a matrix, so using a 2D block

Matrix Multiply: CUDA Kernel

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Each kernel computes one output

Page 21: Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ...  ... (3, 1

Matrix Multiply: CUDA Kernel

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Where did the two outer for loops in the CPU implementation go?

Matrix Multiply: CUDA Kernel

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

No locks or synchronization, why?

Matrix Multiply

Step 3Invoke the kernel in CUDA C

Matrix Multiply: Invoke Kernel

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

One block with width by width threads

Matrix Multiply

125

One Block of threads compute matrix Pd

Each thread computes one element of Pd

Each threadLoads a row of matrix MdLoads a column of matrix NdPerform one multiply and addition for each pair of Md and Nd elementsCompute to off-chip memory access ratio close to 1:1 (not very high)

Size of matrix limited by the number of threads allowed in a thread block

Grid 1Block 1

3 2 5 4

2

4

2

6

48

Thread(2, 2)

WIDTH

Md Pd

Nd

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign

Slide from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture2%20cuda%20spring%2009.ppt

Matrix Multiply

What is the major performance problem with our implementation?What is the major limitation?