5/1/2007
1
Real-Time Graphics Architecture
Lecture 9: Programming GPUs
p
Kurt Akeley
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
Kurt Akeley
Pat Hanrahanhttp://graphics.stanford.edu/cs448-07-spring/
Programming GPUs
Outline
History
Contemporary GPUsProgramming model
Implementation
GPGPU: non-graphical computation using GPUs
Caveat
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
This is not a programming tutorial
You will gain some understanding of the architectural issues that underlie programmable GPUs
5/1/2007
2
GPU programmability is not new
Essentially all GPUs are (and have been) programmable
The questions are:
Who gets to program them?
And when?
Example systems from SGI’s early history:
Geometry Engine (1981)
GE4 (1987)
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
GE4 (1987)
VGXT Image Engine (1991)
Dynamic code generation
Geometry engine (1981)
Four 32-bit floating-point units
Jim-Clark format
2’s-complement e and m2 s complement e and m
Programmed using John Hennessy’s SLIM(Stanford Language for the Implementation of Microcode)
~500-term PLA, hard-wired at fabrication time
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
Programmed by Jim Clark and Marc Hannah
5/1/2007
3
SGI GE4 (1987)
Weitek 3332 floating-point ALU
~4k ~64-bit lines of microcode in SRAM
Programmed in low-level assembly code
No run-time code changes were made (though they could have been)
Programmed by several SGI employees, including Kurt
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
p y , g
Wire-wrapped GE4 prototype
VGXT image engine (1991)
Tiny code space to implement
Texture filter and application
Frame buffer operations
Code overlays loaded on-the-fly
Code sequences generated prior to run-time
Fields (e.g., blend function) filled in at run time
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
5/1/2007
4
Dynamic code generation
SUN graphics card
Bresenham lines
Abrash talk …
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
Application programmability is not new
GPUs have a long history of application programmability
Example systems:
Ikonas graphics system (1978)
Trancept TAAC-1 (1987)
Multi-pass vector processing (2000)
Register combiners (2001)
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
5/1/2007
5
Ikonas graphics system (1978)
Multi-board system
32-bit integer processor (16x16 multiply)
16 bit graphics processor (16 pixel parallel render)16-bit graphics processor (16 pixel parallel render)
Hardware Z-buffer
…
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
Image credit: www.virhistory.com/ikonas/ikonas.html
Trancept TAAC-1 (1987)
C-language microcode compiler
200-bit microcode word
Sold to Sun Microsystems in 1987
“Customers got a C language microcode compiler and were strongly encouraged (!) to develop their own code” –Nick England, www virhistory com/trancept/
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
www.virhistory.com/trancept/trancept.html
Image credit: www.virhistory.com/trancept/trancept.html
5/1/2007
6
Multi-pass vector processing (2000)
#include “marble.h”surface marble() {varying color a;uniform string fx;uniform float x; x = ½;fx = “noisebw.tx”;
Treat OpenGL as a very long instruction word
Compute vector stylefx noisebw.tx ;FB = texture(tx, scale(x,x,x));repeat(3) {
x = x * 0.5;FB *= 0.5;FB += texture(tx, scale(x,x,x));
}FB = lookup(FB,tab);a = FB;FB = diffuse;FB *= a;FB += environment(“env”);
}
Apply inst. to all pixels
Build up final image in many passes
Peercy, Olano, Airey, and Ungar, Interactive Multi-Pass Programmable Shading, SIGGRAPH 2000
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
g,
(Figure adapted from the SIGGRAPH paper)
Register combiners (2001)
ARB_texture_env_combine
“New texture environment function COMBINE ARB allows texture combiner COMBINE_ARB allows texture combiner operations, including REPLACE, MODULATE, ADD, … , INTERPOLATE.”
ARB_texture_env_combine
“Adds the capability to use the texture color from other texture units as sources to the COMBINE ARB environment function ”
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
COMBINE_ARB environment function.”
ARB_texture_env_dot3
Adds a dot-product operation to the texture combiner operations.
5/1/2007
7
Summary of history
General-purpose processor with attached unitsCould not compete in performance/cost ratio“Heathkit” marketing
Multi-pass vector processingLow performanceWasteful use of bandwidth (against the trends)
Register combinersRube Goldberg design qualityHopelessly increasing complexity
P bl i li t
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
Programmable pipeline stagesNot accessible to application programmersReason (at SGI) was
Unable to keep the programming model consistent from product to product, andUnable to use software to hide the inconsistencies
Contemporary GPU Programming Model
Application-programmable pipeline stages
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
5/1/2007
8
Programming model (Direct3D 10 pipeline)
Input Assembler
S l T
VertexesTraditional graphics pipeline
(except location of geometry shader)
Three application-
Memory
Rasterizer
Vertex ShaderSampler
ConstantTexture
Geometry ShaderSampler
ConstantTexture
Three applicationprogrammable stages operate on independent vertexes, primitives, and pixel fragments
All stages share same programming model, and (most) resources
Single shared memory, but
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
Output Merger
Pixel ShaderSampler
ConstantTexture
Target
Single shared memory, but memory objects remain distinct
Application programmable
Direct3D 10 programmable core
Input registers
(16) 4x32b
Textures
(128)
Complete instruction set
Floating-point operations
Integer operations (new)
From previous stage
Temp registers
(32/4096) 4x32b
FP unit Special
Const buffers
(16) 4096x32b
Samplers
(16)
Int ALU
Integer operations (new)
Branching operations
Unlimited code length
Vector instructions
Lots of temporary state
Independent operation
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
4x32b Function
Output registers
(16) 4x32b
4x32bEach vertex, or
Each primitive, or
Each pixel fragment
To next stage
5/1/2007
9
Vector instructions
Vector vs. scalar
4-component vector arithmetic is typical of gfx
But as shaders get longer, an increasing fraction of their code is scalar
Some GPUs support dual-issue split vectors:
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
Programming interface
Assembly code
Low level and error prone
Not available in Direct3D 10
PARAM mat[4] = { state.matrix.modelview };DP4 result.x, vertex, mat[0];DP4 result.y, vertex, mat[1];DP4 result.z, vertex, mat[2];
C-like shading language
Cg: layered above assembly language
HLSL, OpenGL Shading Language: intrinsic
Fixed-function stages
L t h li ti
DP4 result.w, vertex, mat[3];
Vertex transformation(ARB_vertex_program)
void transform(float4 position : POSITION,
out float4 oPosition : POSITION,
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
Lost when a application-defined shader is used
Not included in Direct3D 10
Were to be omitted from OpenGL 2.0
uniform float4x4 modelView)
{
oPosition = mul(modelView, position);
}
Vertex transformation(Cg)
5/1/2007
10
Parallel coding complexity is minimized
Complexities are handled by the implementation:
DependenciesdOrdering
Sorting
ScalabilityComputation
Bandwidth
Load balancing (much more implementation work)
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
Load balancing (much more implementation work)
All threads are independent. Message passing is
Not required
Not allowed
Parallel coding complexity is minimized
Correctness constraints include:
Cannot modify textures directly from shaders
Cannot bind a target as both source and sinkTexture and render target
Vertex buffer and stream output
Cannot “union” different data formats
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
5/1/2007
11
Performance considerations
Programmer can largely ignore:
Memory latency
Floating-point cost (same as integer)
Special function cost
Texture filtering cost
Programmer should pay attention to:
Divergent branching
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
Divergent branching
Number of registers used
Size of storage formats (bandwidth)
Re-specification of Z values (defeats Z cull)
Contemporary GPU Implementation
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
5/1/2007
12
GeForce 7800Parallelism
Cull / Clip / Setup
Shader Instruction DispatchZ-Cull
Host / FW / VTF
Figure courtesy of NVIDIA
L2 Tex
Fragment Crossbar
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
MemoryPartition
MemoryPartition
MemoryPartition
MemoryPartition
DRAM(s) DRAM(s) DRAM(s) DRAM(s)
SIMD control
SIMD = Single Instruction, Multiple Data
One or more SIMD blocks; each block includes many data paths controlled by a single sequencer
Data Path
Sequencer
by a single sequencer
SIMD benefits:
Reduced circuitry
Synchronization of eventsReduced instruction-fetch bandwidth
Better texture cache utilization
Data Path
Data Path
Data Path
Generic SIMD
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
SIMD liabilities:
Branch divergence reduces performance
Branch convergence is complicated to implement
Geometry Engine
5/1/2007
13
Multiple threads hide memory latency
Scheduler maintains a queue of ready-to-run threads
When a thread blocks, another is started
Blocking determination may beg y
Simple (any attempted memory access), or
Advanced (block only on data dependencies)
Thread context storage is significant – product of
Number of processors,
Threads per processor, and
State per thread (registers pc )
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
State per thread (registers, pc, …)
Fixed-size memory pool may be insufficient for large thread contexts
Large state/thread too few threads/processor
Local registers may not be “free”
Latency hiding consumes memory
Latency hiding mechanisms consume memory
Cache (CPU, ~texture)Rasterizer
Fragment fifo (texture pre-fetching)
Shader context (threaded execution)
GPUs devote less die area to cache than CPUs
B t th l ti i
Request FIFO
FIFO
Rasterizer
Reorder Buffer
Texture
Texturememory
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
But the real question is: what area is devoted to latency-hiding memory?
Hint: ask the invited experts
Texture Apply
Filter
From lecture 7
5/1/2007
14
GeForce 8800 thread scheduling
Threads are scheduled in SIMD blocks
4 threads per quad (2x2 fragment unit)
16 threads per “warp” (NVIDIA GeForce 8800)16 threads per “warp” (NVIDIA GeForce 8800)
Thread scheduling is critical and complicated
David Kirk assessment of 8800 load balancing: “3x more difficult overall”
Sort-last-fragment, with load balancing forVertex processors
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
p
Geometry (primitive) processors
Pixel fragment processors
Reference: Mike Shebanow, ECE 498 AL: Programming Massively Parallel Computers, Lecture 12
Thread scheduling complexity
Two parts to scheduling decisions:
What class of thread should be run?Vertex, primitive, fragment
When should a thread (group) be blocked?When a memory access is attempted?
When a data dependency is reached?
Data-dependent blocking is worthwhile, because it conserves thread-context memory space
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
conserves thread-context memory space
Has been an ATI advantage over NVIDIA
Did this contribute to 8800 scheduling difficulty?
5/1/2007
15
Abstraction distance
Actual instruction set may not match “advertised” one
Was true when assembly language was supportedl blComplex post-assembly run-time optimization
Layered languages (like Cg) 2-layer optimization
Even easier with Direct3D 10
Example: use scalar processor to emulate vector operations
L b t ti di t ’t ti l i th ’90
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
Large abstraction distance wasn’t practical in the ’90s
CPUs are much faster now
Software technology has come a long way
Why isn’t Output Merger programmable?
OM is highly optimized
Memory access is the GPU limiter in many cases
Input Assembler
S l T
Vertexes
many cases
OM has data hazards
All programmable interfaces are memory read-only
Memory
Rasterizer
Vertex ShaderSampler
ConstantTexture
Geometry ShaderSampler
ConstantTexture
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
Output Merger
Pixel ShaderSampler
ConstantTexture
TargetApplication programmable
5/1/2007
16
GPGPU
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
Performance motivates interest in GPGPU
7x performance advantage to GPUs today:
47 GFLOPS Intel core2 extreme e680047 GFLOPS – Intel core2 extreme e6800
345 GFLOPS – NVIDIA GeForce 8800 GTX
350 GFLOPS – ATI X1900
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
5/1/2007
17
Rate-of-change of performance is the key
CPU and GPU rates differ:
GPU: ~2x CAGR
CPU: <1 5x CAGR recently much lessCPU: <1.5x CAGR, recently much less
Will this 10x/decade disparity persist in the multi-core era?
OP
S
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
GFL
O
G80 = GeForce 8800 GTX
G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
NV30 = GeForce FX 5800
NVIDIA Corporation 2006
Additional GPGPU motivations
Visible success stories
Folding@home gets 20-40x speedup over PCs
Matrix multiply achieves 120 GFLOPS on ATI X1900Matrix multiply achieves 120 GFLOPS on ATI X1900
Ubiquity of GPU hardware
Every PC has a GPU of some kind
This is a first for an attached hardware accelerator
Evolving GPU architecture
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
Rich instruction set
Highly parallel
Improving GPU CPU data transfer rates
…
5/1/2007
18
GPUs are high-performance multiprocessors
GeForce 8800 block diagram (NVIDIA Corporation 2006)
S t / R t / ZC llD t A bl
Host
SP SP
TF
Thre
ad P
roce
ssor
Vtx Thread Issue
Setup / Rstr / ZCull
Geom Thread Issue Pixel Thread Issue
Data Assembler
SP SP
TF
SP SP
TF
SP SP
TF
SP SP
TF
SP SP
TF
SP SP
TF
SP SP
TF
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
L2
FB
L1 TL1 L1 L1 L1 L1 L1 L1
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
Scatter, Gather, and Reduction
Frame buffer mindset: fragment associated with a specific region of memory
Early programmable GPUs (2001) Today’s GPUs (2007)
Gather Texture fetch Texture fetch
S tt U f t t
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
Scatter -- Uses fragment sort
Reduction -- Occlusion query
5/1/2007
19
GPGPU concerns
Application specificity. Requires:
High arithmetic intensity (compute / bandwidth)
High branching coherence
Ability to decompose to 1000s of threads
Computation quality
No memory error detection/correction
No double precision
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
No double precision
Incomplete (but improving) IEEE single precision
Possible end of divergence of CPU and GPU performance
Abysmal history of attached processor success
How will CPUs and GPUs coexist?
Possible outcomes:
CPU instruction set additions
Heterogeneous multi-coreShared memory
Differentiated memory
Generalized GPUs
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
What do/will we mean by GPU?
Anything graphics related?
Massively multi-threaded to hide memory latency?
5/1/2007
20
Summary
Long history of GPU programmability, but current situation (app-programmable pipeline stages) is new
Current GPUs are easily programmed (for graphics) and Current GPUs are easily programmed (for graphics) and give programmers amazing power
Current GPUs are high-performance multiprocessors
Exciting things are happening in computer architecture, and GPUs will play an important role
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
Suggested readings
Wen-Mei Hwu and David Kirk, ECE 498 AL: Programming Massively Parallel Computers,
http://courses.ece.uiuc.edu/ece498/al/index.html
Peercy Olano, Airey, and Ungar, Interactive Multi-Pass Programmable Shading, SIGGRAPH 2000.
Fernando, GPU Gems, Addison Wesley, 2004.
Pharr and Fernando, GPU Gems 2, Addison Wesley, 2005.
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
5/1/2007
21
Real-Time Graphics Architecture
Lecture 9: Programming GPUs
p
Kurt Akeley
CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007
Kurt Akeley
Pat Hanrahanhttp://graphics.stanford.edu/cs448-07-spring/