Download - Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

5/1/2007

1

Real-Time Graphics Architecture

Lecture 9: Programming GPUs

p

Kurt Akeley

CS448 Lecture 9 Kurt Akeley, Pat Hanrahan, Spring 2007

Kurt Akeley

Pat Hanrahanhttp://graphics.stanford.edu/cs448-07-spring/

Programming GPUs

Outline

History

Contemporary GPUsProgramming model

Implementation

GPGPU: non-graphical computation using GPUs

Caveat


This is not a programming tutorial

You will gain some understanding of the architectural issues that underlie programmable GPUs

5/1/2007

2

GPU programmability is not new

Essentially all GPUs are (and have been) programmable

The questions are:

Who gets to program them?

And when?

Example systems from SGI’s early history:

Geometry Engine (1981)

GE4 (1987)


GE4 (1987)

VGXT Image Engine (1991)

Dynamic code generation

Geometry engine (1981)

Four 32-bit floating-point units

Jim-Clark format

2’s-complement e and m2 s complement e and m

Programmed using John Hennessy’s SLIM(Stanford Language for the Implementation of Microcode)

~500-term PLA, hard-wired at fabrication time


Programmed by Jim Clark and Marc Hannah

5/1/2007

3

SGI GE4 (1987)

Weitek 3332 floating-point ALU

~4k ~64-bit lines of microcode in SRAM

Programmed in low-level assembly code

No run-time code changes were made (though they could have been)

Programmed by several SGI employees, including Kurt


p y , g

Wire-wrapped GE4 prototype

VGXT image engine (1991)

Tiny code space to implement

Texture filter and application

Frame buffer operations

Code overlays loaded on-the-fly

Code sequences generated prior to run-time

Fields (e.g., blend function) filled in at run time


5/1/2007

4

Dynamic code generation

SUN graphics card

Bresenham lines

Abrash talk …


Application programmability is not new

GPUs have a long history of application programmability

Example systems:

Ikonas graphics system (1978)

Trancept TAAC-1 (1987)

Multi-pass vector processing (2000)

Register combiners (2001)


5/1/2007

5

Ikonas graphics system (1978)

Multi-board system

32-bit integer processor (16x16 multiply)

16 bit graphics processor (16 pixel parallel render)16-bit graphics processor (16 pixel parallel render)

Hardware Z-buffer

…


Image credit: www.virhistory.com/ikonas/ikonas.html

Trancept TAAC-1 (1987)

C-language microcode compiler

200-bit microcode word

Sold to Sun Microsystems in 1987

“Customers got a C language microcode compiler and were strongly encouraged (!) to develop their own code” –Nick England, www virhistory com/trancept/


www.virhistory.com/trancept/trancept.html

Image credit: www.virhistory.com/trancept/trancept.html

5/1/2007

6

Multi-pass vector processing (2000)

#include “marble.h”surface marble() {varying color a;uniform string fx;uniform float x; x = ½;fx = “noisebw.tx”;

Treat OpenGL as a very long instruction word

Compute vector stylefx noisebw.tx ;FB = texture(tx, scale(x,x,x));repeat(3) {

x = x * 0.5;FB *= 0.5;FB += texture(tx, scale(x,x,x));

}FB = lookup(FB,tab);a = FB;FB = diffuse;FB *= a;FB += environment(“env”);

}

Apply inst. to all pixels

Build up final image in many passes

Peercy, Olano, Airey, and Ungar, Interactive Multi-Pass Programmable Shading, SIGGRAPH 2000


g,

(Figure adapted from the SIGGRAPH paper)

Register combiners (2001)

ARB_texture_env_combine

“New texture environment function COMBINE ARB allows texture combiner COMBINE_ARB allows texture combiner operations, including REPLACE, MODULATE, ADD, … , INTERPOLATE.”

ARB_texture_env_combine

“Adds the capability to use the texture color from other texture units as sources to the COMBINE ARB environment function ”


COMBINE_ARB environment function.”

ARB_texture_env_dot3

Adds a dot-product operation to the texture combiner operations.

5/1/2007

7

Summary of history

General-purpose processor with attached unitsCould not compete in performance/cost ratio“Heathkit” marketing

Multi-pass vector processingLow performanceWasteful use of bandwidth (against the trends)

Register combinersRube Goldberg design qualityHopelessly increasing complexity

P bl i li t


Programmable pipeline stagesNot accessible to application programmersReason (at SGI) was

Unable to keep the programming model consistent from product to product, andUnable to use software to hide the inconsistencies

Contemporary GPU Programming Model

Application-programmable pipeline stages


5/1/2007

8

Programming model (Direct3D 10 pipeline)

Input Assembler

S l T

VertexesTraditional graphics pipeline

(except location of geometry shader)

Three application-

Memory

Rasterizer

Vertex ShaderSampler

ConstantTexture

Geometry ShaderSampler

ConstantTexture

Three applicationprogrammable stages operate on independent vertexes, primitives, and pixel fragments

All stages share same programming model, and (most) resources

Single shared memory, but


Output Merger

Pixel ShaderSampler

ConstantTexture

Target

Single shared memory, but memory objects remain distinct

Application programmable

Direct3D 10 programmable core

Input registers

(16) 4x32b

Textures

(128)

Complete instruction set

Floating-point operations

Integer operations (new)

From previous stage

Temp registers

(32/4096) 4x32b

FP unit Special

Const buffers

(16) 4096x32b

Samplers

(16)

Int ALU

Integer operations (new)

Branching operations

Unlimited code length

Vector instructions

Lots of temporary state

Independent operation


4x32b Function

Output registers

(16) 4x32b

4x32bEach vertex, or

Each primitive, or

Each pixel fragment

To next stage

5/1/2007

9

Vector instructions

Vector vs. scalar

4-component vector arithmetic is typical of gfx

But as shaders get longer, an increasing fraction of their code is scalar

Some GPUs support dual-issue split vectors:


Programming interface

Assembly code

Low level and error prone

Not available in Direct3D 10

PARAM mat[4] = { state.matrix.modelview };DP4 result.x, vertex, mat[0];DP4 result.y, vertex, mat[1];DP4 result.z, vertex, mat[2];

C-like shading language

Cg: layered above assembly language

HLSL, OpenGL Shading Language: intrinsic

Fixed-function stages

L t h li ti

DP4 result.w, vertex, mat[3];

Vertex transformation(ARB_vertex_program)

void transform(float4 position : POSITION,

out float4 oPosition : POSITION,


Lost when a application-defined shader is used

Not included in Direct3D 10

Were to be omitted from OpenGL 2.0

uniform float4x4 modelView)

{

oPosition = mul(modelView, position);

}

Vertex transformation(Cg)

5/1/2007

10

Parallel coding complexity is minimized

Complexities are handled by the implementation:

DependenciesdOrdering

Sorting

ScalabilityComputation

Bandwidth

Load balancing (much more implementation work)


Load balancing (much more implementation work)

All threads are independent. Message passing is

Not required

Not allowed

Parallel coding complexity is minimized

Correctness constraints include:

Cannot modify textures directly from shaders

Cannot bind a target as both source and sinkTexture and render target

Vertex buffer and stream output

Cannot “union” different data formats


5/1/2007

11

Performance considerations

Programmer can largely ignore:

Memory latency

Floating-point cost (same as integer)

Special function cost

Texture filtering cost

Programmer should pay attention to:

Divergent branching


Divergent branching

Number of registers used

Size of storage formats (bandwidth)

Re-specification of Z values (defeats Z cull)

Contemporary GPU Implementation


5/1/2007

12

GeForce 7800Parallelism

Cull / Clip / Setup

Shader Instruction DispatchZ-Cull

Host / FW / VTF

Figure courtesy of NVIDIA

L2 Tex

Fragment Crossbar


MemoryPartition

MemoryPartition

MemoryPartition

MemoryPartition

DRAM(s) DRAM(s) DRAM(s) DRAM(s)

SIMD control

SIMD = Single Instruction, Multiple Data

One or more SIMD blocks; each block includes many data paths controlled by a single sequencer

Data Path

Sequencer

by a single sequencer

SIMD benefits:

Reduced circuitry

Synchronization of eventsReduced instruction-fetch bandwidth

Better texture cache utilization

Data Path

Data Path

Data Path

Generic SIMD


SIMD liabilities:

Branch divergence reduces performance

Branch convergence is complicated to implement

Geometry Engine

5/1/2007

13

Multiple threads hide memory latency

Scheduler maintains a queue of ready-to-run threads

When a thread blocks, another is started

Blocking determination may beg y

Simple (any attempted memory access), or

Advanced (block only on data dependencies)

Thread context storage is significant – product of

Number of processors,

Threads per processor, and

State per thread (registers pc )


State per thread (registers, pc, …)

Fixed-size memory pool may be insufficient for large thread contexts

Large state/thread too few threads/processor

Local registers may not be “free”

Latency hiding consumes memory

Latency hiding mechanisms consume memory

Cache (CPU, ~texture)Rasterizer

Fragment fifo (texture pre-fetching)

Shader context (threaded execution)

GPUs devote less die area to cache than CPUs

B t th l ti i

Request FIFO

FIFO

Rasterizer

Reorder Buffer

Texture

Texturememory


But the real question is: what area is devoted to latency-hiding memory?

Hint: ask the invited experts

Texture Apply

Filter

From lecture 7

5/1/2007

14

GeForce 8800 thread scheduling

Threads are scheduled in SIMD blocks

4 threads per quad (2x2 fragment unit)

16 threads per “warp” (NVIDIA GeForce 8800)16 threads per “warp” (NVIDIA GeForce 8800)

Thread scheduling is critical and complicated

David Kirk assessment of 8800 load balancing: “3x more difficult overall”

Sort-last-fragment, with load balancing forVertex processors


p

Geometry (primitive) processors

Pixel fragment processors

Reference: Mike Shebanow, ECE 498 AL: Programming Massively Parallel Computers, Lecture 12

Thread scheduling complexity

Two parts to scheduling decisions:

What class of thread should be run?Vertex, primitive, fragment

When should a thread (group) be blocked?When a memory access is attempted?

When a data dependency is reached?

Data-dependent blocking is worthwhile, because it conserves thread-context memory space


conserves thread-context memory space

Has been an ATI advantage over NVIDIA

Did this contribute to 8800 scheduling difficulty?

5/1/2007

15

Abstraction distance

Actual instruction set may not match “advertised” one

Was true when assembly language was supportedl blComplex post-assembly run-time optimization

Layered languages (like Cg) 2-layer optimization

Even easier with Direct3D 10

Example: use scalar processor to emulate vector operations

L b t ti di t ’t ti l i th ’90


Large abstraction distance wasn’t practical in the ’90s

CPUs are much faster now

Software technology has come a long way

Why isn’t Output Merger programmable?

OM is highly optimized

Memory access is the GPU limiter in many cases

Input Assembler

S l T

Vertexes

many cases

OM has data hazards

All programmable interfaces are memory read-only

Memory

Rasterizer

Vertex ShaderSampler

ConstantTexture

Geometry ShaderSampler

ConstantTexture


Output Merger

Pixel ShaderSampler

ConstantTexture

TargetApplication programmable

5/1/2007

16

GPGPU


Performance motivates interest in GPGPU

7x performance advantage to GPUs today:

47 GFLOPS Intel core2 extreme e680047 GFLOPS – Intel core2 extreme e6800

345 GFLOPS – NVIDIA GeForce 8800 GTX

350 GFLOPS – ATI X1900


5/1/2007

17

Rate-of-change of performance is the key

CPU and GPU rates differ:

GPU: ~2x CAGR

CPU: <1 5x CAGR recently much lessCPU: <1.5x CAGR, recently much less

Will this 10x/decade disparity persist in the multi-core era?

OP

S


GFL

O

G80 = GeForce 8800 GTX



NV40 = GeForce 6800 Ultra

NV35 = GeForce FX 5950 Ultra

NV30 = GeForce FX 5800

NVIDIA Corporation 2006

Additional GPGPU motivations

Visible success stories

Folding@home gets 20-40x speedup over PCs

Matrix multiply achieves 120 GFLOPS on ATI X1900Matrix multiply achieves 120 GFLOPS on ATI X1900

Ubiquity of GPU hardware

Every PC has a GPU of some kind

This is a first for an attached hardware accelerator

Evolving GPU architecture


Rich instruction set

Highly parallel

Improving GPU CPU data transfer rates

…

5/1/2007

18

GPUs are high-performance multiprocessors

GeForce 8800 block diagram (NVIDIA Corporation 2006)

S t / R t / ZC llD t A bl

Host

SP SP

TF

Thre

ad P

roce

ssor

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Data Assembler

SP SP

TF

SP SP

TF

SP SP

TF

SP SP

TF

SP SP

TF

SP SP

TF

SP SP

TF


L2

FB

L1 TL1 L1 L1 L1 L1 L1 L1

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Scatter, Gather, and Reduction

Frame buffer mindset: fragment associated with a specific region of memory

Early programmable GPUs (2001) Today’s GPUs (2007)

Gather Texture fetch Texture fetch

S tt U f t t


Scatter -- Uses fragment sort

Reduction -- Occlusion query

5/1/2007

19

GPGPU concerns

Application specificity. Requires:

High arithmetic intensity (compute / bandwidth)

High branching coherence

Ability to decompose to 1000s of threads

Computation quality

No memory error detection/correction

No double precision


No double precision

Incomplete (but improving) IEEE single precision

Possible end of divergence of CPU and GPU performance

Abysmal history of attached processor success

How will CPUs and GPUs coexist?

Possible outcomes:

CPU instruction set additions

Heterogeneous multi-coreShared memory

Differentiated memory

Generalized GPUs


What do/will we mean by GPU?

Anything graphics related?

Massively multi-threaded to hide memory latency?

5/1/2007

20

Summary

Long history of GPU programmability, but current situation (app-programmable pipeline stages) is new

Current GPUs are easily programmed (for graphics) and Current GPUs are easily programmed (for graphics) and give programmers amazing power

Current GPUs are high-performance multiprocessors

Exciting things are happening in computer architecture, and GPUs will play an important role


Suggested readings

Wen-Mei Hwu and David Kirk, ECE 498 AL: Programming Massively Parallel Computers,

http://courses.ece.uiuc.edu/ece498/al/index.html

Peercy Olano, Airey, and Ungar, Interactive Multi-Pass Programmable Shading, SIGGRAPH 2000.

Fernando, GPU Gems, Addison Wesley, 2004.

Pharr and Fernando, GPU Gems 2, Addison Wesley, 2005.


5/1/2007

21

Real-Time Graphics Architecture

Lecture 9: Programming GPUs

p

Kurt Akeley


Kurt Akeley

Pat Hanrahanhttp://graphics.stanford.edu/cs448-07-spring/

Download - Real-Time Graphics Architecturegraphics.stanford.edu/courses/cs448-07-spring/lectures/gpu/gpu.pdf · Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley CS448

Top Related