nvidia geforce ryan hendrixson ryan schubert allison walthall

81
NVIDIA GeForce NVIDIA GeForce Ryan Hendrixson Ryan Hendrixson Ryan Schubert Ryan Schubert Allison Walthall Allison Walthall

Upload: martina-lamb

Post on 24-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

NVIDIA GeForceNVIDIA GeForce

Ryan HendrixsonRyan Hendrixson

Ryan SchubertRyan Schubert

Allison WalthallAllison Walthall

Page 2: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

What Does a GPU Actually What Does a GPU Actually Do?Do?

Historically, from:Historically, from:– Acting simply as a frame bufferActing simply as a frame buffer– Doing vertex transformations and pixel Doing vertex transformations and pixel

color calculationscolor calculations– Now even programmable Now even programmable

In the simplest sense, a modern GPU In the simplest sense, a modern GPU implements a 3D rendering pipelineimplements a 3D rendering pipeline

Page 3: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

3D Rendering Pipeline3D Rendering Pipeline (direct (direct illumination)illumination)

3D Geometric Primitives

ModelingTransformation

ModelingTransformation

ViewingTransformation

ViewingTransformation

ProjectionTransformation

ProjectionTransformation

LightingLighting

Image

ClippingClipping

ScanConversion

ScanConversion

This is a pipelinedsequence of operations to draw a 3D primitive

into a 2D image

Page 4: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

3D Rendering Pipeline3D Rendering Pipeline (direct illumination) (direct illumination)

ModelingTransformation

ModelingTransformation

ViewingTransformation

ViewingTransformation

ProjectionTransformation

ProjectionTransformation

LightingLighting

3D Geometric Primitives

Image

ClippingClipping

ScanConversion

ScanConversion

Transform into 3D world coordinate system

Page 5: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

3D Rendering Pipeline3D Rendering Pipeline (direct illumination) (direct illumination)

ModelingTransformation

ModelingTransformation

ViewingTransformation

ViewingTransformation

ProjectionTransformation

ProjectionTransformation

LightingLighting

3D Geometric Primitives

Image

ClippingClipping

ScanConversion

ScanConversion

Transform into 3D world coordinate system

Illuminate according to lighting and reflectance

Page 6: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

3D Rendering Pipeline3D Rendering Pipeline (direct illumination) (direct illumination)

ModelingTransformation

ModelingTransformation

ViewingTransformation

ViewingTransformation

ProjectionTransformation

ProjectionTransformation

LightingLighting

3D Geometric Primitives

Image

ClippingClipping

ScanConversion

ScanConversion

Transform into 3D world coordinate system

Illuminate according to lighting and reflectance

Transform into 3D camera coordinate system

Page 7: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

3D Rendering Pipeline3D Rendering Pipeline (direct illumination) (direct illumination)

ModelingTransformation

ModelingTransformation

ViewingTransformation

ViewingTransformation

ProjectionTransformation

ProjectionTransformation

LightingLighting

3D Geometric Primitives

Image

ClippingClipping

ScanConversion

ScanConversion

Transform into 3D world coordinate system

Transform into 3D camera coordinate system

Transform into 2D screen coordinate system

Illuminate according to lighting and reflectance

Page 8: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

3D Rendering Pipeline3D Rendering Pipeline (direct illumination) (direct illumination)

ModelingTransformation

ModelingTransformation

ViewingTransformation

ViewingTransformation

ProjectionTransformation

ProjectionTransformation

LightingLighting

3D Geometric Primitives

Image

ClippingClipping

ScanConversion

ScanConversion

Transform into 3D world coordinate system

Transform into 3D camera coordinate system

Clip primitives outside camera’s view

Transform into 2D screen coordinate system

Illuminate according to lighting and reflectance

Page 9: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

3D Rendering Pipeline3D Rendering Pipeline (direct illumination) (direct illumination)

ModelingTransformation

ModelingTransformation

ViewingTransformation

ViewingTransformation

ProjectionTransformation

ProjectionTransformation

LightingLighting

3D Geometric Primitives

Image

ClippingClipping

ScanConversion

ScanConversion

Transform into 3D world coordinate system

Transform into 3D camera coordinate system

Draw pixels

Clip primitives outside camera’s view

Transform into 2D screen coordinate system

Illuminate according to lighting and reflectance

Page 10: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GPUCPU

Modern OpenGL PipelineModern OpenGL Pipeline

Programmable Vertex ProcessorProgrammable Vertex Processor Programmable Fragment (Pixel) Programmable Fragment (Pixel)

ProcessorProcessor

ApplicationApplication VertexProcessor

VertexProcessor

Assembly& Rasterization

Assembly& Rasterization

PixelProcessor

PixelProcessor

VideoMemory

(Textures)

VideoMemory

(Textures)VerticesVertices

(3D)(3D)Xformed,Xformed,

LitLitVerticesVertices

(2D)(2D)

FragmentsFragments(pre-pixels)(pre-pixels)

FinalFinalpixelspixels

(Color, Depth)(Color, Depth)

Graphics StateGraphics State

Render-to-textureRender-to-texture

VertexProcessor

VertexProcessor

PixelProcessor

PixelProcessor

Page 11: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

OpenGL vs. DirectXOpenGL vs. DirectX

Just graphicsJust graphics Standard C Standard C

interfacesinterfaces State machineState machine Multiple Multiple

platformsplatforms Academic useAcademic use

Graphics, Graphics, multimedia, multimedia, etc.etc.

C++ interfacesC++ interfaces Object orientedObject oriented WindowsWindows PC gamesPC games

Page 12: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Possible GPU Performance Possible GPU Performance BottlenecksBottlenecks

CPU/Bus BoundCPU/Bus Bound– Simply not able to send enough vertices to the Simply not able to send enough vertices to the

card to keep it busycard to keep it busy Vertex BoundVertex Bound

– Vertex processing engine is fully loaded, while Vertex processing engine is fully loaded, while the fragment engine is just waiting and the fragment engine is just waiting and grabbing data as soon as it’s readygrabbing data as soon as it’s ready

Pixel BoundPixel Bound– The fragment engine is fully loaded, causing The fragment engine is fully loaded, causing

the vertex engine to have to wait before the vertex engine to have to wait before sending more datasending more data

Page 13: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Early HistoryEarly History

NVIDIA founded in 1993NVIDIA founded in 1993 1997: RIVA1997: RIVA 1998: RIVA TNT1998: RIVA TNT 1999: GeForce 256 (NV10)1999: GeForce 256 (NV10)

Page 14: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GeForce 256 (NV10)GeForce 256 (NV10)

Lighting and transformationLighting and transformation DDR and SDRDDR and SDR HDTV compliantHDTV compliant Hardware alpha-blendingHardware alpha-blending 4 pixel pipelines at 120 MHz4 pixel pipelines at 120 MHz Fill Rate: 480 Megapixels/secondFill Rate: 480 Megapixels/second

Page 15: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GeForce2GeForce2

2000: GeForce 2 GTS:2000: GeForce 2 GTS:– Doubled the pixel fill rateDoubled the pixel fill rate– Quadrupled the texel fill rateQuadrupled the texel fill rate– Increased clock speedIncreased clock speed– Multi-texturing Multi-texturing – S3TC, MPEG-2, FSAAS3TC, MPEG-2, FSAA

Page 16: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Anti-AliasingAnti-Aliasing

Without Anti-AliasingWithout Anti-Aliasing With Anti-AliasingWith Anti-Aliasing

Page 17: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GeForce2GeForce2

2000: GeForce 2 MX2000: GeForce 2 MX– Cut pixel pipeline by 2, making it Cut pixel pipeline by 2, making it

cost effectivecost effective– TwinviewTwinview– Compatible with MACsCompatible with MACs

Page 18: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GeForce2GeForce2

Jan 2001: Apple selected Jan 2001: Apple selected GeForce2 MX as default high-end GeForce2 MX as default high-end graphics solution for Power Mac graphics solution for Power Mac G4G4

August 2000: GeForce2 UltraAugust 2000: GeForce2 Ultra November 2000: GeForce2 GoNovember 2000: GeForce2 Go December 2000: NVIDIA buys December 2000: NVIDIA buys

3DFX3DFX

Page 19: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GeForce3GeForce3

2001: GeForce3 (NV20)2001: GeForce3 (NV20)– 240 MHz Core/500 MHz Memory240 MHz Core/500 MHz Memory– 57 million transistors57 million transistors– 46-76 Gigaflops46-76 Gigaflops– Vertex shader technologyVertex shader technology– Pixel shader technologyPixel shader technology– LightSpeed Memory architectureLightSpeed Memory architecture

Page 20: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall
Page 21: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

LightSpeed Memory LightSpeed Memory ArchitectureArchitecture

Page 22: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GeForce4GeForce4

2002: GeForce4 Ti (NV25) and MX 2002: GeForce4 Ti (NV25) and MX (NV17)(NV17)

– Ti:Ti: 4200, 4400, 4600, and 4800 4200, 4400, 4600, and 4800

versionsversions 63 million transistors63 million transistors Chip clock 225-300 MHzChip clock 225-300 MHz Memory Clock 500-650 MHzMemory Clock 500-650 MHz 75-100 million vertices/second75-100 million vertices/second

Page 23: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GeForce FXGeForce FX

November 2002: Geforce FX November 2002: Geforce FX (NV30)(NV30)

– 16 variations for different price ranges16 variations for different price ranges– 125 million transistors125 million transistors– 8 pixels/clock8 pixels/clock– 1 tmu/pipe (16 textures/unit)1 tmu/pipe (16 textures/unit)– 128 bit memory interface128 bit memory interface– 128 MB/256 MB Memory size support 128 MB/256 MB Memory size support

Page 24: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GeForce 6 seriesGeForce 6 series

GeForce 6 series (NV40 )GeForce 6 series (NV40 )– 6200; 6600 GT and Ultra; 6800 GT, 6200; 6600 GT and Ultra; 6800 GT,

Ultra, and Ultra ExtremeUltra, and Ultra Extreme– Core clock speed 450 MHzCore clock speed 450 MHz– Memory clock speed 600 MHzMemory clock speed 600 MHz– 6 4-wide fp32 vector MADDs/ clock 6 4-wide fp32 vector MADDs/ clock

cycle vertex shader unitscycle vertex shader units– 16 4-wide fp32 vector MADDs/ 16 4-wide fp32 vector MADDs/

clock cycle pixel shader unitsclock cycle pixel shader units

Page 25: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GeForce 6 seriesGeForce 6 series

Super scalar 16 pipe Super scalar 16 pipe architecturearchitecture

CineFX3.0 engineCineFX3.0 engineAll operations done in FP32 All operations done in FP32

precision per componentprecision per component200 Gigaflops (Compare this to 200 Gigaflops (Compare this to

the Itanium’s 6.4 Gigaflops)the Itanium’s 6.4 Gigaflops)

Page 26: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

General Diagram General Diagram (6800/NV40)(6800/NV40)

Page 27: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

TurboCacheTurboCache

Uses PCI-Express bandwidth to render Uses PCI-Express bandwidth to render directly to system memorydirectly to system memory

Card needs less memoryCard needs less memory Performance boost while lowering costPerformance boost while lowering cost TurboCache Manager dynamically TurboCache Manager dynamically

allocates from main memoryallocates from main memory Local memory used to cache data and Local memory used to cache data and

to deliver peak performance when to deliver peak performance when neededneeded

Page 28: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

TurboCacheTurboCache

Page 29: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

NV40 Vertex ProcessorNV40 Vertex Processor

An NV40 vertex processor is able to execute one vector operation (up to four FP32 components), one scalar FP32 operation, and make one access to the texture per clock cycle

Page 30: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

NV40 Fragment ProcessorsNV40 Fragment ProcessorsEarly termination from mini z buffer and z buffer checks; resulting sets of 4 pixels

(quads) passed on to fragment units

Page 31: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Programmable 2D and Video Programmable 2D and Video ProcessorProcessor

Can be used for video decoding and Can be used for video decoding and coding (IDCT, deinterlacing, color coding (IDCT, deinterlacing, color model transformations, etc.)model transformations, etc.)

Page 32: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Why NV40 series was betterWhy NV40 series was better

Massive parallelismMassive parallelism ScalabilityScalability

– Lower end products have fewer pixel Lower end products have fewer pixel pipes and fewer vertex shader unitspipes and fewer vertex shader units

Computation PowerComputation Power– 222 million transistors222 million transistors– First to comply with Microsoft’s DirectX First to comply with Microsoft’s DirectX

9 spec9 spec Dynamic Branching in pixel shadersDynamic Branching in pixel shaders

Page 33: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Dynamic BranchingDynamic Branching

Helps detect if pixel needs shadingHelps detect if pixel needs shading Instruction flow handled in groups of Instruction flow handled in groups of

pixelspixels Specify branch granularity (the Specify branch granularity (the

number of consecutive pixels that number of consecutive pixels that take the same branch) take the same branch)

Better distribution of blocks of pixels Better distribution of blocks of pixels between the different quad engines between the different quad engines

Page 34: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Dynamic BranchingDynamic Branching

Page 35: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GeForce 7 seriesGeForce 7 series

7800 GT7800 GT $449$449 7 vertex units7 vertex units 20 pixel 20 pixel

pipelinespipelines Clock speed 400 Clock speed 400

MHz MHz Memory clock Memory clock

speed 500 MHzspeed 500 MHz

7800 GTX7800 GTX $600$600 8 vertex units 8 vertex units 24 pixel 24 pixel

pipelinespipelines Clock speed 430 Clock speed 430

MHz MHz Memory clock Memory clock

speed 600 MHzspeed 600 MHz

Page 36: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GeForce 7800GeForce 7800

302 million transistors302 million transistors 200 Gigaflops of multiply/add 200 Gigaflops of multiply/add

calculations per secondcalculations per second 128-bit floating point precision 128-bit floating point precision

through the entire rendering pipelinethrough the entire rendering pipeline Fill Rate: 10.3 GigatexelsFill Rate: 10.3 Gigatexels 860 million vertices/sec860 million vertices/sec

Page 37: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GeForce 7800GeForce 7800

Page 38: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall
Page 39: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall
Page 40: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

ALU Units in Pixel ProcessorALU Units in Pixel Processor

Sub-unit 1:Sub-unit 1:– NV40: textures data and can issue a MUL NV40: textures data and can issue a MUL

vector instruction or use its mini-ALU to vector instruction or use its mini-ALU to issue a non-vector instruction issue a non-vector instruction

– G70: same but also can issue a G70: same but also can issue a multiply/addmultiply/add

Sub-unit 2:Sub-unit 2:– NV40: can issue a multiply/add vector NV40: can issue a multiply/add vector

instruction or use its own mini-ALU to issue instruction or use its own mini-ALU to issue a non-vector instructiona non-vector instruction

– G70: sameG70: same

Page 41: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GeForce 6 vs. GeForce 7GeForce 6 vs. GeForce 7

ALU UnitsALU Units– G70: 24 ALU UnitsG70: 24 ALU Units– NV40: 16 ALU UnitsNV40: 16 ALU Units

Register file: same sizeRegister file: same size Texture samplers the same but when Texture samplers the same but when

fetching large textures in preparation fetching large textures in preparation for filtering, G70's samplers have less for filtering, G70's samplers have less latency pulling those textures out of latency pulling those textures out of memory memory

Page 42: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GeForce 6 vs. GeForce 7GeForce 6 vs. GeForce 7(speculative)(speculative)

Increased L2 texture cache (to around Increased L2 texture cache (to around 12KB) 12KB)

Better cache re-use with larger Better cache re-use with larger textures, decompressing those larger textures, decompressing those larger textures into L1 fastertextures into L1 faster

Possibly offering more granularity in Possibly offering more granularity in cache access by the GPU, to reduce cache access by the GPU, to reduce texture bandwidth, speeding up texture bandwidth, speeding up rendering. rendering.

Page 43: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GeForce 6 vs. GeForce 7GeForce 6 vs. GeForce 7

33 % more vertex units, each with 33 % more vertex units, each with more performancemore performance

Improved vertex fetch unit Improved vertex fetch unit (unconfirmed by Nvidia) (unconfirmed by Nvidia)

Triangle setup and rasteriser Triangle setup and rasteriser optimized via the use of a new raster optimized via the use of a new raster pattern (again unconfirmed by pattern (again unconfirmed by Nvidia)Nvidia)

Page 44: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

General Diagram General Diagram (7800/G70)(7800/G70)

Page 45: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

FramebufferFramebuffer TexturesTextures Fragment processorFragment processor Vertex processorVertex processor InterpolantsInterpolants GeForce 7800 (G70) supports 128 bit GeForce 7800 (G70) supports 128 bit

through entire pipeline!through entire pipeline!

32-bit IEEE floating-point32-bit IEEE floating-pointthroughout pipeline (NV40)throughout pipeline (NV40)

Page 46: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Hardware supports several Hardware supports several other data typesother data types

Fragment processor also supports:Fragment processor also supports:– 16-bit “half” floating point16-bit “half” floating point– 12-bit fixed point12-bit fixed point– These may be faster than 32-bit on some HWThese may be faster than 32-bit on some HW

Framebuffer/textures also support:Framebuffer/textures also support:– Large variety of fixed-point formatsLarge variety of fixed-point formats– E.g., classical 8-bit per componentE.g., classical 8-bit per component– These formats use less memory bandwidth These formats use less memory bandwidth

than FP32than FP32

Page 47: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

How are current GPU’s How are current GPU’s different from CPU?different from CPU?

GPU is a stream processorGPU is a stream processorMultiple programmable processing unitsMultiple programmable processing units

Connected by data flowsConnected by data flows

ApplicationVertexProcessor

FragmentProcessor

Assem

bly &R

asterization

Fram

ebufferO

perations

Fram

ebuffer

Textures

Page 48: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

How are current GPU’s How are current GPU’s different from CPU?different from CPU?

Optimized for 4-vector arithmeticOptimized for 4-vector arithmetic– Useful for graphics – colors, vectors, Useful for graphics – colors, vectors,

texcoordstexcoords– Easy way to get high performance/costEasy way to get high performance/cost– SIMD/MIMDSIMD/MIMD

Page 49: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GPU Memory Model vs GPU Memory Model vs CPU’sCPU’s

Much more restricted memory accessMuch more restricted memory access– Allocate/free memory only before computationAllocate/free memory only before computation– Limited memory access during computation (kernel)Limited memory access during computation (kernel)

RegistersRegisters– Read/writeRead/write

Local memoryLocal memory– Does not existDoes not exist

Global memoryGlobal memory– Read-only during computationRead-only during computation– Write-only at end of computation (pre-computed Write-only at end of computation (pre-computed

address)address) Disk accessDisk access

– Does not existDoes not exist

Page 50: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GPU Memory ModelGPU Memory Model

Where is GPU Data Stored?Where is GPU Data Stored?– Vertex bufferVertex buffer– Frame bufferFrame buffer– TextureTexture

Vertex BufferVertex

ProcessorRasterizer

FragmentProcessor

Texture

Frame Buffer(s)

VS 3.0 GPUs

Page 51: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GPGPU and MotivationGPGPU and Motivation

GPUs are fast…GPUs are fast…– Itanium: 6.4 GFLOPSItanium: 6.4 GFLOPS– GeForceFX 7800: 200 GFLOPsGeForceFX 7800: 200 GFLOPs– GPUs are getting faster, fasterGPUs are getting faster, faster– CPUs: annual growth CPUs: annual growth 1.5× 1.5× decade decade

growth growth 60× 60× – GPUs: annual growth > 2.0× GPUs: annual growth > 2.0× decade decade

growth > 1000growth > 1000

Page 52: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Motivation:Motivation:Computational PowerComputational Power

Courtesy Naga Govindaraju

GPU

CPU

GPU

Page 53: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GPGPUGPGPU

Good for inherently parallel Good for inherently parallel applicationsapplications

Rapidly evolving ISA and HW Rapidly evolving ISA and HW architecturearchitecture– Largely secretLargely secret

Can’t simply “port” code written for Can’t simply “port” code written for the CPU!the CPU!

Page 54: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Programs are ShadersPrograms are Shaders

Bound by the specific hardware profile:Bound by the specific hardware profile:– E.g. different cards have different supported E.g. different cards have different supported

hardware, OpenGL has different restrictions hardware, OpenGL has different restrictions than DirectX, etcthan DirectX, etc

Hardware profiles change relatively Hardware profiles change relatively drastically as new GPUs are developeddrastically as new GPUs are developed– But typically new profiles only add features, so But typically new profiles only add features, so

there is generally still backwards compatibility there is generally still backwards compatibility (but not always)(but not always)

Page 55: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Vertex processorVertex processor

256 instructions per program 256 instructions per program originallyoriginally(effectively higher with branching)(effectively higher with branching)– Now up to 65535 instructionsNow up to 65535 instructions

Executes on all verticesExecutes on all vertices Outputs new vertices or texture Outputs new vertices or texture

coordinates, etccoordinates, etc

Page 56: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Fragment Processor Flow Fragment Processor Flow ChartChart

Page 57: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Fragment processor hasFragment processor hasflexible texture mappingflexible texture mapping

Memory is accessible through texture Memory is accessible through texture readsreads

Texture reads are just another Texture reads are just another instructioninstruction

Allows computed texture coordinates,Allows computed texture coordinates,nested to nested to arbitraryarbitrary depth depth

Allows multiple uses of a singleAllows multiple uses of a singletexture unittexture unit

Page 58: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Additional fragment Additional fragment processor capabilitiesprocessor capabilities

Read access to window-space positionRead access to window-space position Read/write access to fragment ZRead/write access to fragment Z Built-in derivative instructionsBuilt-in derivative instructions

– Partial derivatives w.r.t. screen-space x or Partial derivatives w.r.t. screen-space x or yy

– Useful for anti-aliasingUseful for anti-aliasing Conditional fragment-kill instructionConditional fragment-kill instruction Multiple FP formats supportedMultiple FP formats supported

Page 59: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Fragment processor Fragment processor limitationslimitations

Originally No branchingOriginally No branching– Now support dynamic branching (but it’s Now support dynamic branching (but it’s

still costly)still costly) No indexed reads from registersNo indexed reads from registers

– Use texture reads insteadUse texture reads instead No memory writesNo memory writes

Page 60: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Branching Instruction CostsBranching Instruction Costs(GeForce 6800)(GeForce 6800)

Page 61: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Fragment shadersFragment shaders

Originally very limited in size (only 96 Originally very limited in size (only 96 instructions), now expanded to 65535+ instructions), now expanded to 65535+ instructionsinstructions

New cards support dynamic branching (but New cards support dynamic branching (but it still incurs some performance penalty)it still incurs some performance penalty)

Now have the ability to output to multiple Now have the ability to output to multiple render targetsrender targets

Page 62: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

CineFX 4.0 EngineCineFX 4.0 Engine

A redesigned vertex shader unit reduces A redesigned vertex shader unit reduces the time to set up and perform geometry the time to set up and perform geometry processing. processing.

A new pixel shader unit design can carry A new pixel shader unit design can carry out twice as many floating-point operations out twice as many floating-point operations and greatly accelerates other mathematical and greatly accelerates other mathematical operations to increase throughput. operations to increase throughput.

An advanced texture unit incorporates new An advanced texture unit incorporates new hardware algorithms and better caching to hardware algorithms and better caching to speed filtering and blending operations. speed filtering and blending operations.

Page 63: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Vertex ShadersVertex Shaders The 7800 has 8 vertex The 7800 has 8 vertex

shadersshaders The Triangle Setup The Triangle Setup

stage turns the vertex stage turns the vertex points into a trianglepoints into a triangle

It also determines It also determines mathmatically the mathmatically the rasterization for each rasterization for each triangletriangle

Accelerating triangle Accelerating triangle setup increases the setup increases the total throughput of the total throughput of the 3D pipeline3D pipeline

Page 64: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Theoretical Rasterization Pattern of a Theoretical Rasterization Pattern of a TriangleTriangle

Page 65: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

New Pixel Shader – MADDNew Pixel Shader – MADD Multiply and Accumulate are Multiply and Accumulate are

commonly used math functions in 3D commonly used math functions in 3D graphicsgraphics

MADD stands for Multiply-ADD MADD stands for Multiply-ADD operationsoperations

The 7800 can do twice the amount of The 7800 can do twice the amount of MADD operations than previous GPUs MADD operations than previous GPUs couldcould

This allows developers to create much This allows developers to create much more complex visual effectsmore complex visual effects

Page 66: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Transparency Adaptive Transparency Adaptive SupersamplingSupersampling

Takes extra passes of thin-lined Takes extra passes of thin-lined objects such as chain linked fences objects such as chain linked fences or trees to enhance qualityor trees to enhance quality

Pixels inside of a polygon are usually Pixels inside of a polygon are usually not touched by anti-aliasing methodsnot touched by anti-aliasing methods

With this, a key set is devised, and With this, a key set is devised, and those pixels are anti-aliased, creating those pixels are anti-aliased, creating a smoother image.a smoother image.

Page 67: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Transparency Adaptive Transparency Adaptive SupersamplingSupersampling

Page 68: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Transparency Adaptive Transparency Adaptive MultisamplingMultisampling

Higher levels of performance, Higher levels of performance, because it uses one texel to because it uses one texel to determine other subpixel valuesdetermine other subpixel values

Not as high qualityNot as high quality

Page 69: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall
Page 70: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Supporting the FutureSupporting the Future

The 7800 is already set up to support The 7800 is already set up to support the new Microsoft Longhorn OS with the new Microsoft Longhorn OS with some of the following advancementssome of the following advancements– Video post-processingVideo post-processing– Real-time desktop compositingReal-time desktop compositing– Seamless multiple 3D applicationsSeamless multiple 3D applications– Accelerated antialiased text renderingAccelerated antialiased text rendering– Special effects and animationSpecial effects and animation

Page 71: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Accelerated Graphics Port Accelerated Graphics Port (AGP)(AGP)

The AGP is superior to the PCI because it The AGP is superior to the PCI because it provides a dedicated pathways between provides a dedicated pathways between the slot and the processorthe slot and the processor

Uses sideband addressingUses sideband addressing PCI must load a texture from the hard PCI must load a texture from the hard

drive into the systems RAM, then from the drive into the systems RAM, then from the RAM into the GNU framebufferRAM into the GNU framebuffer

AGP can read textures directly from AGP can read textures directly from system RAM by “tricking” the CPU into system RAM by “tricking” the CPU into believing the textures are in the believing the textures are in the framebuffer, when they are really in framebuffer, when they are really in memorymemory

Page 72: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

PCI ExpressPCI Express

Based on the PCI Based on the PCI system, allowing for system, allowing for backwards backwards compatibilitycompatibility

Uses 1 bit, bi-Uses 1 bit, bi-directional lanes (PCI directional lanes (PCI used a bus)used a bus)

Each lane can support Each lane can support 250 MB/s in each lane 250 MB/s in each lane (4GB/s total)(4GB/s total)– AGP is only 2 GB/sAGP is only 2 GB/s

Page 73: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Scalable Link Interface (SLI)Scalable Link Interface (SLI)

Takes advantage of the PCI express bus, Takes advantage of the PCI express bus, which will allow more than one discrete which will allow more than one discrete graphics device on the same PCI hostgraphics device on the same PCI host

Allows two of the same GeForce GPUs to Allows two of the same GeForce GPUs to run on one machine, thus “sharing” run on one machine, thus “sharing” load.load.

There are two modes for thisThere are two modes for this– Split-frame Rendering (SFR)Split-frame Rendering (SFR)– Alternate-frame Rendering (AFR)Alternate-frame Rendering (AFR)

Page 74: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall
Page 75: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Split-frame RenderingSplit-frame Rendering Has each GPU render Has each GPU render

a portion of the a portion of the screen, split screen, split horizontallyhorizontally

No extra latencyNo extra latency Not necessarily evenly Not necessarily evenly

splitsplit– SFR is load shared, so it SFR is load shared, so it

splits up the frame by splits up the frame by the amount of work, not the amount of work, not the sizethe size

A large amount of A large amount of overhead is involved, overhead is involved, causing a max speed causing a max speed up of around 1.8 times up of around 1.8 times

Page 76: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Alternate-frame RenderingAlternate-frame Rendering

Avoids all the Avoids all the overhead problems overhead problems of SFRof SFR

Many buffer swapsMany buffer swaps Reliant on the Reliant on the

speed of the speed of the processorprocessor

Can cause latency Can cause latency issuesissues

Recommended Recommended mode by NVIDIAmode by NVIDIA

Page 77: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GeForce Go 7800 GTXGeForce Go 7800 GTX The mobile version of the The mobile version of the

7800 GTX7800 GTX Everything from the Everything from the

desktop release has been desktop release has been carried over to thiscarried over to this

Can switch between x1 Can switch between x1 and x16 lanes of PCI and x16 lanes of PCI ExpressExpress

Uses PowerMizer 6.0, Uses PowerMizer 6.0, which allows this chip to which allows this chip to operate in the same operate in the same envelope as it’s envelope as it’s predecessor, the 6800predecessor, the 6800

Page 78: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall
Page 79: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall
Page 80: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

GeForce Go 7800 – Power GeForce Go 7800 – Power IssuesIssues

Power consumption and package are the same as the 6800 Ultra chip, Power consumption and package are the same as the 6800 Ultra chip, meaning notebook designers do not have to change very much about meaning notebook designers do not have to change very much about their thermal designstheir thermal designs

Dynamic clock scaling can run as slow as 16 MHzDynamic clock scaling can run as slow as 16 MHz– This is true for the engine, memory, and pixel clocksThis is true for the engine, memory, and pixel clocks

Heavier use of clock gating than the desktop versionHeavier use of clock gating than the desktop version Runs at voltages lower than any other mobile performance partRuns at voltages lower than any other mobile performance part RegardlessRegardless, you won’t get much battery-based runtime for a 3D game, you won’t get much battery-based runtime for a 3D game

Page 81: NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

Questions?Questions?