nvidia geforce ryan hendrixson ryan schubert allison walthall
TRANSCRIPT
NVIDIA GeForceNVIDIA GeForce
Ryan HendrixsonRyan Hendrixson
Ryan SchubertRyan Schubert
Allison WalthallAllison Walthall
What Does a GPU Actually What Does a GPU Actually Do?Do?
Historically, from:Historically, from:– Acting simply as a frame bufferActing simply as a frame buffer– Doing vertex transformations and pixel Doing vertex transformations and pixel
color calculationscolor calculations– Now even programmable Now even programmable
In the simplest sense, a modern GPU In the simplest sense, a modern GPU implements a 3D rendering pipelineimplements a 3D rendering pipeline
3D Rendering Pipeline3D Rendering Pipeline (direct (direct illumination)illumination)
3D Geometric Primitives
ModelingTransformation
ModelingTransformation
ViewingTransformation
ViewingTransformation
ProjectionTransformation
ProjectionTransformation
LightingLighting
Image
ClippingClipping
ScanConversion
ScanConversion
This is a pipelinedsequence of operations to draw a 3D primitive
into a 2D image
3D Rendering Pipeline3D Rendering Pipeline (direct illumination) (direct illumination)
ModelingTransformation
ModelingTransformation
ViewingTransformation
ViewingTransformation
ProjectionTransformation
ProjectionTransformation
LightingLighting
3D Geometric Primitives
Image
ClippingClipping
ScanConversion
ScanConversion
Transform into 3D world coordinate system
3D Rendering Pipeline3D Rendering Pipeline (direct illumination) (direct illumination)
ModelingTransformation
ModelingTransformation
ViewingTransformation
ViewingTransformation
ProjectionTransformation
ProjectionTransformation
LightingLighting
3D Geometric Primitives
Image
ClippingClipping
ScanConversion
ScanConversion
Transform into 3D world coordinate system
Illuminate according to lighting and reflectance
3D Rendering Pipeline3D Rendering Pipeline (direct illumination) (direct illumination)
ModelingTransformation
ModelingTransformation
ViewingTransformation
ViewingTransformation
ProjectionTransformation
ProjectionTransformation
LightingLighting
3D Geometric Primitives
Image
ClippingClipping
ScanConversion
ScanConversion
Transform into 3D world coordinate system
Illuminate according to lighting and reflectance
Transform into 3D camera coordinate system
3D Rendering Pipeline3D Rendering Pipeline (direct illumination) (direct illumination)
ModelingTransformation
ModelingTransformation
ViewingTransformation
ViewingTransformation
ProjectionTransformation
ProjectionTransformation
LightingLighting
3D Geometric Primitives
Image
ClippingClipping
ScanConversion
ScanConversion
Transform into 3D world coordinate system
Transform into 3D camera coordinate system
Transform into 2D screen coordinate system
Illuminate according to lighting and reflectance
3D Rendering Pipeline3D Rendering Pipeline (direct illumination) (direct illumination)
ModelingTransformation
ModelingTransformation
ViewingTransformation
ViewingTransformation
ProjectionTransformation
ProjectionTransformation
LightingLighting
3D Geometric Primitives
Image
ClippingClipping
ScanConversion
ScanConversion
Transform into 3D world coordinate system
Transform into 3D camera coordinate system
Clip primitives outside camera’s view
Transform into 2D screen coordinate system
Illuminate according to lighting and reflectance
3D Rendering Pipeline3D Rendering Pipeline (direct illumination) (direct illumination)
ModelingTransformation
ModelingTransformation
ViewingTransformation
ViewingTransformation
ProjectionTransformation
ProjectionTransformation
LightingLighting
3D Geometric Primitives
Image
ClippingClipping
ScanConversion
ScanConversion
Transform into 3D world coordinate system
Transform into 3D camera coordinate system
Draw pixels
Clip primitives outside camera’s view
Transform into 2D screen coordinate system
Illuminate according to lighting and reflectance
GPUCPU
Modern OpenGL PipelineModern OpenGL Pipeline
Programmable Vertex ProcessorProgrammable Vertex Processor Programmable Fragment (Pixel) Programmable Fragment (Pixel)
ProcessorProcessor
ApplicationApplication VertexProcessor
VertexProcessor
Assembly& Rasterization
Assembly& Rasterization
PixelProcessor
PixelProcessor
VideoMemory
(Textures)
VideoMemory
(Textures)VerticesVertices
(3D)(3D)Xformed,Xformed,
LitLitVerticesVertices
(2D)(2D)
FragmentsFragments(pre-pixels)(pre-pixels)
FinalFinalpixelspixels
(Color, Depth)(Color, Depth)
Graphics StateGraphics State
Render-to-textureRender-to-texture
VertexProcessor
VertexProcessor
PixelProcessor
PixelProcessor
OpenGL vs. DirectXOpenGL vs. DirectX
Just graphicsJust graphics Standard C Standard C
interfacesinterfaces State machineState machine Multiple Multiple
platformsplatforms Academic useAcademic use
Graphics, Graphics, multimedia, multimedia, etc.etc.
C++ interfacesC++ interfaces Object orientedObject oriented WindowsWindows PC gamesPC games
Possible GPU Performance Possible GPU Performance BottlenecksBottlenecks
CPU/Bus BoundCPU/Bus Bound– Simply not able to send enough vertices to the Simply not able to send enough vertices to the
card to keep it busycard to keep it busy Vertex BoundVertex Bound
– Vertex processing engine is fully loaded, while Vertex processing engine is fully loaded, while the fragment engine is just waiting and the fragment engine is just waiting and grabbing data as soon as it’s readygrabbing data as soon as it’s ready
Pixel BoundPixel Bound– The fragment engine is fully loaded, causing The fragment engine is fully loaded, causing
the vertex engine to have to wait before the vertex engine to have to wait before sending more datasending more data
Early HistoryEarly History
NVIDIA founded in 1993NVIDIA founded in 1993 1997: RIVA1997: RIVA 1998: RIVA TNT1998: RIVA TNT 1999: GeForce 256 (NV10)1999: GeForce 256 (NV10)
GeForce 256 (NV10)GeForce 256 (NV10)
Lighting and transformationLighting and transformation DDR and SDRDDR and SDR HDTV compliantHDTV compliant Hardware alpha-blendingHardware alpha-blending 4 pixel pipelines at 120 MHz4 pixel pipelines at 120 MHz Fill Rate: 480 Megapixels/secondFill Rate: 480 Megapixels/second
GeForce2GeForce2
2000: GeForce 2 GTS:2000: GeForce 2 GTS:– Doubled the pixel fill rateDoubled the pixel fill rate– Quadrupled the texel fill rateQuadrupled the texel fill rate– Increased clock speedIncreased clock speed– Multi-texturing Multi-texturing – S3TC, MPEG-2, FSAAS3TC, MPEG-2, FSAA
Anti-AliasingAnti-Aliasing
Without Anti-AliasingWithout Anti-Aliasing With Anti-AliasingWith Anti-Aliasing
GeForce2GeForce2
2000: GeForce 2 MX2000: GeForce 2 MX– Cut pixel pipeline by 2, making it Cut pixel pipeline by 2, making it
cost effectivecost effective– TwinviewTwinview– Compatible with MACsCompatible with MACs
GeForce2GeForce2
Jan 2001: Apple selected Jan 2001: Apple selected GeForce2 MX as default high-end GeForce2 MX as default high-end graphics solution for Power Mac graphics solution for Power Mac G4G4
August 2000: GeForce2 UltraAugust 2000: GeForce2 Ultra November 2000: GeForce2 GoNovember 2000: GeForce2 Go December 2000: NVIDIA buys December 2000: NVIDIA buys
3DFX3DFX
GeForce3GeForce3
2001: GeForce3 (NV20)2001: GeForce3 (NV20)– 240 MHz Core/500 MHz Memory240 MHz Core/500 MHz Memory– 57 million transistors57 million transistors– 46-76 Gigaflops46-76 Gigaflops– Vertex shader technologyVertex shader technology– Pixel shader technologyPixel shader technology– LightSpeed Memory architectureLightSpeed Memory architecture
LightSpeed Memory LightSpeed Memory ArchitectureArchitecture
GeForce4GeForce4
2002: GeForce4 Ti (NV25) and MX 2002: GeForce4 Ti (NV25) and MX (NV17)(NV17)
– Ti:Ti: 4200, 4400, 4600, and 4800 4200, 4400, 4600, and 4800
versionsversions 63 million transistors63 million transistors Chip clock 225-300 MHzChip clock 225-300 MHz Memory Clock 500-650 MHzMemory Clock 500-650 MHz 75-100 million vertices/second75-100 million vertices/second
GeForce FXGeForce FX
November 2002: Geforce FX November 2002: Geforce FX (NV30)(NV30)
– 16 variations for different price ranges16 variations for different price ranges– 125 million transistors125 million transistors– 8 pixels/clock8 pixels/clock– 1 tmu/pipe (16 textures/unit)1 tmu/pipe (16 textures/unit)– 128 bit memory interface128 bit memory interface– 128 MB/256 MB Memory size support 128 MB/256 MB Memory size support
GeForce 6 seriesGeForce 6 series
GeForce 6 series (NV40 )GeForce 6 series (NV40 )– 6200; 6600 GT and Ultra; 6800 GT, 6200; 6600 GT and Ultra; 6800 GT,
Ultra, and Ultra ExtremeUltra, and Ultra Extreme– Core clock speed 450 MHzCore clock speed 450 MHz– Memory clock speed 600 MHzMemory clock speed 600 MHz– 6 4-wide fp32 vector MADDs/ clock 6 4-wide fp32 vector MADDs/ clock
cycle vertex shader unitscycle vertex shader units– 16 4-wide fp32 vector MADDs/ 16 4-wide fp32 vector MADDs/
clock cycle pixel shader unitsclock cycle pixel shader units
GeForce 6 seriesGeForce 6 series
Super scalar 16 pipe Super scalar 16 pipe architecturearchitecture
CineFX3.0 engineCineFX3.0 engineAll operations done in FP32 All operations done in FP32
precision per componentprecision per component200 Gigaflops (Compare this to 200 Gigaflops (Compare this to
the Itanium’s 6.4 Gigaflops)the Itanium’s 6.4 Gigaflops)
General Diagram General Diagram (6800/NV40)(6800/NV40)
TurboCacheTurboCache
Uses PCI-Express bandwidth to render Uses PCI-Express bandwidth to render directly to system memorydirectly to system memory
Card needs less memoryCard needs less memory Performance boost while lowering costPerformance boost while lowering cost TurboCache Manager dynamically TurboCache Manager dynamically
allocates from main memoryallocates from main memory Local memory used to cache data and Local memory used to cache data and
to deliver peak performance when to deliver peak performance when neededneeded
TurboCacheTurboCache
NV40 Vertex ProcessorNV40 Vertex Processor
An NV40 vertex processor is able to execute one vector operation (up to four FP32 components), one scalar FP32 operation, and make one access to the texture per clock cycle
NV40 Fragment ProcessorsNV40 Fragment ProcessorsEarly termination from mini z buffer and z buffer checks; resulting sets of 4 pixels
(quads) passed on to fragment units
Programmable 2D and Video Programmable 2D and Video ProcessorProcessor
Can be used for video decoding and Can be used for video decoding and coding (IDCT, deinterlacing, color coding (IDCT, deinterlacing, color model transformations, etc.)model transformations, etc.)
Why NV40 series was betterWhy NV40 series was better
Massive parallelismMassive parallelism ScalabilityScalability
– Lower end products have fewer pixel Lower end products have fewer pixel pipes and fewer vertex shader unitspipes and fewer vertex shader units
Computation PowerComputation Power– 222 million transistors222 million transistors– First to comply with Microsoft’s DirectX First to comply with Microsoft’s DirectX
9 spec9 spec Dynamic Branching in pixel shadersDynamic Branching in pixel shaders
Dynamic BranchingDynamic Branching
Helps detect if pixel needs shadingHelps detect if pixel needs shading Instruction flow handled in groups of Instruction flow handled in groups of
pixelspixels Specify branch granularity (the Specify branch granularity (the
number of consecutive pixels that number of consecutive pixels that take the same branch) take the same branch)
Better distribution of blocks of pixels Better distribution of blocks of pixels between the different quad engines between the different quad engines
Dynamic BranchingDynamic Branching
GeForce 7 seriesGeForce 7 series
7800 GT7800 GT $449$449 7 vertex units7 vertex units 20 pixel 20 pixel
pipelinespipelines Clock speed 400 Clock speed 400
MHz MHz Memory clock Memory clock
speed 500 MHzspeed 500 MHz
7800 GTX7800 GTX $600$600 8 vertex units 8 vertex units 24 pixel 24 pixel
pipelinespipelines Clock speed 430 Clock speed 430
MHz MHz Memory clock Memory clock
speed 600 MHzspeed 600 MHz
GeForce 7800GeForce 7800
302 million transistors302 million transistors 200 Gigaflops of multiply/add 200 Gigaflops of multiply/add
calculations per secondcalculations per second 128-bit floating point precision 128-bit floating point precision
through the entire rendering pipelinethrough the entire rendering pipeline Fill Rate: 10.3 GigatexelsFill Rate: 10.3 Gigatexels 860 million vertices/sec860 million vertices/sec
GeForce 7800GeForce 7800
ALU Units in Pixel ProcessorALU Units in Pixel Processor
Sub-unit 1:Sub-unit 1:– NV40: textures data and can issue a MUL NV40: textures data and can issue a MUL
vector instruction or use its mini-ALU to vector instruction or use its mini-ALU to issue a non-vector instruction issue a non-vector instruction
– G70: same but also can issue a G70: same but also can issue a multiply/addmultiply/add
Sub-unit 2:Sub-unit 2:– NV40: can issue a multiply/add vector NV40: can issue a multiply/add vector
instruction or use its own mini-ALU to issue instruction or use its own mini-ALU to issue a non-vector instructiona non-vector instruction
– G70: sameG70: same
GeForce 6 vs. GeForce 7GeForce 6 vs. GeForce 7
ALU UnitsALU Units– G70: 24 ALU UnitsG70: 24 ALU Units– NV40: 16 ALU UnitsNV40: 16 ALU Units
Register file: same sizeRegister file: same size Texture samplers the same but when Texture samplers the same but when
fetching large textures in preparation fetching large textures in preparation for filtering, G70's samplers have less for filtering, G70's samplers have less latency pulling those textures out of latency pulling those textures out of memory memory
GeForce 6 vs. GeForce 7GeForce 6 vs. GeForce 7(speculative)(speculative)
Increased L2 texture cache (to around Increased L2 texture cache (to around 12KB) 12KB)
Better cache re-use with larger Better cache re-use with larger textures, decompressing those larger textures, decompressing those larger textures into L1 fastertextures into L1 faster
Possibly offering more granularity in Possibly offering more granularity in cache access by the GPU, to reduce cache access by the GPU, to reduce texture bandwidth, speeding up texture bandwidth, speeding up rendering. rendering.
GeForce 6 vs. GeForce 7GeForce 6 vs. GeForce 7
33 % more vertex units, each with 33 % more vertex units, each with more performancemore performance
Improved vertex fetch unit Improved vertex fetch unit (unconfirmed by Nvidia) (unconfirmed by Nvidia)
Triangle setup and rasteriser Triangle setup and rasteriser optimized via the use of a new raster optimized via the use of a new raster pattern (again unconfirmed by pattern (again unconfirmed by Nvidia)Nvidia)
General Diagram General Diagram (7800/G70)(7800/G70)
FramebufferFramebuffer TexturesTextures Fragment processorFragment processor Vertex processorVertex processor InterpolantsInterpolants GeForce 7800 (G70) supports 128 bit GeForce 7800 (G70) supports 128 bit
through entire pipeline!through entire pipeline!
32-bit IEEE floating-point32-bit IEEE floating-pointthroughout pipeline (NV40)throughout pipeline (NV40)
Hardware supports several Hardware supports several other data typesother data types
Fragment processor also supports:Fragment processor also supports:– 16-bit “half” floating point16-bit “half” floating point– 12-bit fixed point12-bit fixed point– These may be faster than 32-bit on some HWThese may be faster than 32-bit on some HW
Framebuffer/textures also support:Framebuffer/textures also support:– Large variety of fixed-point formatsLarge variety of fixed-point formats– E.g., classical 8-bit per componentE.g., classical 8-bit per component– These formats use less memory bandwidth These formats use less memory bandwidth
than FP32than FP32
How are current GPU’s How are current GPU’s different from CPU?different from CPU?
GPU is a stream processorGPU is a stream processorMultiple programmable processing unitsMultiple programmable processing units
Connected by data flowsConnected by data flows
ApplicationVertexProcessor
FragmentProcessor
Assem
bly &R
asterization
Fram
ebufferO
perations
Fram
ebuffer
Textures
How are current GPU’s How are current GPU’s different from CPU?different from CPU?
Optimized for 4-vector arithmeticOptimized for 4-vector arithmetic– Useful for graphics – colors, vectors, Useful for graphics – colors, vectors,
texcoordstexcoords– Easy way to get high performance/costEasy way to get high performance/cost– SIMD/MIMDSIMD/MIMD
GPU Memory Model vs GPU Memory Model vs CPU’sCPU’s
Much more restricted memory accessMuch more restricted memory access– Allocate/free memory only before computationAllocate/free memory only before computation– Limited memory access during computation (kernel)Limited memory access during computation (kernel)
RegistersRegisters– Read/writeRead/write
Local memoryLocal memory– Does not existDoes not exist
Global memoryGlobal memory– Read-only during computationRead-only during computation– Write-only at end of computation (pre-computed Write-only at end of computation (pre-computed
address)address) Disk accessDisk access
– Does not existDoes not exist
GPU Memory ModelGPU Memory Model
Where is GPU Data Stored?Where is GPU Data Stored?– Vertex bufferVertex buffer– Frame bufferFrame buffer– TextureTexture
Vertex BufferVertex
ProcessorRasterizer
FragmentProcessor
Texture
Frame Buffer(s)
VS 3.0 GPUs
GPGPU and MotivationGPGPU and Motivation
GPUs are fast…GPUs are fast…– Itanium: 6.4 GFLOPSItanium: 6.4 GFLOPS– GeForceFX 7800: 200 GFLOPsGeForceFX 7800: 200 GFLOPs– GPUs are getting faster, fasterGPUs are getting faster, faster– CPUs: annual growth CPUs: annual growth 1.5× 1.5× decade decade
growth growth 60× 60× – GPUs: annual growth > 2.0× GPUs: annual growth > 2.0× decade decade
growth > 1000growth > 1000
Motivation:Motivation:Computational PowerComputational Power
Courtesy Naga Govindaraju
GPU
CPU
GPU
GPGPUGPGPU
Good for inherently parallel Good for inherently parallel applicationsapplications
Rapidly evolving ISA and HW Rapidly evolving ISA and HW architecturearchitecture– Largely secretLargely secret
Can’t simply “port” code written for Can’t simply “port” code written for the CPU!the CPU!
Programs are ShadersPrograms are Shaders
Bound by the specific hardware profile:Bound by the specific hardware profile:– E.g. different cards have different supported E.g. different cards have different supported
hardware, OpenGL has different restrictions hardware, OpenGL has different restrictions than DirectX, etcthan DirectX, etc
Hardware profiles change relatively Hardware profiles change relatively drastically as new GPUs are developeddrastically as new GPUs are developed– But typically new profiles only add features, so But typically new profiles only add features, so
there is generally still backwards compatibility there is generally still backwards compatibility (but not always)(but not always)
Vertex processorVertex processor
256 instructions per program 256 instructions per program originallyoriginally(effectively higher with branching)(effectively higher with branching)– Now up to 65535 instructionsNow up to 65535 instructions
Executes on all verticesExecutes on all vertices Outputs new vertices or texture Outputs new vertices or texture
coordinates, etccoordinates, etc
Fragment Processor Flow Fragment Processor Flow ChartChart
Fragment processor hasFragment processor hasflexible texture mappingflexible texture mapping
Memory is accessible through texture Memory is accessible through texture readsreads
Texture reads are just another Texture reads are just another instructioninstruction
Allows computed texture coordinates,Allows computed texture coordinates,nested to nested to arbitraryarbitrary depth depth
Allows multiple uses of a singleAllows multiple uses of a singletexture unittexture unit
Additional fragment Additional fragment processor capabilitiesprocessor capabilities
Read access to window-space positionRead access to window-space position Read/write access to fragment ZRead/write access to fragment Z Built-in derivative instructionsBuilt-in derivative instructions
– Partial derivatives w.r.t. screen-space x or Partial derivatives w.r.t. screen-space x or yy
– Useful for anti-aliasingUseful for anti-aliasing Conditional fragment-kill instructionConditional fragment-kill instruction Multiple FP formats supportedMultiple FP formats supported
Fragment processor Fragment processor limitationslimitations
Originally No branchingOriginally No branching– Now support dynamic branching (but it’s Now support dynamic branching (but it’s
still costly)still costly) No indexed reads from registersNo indexed reads from registers
– Use texture reads insteadUse texture reads instead No memory writesNo memory writes
Branching Instruction CostsBranching Instruction Costs(GeForce 6800)(GeForce 6800)
Fragment shadersFragment shaders
Originally very limited in size (only 96 Originally very limited in size (only 96 instructions), now expanded to 65535+ instructions), now expanded to 65535+ instructionsinstructions
New cards support dynamic branching (but New cards support dynamic branching (but it still incurs some performance penalty)it still incurs some performance penalty)
Now have the ability to output to multiple Now have the ability to output to multiple render targetsrender targets
CineFX 4.0 EngineCineFX 4.0 Engine
A redesigned vertex shader unit reduces A redesigned vertex shader unit reduces the time to set up and perform geometry the time to set up and perform geometry processing. processing.
A new pixel shader unit design can carry A new pixel shader unit design can carry out twice as many floating-point operations out twice as many floating-point operations and greatly accelerates other mathematical and greatly accelerates other mathematical operations to increase throughput. operations to increase throughput.
An advanced texture unit incorporates new An advanced texture unit incorporates new hardware algorithms and better caching to hardware algorithms and better caching to speed filtering and blending operations. speed filtering and blending operations.
Vertex ShadersVertex Shaders The 7800 has 8 vertex The 7800 has 8 vertex
shadersshaders The Triangle Setup The Triangle Setup
stage turns the vertex stage turns the vertex points into a trianglepoints into a triangle
It also determines It also determines mathmatically the mathmatically the rasterization for each rasterization for each triangletriangle
Accelerating triangle Accelerating triangle setup increases the setup increases the total throughput of the total throughput of the 3D pipeline3D pipeline
Theoretical Rasterization Pattern of a Theoretical Rasterization Pattern of a TriangleTriangle
New Pixel Shader – MADDNew Pixel Shader – MADD Multiply and Accumulate are Multiply and Accumulate are
commonly used math functions in 3D commonly used math functions in 3D graphicsgraphics
MADD stands for Multiply-ADD MADD stands for Multiply-ADD operationsoperations
The 7800 can do twice the amount of The 7800 can do twice the amount of MADD operations than previous GPUs MADD operations than previous GPUs couldcould
This allows developers to create much This allows developers to create much more complex visual effectsmore complex visual effects
Transparency Adaptive Transparency Adaptive SupersamplingSupersampling
Takes extra passes of thin-lined Takes extra passes of thin-lined objects such as chain linked fences objects such as chain linked fences or trees to enhance qualityor trees to enhance quality
Pixels inside of a polygon are usually Pixels inside of a polygon are usually not touched by anti-aliasing methodsnot touched by anti-aliasing methods
With this, a key set is devised, and With this, a key set is devised, and those pixels are anti-aliased, creating those pixels are anti-aliased, creating a smoother image.a smoother image.
Transparency Adaptive Transparency Adaptive SupersamplingSupersampling
Transparency Adaptive Transparency Adaptive MultisamplingMultisampling
Higher levels of performance, Higher levels of performance, because it uses one texel to because it uses one texel to determine other subpixel valuesdetermine other subpixel values
Not as high qualityNot as high quality
Supporting the FutureSupporting the Future
The 7800 is already set up to support The 7800 is already set up to support the new Microsoft Longhorn OS with the new Microsoft Longhorn OS with some of the following advancementssome of the following advancements– Video post-processingVideo post-processing– Real-time desktop compositingReal-time desktop compositing– Seamless multiple 3D applicationsSeamless multiple 3D applications– Accelerated antialiased text renderingAccelerated antialiased text rendering– Special effects and animationSpecial effects and animation
Accelerated Graphics Port Accelerated Graphics Port (AGP)(AGP)
The AGP is superior to the PCI because it The AGP is superior to the PCI because it provides a dedicated pathways between provides a dedicated pathways between the slot and the processorthe slot and the processor
Uses sideband addressingUses sideband addressing PCI must load a texture from the hard PCI must load a texture from the hard
drive into the systems RAM, then from the drive into the systems RAM, then from the RAM into the GNU framebufferRAM into the GNU framebuffer
AGP can read textures directly from AGP can read textures directly from system RAM by “tricking” the CPU into system RAM by “tricking” the CPU into believing the textures are in the believing the textures are in the framebuffer, when they are really in framebuffer, when they are really in memorymemory
PCI ExpressPCI Express
Based on the PCI Based on the PCI system, allowing for system, allowing for backwards backwards compatibilitycompatibility
Uses 1 bit, bi-Uses 1 bit, bi-directional lanes (PCI directional lanes (PCI used a bus)used a bus)
Each lane can support Each lane can support 250 MB/s in each lane 250 MB/s in each lane (4GB/s total)(4GB/s total)– AGP is only 2 GB/sAGP is only 2 GB/s
Scalable Link Interface (SLI)Scalable Link Interface (SLI)
Takes advantage of the PCI express bus, Takes advantage of the PCI express bus, which will allow more than one discrete which will allow more than one discrete graphics device on the same PCI hostgraphics device on the same PCI host
Allows two of the same GeForce GPUs to Allows two of the same GeForce GPUs to run on one machine, thus “sharing” run on one machine, thus “sharing” load.load.
There are two modes for thisThere are two modes for this– Split-frame Rendering (SFR)Split-frame Rendering (SFR)– Alternate-frame Rendering (AFR)Alternate-frame Rendering (AFR)
Split-frame RenderingSplit-frame Rendering Has each GPU render Has each GPU render
a portion of the a portion of the screen, split screen, split horizontallyhorizontally
No extra latencyNo extra latency Not necessarily evenly Not necessarily evenly
splitsplit– SFR is load shared, so it SFR is load shared, so it
splits up the frame by splits up the frame by the amount of work, not the amount of work, not the sizethe size
A large amount of A large amount of overhead is involved, overhead is involved, causing a max speed causing a max speed up of around 1.8 times up of around 1.8 times
Alternate-frame RenderingAlternate-frame Rendering
Avoids all the Avoids all the overhead problems overhead problems of SFRof SFR
Many buffer swapsMany buffer swaps Reliant on the Reliant on the
speed of the speed of the processorprocessor
Can cause latency Can cause latency issuesissues
Recommended Recommended mode by NVIDIAmode by NVIDIA
GeForce Go 7800 GTXGeForce Go 7800 GTX The mobile version of the The mobile version of the
7800 GTX7800 GTX Everything from the Everything from the
desktop release has been desktop release has been carried over to thiscarried over to this
Can switch between x1 Can switch between x1 and x16 lanes of PCI and x16 lanes of PCI ExpressExpress
Uses PowerMizer 6.0, Uses PowerMizer 6.0, which allows this chip to which allows this chip to operate in the same operate in the same envelope as it’s envelope as it’s predecessor, the 6800predecessor, the 6800
GeForce Go 7800 – Power GeForce Go 7800 – Power IssuesIssues
Power consumption and package are the same as the 6800 Ultra chip, Power consumption and package are the same as the 6800 Ultra chip, meaning notebook designers do not have to change very much about meaning notebook designers do not have to change very much about their thermal designstheir thermal designs
Dynamic clock scaling can run as slow as 16 MHzDynamic clock scaling can run as slow as 16 MHz– This is true for the engine, memory, and pixel clocksThis is true for the engine, memory, and pixel clocks
Heavier use of clock gating than the desktop versionHeavier use of clock gating than the desktop version Runs at voltages lower than any other mobile performance partRuns at voltages lower than any other mobile performance part RegardlessRegardless, you won’t get much battery-based runtime for a 3D game, you won’t get much battery-based runtime for a 3D game
Questions?Questions?