commodity components 1 greg humphreys october 10, 2002

Commodity Components 1Commodity Components 1

Greg HumphreysGreg HumphreysOctober 10, 2002October 10, 2002

Life Is But a (Graphics) StreamLife Is But a (Graphics) Stream

C C C C

T

TT

T

S S S S

“There’s a bug somewhere…”

You Spent How Much Money You Spent How Much Money On On WhatWhat Exactly? Exactly?

You Spent How Much Money You Spent How Much Money On On WhatWhat Exactly? Exactly?

Several Years of Failed ExperimentsSeveral Years of Failed Experiments

1975-1980

1982-1986

1980-1982

1986-Present

The ProblemThe Problem

Scalable graphics solutions are rare Scalable graphics solutions are rare and expensiveand expensive

Commodity technology is getting Commodity technology is getting fasterfaster

But it tends not to scaleBut it tends not to scale

Cluster graphics solutions have been Cluster graphics solutions have been inflexibleinflexible

Big ModelsBig Models

Scans of Saint Matthew (386 MPolys) and the David (2 GPolys) Stanford Digital Michelangelo Project

Large DisplaysLarge Displays

Window system and large-screen interaction metaphorsFrançois Guimbretière, Stanford University HCI group

Modern Graphics ArchitectureModern Graphics ArchitectureNVIDIA GeForce4 Ti 4600NVIDIA GeForce4 Ti 4600Amazing technology:Amazing technology:

• 4.8 Gpix/sec (antialiased)4.8 Gpix/sec (antialiased)

• 136 Mtri/sec136 Mtri/sec

• Programmable pipeline stagesProgrammable pipeline stages

Capabilities increasing at roughly Capabilities increasing at roughly 225% per year225% per year

But it doesn’t scale:But it doesn’t scale:• Triangle rate is bus-limited – 136 Mtri/sec mostly Triangle rate is bus-limited – 136 Mtri/sec mostly

unachievableunachievable

• Display resolution growing very slowlyDisplay resolution growing very slowly

Result: There is a serious gap between dataset Result: There is a serious gap between dataset complexity and processing powercomplexity and processing power

GeForce4 Die Plot Courtesy NVIDIA

Why Clusters?Why Clusters?

Commodity partsCommodity parts• Complete graphics pipeline on a single chip

• Extremely fast product cycle

FlexibilityFlexibility• Configurable building blocks

CostCost• Driven by consumer demand

• Economies of scale

AvailabilityAvailability• Insufficient demand for “big iron” solutions

• Little or no ongoing innovation in graphics “supercomputers”

Stanford/DOE Visualization ClusterStanford/DOE Visualization Cluster32 nodes, each with graphics32 nodes, each with graphics

Compaq SP750Compaq SP750• Dual 800 MHz PIII Xeon

• i840 logic

• 256 MB memory

• 18 GB disk

• 64-bit 66 MHz PCI

• AGP-4x

GraphicsGraphics• 16 NVIDIA Quadro2 Pro

• 16 NVIDIA GeForce 3

NetworkNetwork• Myrinet (LANai 7 ~ 100 MB/sec)

Virginia Cluster?Virginia Cluster?

IdeasIdeas

Technology for driving tiled displaysTechnology for driving tiled displays•Unmodified applications

•Efficient network usage

Scalable rendering rates on clustersScalable rendering rates on clusters•161 Mtri/sec at interactive rates to a display wall

•1.6 Gvox/sec at interactive rates to a single display

Cluster graphics as stream processingCluster graphics as stream processing•Virtual graphics interface

•Flexible mechanism for non-invasive transformations

The Name GameThe Name Game

One idea, two systems:One idea, two systems:WireGLWireGL

• Sort-first parallel rendering for tiled displays

• Released to the public in 2000

ChromiumChromium• General stream processing framework

• Multiple parallel rendering architectures

• Open-source project started in June 2001

• Alpha release September 2001

• Beta release April 2002

• 1.0 release September 2002

General ApproachGeneral Approach

Replace system’s OpenGL driverReplace system’s OpenGL driver• Industry standard API

•Support existing unmodified applications

Manipulate streams of API commandsManipulate streams of API commands•Route commands over a network

•Track state!

•Render commands using graphics hardware

Allow parallel applications to issue Allow parallel applications to issue OpenGLOpenGL•Constrain ordering between multiple streams

Cluster GraphicsCluster Graphics

•Raw scalability is easy (just add more pipelines)•One of our goals is to expose that scalability to an application

AppGraphicsHardware

Display

AppGraphicsHardware

Display

AppGraphicsHardware

Display

AppGraphicsHardware

Display


• Graphics hardware is indivisible• Each graphics pipeline managed by a network server

App Server Display

App Server Display

App Server Display

App Server Display


• Flexible number of clients, servers and displays• Compute limited = more clients• Graphics limited = more servers• Interface/network limited = more of both

Server

App

Server Display

App

Server Display

App

Server

Output ScalabilityOutput Scalability

Larger displays with Larger displays with unmodifiedunmodified applicationsapplicationsOther possibilities: broadcast, ring networkOther possibilities: broadcast, ring network

App

Server

Server

Server

...

Display

Display

Display

...

1 byte

12 bytes

Protocol DesignProtocol Design

1 byte overhead per function call1 byte overhead per function call

glColor3f( 1.0, 0.5, 0.5 );

COLOR3F1.00.50.5

VERTEX3F

1.02.03.0

COLOR3F

0.51.00.5

VERTEX3F

2.03.01.0

glVertex3f( 1.0, 2.0, 3.0 );

glColor3f( 0.5, 1.0, 0.5 );

glVertex3f( 2.0, 3.0, 1.0 );

Protocol DesignProtocol Design

1 byte overhead per function call1 byte overhead per function call

glColor3f( 1.0, 0.5, 0.5 );

COLOR3F1.00.50.5

VERTEX3F

1.02.03.0

COLOR3F

0.51.00.5

VERTEX3F

2.03.01.0

glVertex3f( 1.0, 2.0, 3.0 );

glColor3f( 0.5, 1.0, 0.5 );

glVertex3f( 2.0, 3.0, 1.0 );

Opcodes

Data

Efficient Remote RenderingEfficient Remote Rendering

NetworkNetwork Mvert/secMvert/sec EfficiencyEfficiency

DirectDirect NoneNone 21.5021.50

Application draws 60,000 vertices/frame Measurements using 800 MhZ PIII + GeForce2Efficiency assumes 12 bytes per triangle




GLXGLX 100 Mbit100 Mbit 0.730.73 70%70%





GLXGLX 100 Mbit100 Mbit 0.730.73 70%70%

WireGLWireGL 100 Mbit100 Mbit 0.900.90 86%86%





GLXGLX 100 Mbit100 Mbit 0.730.73 70%70%


WireGLWireGL MyrinetMyrinet 9.189.18 88%88%





GLXGLX 100 Mbit100 Mbit 0.730.73 70%70%


WireGLWireGL MyrinetMyrinet 9.189.18 88%88%

WireGLWireGL None*None* 20.9020.90 97%97%

Application draws 60,000 vertices/frame Measurements using 800 MhZ PIII + GeForce2Efficiency assumes 12 bytes per triangle*None: discard packets, measuring pack rate

Sort-first Stream SpecializationSort-first Stream Specialization

Update bounding box per-vertexUpdate bounding box per-vertex

Transform bounds to screen-spaceTransform bounds to screen-space

Assign primitives to servers (with Assign primitives to servers (with overlap)overlap)

Graphics StateGraphics State

OpenGL is a big state machineOpenGL is a big state machineState encapsulates control for State encapsulates control for

geometric operationsgeometric operations•Lighting/shading parameters

•Texture maps and texture mapping parameters

•Boolean enables/disables

•Rendering modes

Example: Example: glColor3f( 1.0, 1.0, 1.0 )glColor3f( 1.0, 1.0, 1.0 )• Sets the current color to white

•Any subsequent primitives will appear white

Lazy State UpdateLazy State Update

Track entire OpenGL stateTrack entire OpenGL state

Precede a tile’s geometry with state Precede a tile’s geometry with state deltasdeltas

glTexImage2D(…)

glBlendFunc(…)

glEnable(…)

glLightfv(…)

glMaterialf(…)

glEnable(…)

Ian Buck, Greg Humphreys and Pat Hanrahan, Graphics Hardware Workshop 2000

How Does State Tracking Work?How Does State Tracking Work?

Tracking state is a no-brainer, it’s the frequent Tracking state is a no-brainer, it’s the frequent context differences that complicate thingscontext differences that complicate things

Need to quickly find the elements that are Need to quickly find the elements that are differentdifferent

Represent state as a hierarchy of dirty bitsRepresent state as a hierarchy of dirty bits

18 top-level categories: buffer, transformation, 18 top-level categories: buffer, transformation, lighting, texture, stencil, etc.lighting, texture, stencil, etc.

Actually, use dirty bit-vectors. Each bit Actually, use dirty bit-vectors. Each bit corresponds to a rendering servercorresponds to a rendering server

0 0 0 0 0 0 0 0

Inside State TrackingInside State Tracking

glLightf( GL_LIGHT1, GL_SPOT_CUTOFF, 45)

0 0 0 0 0 0 0 0 Transformation

0 0 0 0 0 0 0 0 Pixel

Lighting1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 Texture

0 0 0 0 0 0 0 0 Buffer

0 0 0 0 0 0 0 0 Fog

0 0 0 0 0 0 0 0 Line

0 0 0 0 0 0 0 0 Polygon

0 0 0 0 0 0 0 0 Viewport

0 0 0 0 0 0 0 0 Scissor

0 0 0 0 0 0 0 0

Light 00 0 0 0 0 0 0 0

Light 1

0 0 0 0 0 0 0 0 Light 2

0 0 0 0 0 0 0 0

Light 30 0 0 0 0 0 0 0

Light 4

0 0 0 0 0 0 0 0 Light 5

0 0 0 0 0 0 0 0 Texture

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 Ambient color

0 0 0 0 0 0 0 0 Diffuse color

Spot cutoff

(0,0,0,1)

(1,1,1,1)

180

0 0 0 0 0 0 0 0 Light 2

1 1 1 1 1 1 1 1

451 1 1 1 1 1 1 1

Context ComparisonContext Comparison

(0,0,0,1)

(1,1,1,1)

45

Client State Server 2’s State

(0,0,0,1)

(1,1,1,1)

180

Bit 2 set?Bit 2 set?Bit 2 set?Bit 2 set?Bit 2 set?

Bit 2 set?

No, skip itNo, skip itYes, drill downNo, skip itYes, drill down

No, skip it

Equal

EqualNot Equal!Pack Command!

LIGHTF_OPGL_LIGHT1

GL_SPOT_CUTOFF45

45

Output Scalability ResultsOutput Scalability Results

Marching Cubes

Point-to-point “broadcast” doesn’t scale at allHowever, it’s still the commercial solution [SGI Cluster]

Output Scalability ResultsOutput Scalability Results

Quake III: Arena

Larger polygons overlap more tiles

WireGL on the SIGGRAPH show floorWireGL on the SIGGRAPH show floor

Input ScalabilityInput Scalability

Parallel geometry extractionParallel geometry extractionParallel data submissionParallel data submissionOrdering?Ordering?

App

App

App

...DisplayServer

Parallel OpenGL APIParallel OpenGL API

Express ordering constraints between Express ordering constraints between multiple independent graphics contexts multiple independent graphics contexts

Don’t block the application, just encode Don’t block the application, just encode them like any other graphics commandthem like any other graphics command

Ordering is resolved by the rendering Ordering is resolved by the rendering serverserver

Homan Igehy, Gordon Stoll and Pat Hanrahan, SIGGRAPH 98

Introduce new OpenGL commands:Introduce new OpenGL commands:•glBarrierExec•glSemaphoreP•glSemaphoreV

Serial OpenGL ExampleSerial OpenGL Example

def Display:glClear(…)DrawOpaqueGeometry()DrawTransparentGeometry()SwapBuffers()

Parallel OpenGL ExampleParallel OpenGL Exampledef Init:

glBarrierInit(barrier, 2)glSemaphoreInit(sema, 0)

def Display:if my_client_id == 1:

glClear(…)glBarrierExec(barrier)DrawOpaqueGeometry(my_client_id)glBarrierExec(barrier)if my_client_id == 1:

glSemaphoreP(sema)DrawTransparentGeometry1()

else:DrawTransparentGeometry2()glSemaphoreV(sema)

glBarrierExec(barrier)if my_client_id == 1:

SwapBuffers()

Optional in Chromium

Optional in Chromium

Inside a Rendering ServerInside a Rendering Server

Client 1

Client 2

Clear BarrierOpaqueGeom 1

TransparentGeom 1

SwapSemaP BarrierBarrier

BarrierOpaqueGeom 1

BarrierTransparent

Geom 2SemaV Barrier


Client 1

Client 2

BarrierOpaqueGeom 1

TransparentGeom 1


BarrierOpaqueGeom 1

BarrierTransparent

Geom 2SemaV Barrier


Client 1

Client 2

OpaqueGeom 1

TransparentGeom 1


BarrierOpaqueGeom 1

BarrierTransparent

Geom 2SemaV Barrier


Client 1

Client 2

OpaqueGeom 1

TransparentGeom 1


OpaqueGeom 1

BarrierTransparent

Geom 2SemaV Barrier


Client 1

Client 2

OpaqueGeom 1

TransparentGeom 1


BarrierTransparent

Geom 2SemaV Barrier


Client 1

Client 2

OpaqueGeom 1

TransparentGeom 1


TransparentGeom 2

SemaV Barrier


Client 1

Client 2

TransparentGeom 1


TransparentGeom 2

SemaV Barrier


Client 1

Client 2

TransparentGeom 1

SwapSemaP Barrier

TransparentGeom 2

SemaV Barrier


Client 1

Client 2

TransparentGeom 1

SwapBarrier

TransparentGeom 2

SemaV Barrier


Client 1

Client 2

TransparentGeom 1

SwapBarrier

BarrierSemaV


Client 1

Client 2

TransparentGeom 1

SwapBarrier

Barrier


Client 1

Client 2

TransparentGeom 1

SwapBarrier


Client 1

Client 2

SwapBarrier


Client 1

Client 2

Swap


Client 1

Client 2

Input Scalability ResultsInput Scalability Results

Multiple clients, Multiple clients, oneone serverserver

Compute limited applicationCompute limited application

A Fully Parallel ConfigurationA Fully Parallel Configuration

App

App

App

...

Server

Server

Server

...Display

Display

Display

Fully Parallel ResultsFully Parallel Results

1-1 rate: 472 KTri/sec16-16 rate: 6.2 MTri/sec

Peak Sort-First Rendering PerformancePeak Sort-First Rendering Performance

Immediate ModeImmediate Mode• Unstructured triangle

strips

• 200 triangles/strip

Data change Data change every every frameframe

Peak observed: Peak observed: 161,000,000 161,000,000 tris/secondtris/second

Total scene size = Total scene size = 16,000,000 16,000,000 trianglestriangles

Total display size at Total display size at 32 nodes: 32 nodes: 2048x1024 2048x1024 (256x256 tiles)(256x256 tiles)

WireGL Image ReassemblyWireGL Image Reassembly

Composite

App

App

App

...

Server

Server

Server

...

NetworkGeometry

NetworkImagery

Display


Composite

ParallelOpenGL

ParallelOpenGL

App

App

App

...

Server

Server

Server

...Display


Server

ParallelOpenGL

ParallelOpenGL

Composite == Server!Composite == Server!

Server

Server

Server

...

App

App

App

...Display

Lightning-2 = Digital Video Switch from Intel ResearchLightning-2 = Digital Video Switch from Intel ResearchRoute partial scanlines from Route partial scanlines from nn inputs to inputs to mm outputs outputsTile reassembly or depth compositing at full refresh rateTile reassembly or depth compositing at full refresh rateLightning-2 and WireGL demonstrated at SIGGRAPH Lightning-2 and WireGL demonstrated at SIGGRAPH 20012001

Image Reassembly in HardwareImage Reassembly in Hardware

Gordon Stoll et al., SIGGRAPH 2001

Example: 16-way Tiling of One MonitorExample: 16-way Tiling of One Monitor

Framebuffer 1

Framebuffer 2

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

1 3 6

8 9 11

14 16

2 4 5

7 10 12

13 15

3 6

ReconstructedImage

StripHeaders

WireGL ShortcomingsWireGL Shortcomings

Sort-firstSort-first•Can be difficult to load-balance

•Screen-space parallelism limited

•Heavily dependent on spatial locality

Resource utilizationResource utilization•Geometry must move over network every frame

•Server’s graphics hardware remains underutilized

We need something more flexibleWe need something more flexible

S

AA

A

...

SS

S

...

Stream ProcessingStream Processing

Stream Source Transform 1

Transform 1

Transform 2...

...

Streams:• Ordered sequences of records• Potentially infinite

Transformations:• Process only the head element• Finite local storage• Can be partitioned across processors to expose

parallelism

Stream Source

Stream Output

Why Stream Processing?Why Stream Processing?

Elegant mechanism for dealing with huge Elegant mechanism for dealing with huge datadata• Explicitly expose and exploit parallelism

• Hide latency

State of the art in many fields:State of the art in many fields:• Databases [Terry92, Babu01]

• Telephony [Cortes00]

• Online Algorithms [Borodin98,O’Callaghan02]

• Sensor Fusion [Madden01]

• Media Processing [Halfhill00,Khailany01]

• Computer Architecture [Rixner98]

• Graphics Hardware [Owens00, NVIDIA, ATI]

Cluster Graphics As Stream ProcessingCluster Graphics As Stream ProcessingTreat OpenGL calls as a stream of Treat OpenGL calls as a stream of

commandscommands

Form a DAG of stream transformation Form a DAG of stream transformation nodesnodes•Nodes are computers in a cluster

•Edges are OpenGL API communication

Each node has a Each node has a serializationserialization stage stage and a and a transformationtransformation stage stage

Stream SerializationStream Serialization

Convert multiple streams into a single Convert multiple streams into a single streamstream

Efficiently context-switch between streamsEfficiently context-switch between streams

Constrain ordering using Parallel OpenGL Constrain ordering using Parallel OpenGL APIAPI

Two kinds of serializers:Two kinds of serializers:

•Network server:

•Application:

•Unmodified serial application

•Custom parallel application

S

AOpenGL

Stream TransformationStream Transformation

Serialized stream is dispatched to Serialized stream is dispatched to “Stream Processing Units” (SPUs)“Stream Processing Units” (SPUs)

Each SPU is a shared libraryEach SPU is a shared library•Exports a (partial) OpenGL interface

Each node loads a Each node loads a chainchain of SPUs at of SPUs at run timerun time

SPUs are generic and interchangeableSPUs are generic and interchangeable

Example: WireGL RevealedExample: WireGL Revealed

App

App

App

...Tilesort

Tilesort

Tilesort

...

Server

Server

Server

Readback

Readback

Readback

Send

Send

Send

Server

Render

SPU InheritanceSPU Inheritance

The Readback and Render SPUs are The Readback and Render SPUs are relatedrelated•Readback renders everything except

SwapBuffers

Readback Readback inheritsinherits from the Render from the Render SPUSPU•Override parent’s implementation of

SwapBuffers

•All OpenGL calls considered “virtual”

Example: Readback’s SwapBuffersExample: Readback’s SwapBuffers

Easily extended to include depth composite Easily extended to include depth composite

All other functions inherited from Render SPUAll other functions inherited from Render SPU

void RB_SwapBuffers(void){ self.ReadPixels( 0, 0, w, h, ... ); if (self.id == 0) child.Clear( GL_COLOR_BUFFER_BIT ); child.BarrierExec( READBACK_BARRIER ); child.RasterPos2i( tileX, tileY ); child.DrawPixels( w, h, ... ); child.BarrierExec( READBACK_BARRIER ); if (self.id == 0) child.SwapBuffers( );}

Optional

Optional

Virtual GraphicsVirtual Graphics

Separate interface from Separate interface from implementationimplementation

Underlying architecture can change Underlying architecture can change without application’s knowledgewithout application’s knowledge

Douglas Voorhies, David Kirk and Olin Lathrop, SIGGRAPH 88

Example: Sort-LastExample: Sort-Last

Application runs directly on graphics hardwareApplication runs directly on graphics hardware

Same application can use sort-last or sort-firstSame application can use sort-last or sort-first

...

Application

Application

Application

Readback

Readback

Readback

Send

Send

Send

Server

Render

Example: Sort-Last Binary SwapExample: Sort-Last Binary Swap

Application Application

Application Application

Readback Readback

Readback Readback

BSwap BSwap

BSwap BSwap

Send Send

Send Send

Server

Render

Binary Swap Volume Rendering ResultsBinary Swap Volume Rendering Results

One node

Two nodes

Four nodesEight nodes Sixteen nodes

Example: User Interface ReintegrationExample: User Interface Reintegration

App

Tilesort ...

Server

Server

Server

IBMT221

Display(3840x2400)

IBMScalableGraphicsEngine

Chromium ProtocolUDP/Gigabit EthernetDigital Video Cables

Integrate

Integrate

Integrate

Serial applications can drive the T221 with their original user interfaceParallel applications can have a user interface

CATIA Driving IBM’s T221 DisplayCATIA Driving IBM’s T221 Display

Jet engine nacelle model courtesy Goodrich AerostructuresX-Windows Integration SPU by Peter Kirchner and Jim Klosowski, IBM T.J. WatsonChromium is the only practical way to drive the T221 with an existing applicationDemonstrated at Supercomputing 2001

3840

24

00

A Hidden-line Style SPUA Hidden-line Style SPU

DemoDemo

Quake III

Render

Quake III

StateQueryVertexArrayHiddenLine Render

Server

HiddenLine Render

Quake III

SendStateQuery

Is “HiddenLine” Really a SPU?Is “HiddenLine” Really a SPU?

Technically, no!Technically, no!

Requires potentially unbounded Requires potentially unbounded resourcesresources

Alternate design:Alternate design:

Application

SQ VAHiddenLine

Server

Server

ReadbackSend

ReadbackSend

Server

Render

Lines

Polygons

DepthComposit

e

Current ShortcomingsCurrent Shortcomings

General display lists on tiled displaysGeneral display lists on tiled displays•Display lists that affect the graphics state

Distributed texture managementDistributed texture management•Each node must provide its own texture

•Potential N2 texture explosion

•Virtualize distributed texture access?

Ease of creating parallel applicationsEase of creating parallel applications• Input event management

Future DirectionsFuture Directions

End-to-end visualization system for 4D End-to-end visualization system for 4D datadata•Data management and load balancing

•Volume compression

Remote/Ubiquitous VisualizationRemote/Ubiquitous Visualization•Scalable graphics as a shared resource

•Transparent remote interaction with (parallel) apps

•Shift away from desktop-attached graphics

Taxonomy of non-invasive techniquesTaxonomy of non-invasive techniques•Classify SPUs and algorithms

• Identify tradeoffs in design

Observations and PredictionsObservations and Predictions

Manipulation of graphics streams is a Manipulation of graphics streams is a powerful abstraction for cluster powerful abstraction for cluster graphicsgraphics•Achieves both input and output scalability

Providing Providing mechanismsmechanisms instead of instead of algorithms algorithms allows greater flexibilityallows greater flexibility•Data management algorithms can be built into

a parallel application or embedded in a SPU

Flexible remote graphics will lead to a Flexible remote graphics will lead to a revolution in ubiquitous computingrevolution in ubiquitous computing

commodity components 1 greg humphreys october 10, 2002

Documents

clients graphics

single displaycluster

interactive rates

alpha release

beta release

processing powergeforce4

amazing technology

expensivecommodity technology