gpus and cpus: the uneasy alliance panel discussion

GPUs and CPUs:The Uneasy Alliance

Panel Discussion

2

Panelists

• Neil Trevett, 3Dlabs• Michael Doggett, ATI• Adam Lake, Intel• David Kirk, NVIDIA• Bill Mark, University of Texas at

Austin

Moderator• Peter N. Glaskowsky, MemoryLogix

Neil Trevett3Dlabs

Neil Trevett is Senior Vice President for Market Development at 3Dlabs, Inc. Trevett also serves as President of the Web3D Consortium and secretary of the Khronos Group developing the OpenML and OpenGL ES standards for dynamic media processing and graphics APIs for embedded appliances and applications.

© Copyright 3Dlabs 2004 -

GP2 MusingsGP2 MusingsNeil Trevett, Senior VP Market Development, 3DlabsNeil Trevett, Senior VP Market Development, 3Dlabs

President, Khronos GroupPresident, Khronos GroupLos Angeles 2004Los Angeles 2004


CPUs and GPUs – Dynamic TensionCPUs and GPUs – Dynamic TensionCPUs and GPUs – Dynamic TensionCPUs and GPUs – Dynamic Tension

CPUs and GPUs exist because of their different design goalsCPUs – maximize performance and minimize cost of executing SCALAR codeGPUs – exploit parallelism to beat CPUs at executing VECTOR code

BUT - GPUs are rapidly integrating many CPU techniques Learned and refined by the CPU community over decades

Demand-Paged Virtual Memory – 256GB

Virtual Shader Program Memory - 256K instructions

Efficient multi-tasking and isochronous channel

512MB memory -> 1GB memory

High-level Language Programmability

Advanced GPUs designed exclusively for PROFESSSIONAL PRODUCTIVITY

If you would like try a Wildcat Realizm boardemail [email protected]

A message from your sponsor


CPUs and GPUs – Dynamic TensionCPUs and GPUs – Dynamic TensionCPUs and GPUs – Dynamic TensionCPUs and GPUs – Dynamic Tension

Fundamentally different designs finding increasingly common ground

Increasing commonality creates possibilities for tighter integrationE.g. merge virtual address spaces with cache coherency Would enable new CPU/GPU cooperative paradigmsPossibility of increased coprocessor linkageBreak the AGP/PCIe bottleneck

CPU GPUFundamental differences

in design approach

Increasing areas of commonality

CPUSubsystem

CPUSubsystem

GPUSubsystem

GPUSubsystem

Cache Coherent Unified Virtual

Memory Space

Cache Coherent Unified Virtual

Memory Space


GPUs – More Than Graphics Processors?GPUs – More Than Graphics Processors?GPUs – More Than Graphics Processors?GPUs – More Than Graphics Processors?

The volume of graphics shipments has created the GPU phenomenonIngenious work ongoing to find alternative uses for these graphics machines

Can GPUs be modified to address non-graphics needs?E.g. double precision, less SIMD more MIMD, more general data storage

Primarily an economic questionNot just technology

Does reaching for new markets decrease your graphics market share?Increased costs bring no benefit for core market

Graphics

Market Design Spectrum

$

ImagingHPC

Design Shift will only occur if the “Integral of Achieved Profit” is increased

Shifting this far – decreases

effectiveness in graphics market?

Probably a small stretch for increased

volume


Programming GPUs – Industry ChallengeProgramming GPUs – Industry ChallengeProgramming GPUs – Industry ChallengeProgramming GPUs – Industry Challenge

GPU microarchitectures will not be exposed externally any time soonToo much intellectual property would be exposedWould create too much architectural inertia at a time of rapid innovation

Agree that Domain Specific Libraries are effective, pragmatic approachGood to start solving specific real problems now

But we should aim higher than just a library approach?Feels like we need to expose the full flexibility of programmability

Creating effective industry programming infrastructure is a challenge

DomainLanguages

EvolvingGPU

architectures

Firewall to GPU ISAs

DomainLanguagesDomain

LanguagesDomainLanguagesDomain

Languages

EvolvingGPU

architecturesEvolving

GPU architectures

EvolvingGPU

architecturesEvolving

GPU architectures

EvolvingGPU

architecturesCombinatorial

Problem


Combine desirable

features from the different approaches

Industry Standard Virtual Machine?Industry Standard Virtual Machine?Industry Standard Virtual Machine?Industry Standard Virtual Machine?

Could a Virtual Machine standard avoid combinatorial explosion?Uncouples multiple languages from multiple GPUsTarget for domain language architects AND enables innovation by GPU vendors

Create an open and cross-platform industry standard virtual machine?Correct virtual machine could help and persuade GPUs evolve into stream processors

What should that virtual machine be?Can we work together to figure out this key question?

ARB Vertex and Fragment extensions?

OpenGL Shading Language?

Brook or sh?

Domain Languages

GPUs

Too-graphics oriented, too low-level to track the capabilities of evolving GPU architecture?

Too-graphics oriented? Effectively a graphics Domain Specific Library – with the flexibility of programmability?

Can be extended for more generality? What direction should the OpenGL ARB take?

The level of abstraction we need to break out of the graphics mind-set? TOO big a leap from graphics base?

Too high-level to be a useful virtual machine?

Virtual Machines


OpenGL ES 2.0OpenGL ES 2.0

OpenGL 2.0OpenGL 2.0



Battery Powered GPUs!Battery Powered GPUs!Battery Powered GPUs!Battery Powered GPUs!

The Khronos Group is now defining OpenGL ES 2.0The OpenGL Shading Language comes to cell phones!

Driven hard cell-phone industry for compelling hand-held gamingAggressive development to match the availability of GPUs in handsets

OpenGL ES 2.0 will not just be in phones – e.g. games consolesSony Playstation is a Khronos Member



Enabled software AND hardware 3D engines – including small-footprint, low-end fixed point platforms

GLSL-based Shader programmability for embedded devices. Tackling

issues such as remote compilation

Mid-03 Mid-04 Mid-05

Increased emphasis on hardware acceleration and

enhanced 3D pipeline


Embedded Industry - GP2 Genetic DiversityEmbedded Industry - GP2 Genetic DiversityEmbedded Industry - GP2 Genetic DiversityEmbedded Industry - GP2 Genetic Diversity

Cell phones – 100Ms units a year that will have GPUs3D gaming now PLUS phones mutating to general-purpose personal compute devices

Size, power and cost - low-power design now getting lot of attentionInteresting for build handhelds AND large arrays for HPC etc.

Embedded industry has fast innovation, flexible infrastructureTight CPU/GPU integration might happen here first – systems on a chip

Programmable acceleration avoids multiple media acceleration blocksA programmable GPU can accelerate 3D, images, video, audio, speech and ….OpenMAX – a new Khronos standard – domain specific primitive libraries

Uneasy alliance with DSPs too!!Will GPUs even assume some baseband processing?

ARMCPUCore

ARMCPUCore

LowPowerGPUCore

LowPowerGPUCore

Cache Coherent

Unified Virtual Memory Space

Cache Coherent

Unified Virtual Memory Space

Single Chip

Domain-specific primitive libraries – can be accelerated on GPUs

Michael DoggettATI

Michael Doggett is an architect at ATI. He is working on upcoming graphics hardware for microsoft and desktop PC graphics chips. Before joining ATI, Doggett was a post doc at the University of Tuebingen in Germany and completed his Ph.D. at the University of New South Wales in Sydney, Australia.

GPUs and CPUs: The Uneasy AllianceMike DoggettATI

14GPUs and CPUs: The Uneasy Alliance?

GPUs

• Not stream processors• Graphics black box• Deep pipeline

–Arithmetic intensity


GPUs

• How to get new features into GPUs ?–Get game developers to use them

• Architectural Specs–API definition–GPUBench

• Double precision–Performance tradeoff–Simulated double


GPU future

• Competitive market• More of the same

Adam LakeIntel

Adam Lake is a Sr. Software Engineer at Intel specializing in 3D graphics. Previous areas of work include stream processing, compilers for high level shading languages, and non-photorealistic rendering. He holds an M.S. degree from the University of North Carolina at Chapel Hill.

A few A few alternatives…alternatives…

19

Intel IXPIntel IXP Network Processor Family Network Processor Family

20

IXP Perf. CharacteristicsIXP Perf. Characteristics IXP2800 [Intel02]IXP2800 [Intel02]

51 GB/s peak to RDRAM51 GB/s peak to RDRAM 3 RDRAM channels input and output, total aggregate@533 3 RDRAM channels input and output, total aggregate@533

MHzMHz 32 GB/s peak to SDRAM32 GB/s peak to SDRAM

4 QDR II SDRAM ports (2 read/2write) @250 MHz4 QDR II SDRAM ports (2 read/2write) @250 MHz Example Application: 10GB/s EthernetExample Application: 10GB/s Ethernet 1.4 GHz clock rate1.4 GHz clock rate

IXP2400 4,800 MIPSIXP2400 4,800 MIPS IXP1200 1,200 MIPSIXP1200 1,200 MIPS Notes:Notes:

NO FPU!!NO FPU!! Packet arrival rate determines # instructions Packet arrival rate determines # instructions

executed per packetexecuted per packet

21

Key takeaways for IXPKey takeaways for IXP

Designed for Network processing Designed for Network processing workloadsworkloads

Switch on event model for hardware Switch on event model for hardware resourcesresources

No FPU, nor plans for FPUNo FPU, nor plans for FPU Improving software stackImproving software stack

Shangri-la projectShangri-la project

22

MXP5800MXP5800

23

Specs of MXP5800Specs of MXP5800

Internal B/WInternal B/W 532 Mbytes/S/Connection532 Mbytes/S/Connection

Theoretical External B/WTheoretical External B/W 1 GByte/S1 GByte/S

130 nm130 nm 256 MHz256 MHz 35 mm x 35 mm die35 mm x 35 mm die

24

Key takeaways from MXPKey takeaways from MXP

Not a general purpose Not a general purpose MicroprocessorMicroprocessor

Shipping today with software toolsShipping today with software tools One common ISA for all execution One common ISA for all execution

unitsunits

25

So what’s the point?So what’s the point?

Some alternatives for general Some alternatives for general purpose computing on special purpose computing on special purpose hardwarepurpose hardware

Larger context of stream processing Larger context of stream processing architecturesarchitectures

26

Programming ModelsProgramming Models Getting the programming model right is Getting the programming model right is

hardhard Graphics architects got it right for graphicsGraphics architects got it right for graphics

Made harder if you try to be completely Made harder if you try to be completely generalgeneral

Reason: Increase generality, you lose Reason: Increase generality, you lose performanceperformance You can quickly lose any benefit of your You can quickly lose any benefit of your

stream programming modelstream programming model Fully general streaming, in the limit, is Fully general streaming, in the limit, is

multithreadingmultithreading

27

Call to ActionCall to Action For some applications in computational For some applications in computational

science and other domains performance is science and other domains performance is dominant factor, not costdominant factor, not cost

However, in other domains, cost is However, in other domains, cost is dominant:dominant: Purchase Price per MIPPurchase Price per MIP Not just raw performanceNot just raw performance

Call to actionCall to action Consider chipset implementations:Consider chipset implementations:

Analysis of GPGPU taking raw $ cost into accountAnalysis of GPGPU taking raw $ cost into account There are 3 options, not 2:There are 3 options, not 2:

CPU vs. CPU and chipset vs. GPUCPU vs. CPU and chipset vs. GPU

28

The BIG ProblemsThe BIG Problems

How do we program it?How do we program it? Programming ModelProgramming Model

How do we feed it?How do we feed it? Memory hierarchy and bandwidthMemory hierarchy and bandwidth

How do we keep it cool? How do we keep it cool? Power and Thermal requirements Power and Thermal requirements

provide significant challenges for ALL provide significant challenges for ALL architecturesarchitectures

David KirkNVIDIA

David Kirk has been NVIDIA's Chief Scientist since January 1997. Prior to joining NVIDIA, Kirk held positions at Crystal Dynamics and the Apollo Systems Division of Hewlett-Packard Company. Kirk holds M.S. and Ph.D. degrees in Computer Science from the California Institute of Technology.

vertex

setuprasterizer

pixel

texture

memory

per pixel texturefilter & x8 blending

(Year 2000) The GeForce256 Graphics Pipeline

vertextransform & lighting

per-pixelinterpolation

polygonpolygon setup &rasterization

Z-buffer, x8 blending& anti-aliasimage

vertex

setuprasterizer

pixel

texture

image

per-pixel texture, fp16 blending

(Year 2004)The GeForce6 Graphics Pipeline

programmable vertexprocessing (fp32)

programmable per-pixel math (fp32)

polygonpolygon setup,culling, rasterization

Z-buf, fp16 blending,anti-alias (MRT)

memory

data

setuprasterizer

data

data

data

data fetch, fp16 blending

(Year 2004)The GeForce6 NON-Graphics Pipeline

programmable MIMDprocessing (fp32)

programmable SIMDprocessing (fp32)

listsSIMD“rasterization”

predicated write, fp16blend, multiple output

memory

“GP” Processors

XShared peak Input bandwidth

Shared peak Output bandwidth

Dedicated peak Processing power

memory

Bill MarkUniversity of Texas at Austin

Bill Mark is an assistant professor in the Department of Computer Sciences at the University of Texas at Austin. Mark was the lead architect of NVIDIA's Cg language and development system. He holds a Ph.D. from the University of North Carolina at Chapel Hill.

GP2 Panel PresentationGP2 Panel Presentation

William Mark, University of Texas at Austin

We’re entering an era ofWe’re entering an era ofdisruptive changedisruptive changeWe’re entering an era ofWe’re entering an era ofdisruptive changedisruptive change

• Driven by VLSI technology– Too many transistors: CPU performance plateau– Heat/Power is now a first-class constraint– Possible to fit many processors on a single chip

• Two kinds of change coming:– Technical – single-chip parallel computation– Industry structure – pressure for vertical re-integration

What do we mean by“CPU vs. GPU”?What do we mean by“CPU vs. GPU”?

• General HW vs. specialized HW– GPU’s moving towards generality, but not fully there yet

• Sequential vs. Parallel– Latency optimized vs. Throughput optimized

• Two separate chips

• Different sets of companies (exception: Intel)

• Raw HW access vs. Managed code

Need at least two parallel programming modelsNeed at least two parallel programming models

• Stream model– Naturally exposes parallelism and communication– Easy to use, when problem maps well

• Communicating sequential processes (e.g. pthreads)– Explicitly exposes spatial dimension of HW parallelism– Efficiently supports data-dependent communication patterns– Useful for creating/modifying large irregular data structures– Harder to use – e.g. race conditions– Hard to get performance portability

HW must satisfymass-market needsHW must satisfymass-market needs

• Games will continue to dominate– Rendering– Simulation? – an opportunity

• Maximize impact of research by meeting game needs– Chicken/Egg problem: Co-evolve algorithms and architectures– Different visibility algorithms – ray casting?– Global illumination – shadows, ambient occlusion, reflection, …– Parallelize model management, simulation, game behavior, …

• Solving these problems will help other applications

2-year predictions2-year predictions

• CPU’s: multi-core trend accelerates– Multicore used by games and HPC

• GPU’s: More powerful streaming model– Scatter, gather, conditional streams, reductions, etc.– Start to see more success stories for GPGPU– But limits of stream model become apparent

• “Dark Horses” attract increasing attention– CELL and others

6-year predictions6-year predictions

• One processing chip for PC’s– Who makes it?

• Heterogeneous architecture for this chip:– Classical CPU– Parallel fine-grained shared memory (pthreads)– Parallel stream processor (Brook)

• Supports ray-casting visibility

• This architecture emerges in console space first

• This architecture meets many HPC needs

Peter N. GlaskowskyMemoryLogix

Peter Glaskowsky is Chief System Architect at MemoryLogix, a Silicon Valley microprocessor design startup. Formerly, Glaskowsky was editor in chief of Microprocessor Report and a principal analyst with In-Stat/MDR, a chief engineer at Integrated Device Technology, and a lead engineer at SuperMac and Telebit.

43

Some Panel Topics

• Which problems are the natural province of the CPU?

• …of the GPU?• Which CPU design elements will be

borrowed by GPUs, and vice-versa?• Which problems support cooperation

between the CPU and GPU?– How do we stimulate this cooperation?– Or will it be more like competition?

44

Panelists

• Neil Trevett, 3Dlabs• Michael Doggett, ATI• Adam Lake, Intel• David Kirk, NVIDIA• Bill Mark, University of Texas at

Austin

Moderator• Peter N. Glaskowsky, MemoryLogix

gpus and cpus: the uneasy alliance panel discussion

Documents

dlabs neil trevett

dlabs president

gpus dynamic tension

graphics market share

increased volume slide

memorylogix slide

graphics apis

sponsor slide