Page 1: Mass Market Applications of Massively Parallel Computing Chas. Boyd

Mass Market Applications of Massively Parallel ComputingMass Market Applications of Massively Parallel Computing

Chas. Boyd

Page 2: Mass Market Applications of Massively Parallel Computing Chas. Boyd



• Projections of future hardware

• The client computing space

• Mass-market parallel applications

• Common application characteristics

• Interesting processor features

Page 3: Mass Market Applications of Massively Parallel Computing Chas. Boyd


The Physics of SiliconThe Physics of Silicon

• The way processors get faster has fundamentally changed

• No more free performance gains due to clock rate and Instruction-Level Parallelism

• Yet gates-per-die continues to grow

• Possibly faster now that clock rate isn’t an issue

• Estimate: doubling every 2-2.5 years

• New area means more cores and caches

• In-order core counts may grow faster than Out-of-Order core counts do

Page 4: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Page 5: Mass Market Applications of Massively Parallel Computing Chas. Boyd


The Old StoryThe Old Story

Page 6: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Page 7: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Page 8: Mass Market Applications of Massively Parallel Computing Chas. Boyd


A Surplus of CoresA Surplus of Cores

• ‘More cores than we know what to do with’

• Literally

• Servers scale with transaction counts

• Technical Computing

• history of dealing with parallel workloads

• What are the opportunities for the PC client?

• Are there mass market applications that are parallelizable?

Page 9: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Requirements of Mass Market SpaceRequirements of Mass Market Space

• Fairly easy to program and maintain

• Cannot break on future hardware or operating systems

• Transparent back-compatibility, fwd compatibility

• Mass market customers hate regressions!

• Consumer software must operate for decades

• Must get faster automatically

• Why we are here

Page 10: Mass Market Applications of Massively Parallel Computing Chas. Boyd


AMD Term:AMD Term:

• Personal Stream Computing

• Actually nothing like ‘stream processing’ as used by Stanford Brook, etc.

Page 11: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Data-Parallel ProcessingData-Parallel Processing

• Key technique, how do we apply it to consumers?

• What takes lots of data?

• Media, pixels, audio samples

• Video, imaging, audio

• Games

Page 12: Mass Market Applications of Massively Parallel Computing Chas. Boyd



• Decode, encode, transcode

• Motion Estimation, DCT, Quantization

• Effects

• Anything you would want to do to an image

• Scaling, sepia, DVE effects (transitions)

• Indexing

• Search/Recognition -convolutions

Page 13: Mass Market Applications of Massively Parallel Computing Chas. Boyd



• Demosaicing

• Extract colors with knowledge of sensor layout

• Segmentation

• Identify areas of image to process

• Cleanup

• Color correction, noise removal, etc.

• Indexing

• Identify areas for tagging

Page 14: Mass Market Applications of Massively Parallel Computing Chas. Boyd


User Interaction with Media User Interaction with Media

• Client applications can/should be interactive

• Mass market wants full automation

• ‘Pro-sumer’ wants some options to participate, but with real-time feedback (20+ fps) on 16 GPixel images

• Automating media processing requires analysis

• Recognition, segmentation, image understanding

• Is this image outdoors or inside?

• Is this image right-side up?

• Does it contain faces?

• Are their eyes red?

Page 15: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Imaging MarketsImaging Markets

• In some sense, the broader the market, the more sophisticated the algorithm required

• Although pro-sumers care more about performance, and they are the ones that write the reviews

Page 16: Mass Market Applications of Massively Parallel Computing Chas. Boyd


FFT PerformanceFFT Performance

Page 17: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Game Applications of Mass ParallelGame Applications of Mass Parallel

• Rendering

• Imaging

• Physics

• IK

• AI

Page 18: Mass Market Applications of Massively Parallel Computing Chas. Boyd


19Ultima Underworld 1993Ultima Underworld 1993

Page 19: Mass Market Applications of Massively Parallel Computing Chas. Boyd

Dark Messiah 2007Dark Messiah 2007Dark Messiah 2007Dark Messiah 2007


Page 20: Mass Market Applications of Massively Parallel Computing Chas. Boyd

Game RenderingGame Rendering

• Well established at this point, but new techniques keep being discovered

• Rendering different terms at different spatial scales

• E.g. Irradiance can be computed more sparsely than exit radiance enabling large increases in the number of incident light sources considered

• Spherical harmonic coefficient manipulations

Page 21: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Game ImagingGame Imaging

• Post processing

• Reduction (histogram or single average value)

• Exposure estimation based on log average luminance

• Exposure correction

• Oversaturation extraction

• Large blurs (proportional to screen size)

• Depth of field

• Motion blur

Page 22: Mass Market Applications of Massively Parallel Computing Chas. Boyd

Half-Life 2Half-Life 2Half-Life 2Half-Life 2


Page 23: Mass Market Applications of Massively Parallel Computing Chas. Boyd

Half-Life 2Half-Life 2Half-Life 2Half-Life 2


Page 24: Mass Market Applications of Massively Parallel Computing Chas. Boyd

Half-Life 2Half-Life 2Half-Life 2Half-Life 2


Page 25: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Game PhysicsGame Physics

• Particles -non-interacting

• Particles -interacting

• Rigid bodies

• Deformable bodies

• Etc.

Page 26: Mass Market Applications of Massively Parallel Computing Chas. Boyd

Game Processor EvolutionGame Processor EvolutionGame Processor EvolutionGame Processor Evolution

Vertex Shader

Pixel Shader



Texture Creation

Mesh Modeling

PhysicsContent Creation Process

Game Stack




Real Time


Page 27: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Common Properties of Mass AppsCommon Properties of Mass Apps

• Results of client computations are displayed at interactive rates

• Fundamental requirement of client systems

• Tight coupling with graphics is optimal

• Physical proximity to renderer is beneficial

• Smaller data types are key

Page 28: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Support for Image Data TypesSupport for Image Data Types

• Pixels, texels, motion vectors, etc.

• Image data more important than float32s

• Data declines in size as importance drops

• Bytes, words, fp11, fp16, single, double

• Bytes may be declining in importance

• Hardware support for formatting is useful

• Clock cycles required by shift/or/mul, etc. cost too much power

Page 29: Mass Market Applications of Massively Parallel Computing Chas. Boyd


I/O ConsiderationsI/O Considerations

• Like most computations that are not 3-D rendering, GPUs are i/o bound

• Arithmetic intensity is lower than GPUs

• Convolutions

• Support for efficient data types is very important

Page 30: Mass Market Applications of Massively Parallel Computing Chas. Boyd


GPU Arithmetic Intensity ProjectionGPU Arithmetic Intensity Projection

Page 31: Mass Market Applications of Massively Parallel Computing Chas. Boyd


GPU Arithmetic Intensity ProjectionGPU Arithmetic Intensity Projection

• 2-3 more process doublings before new memory technologies will help much

• Stacked die?, 2k wide bus?, optical?

• Estimate at least 4x increase in nr of compute instructions per read operation

• Arithmetic intensities reach 64??

• This is fine for 3-D rendering

• Other data-parallel apps need more i/o

Page 32: Mass Market Applications of Massively Parallel Computing Chas. Boyd


I/O PatternsI/O Patterns

• Solutions will have a variety of mechanisms to help with worsening i/o constraints

• Data re-use (at cache size scales) is relatively rare in media applications

• Read-write use of memory is rare

• Read-write caches are less critical

• Streaming data behavior is sufficient

• Read contention and write contention are the issue, not read-after-write scenarios

Page 33: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Interesting TechniquesInteresting Techniques

• Shared registers

• Possibly interesting to help with i/o bandwidth

• Reducing on-chip bandwidth may help power/heat

• Scatter

• Can be useful in scenarios that don’t thrash output subsystem

• Can reduce pressure on gather input system

Page 34: Mass Market Applications of Massively Parallel Computing Chas. Boyd



• Key element of almost all image and video processing operations

• Scaling, glows, blurs, search, segmentation

• Algorithm has very low arithmetic intensity

• 1 MAD per sample

• Also has huge re-use (order of kernel size)

• Shared registers should reduce arithmetic intensity by factor of kernel size

Page 35: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Processor Core TypesProcessor Core Types

• Heterogeneous Many-core

• In-Order vs. Out-of-Order

• Distinction arose from targeting 2 different workload design points

• Software can select ideal core type for each algorithm (workload design point)

• Since not all cores can be powered anyway

• Hardware can make trade-offs on:

• Power, area, performance growth rate

Page 36: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Local Memory Accesses Streaming Memory Access


e B






Page 37: Mass Market Applications of Massively Parallel Computing Chas. Boyd

Workload DifferencesWorkload Differences

General Processing

• Small batches

• Frequent branches

• Many data inter-dependencies

• Scalar ops

• Vector ops

Media Processing

• Large batches

• Few branches

• Few data inter-dependencies

• Scalar ops

• Vector ops

Page 38: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Lesson from GPGPU ResearchLesson from GPGPU Research

• Many important tasks have data-parallel implementations

• Typically requires a new algorithm

• May be just as maintainable

• Definitely more scalable with core count

Page 39: Mass Market Applications of Massively Parallel Computing Chas. Boyd


APIs Must Hide ImplementationsAPIs Must Hide Implementations

• Implementation attributes must be hidden from apps to enable scaling over time

• Number of cores operating

• Number of registers available

• Number of i/o ports

• Sizes of caches

• Thread scheduling policies

• Otherwise, these cannot change, and performance will not grow

Page 40: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Order of Thread ExecutionOrder of Thread Execution

• Shared registers and scatter share a pitfall:

• It may be possible to write code that is dependent on the order of thread execution

• This violates scaling requirement

• The order of thread execution may vary from run-to-run (each frame)

• Will certainly vary between implementations

• Cross-vendor and within vendor product lines

• All such code is considered incorrect

Page 41: Mass Market Applications of Massively Parallel Computing Chas. Boyd


System Design GoalsSystem Design Goals

• Enable massively parallel implementations

• Efficient scaling to 1000s of cores

• No blocking/waiting

• No constraints on order of thread execution

• No read-after-write hazards

• Enable future compatibility

• New hardware releases, new operating systems

Page 42: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Other Computing ParadigmsOther Computing Paradigms

• CPU –originated:

• Lock-based, Lockless

• Message Passing

• Transactional Memory

• May not scale well to 1000s of cores

• GPU Paradigms


• May not scale well over time

Page 43: Mass Market Applications of Massively Parallel Computing Chas. Boyd


High Level APIsHigh Level APIs

• Microsoft Accelerator

• Google Peakstream

• Rapidmind

• Acceleware

• Stream processing

• Brook, Sequoia

Page 44: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Additional GPU FeaturesAdditional GPU Features

• Linear Filtering

• 1-D, 2-D, 3-D floating point array indices

• Image and video data benefit

• Triangle Interpolators

• Address calculations take many clocks

• Blenders

• Atomic reduction ops reduce ordering concerns

• 4-vector operations

• Vector data, syntactic convenience

Page 45: Mass Market Applications of Massively Parallel Computing Chas. Boyd


Processor OpportunitiesProcessor Opportunities

• Client computing performance can improve

• Client space is a large un-tapped opportunity for parallel processing

• Hardware changes required are minimal and fairly obvious

• Fast display, efficient i/o, scalable over time

Top Related