scaling graphics performance with...
TRANSCRIPT
Scaling Graphics Performance with Multiprocessing
Kristof BeetsBusiness Development Manager – POWERVR Graphicsp g p
April 2nd 2009
Imagination – what we do
Markets:SoC Technologies &Solutions:
Customer IPMobile Phone
ENSIGMA META POWERVR
Customer IP
POWERVRGraphics
3rd Party IP
Multimedia
Hand-held Multimedia
SoC Platform
POWERVRVideo
ENSIGMAReceivers
METAProcessors
POWERVRDisplay
Home Consumer &
Audio Display
Consumer & Entertainment
Mobile ComputingAudio Computing
In-Car Electronics
PURE Digital Consumer Products:
V1.0 March 09 © 2009 Imagination Technologies Ltd.2
Electronics
Graphics AccelerationThe Scale of the Problem
The diversity of applications using embedded 3D requires an extremely broad range of performance, power and cost points.
Must track process nodesMust be implementable using ASIC techniques
GFl
ops
CellphonesAuto NavHandheld GamingUMPC/MIDConsole64x Scale Factor
Delivering those scalable solutions in a timely manner
CellphonesAuto NavHandheld GamingUMPC/MIDConsole64x Scale Factor
It is essential to maintain a single, coherent software stack.
mannerCore variants must leverage base platform to enable rapid deliveriesStrike a balance between optimisation and time to market
Multiple-core delivery Schedule
Applications Java VMAPI BindingsGUI Game
OGL OCL OVG
V1.0 March 09
1x 2x 4x 8x 32x 64x etc…
EGL
Services Layer
PowerVR Hardware
© 2009 Imagination Technologies Ltd.4
How to optimise scalabilityMulticore or optimised multi-pipelined cores?
ad t are
a
Multicore i
%ov
erhe
a
Perf
/uni
tregion
Infrastructure overhead as proportion of area varies with performance levelPerformance/unit area increases as absolute performance of multi-pipeline core increasesReturns diminish as ratio of overhead to core area increases
V1.0 March 09
Balance point between efficiency and time to market determines the start of the Multicore Region.
© 2009 Imagination Technologies Ltd.5
SGX Overview & Roadmap
SGX architecture exploits the inherent fine-grained parallelism of graphics
Task scheduling plus hardware
Screen Tiling
Task scheduling plus hardware thread management enables the problem to be distributed over multiple execution pipelines.
As implementations move to multi-core, distributing the parallelism amongst cores b th k ibecomes the key issue
TBDR offers a workable solution which can exploit parallelism without increasing latency.
V1.0 March 09 © 2009 Imagination Technologies Ltd.6
Other solutions fail to distribute geometry workload or increase latency
POWERVR SGX GP-GPU ArchitectureData Flow Overview
POWERVR SGXHost Bus
Control and Register Bus Host CPU Interface
Vertex DataMaster
(Geometry)
TilingCo-
Processor
Universal Scalable Shader Engine
ThreadScheduler
Multi-ThreadedExecution Unit
PixelCo-
Processor
Pixel DataMaster(ISP)
Pixel DataMaster(ISP)
PixelCo-
Processor
ThreadScheduler
Multi-ThreadedExecution UnitCoarse
GrainScheduler
ocesso(ISP)
TexturingCo-
TexturingCo-General
Purpose
ThreadScheduler
Multi-ThreadedExecution Unit
(CGS)
System Memory
Bus
CoProcessorProcessorPurpose
Data MasterMulti-level Cache
System Memor
V1.0 March 09 © 2009 Imagination Technologies Ltd.7
MemoryInterface System Memory Bus
POWERVR USSE Thread Scheduling Overview
Task Queues
Multi-ThreadedThreadTask => Thread
Thread Queues
Active Threads
Vertex Source
Execution Unit 1Controldistribution
Vertex Source
Pixel Source
CoarseGrain
Scheduler Multi-ThreadedExecution Unit 2
ThreadControl
Task => Threaddistribution
Active Threads
General/
(CGS)
General/Control Source
Multi-ThreadedExecution Unit N
V1.0 March 09 © 2007 Imagination Technologies Ltd.8
POWERVR SGX543 MP ArchitectureHierarchical Scheduling of Tasks
ad
ulin
g
g
Applications
Pipe
ch
edul
ing
Sche
dulin
g
Thre
aSc
hedu
Sche
dulin
g
API Driver
SPi
pe S
Thre
ad
Sche
dulin
g
MP
Cor
e S
ulin
g
e
Thre
ad
ched
ulin
g
M
Pipe
Sch
edu
ead
ulin
gSc
Low Level Services
V1.0 March 09
P
© 2009 Imagination Technologies Ltd.9
Thre
Sche
d
GP-GPU StandardisationOpenCL Goals and Features
An open standard for heterogeneous parallel programmingEnables applications to execute compute intensive tasks across one or more devicesDevices can include single/multi-core CPUs, GPUs or other general processing coresAvoids the need to use custom or task-specific APIs to leverage a device’s processing power - such as graphics APIs for computation using a GPU.
Components of OpenCLOpenCL consists of an API and C like programming languageOpenCL consists of an API and C-like programming languageUser writes processing functions (kernels) in the OpenCL C language
A well defined subset of the C99 standardAdditional keywords and basic-types as appropriate for parallel & vector processingy yp pp p p p gIncludes numerical accuracy requirements for mathematical operations
The OpenCL API includes:Device enumeration and queryingKernel management and executionManagement of buffers and images for kernel input and outputSynchronisation and event handlingInteroperability with OpenGL and other graphics APIs
Confidential – Subject to NDAV01.11 nov07
Interoperability with OpenGL and other graphics APIs
For more details: http://www.khronos.org/opencl/
© 2009 Imagination Technologies Ltd.10
Conclusions
Tile based architectures are ideally placed for multicore scaling of performanceBut it has to be the right Tile Based Architecture: Scalable in all the correct dimensions.
Universal Load-balanced to handle Vertex/Pixel and General Instructions
The base unit has to be right – inefficiencies are amplified by multicore
Mobile Multicore solutions will be here within a yearyFurther erosion of the distinction between appliances and computers
Raw graphics power will empower UI and applications designers
GP-GPU already in R&D using SGX with all major OEMs e.g. Image ProcessingGP GPU already in R&D using SGX with all major OEMs e.g. Image Processing
The performance curve shows no signs of flattening outIn fact, we are just getting started…
V1.0 March 09 © 2009 Imagination Technologies Ltd.11