scaling graphics performance with...

Scaling Graphics Performance with Multiprocessing

Kristof BeetsBusiness Development Manager – POWERVR Graphicsp g p

April 2nd 2009

Imagination – what we do

Markets:SoC Technologies &Solutions:

Customer IPMobile Phone

ENSIGMA META POWERVR

Customer IP

POWERVRGraphics

3rd Party IP

Multimedia

Hand-held Multimedia

SoC Platform

POWERVRVideo

ENSIGMAReceivers

METAProcessors

POWERVRDisplay

Home Consumer &

Audio Display

Consumer & Entertainment

Mobile ComputingAudio Computing

In-Car Electronics

PURE Digital Consumer Products:

V1.0 March 09 © 2009 Imagination Technologies Ltd.2

Electronics

Our Licensees

SUNPLUS


Graphics AccelerationThe Scale of the Problem

The diversity of applications using embedded 3D requires an extremely broad range of performance, power and cost points.

Must track process nodesMust be implementable using ASIC techniques

GFl

ops

CellphonesAuto NavHandheld GamingUMPC/MIDConsole64x Scale Factor

Delivering those scalable solutions in a timely manner

CellphonesAuto NavHandheld GamingUMPC/MIDConsole64x Scale Factor

It is essential to maintain a single, coherent software stack.

mannerCore variants must leverage base platform to enable rapid deliveriesStrike a balance between optimisation and time to market

Multiple-core delivery Schedule

Applications Java VMAPI BindingsGUI Game

OGL OCL OVG

V1.0 March 09

1x 2x 4x 8x 32x 64x etc…

EGL

Services Layer

PowerVR Hardware

© 2009 Imagination Technologies Ltd.4

How to optimise scalabilityMulticore or optimised multi-pipelined cores?

ad t are

a

Multicore i

%ov

erhe

a

Perf

/uni

tregion

Infrastructure overhead as proportion of area varies with performance levelPerformance/unit area increases as absolute performance of multi-pipeline core increasesReturns diminish as ratio of overhead to core area increases

V1.0 March 09

Balance point between efficiency and time to market determines the start of the Multicore Region.


SGX Overview & Roadmap

SGX architecture exploits the inherent fine-grained parallelism of graphics

Task scheduling plus hardware

Screen Tiling

Task scheduling plus hardware thread management enables the problem to be distributed over multiple execution pipelines.

As implementations move to multi-core, distributing the parallelism amongst cores b th k ibecomes the key issue

TBDR offers a workable solution which can exploit parallelism without increasing latency.


Other solutions fail to distribute geometry workload or increase latency

POWERVR SGX GP-GPU ArchitectureData Flow Overview

POWERVR SGXHost Bus

Control and Register Bus Host CPU Interface

Vertex DataMaster

(Geometry)

TilingCo-

Processor

Universal Scalable Shader Engine

ThreadScheduler

Multi-ThreadedExecution Unit

PixelCo-

Processor

Pixel DataMaster(ISP)

Pixel DataMaster(ISP)

PixelCo-

Processor

ThreadScheduler

Multi-ThreadedExecution UnitCoarse

GrainScheduler

ocesso(ISP)

TexturingCo-

TexturingCo-General

Purpose

ThreadScheduler

Multi-ThreadedExecution Unit

(CGS)

System Memory

Bus

CoProcessorProcessorPurpose

Data MasterMulti-level Cache

System Memor


MemoryInterface System Memory Bus

POWERVR USSE Thread Scheduling Overview

Task Queues

Multi-ThreadedThreadTask => Thread

Thread Queues

Active Threads

Vertex Source

Execution Unit 1Controldistribution

Vertex Source

Pixel Source

CoarseGrain

Scheduler Multi-ThreadedExecution Unit 2

ThreadControl

Task => Threaddistribution

Active Threads

General/

(CGS)

General/Control Source

Multi-ThreadedExecution Unit N


POWERVR SGX543 MP ArchitectureHierarchical Scheduling of Tasks

ad

ulin

g

g

Applications

Pipe

ch

edul

ing

Sche

dulin

g

Thre

aSc

hedu

Sche

dulin

g

API Driver

SPi

pe S

Thre

ad

Sche

dulin

g

MP

Cor

e S

ulin

g

e

Thre

ad

ched

ulin

g

M

Pipe

Sch

edu

ead

ulin

gSc

Low Level Services

V1.0 March 09

P


Thre

Sche

d

GP-GPU StandardisationOpenCL Goals and Features

An open standard for heterogeneous parallel programmingEnables applications to execute compute intensive tasks across one or more devicesDevices can include single/multi-core CPUs, GPUs or other general processing coresAvoids the need to use custom or task-specific APIs to leverage a device’s processing power - such as graphics APIs for computation using a GPU.

Components of OpenCLOpenCL consists of an API and C like programming languageOpenCL consists of an API and C-like programming languageUser writes processing functions (kernels) in the OpenCL C language

A well defined subset of the C99 standardAdditional keywords and basic-types as appropriate for parallel & vector processingy yp pp p p p gIncludes numerical accuracy requirements for mathematical operations

The OpenCL API includes:Device enumeration and queryingKernel management and executionManagement of buffers and images for kernel input and outputSynchronisation and event handlingInteroperability with OpenGL and other graphics APIs

Confidential – Subject to NDAV01.11 nov07

Interoperability with OpenGL and other graphics APIs

For more details: http://www.khronos.org/opencl/


Conclusions

Tile based architectures are ideally placed for multicore scaling of performanceBut it has to be the right Tile Based Architecture: Scalable in all the correct dimensions.

Universal Load-balanced to handle Vertex/Pixel and General Instructions

The base unit has to be right – inefficiencies are amplified by multicore

Mobile Multicore solutions will be here within a yearyFurther erosion of the distinction between appliances and computers

Raw graphics power will empower UI and applications designers

GP-GPU already in R&D using SGX with all major OEMs e.g. Image ProcessingGP GPU already in R&D using SGX with all major OEMs e.g. Image Processing

The performance curve shows no signs of flattening outIn fact, we are just getting started…


Scaling Graphics Performance with Multiprocessing

Kristof BeetsBusiness Development Manager – POWERVR Graphicsp g p

April 2nd 2009

scaling graphics performance with...

Documents