overcoming the challenges of multimedia system designrtcgroup.com/arm/2007/presentations/114 -...
TRANSCRIPT
Overcoming The Challenges Of Multimedia System Design
Jem DaviesDirector of Technology
ARM Media Processing Division
222Confidential
AgendaThe challenges
Memory bandwidthPower consumptionCostContent/applications
A multi-disciplinary approachIt’s all about software, stupid
333Confidential
The Challenges – No Surprises HereMemory Bandwidth
Video requires a lot of memory bandwidthHDTV resolution at 30 fps is close to 200MB/s for frame-buffers only
3D Graphics in WVGA and beyondEasily consumes hundreds of MB/s – some architectures even up to 6-7GB/s for ”low-end” WVGA – won’t work well in mobile!
The performance bottleneck for user experience expectationsPower Consumption / Energy Capacity
Mobile applications processors power budget is no more than ~ 250 mWThis is not a PC - this is not the PC marketUsers expect better battery life – not worse! No fuel cells yet
CostCustomers will not pay more than $500 for very high-end mobiles
ContentMobile content is currently limited – new technologies require new investmentsTools are under-developed as yet
444Confidential
Solving the ProblemsSolving the problems requires a multi-disciplinary, multi-faceted approach
So that will be easy then, won’t it?At the core (not just CPU)
Dynamic vs. static powerLocal memories vs. cost to save bandwidthGate count vs. performance
At the interconnect and the fabricBus protocolsSystem level caches to reduce memory bandwidthMemory controllersSystem architecture
Across the systemSoftware stackHardware <=> software interaction
At the content and application levelLittle is done here today – what could be done to help power?
555Confidential
Memory Bandwidth - 1At the core (CPU)
Architecture (incl. ISA)Micro-architectureProven track recordCache(s)TCMs
We’ve learned a lotAbout powerAbout IPAbout standardsAbout scaleAbout value
Debug &trace
interface
666Confidential
Memory Bandwidth - 2At the core (media accelerators): video, graphics and audioAll consume significant memory bandwidth
Some use more than the CPUWith different characteristics, too
At first, audio doesn’t seem too bad, then…… users want MP3 for 100 hours…… and multi-channel audio (games mixing sources, radio etc. etc.)… and 3-D audio, Dolby 5.1, other processing…
Video resolutions get bigger and biggerSome common industry designs do not scale well (PPA)
Not all graphics architectures are equal“Do more with less”Mali™ graphics hardware designed for low bandwidth and low power
777Confidential
High System PerformanceKeeping data close
Support up to 50 outstanding transfersCaching system for all data streamsAll bus transactions are bursts and do cache line fills for future operationsOn-chip buffers for intermediate results prevents unnecessary read/modify/write cycles to memory
Colour (blending, multisampling)Z / depthStencil
Pre-fetching of state data
Mali™ hardware is a team player in an SoC environment
AXI Outstandingtransfers
Frame bufferOn-chip buffers / caches
Compute units
On-chip buffers and caches ensure Mali hardware is
active even if system latency is extremely high
888Confidential
Power-efficient GraphicsMemory bandwidth is significant use of power
Large proportion is off-chip at 10x the power
Mali™ architecture significantly reduces memory bandwidthCombines the best of immediate-mode flow and tile-based rendering
Significant savings for both low and high complexity scenes
0.020.040.060.080.0
100.0
Softwareonly
Immediatemode
Traditionaltile-based
Mali55 Mali200
mW
per
fram
e
Advanced UI 3,000 vertices Gaming Hi 30,000 vertices
999Confidential
Memory Bandwidth - 3At the interconnect and fabric level
Generate burst trafficNumbers of outstanding transactions appropriate to data(Multi-level) caches required – OS supportSystem-level caches reduce bandwidth to off-chip memoryCache coherency protocols allow inter-core communication without touching external memory
System-levelDrivers need to be written with memory usage (power) in mind
E.g. (software)-cache internal resultsApps/content software have to be produced through tools that create efficient code (e.g. cache-optimized loops to save memory bandwidth)
CompilersHigh-level content-generation tools
101010Confidential
Example Mali™ System Architecture
Cortex-A9
AMBA® AXI™ BUS (PL301)
L2 Cache (PL310)
Mali™GP2
DRAM controller (PL341)
Snoop Ctrl Unit ACP
Memory
Here’s a (simplified) example of the system and data flow We need to minimise external memory transactions
Dualports
Mali200 Mali200
LCD controller (PL111)
MaliL2
(Accelerator Coherence Port)
L1 Caches
111111Confidential
System Approach is Key to Performance
AMBA® AXI™ BUS Fabric
SIM Interface
IO
HDDInterface
SATAPHY
Video CodecSub-system
Pre & Post Processing on
Mali200
DDR MemoryController
Mobile DDRPHY
Mali2003D GraphicsSub-system
ImageProcessing
System Level Cache
AudioDE™Audio CodecSub-system
Audio IO
NAND FlashInterface
IO
InterruptController
CoreSight™Debug/Trace
ARM CortexProcessor
NEON™
L2 CacheController
DMAController
TouchscreenInterface
USB IR UART GPIOTimers
SPIx2 FM and TV ReceiverI2S I2C
Cam
era
Inte
rface
Peripheral subsystem
Latency toleranceTraffic from other IP creates challenges for real-time graphicsDevelopers need system knowledge and toolsSoftware API stack
Per frame autonomous renderingMinimise HW / SW interaction
System bus bandwidthDo more for less
Mali200™ GPUUp to 40 GFLOPSOn-chip buffers and cachesBurst optimised bus transactions
Memory bandwidthDo more for less
121212Confidential
High System PerformancePer-frame autonomous renderingNo overhead in HW/SW interactionVertices and control data pushed to memory by API driversMali™ hardware automatically manages frame rendering without S/W interferenceResults
CPU is not kept idle or trapped in interrupt handling routines – traditionally a performance killer for graphicsClean system architecture that eliminates HW/SW interface bottlenecks
Caches /Buffers
Per frameControl logic
Memory
MMU
Read data structures and produce frame buffer
Vertex ArraysPer Frame Config.Textures
ARMProcessor
131313Confidential
How Do We Measure Performance?(Performance equals power)
More efficiency = more performance = less powerAs discussed, performance is affected by multiple factors
Need accurate measurementsNeed “what-if” capabilityNeed realistic systems
E.g. not perfect memoryOn graphics, performance is deeply related to content
Need to agree content – benchmarksBut need to avoid the Dhrystone effect
Some of our recent optimisations didn’t affect SPMarkAll IP suppliers simulate their own IP
As we have all the IP, we simulate/model entire systemsChallenge your IP provider!
141414Confidential
Low Power By DesignCore design for low power
Clock domains, clock gatingOn-chip buffers and area vs. leakage powerIs Silicon free – is it just powered-up Silicon that you have to pay for?
Interconnect design for low powerEfficient, low bandwidth = low power
System designKeeping memory bandwidth low
Content and applicationsWhat can programmers do to save power?
151515Confidential
It’s All About Software, Stupid!
OS/RTOS
Native EnvironmentJava Execution Environment
JSR184 JSR239
HAL
2D/VG Midlet
JSR226 JSR135 JSR234
Java VM
JSR287
3D Midlets
JSR297
Media Midlet
SVGt Flash
Native App
ARM CPUARM CPU Mali GPUMali GPU Video H/WVideo H/W Audio AccelAudio Accel
MIDI 3D Audio
OpenKode
Audio & Video
CodecsTrustZone®
Framework
Content Ecosystem
Content Creation
Tools
Sand or Life?See our demos
161616Confidential
The Software ChallengesStandards are good but …
Who controls compliance?What about “extensions”?Who do you want to do the integration?
Does compliance alone guarantee high performance?Particularly when working with other componentsHow do you verify/validate at this level of complexity?
We believe you want a pre-verified, integrated solution
171717Confidential
Power and Software DesignSome components work together better by designing them together
For example: avoiding data copying saves energy
The more parts of the puzzle you are in control of, the easier this is:
Content, Java VM, M3G2, OpenGL ES 2.0 drivers...It is possible to optimise the flow and remain standards-compliantVendor-specific extensions are a nightmare!
181818Confidential
CostMeeting demand for user experience while keeping cost of devices for mass market
What are the cost drivers?IPSiliconValidationLost market windowEtc.
Obtaining more pre-verified, pre-integrated IP from one supplier will accelerate developmentWill that reduce costs/increase profits overall?
191919Confidential
Cost vs. User ExperienceInevitably, the user exerience has to be tailored to fit the market requirements:
Software-rendered graphicsCPU-rendered low-resolution videoMali55 OpenGL ES/OpenVG-accelerated user interfaceMali200/Sif OpenGL ES 2.0 hardware1080p video, H.264, VC-1 ...
What you want are unified stacksWhat you don’t want is to redesign everything between differing platforms
202020Confidential
ContentContent owners are excited with possible numbers of mobilesThe mobile computing revolution continues:
What is a smartphone?What is a mobile computer?How will content differ between these types of platform?
Challenges will include portability and securityWe need to make it easy to adjust content and make it more efficient on mobile platforms
212121Confidential
Extended Tools Offering for DevelopersRealView® System Generator
Complete model of a platformExecutes ARM binaries in real timeProvides full debug visibility Enables visualisation of contentReduces cross platform compilation issuesCost effective and safe distribution model
Performance Analysis Tools Profiles graphics contentHelps identify system bottle necksEnables content to be tuned to the graphics cores
222222Confidential
SummaryThere are no magic bulletsSound engineering will still have great valueBuilding good systems will still have great valueSolving “the problem” requires work in a number of disciplinesLife for suppliers of individual components gets harderThe world still needs great software, great hardware and great tools