qos for high-performance and power-efficient hd … · qos for high-performance and power-efficient...

1

QoS for High-Performance

and Power-Efficient HD

Multimedia Systems

Rob Kaye

2

Convergence is Happening – For Real

1GHz+ Processor, increasingly multi-core SMP-capable

CPUs

1080p HD video & graphics

Internet connectivity – Either wired, wireless or both

3

The Need for Quality of Service

Communication explosion: more masters, more functions, more data

Multiple high-performance masters competing for limited memory bandwidth

QoS employed to manage traffic flows through interconnect and memory controller Allocate bandwidth and manage

latency appropriately

Allocate any excess capacity for greatest benefit

4

Little’s Law for Queuing Latency

NT = RT . LT

where

NT = number of requests waiting

(“outstanding transactions”)

RT = arrival rate

(bandwidth requested)

LT = latency

(delay in request being completed)

Note: To achieve max theoretical bandwidth from memory system:

Replace RT with theoretical peak memory bandwidth

NT = Bandwidth . Latency

Gives the min number of queuing outstanding transactions to achieve peak theoretical

bandwidth of memory system

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

Po

pu

lati

on

Latency/clocks

System Latency

Static Latency

Queuing Latency

http://crd.lbl.gov/~dhbailey/dhbpapers/little.pdf

6

How Much Buffering is Needed?

NT = Bandwidth * Latency

Simplistically, NT = Latency / Time per transaction

If latency is 20 cycles and each burst takes 4 cycles of active data, then to maintain 100% active data cycles there must be 20 / 4 = 5 outstanding transactions (min)

Static latency Processing rate

Average DMC Queue Depth

4 6 8 10 12

To

tal u

tiliz

atio

n

4 6 8 10 12

Average DMC Queue Depth

Rea

d first la

ten

cy

Read first la

tency

Adjusted

Theoretical Observed

PL340 SDR SDRAMC study

Theoretical

7

CPU Latency Sensitivity : Browser

Memory Latencies Baseline is 130ns

~50ns increments up to 330ns

Measured Cached Time

Cortex A8 768:192:192MHz 32KB-L1 256KB-L2

33% performance loss 130 ->330ns latency

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

130ns 180ns 230ns 280ns 330ns

No

rmal

ize

d E

xecu

tio

n T

ime

Effective Memory Latency

Memory Latency Sensitivity with varying L2 size

0KB

256KB

512KB

1024KB

Averaged over three runs with different sleep values on SystemBench 45~50B cycles/run

8

Reducing CPU Latency

Make CPU high priority

Put it in highest priority group

Cache memory reduces latencyseen by the master (eg CPU)

Reduces memory bandwidth whichreduces latency to other mastersand saves power

Diminishing returns from increasingin cache size

Write data can be buffered

Coherency must be observed

The latency for write traffic seen by the system is significantly reduced

Read latency reduced by prioritizing reads

9

Dealing with Latency-Critical Masters

Real-time latency-critical masters like LCD controllers

Adding latency does not affect performance

Until latency limit is reached

Increase latency-tolerance by inserting additional

buffering FIFO

Priority lower than CPU

Reduces the latency to CPU

If the transaction is still waiting after a time-out period

Promote to highest priority

Only higher priority than CPU if/when necessary

CPU

Priority

Time-out

Latency-Critical

GPU

Mem-mem DMA etc

10

Handling Batch Processing Masters (eg GPUs)

These devices can soak up almost unlimited bandwidth

Memory to memory DMA another example

Can swamp system with transactions

Can typically support multiple outstanding

transactions

SDRAMC with page-hit detection exacerbates

issue

Make these devices lowest priority

Option to increase priority to

ensure a certain minimum bandwidth is obtained

CPU

Priority

Time-out

Latency Critical

GPU

Mem-mem DMA etc

11

System-Level QoS Study

Bus switch I1

Video GPU HDLCD 2 x CPU + L2

Bus switch I2

DMC

AMBA Network Interconnect

NIC-301

12

What Bandwidth is Needed for 1080p?

Item Value

Display refresh bandwidth

1920x1080 60Hz

497.6MB/s ≈ 500MB/s

GPU bandwidth (estimate) 1.5GB/s

Video decode - approx 500MB/s

Total (no video) 2.0GB/sec

Total (with video & GPU) 2.5GB/sec

Excludes CPU and other

DMA bandwidth

13

How Important Is Interconnect to QoS?

SDRAM QoS scheme relies on there being space for QoS masters in the SDRAM queue

High outstanding transactions & high latency cause queue to fill

Stalls interconnect

Time-out measures time in SDRAMC only

Real-time masters cannot jump the queue

QoS mechanism breaks down

Interconnect needs to „regulate‟ outstanding transactions

format PHY

master

master

master

master

Memory Controller

Interconnect

memory

14

Transaction Issue Rate RegulationLittle’s Law

NT = RT.LT

Queue length = Arrival rate * Latency

Regulate arrival rate to control queue length & latency

Latency = Queue length / Arrival rate

Issue rate regulation sometimes known as TSPEC

From Traffic SPECification, used in networking QoS terminology

Approximates to bandwidth regulation (burst size)

Gives a „hard‟ limit to max bandwidth of a master

Like a speed limit on the master

15

Outstanding Transaction Regulation (OT)

Latency = Queue length / Arrival rate

Reducing queue length (outstanding) reduces Latency

Regulate number of outstanding transactions to control

SDRAMC queue

Avoid over-regulation as that could affect SDRAM efficiency

Nicely adaptive – Regulated masters get additional bandwidth

when system is lightly loaded – no hard limit

0

5

10

15

20

25

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

% o

f to

tal c

ycle

s

Queue depth

Queue depth vs Outstanding Transactions DMC Queue Fill DMC Outstanding Transactions

16

Latency Regulation

Controlling the 3rd variable in Little’s Law

Latency LT

Cannot directly control LT

Dynamically adjust priority inclosed- loop Set lowest priority to meet latency

requirement

Adaptive to lightly loaded systems Masters get more bandwidth when

lightly-loaded

Requires co-operation from slave(memory controller) to prioritize

Default low priority

Measure latency

Compare with target

Increase priority if (latency > target)and vice versa

17

QoS Validation with VPE

Used VPE

Verification and Performance

Exploration

VPE executes much faster

than RTL

Reduced 2M cycle test-

bench simulation from 4

hours to 4 minutes

Statistically matches

pattern of traffic

18

Performance Without QoS-301

Bandwidth per master is calculated for:

GPU active phase

GPU idle phase

Aggregate (total) for frame

Results for unconstrained system

GPU was active for 55% frame, when it achieved bandwidth of

2734MB/s

1500MB/s overall

CPU achieved bandwidth of

32MB/s when GPU was active

163 MB/s when GPU was idle (5x)

91 MB/s overall

19

Issue Rate Regulation

Regulate GPU RT (transaction rate) to:

2.4 GB/s (vs 2.66 GB/sunregulated)

Results

GPU bandwidth

Active for 61% frame (+11%)

2441MB/s active (-11%)

1500MB/s overall (+0%)

Maximum NT = 2.65 (measured)


119MB/s when GPU active (+279%)

163 MB/s when GPU idle (+0%)

136 MB/s overall (+50%)

Factor improvement over

unconstrained case

+50% CPU

bandwidth

20

Outstanding Transactions (OT) Regulation

Regulate GPU number of transactions at input to system:

3 outstanding read transactions

1 outstanding write transaction

Results

GPU bandwidth

Active for 56% of frame (+1%)

2664MB/s active (-3%)

1500MB/s overall (+0%)


99MB/s when GPU active (+215%)

163 MB/s when GPU idle (+0%)

127 MB/s overall (+40%)

Suffered from lack of granularity in OT level

Factor improvement over

unconstrained case

CPU bandwidth increased by 40%

+40% CPU

bandwidth

21

Fractional Outstanding Regulation

Regulating maximum outstanding transactions often preferable to regulating bandwidth

More adaptive to loading

Integer NT provided too coarse-grained control – Needed ~2.5 OT

Added average number of outstanding transactions to QoS-301

By varying duty cycle, e.g. NT = 2 .4

Finer degree of control

Useful when many low-bandwidth masters

Each may only require NT <<1

22

Latency Reduction with OT Regulation

Unconstrained system

Large number of queuing transactions (NT) from GPU

NT = 14 (read), 8 (write)

Little or no benefit to GPU –DMC cannot supply more bandwidth in this example system

Queuing latency affects CPU bandwidth

NT = 0.74 (read), 0.21 (write)

CPU cannot issue more simultaneous requests

Regulated system

NT sufficient for GPU bandwidth

Queuing latency (LT) reduced

CPU gains BW

Fewer request buffers required

23

OT versus TSPEC Regulation

Outstanding Transaction

Regulation (OT)

40 50 60 70 80 90 100

RT

LT

Issue Rate Regulation (TSPEC)

40 50 60 70 80 90 100

RT

LT

Consider what happens when system bandwidth requirement reduces

System queuing latency reduces

Queuing latency

bandw

idth

Bandwidth fixed

by definitionAdaptive

Bandwidth doesn‟t

degrade as system

workload increases

24

How QoS-301 is Inserted into NIC-301

The QoS-301 hardware can be configured at any NIC-301 slave

interfaces (ASIB) or internal interface block (IB) with AMBA® Designer

25

QoS Techniques and Their Applications

QoS Min

Bandwidth

Max

Bandwidth

Max Latency Adaptive?

Issue Rate

Regulation x

Latency

Regulation via

priority

Outstanding

Transaction

Regulation

These techniques can be used in isolation or together in combination

26

Future technology challenges with QoS

Cortex™-A15 and ARM‟s next generation CoreLink™ system

IP and Mali™ graphics bring higher performance and new

technology

AMBA 4 Phase 2 in 2011 brings coherency, barriers and virtualisation

ARM is developing roadmap interconnect products for

release in 2011

Network interconnect for efficient connectivity with packetization,

clock management and QoS extensions

High performance coherent interconnect

QoS is critical to system performance, bandwidth and latency

New technologies including virtual networks are in development

27

QoS for Cortex-A15 and Mali

Optimized non-blocking interconnect with

Cache coherency up to 8 Cortex-A15 cores

End to end QoS

Lowest latency for CPU

Highest bandwidth for GPU

New high efficiencymemory controller

1/2/4 channels DDR3 or LPDDR2 up to 1066MHz

System MMU forI/O virtualization

Complements Cortex-A15virtualization extensions

ARM is building systems with processor, graphics, interconnect and memory to test QoS for real applications

Quad

Cortex-A15

Quad

Cortex-A15

AMBA 4 Cache Coherent Interconnect

CCI-400

I/O

device

MMU-400

Dynamic Memory Controller

DMC-400

AXI Network Interconnect

NIC-400

Slaves Slaves

AXI Network Interconnect

NIC-400

LCDVideo

DDR3/

LPDDR2

DDR3/

LPDDR2

PHY

GIC-400Mali 3D

Graphics

PHY

MMU-400 MMU-400

28

Summary

Little‟s Law shows there‟s 3 ways to regulate latency with QoS

Outstanding transactions

Issue Rate

Latency – via dynamic priority

ARM CoreLink NIC-301 with Advanced Quality of Service QoS-301 supports all three singly or in combination

Simulation tuning enabled by fast turn-around of VPE simulations

Programmable for tuning and optimization in silicon

Latency regulation supported in conjunction with DMC-400

QoS is important part of the CoreLink system IP mission to maximize performance and power efficiency

LET „ER ROLL!

29

Thank You

Please visit www.arm.com for ARM related technical details

For any queries contact < [email protected] >

http://www.arm.com/

mailto:[email protected]



qos for high-performance and power-efficient hd … · qos for high-performance and power-efficient...

Documents