Download - LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement

LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement

Gwangsun KimComputer Science Department

Korea Advanced Institute of Science and Technology2011. 12. 20.

- Master’s degree defense -

Table of Contents

• Motivation

• LIBRA

– Introduction to Probabilistic Distance-based Arbitration

– Virtual Contention-based Arbitration

– Hybrid Arbitration

• Evaluation

• Conclusions

2 / 20

Motivation

[Data collected by C. Batten, Y. Pan]

1975 1980 1985 1990 1995 2000 2005 2010 20150.128

1.28

12.8

128

Year

# C

ore

s

8086 80286 80386 80486 PentiumPentium Pro

P-II P-IIIP-4 Itanium

Power4

MIT-RAW

UltraSPARC IV

Pentium-D

Xeon Brisbane

Niagara

Core 2 Quad

Turion X2Core 2 DuoDenmark

Teraflops

Larrabee

TILE64

Sun Rock

Cell

Xeon

Agena Barcelona

TolimanCore i7

IstanbulBeckton

Magny-Cours

• On-Chip Network is an important shared resource in CMP.• Fair allocation of shared resource is needed.

3 / 20

Motivation

• Experiment: 16-core CMPRun SPEC benchmark and 15 copies of memory-intensive microbenchmark to create hotspot.The location of SPEC bench is varied.

• Round-robin arbiter resultsin a significant unfairness.

• Why fairness in OCN matters?– Hard to predict performance (SLA).– Complicates OS design.– Parallel application slowdown.

• This work proposes LIBRA,an OCN support for locality-oblivious task placement.

01

23

0

0.1

0.2

0.3

0.4

0123

X

IPC

Y

Hotspot MC

Up to 12x!

4 / 20

Overview of LIBRA

• Locality-Oblivious Bandwidth Regulatory Aribter

• Libra: constellation of zodiac thatsymbolizes a balance.

• Leverages probabilistic distance-based arbitration (MICRO’10)

• Consists of two mechanisms:1. Virtual contention arbitration (VCA)

- Solve with unfairness

2. Hybrid arbitration- Solve high latency problem

• Combination of 1 and 2: multi-mode arbitration

5 / 20

Probabilistic Distance-based Arbitration (PDBA)

• Proposed to provide fairness in on-chip networks.

1. Probabilistic arbitration

2. Weight is multiplied by contention degree

1

1

22

1

1

1

2

1

1 44x1 x2 x2

1x2 x2

sourcequeue

Router 0 Router 1 Router 2 6 / 20

Limitation of Real Contention-based Arbitration

• Real contention: when two or more requests contend.• Real contention-based arbitration (RCA):

– Non-contention is not accounted for.– In many cases, there is no real contention → unfairness

2 44

1

1

𝑃 ()=0.266𝑃 ()=0.266

Unfair bandwidth allocation!7 / 20

Virtual Contention-based Arbitration (VCA)• Considers historical non-contention in future arbitration.

• Two modes

• Virtual contention mode example:

2

1

1

Last weight: 4Priority counter: 0

Last weight: 1

Priority counter: 0

4

Virtualcontention

Increase priority counter by

4

Real contention mode

Virtual contention mode

8 / 20

Virtual Contention-based Arbitration Cont’d

• Real contention mode example:

• If priority of all ports are the same, then do PDBA.

2

1

1

Last weight: 4Priority counter: 4

Last weight: 1

Priority counter: 0

Realcontention

4>0, so wins.23

Decrement priority counter.

9 / 20

Hybrid Arbiter

• VCA increases router critical path → low clock freq.• Observation: fairness matters only at high load.

– At low load, there are few contention → RR is fine.– At high load, there are many contention and the impact is huge

VCA is needed, but packets are queued up in the buffer → more time for processing.

2

1

2

1

1

1

22

Low load: RR has little impact on fairness High load: VCA provides fairness

RR

VCA

Do pre-calculation

10 / 20

Hybrid Arbiter Cont’d

• If there was no chance for pre-calculation, use RR.

• Use VCA whenever possible.

11 / 20

LIBRA: Multi-mode Arbitration

HybridContention Simple Complex

Yes

Round-robin

Virtual contention arbiter (VCA)in real contention mode

No Virtual contention arbiter (VCA) in virtual contention mode

• Operate in one of multiple modes depending on contention type and load.– Contention type: # of requests for the output port– Load: whether pre-calculation is done or not

12 / 20

Methodology

Parameters Values

Network size 64

Topology 8x8 2D mesh

Buffers 16 flits per VC

Virtual chan-nels

1

Routing XY routing

Router latency 3 cycle

Packet sizeBimodal(50% 1 flit and 50% 4 flit)

Parameters Values

Processor16 out-of-order cores(2GHz, 4-way issue, 64 entry ROB)

L1 cache 32KB, 2-way

L2 cache512KB, 32-way, block size of 64B

Memory controllerClosed-page mode, 2 con-trollers

Topology 4x4 2D mesh

Buffers 6 flits per VC

Virtual channels 4

Flit size 16 byte

Synthetic traffic simulation parameters GEMS simulation parameters

• Area and timing evaluation: Synopsys Design Compiler and IC Com-piler.

• Synthetic simulation using cycle-accurate Booksim simulator.• SPEC CPU 2006 application and microbenchmark simulation using cy-

cle-accurate GEMS + Booksim simulator.

13 / 20

Timing and Area

• Baseline (RR): 1.4GHz and 0.07mm2

• LIBRA reduces latency significantly,while introducing low area overhead.

PDBA PDBA (approximated)

VCA LIBRA1

1.5

2

2.5

3

3.5

4

Latency

Area

Normalized to baseline

[MICRO’10]

14 / 20

Synthetic Traffic Evaluation

• Network stability and throughput

Uniform random Tornado Bitcomp

15 / 20

Support for Locality-oblivious Task Placement

• Configuration– 14 copies of memory-intensive microbenchmark.– SPEC bench. placement: closest or farthest to the hotspot.

• LIBRA reduces max. slowdown by 2.7x and 1.8x com-pared to RR and AGE, respectively.

bzip

2

hmm

er gcc

xala

ncbm

k

gobm

k

milc

sphi

nx3

povr

ay

cact

usAD

M

deal

II

nam

d

HAR

MEA

N

0.01

0.1

1

10

100

RR PDBA PDBA (4-bit approximation)VCA LIBRA AGE

slowdown of farthest over

closest location(log scale)

16 / 20

Analysis on Unfairness of AGE

• AGE can be unfair in closed-loop evaluation.

: buffer depth

: # of in-flight packet from

𝑁1=𝑑+𝑑𝑁1

𝑁1+𝑁 2+𝑁3+𝑁 4

𝑁 2=𝑑+𝑑𝑁2

𝑁2+𝑁 3+𝑁 4

+𝑑𝑁 2

𝑁 1+𝑁2+𝑁 3+𝑁 4

Assumptions:- All nodes send packets to MC- Ideal age-based arbitration- Steady state

𝑁 3=𝑑+𝑑𝑁3

𝑁 3+𝑁4

+𝑑𝑁3

𝑁2+𝑁 3+𝑁 4

+𝑑𝑁 3

𝑁 1+𝑁2+𝑁 3+𝑁 4

𝑁 4=𝑑+𝑑+𝑑𝑁 4

𝑁3+𝑁4

+𝑑𝑁 4

𝑁2+𝑁 3+𝑁 4

+𝑑𝑁 2

𝑁 1+𝑁2+𝑁 3+𝑁 4

, ,

P1 P2 P3 P40

0.5

1

1.5

2

2.5

3

3.5analysis simulation

Node

Relativethroughput

17 / 20

Cost Comparison of QoS Mechanisms

• Area overhead comparison:

PDBA GSF LOFT PVC LIBRA0

20,000 40,000 60,000 80,000

100,000 120,000 140,000 160,000 180,000 200,000

additional area overhead per

node (um2)

[MICRO’10][ISCA’08][MICRO’10][MICRO’09]

LIBRA achieves 38% lower area overhead!(compared to PVC)

18 / 20

Conclusions

• Impact of task placement on performance: up to 30x with RR.• This work proposes LIBRA, a multi-mode arbitration.

– VCA for providing global fairness.– Hybrid arbitration for reducing latency overhead.

• LIBRA can support locality-oblivious task placement.• Analysis on unfairness of age-based arbitration.• LIBRA has 38% lower area overhead compared to PVC.

19 / 20

Q&A

THANK YOU!

20 / 20

Hybrid Arbiter Cont’d

• If there was no chance for pre-calculation, use RR.

• Use VCA whenever possible.

X

𝑟𝑎𝑛𝑑𝑜𝑚𝑛𝑢𝑚𝑏𝑒𝑟

X

+

𝑤0

𝑤1 +

<

<

𝑔0

𝑔1

Pre-calculationstage (PC)

Arbitration stage (SAc)

21 / 20

Download - LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement

Top Related