LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement
Gwangsun KimComputer Science Department
Korea Advanced Institute of Science and Technology2011. 12. 20.
- Master’s degree defense -
Table of Contents
• Motivation
• LIBRA
– Introduction to Probabilistic Distance-based Arbitration
– Virtual Contention-based Arbitration
– Hybrid Arbitration
• Evaluation
• Conclusions
2 / 20
Motivation
[Data collected by C. Batten, Y. Pan]
1975 1980 1985 1990 1995 2000 2005 2010 20150.128
1.28
12.8
128
Year
# C
ore
s
8086 80286 80386 80486 PentiumPentium Pro
P-II P-IIIP-4 Itanium
Power4
MIT-RAW
UltraSPARC IV
Pentium-D
Xeon Brisbane
Niagara
Core 2 Quad
Turion X2Core 2 DuoDenmark
Teraflops
Larrabee
TILE64
Sun Rock
Cell
Xeon
Agena Barcelona
TolimanCore i7
IstanbulBeckton
Magny-Cours
• On-Chip Network is an important shared resource in CMP.• Fair allocation of shared resource is needed.
3 / 20
Motivation
• Experiment: 16-core CMPRun SPEC benchmark and 15 copies of memory-intensive microbenchmark to create hotspot.The location of SPEC bench is varied.
• Round-robin arbiter resultsin a significant unfairness.
• Why fairness in OCN matters?– Hard to predict performance (SLA).– Complicates OS design.– Parallel application slowdown.
• This work proposes LIBRA,an OCN support for locality-oblivious task placement.
01
23
0
0.1
0.2
0.3
0.4
0123
X
IPC
Y
Hotspot MC
Up to 12x!
4 / 20
Overview of LIBRA
• Locality-Oblivious Bandwidth Regulatory Aribter
• Libra: constellation of zodiac thatsymbolizes a balance.
• Leverages probabilistic distance-based arbitration (MICRO’10)
• Consists of two mechanisms:1. Virtual contention arbitration (VCA)
- Solve with unfairness
2. Hybrid arbitration- Solve high latency problem
• Combination of 1 and 2: multi-mode arbitration
5 / 20
Probabilistic Distance-based Arbitration (PDBA)
• Proposed to provide fairness in on-chip networks.
1. Probabilistic arbitration
2. Weight is multiplied by contention degree
1
1
22
1
1
1
2
1
1 44x1 x2 x2
1x2 x2
sourcequeue
Router 0 Router 1 Router 2 6 / 20
Limitation of Real Contention-based Arbitration
• Real contention: when two or more requests contend.• Real contention-based arbitration (RCA):
– Non-contention is not accounted for.– In many cases, there is no real contention → unfairness
2 44
1
1
𝑃 ()=0.266𝑃 ()=0.266
Unfair bandwidth allocation!7 / 20
Virtual Contention-based Arbitration (VCA)• Considers historical non-contention in future arbitration.
• Two modes
• Virtual contention mode example:
2
1
1
Last weight: 4Priority counter: 0
Last weight: 1
Priority counter: 0
4
Virtualcontention
Increase priority counter by
4
Real contention mode
Virtual contention mode
8 / 20
Virtual Contention-based Arbitration Cont’d
• Real contention mode example:
• If priority of all ports are the same, then do PDBA.
2
1
1
Last weight: 4Priority counter: 4
Last weight: 1
Priority counter: 0
Realcontention
4>0, so wins.23
Decrement priority counter.
9 / 20
Hybrid Arbiter
• VCA increases router critical path → low clock freq.• Observation: fairness matters only at high load.
– At low load, there are few contention → RR is fine.– At high load, there are many contention and the impact is huge
VCA is needed, but packets are queued up in the buffer → more time for processing.
2
1
2
1
1
1
22
Low load: RR has little impact on fairness High load: VCA provides fairness
RR
VCA
Do pre-calculation
10 / 20
Hybrid Arbiter Cont’d
• If there was no chance for pre-calculation, use RR.
• Use VCA whenever possible.
11 / 20
LIBRA: Multi-mode Arbitration
HybridContention Simple Complex
Yes
Round-robin
Virtual contention arbiter (VCA)in real contention mode
No Virtual contention arbiter (VCA) in virtual contention mode
• Operate in one of multiple modes depending on contention type and load.– Contention type: # of requests for the output port– Load: whether pre-calculation is done or not
12 / 20
Methodology
Parameters Values
Network size 64
Topology 8x8 2D mesh
Buffers 16 flits per VC
Virtual chan-nels
1
Routing XY routing
Router latency 3 cycle
Packet sizeBimodal(50% 1 flit and 50% 4 flit)
Parameters Values
Processor16 out-of-order cores(2GHz, 4-way issue, 64 entry ROB)
L1 cache 32KB, 2-way
L2 cache512KB, 32-way, block size of 64B
Memory controllerClosed-page mode, 2 con-trollers
Topology 4x4 2D mesh
Buffers 6 flits per VC
Virtual channels 4
Flit size 16 byte
Synthetic traffic simulation parameters GEMS simulation parameters
• Area and timing evaluation: Synopsys Design Compiler and IC Com-piler.
• Synthetic simulation using cycle-accurate Booksim simulator.• SPEC CPU 2006 application and microbenchmark simulation using cy-
cle-accurate GEMS + Booksim simulator.
13 / 20
Timing and Area
• Baseline (RR): 1.4GHz and 0.07mm2
• LIBRA reduces latency significantly,while introducing low area overhead.
PDBA PDBA (approximated)
VCA LIBRA1
1.5
2
2.5
3
3.5
4
Latency
Area
Normalized to baseline
[MICRO’10]
14 / 20
Synthetic Traffic Evaluation
• Network stability and throughput
Uniform random Tornado Bitcomp
15 / 20
Support for Locality-oblivious Task Placement
• Configuration– 14 copies of memory-intensive microbenchmark.– SPEC bench. placement: closest or farthest to the hotspot.
• LIBRA reduces max. slowdown by 2.7x and 1.8x com-pared to RR and AGE, respectively.
bzip
2
hmm
er gcc
xala
ncbm
k
gobm
k
milc
sphi
nx3
povr
ay
cact
usAD
M
deal
II
nam
d
HAR
MEA
N
0.01
0.1
1
10
100
RR PDBA PDBA (4-bit approximation)VCA LIBRA AGE
slowdown of farthest over
closest location(log scale)
16 / 20
Analysis on Unfairness of AGE
• AGE can be unfair in closed-loop evaluation.
: buffer depth
: # of in-flight packet from
𝑁1=𝑑+𝑑𝑁1
𝑁1+𝑁 2+𝑁3+𝑁 4
𝑁 2=𝑑+𝑑𝑁2
𝑁2+𝑁 3+𝑁 4
+𝑑𝑁 2
𝑁 1+𝑁2+𝑁 3+𝑁 4
Assumptions:- All nodes send packets to MC- Ideal age-based arbitration- Steady state
𝑁 3=𝑑+𝑑𝑁3
𝑁 3+𝑁4
+𝑑𝑁3
𝑁2+𝑁 3+𝑁 4
+𝑑𝑁 3
𝑁 1+𝑁2+𝑁 3+𝑁 4
𝑁 4=𝑑+𝑑+𝑑𝑁 4
𝑁3+𝑁4
+𝑑𝑁 4
𝑁2+𝑁 3+𝑁 4
+𝑑𝑁 2
𝑁 1+𝑁2+𝑁 3+𝑁 4
, ,
P1 P2 P3 P40
0.5
1
1.5
2
2.5
3
3.5analysis simulation
Node
Relativethroughput
17 / 20
Cost Comparison of QoS Mechanisms
• Area overhead comparison:
PDBA GSF LOFT PVC LIBRA0
20,000 40,000 60,000 80,000
100,000 120,000 140,000 160,000 180,000 200,000
additional area overhead per
node (um2)
[MICRO’10][ISCA’08][MICRO’10][MICRO’09]
LIBRA achieves 38% lower area overhead!(compared to PVC)
18 / 20
Conclusions
• Impact of task placement on performance: up to 30x with RR.• This work proposes LIBRA, a multi-mode arbitration.
– VCA for providing global fairness.– Hybrid arbitration for reducing latency overhead.
• LIBRA can support locality-oblivious task placement.• Analysis on unfairness of age-based arbitration.• LIBRA has 38% lower area overhead compared to PVC.
19 / 20
Q&A
THANK YOU!
20 / 20
Hybrid Arbiter Cont’d
• If there was no chance for pre-calculation, use RR.
• Use VCA whenever possible.
X
𝑟𝑎𝑛𝑑𝑜𝑚𝑛𝑢𝑚𝑏𝑒𝑟
X
+
𝑤0
𝑤1 +
<
<
𝑔0
𝑔1
Pre-calculationstage (PC)
Arbitration stage (SAc)
21 / 20