The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures
Javier Lira ψ
Carlos Molina ψ,ф
Antonio González ψ,λ
λ Intel Barcelona Research Center
Intel Labs - UPC
Barcelona, Spain
ф Dept. Enginyeria Informàtica
Universitat Rovira i Virgili
Tarragona, Spain
ψ Dept. Arquitectura de Computadors
Universitat Politècnica de Catalunya
Barcelona, Spain
ICS 2010, Tsukuba (Japan) – June 2, 2010
Introduction
CMPs incorporate large and shared last-level caches.
Access latency in large caches is dominated by wire delays.
Traditional caches are no longer feasible as LLC in CMPs.
3
40-45%
Intel® Nehalem IBM® Power7
Non-Uniform Cache Architecture
NUCA divides a large cache in smaller and faster banks.
Cache access latency consists of the routing and bank access latencies.
Banks close to cache controller have smaller latencies than further banks. Processor
4
Motivation
Banks work independently.
Most frequently accessed data concentrate in few banks.
In case of replacement…
A good choice in a particular bank could be completely unfair if the whole NUCA is considered.
Cor
e 6
Cor
e 7
Core 0 Core 1
Cor
e 2
Cor
e 3
Core 4Core 5
@
The Auction
A collaborative replacement technique that
finds the most appropriate data to evict, not only from
a particular bank but from the whole NUCA cache.
6
Methodology
Simulation tools: Simics + GEMS CACTI v6.0
Two scenarios: Multi-programmed
Mix of SPEC CPU2006
Parallel applications PARSEC
Number of cores 8 – UltraSPARC IIIi
Frequency 1.5 GHz
Main Memory Size 4 Gbytes
Memory Bandwidth 512 Bytes/cycle
Private L1 caches 8 x 32 Kbytes, 2-way
Shared L2 NUCA cache 8 MBytes, 256 Banks
NUCA Bank 32 KBytes, 8-way
L1 cache latency 3 cycles
NUCA bank latency 4 cycles
Router delay 1 cycle
On-chip wire delay 1 cycle
Main memory latency 250 cycles (from core)
Auction time-out 150 cycles
Baseline NUCA cache architecture
• CMP-DNUCA
• 8 cores
• 256 banks
• 16-way bank-set assoc.
(8 local + 8 central)
• LRU in the bank
• Zero-copy in the NUCA
[2] B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. MICRO ‘04
The Auction
The Auction is a collaborative replacement technique that finds the most appropriate data to evict,
not only from a particular bank but from the whole NUCA cache.”
11
“• It owns the item, but wants to sell it.• Bank where the replacement happens.Owner• Potential owners of the auctioned item.• The other banks from the bankset.Bidder• Manages the current auction.• New component: Auction slots.Controller
Auction participants:
The Auction
12
Cor
e 6
Cor
e 7
Core 0 Core 1
Cor
e 2
Cor
e 3
Core 4Core 5
Auction Slots
. . .
Step 1: Owner starts the auctionStep 2: Bids for the auctioned itemStep 3: Item is sold!
First Auction Approach: Base
Fills the gaps provoked by invalidating replicated data.
Owner Invites all other banks from the bankset.
Bidder Bids if NO new replacement.
Controller
First bid wins, but prioritising central banks.
13
Performance
15
MIX
black
schole
s
bodytra
ck
canneal
dedup
face
simferre
t
fluid
animate
freqm
ine
rayt
race
stre
amclu
ster
swaptio
nsvip
sx2
64
H-mean
0.950000000000001
1
1.05
1.1
1.15
1.2
Baseline Victim Cache One-Copy AUC-BASE
Per
form
ance
sp
eed
-up
1.23 1.24 1.34
Significant benefits with large working setsGood performance in both scenariosBlindly relocating data could be harmfulThe Auction outperforms prior proposals
Energy consumption
16
MIX
black
schole
s
bodytra
ck
canneal
dedup
face
sim ferre
t
fluid
animate
freqm
ine
rayt
race
stre
amclu
ster
swaptio
nsvip
sx2
64
A-mean
0.8
0.85
0.9
0.95
1
1.05
1.1 OffchipDynamic
A: Baseline, B: Victim Cache, C: One-Copy and D: AUC-BASE
En
erg
y p
er
ins
tru
cti
on
(n
orm
ali
zed
)
A B C D
Leakage dominates the energy consumptionAuction reduces overall energy consumed
Enhanced Auction Approaches
Almost half of auctions finished without receiving bids.
We need… a metric to measure the quality of data.
By increasing auction accuracy… Controller has more options to decide the best destination. Auctions with no bids are reduced.
Auction-based global replacement policies.
18
Bank Usage Imbalance
19
Banks will bid relying on their usage rate.
Owner Invites all other banks from the bankset.
Bidder Bids if less frequently “used” than owner.
Controller
The least “used” bidder wins.
Capacity replacements per cache-set
Prioritising most accessed data
20
Keeps most accessed data in the NUCA cache.
Owner Invites all other banks from the bankset.
Bidder Bids if LRU’s been less accessed than item.
Controller
Bidder with the least accessed LRU wins.
Access counter per line
Auction accuracy
21
0 1 2 3 4 5 6 7 8 9 100
102030405060708090
100
AUC-BASE AUC-ENH1-IMB AUC-ENH2-ACC
Number of bids
Pe
rce
nta
ge
of
Au
cti
on
s
Reduction of auctions that finish with no bidsController decisions are more accurated
Auction network
22
MIX
black
schole
s
bodytra
ck
canneal
dedup
face
sim ferre
t
fluid
animate
freqm
ine
rayt
race
stre
amclu
ster
swaptio
nsvip
sx2
64
A-Mean
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
AUC-BASE AUC-ENH1-IMB AUC-ENH2-ACC
Au
cti
on
me
ss
ag
es At the cost of increasing network traffic
Performance
23
MIX
black
schole
s
bodytra
ck
canneal
dedup
face
simferre
t
fluid
animate
freqm
ine
rayt
race
stre
amclu
ster
swaptio
nsvip
sx2
64
H-mean
0.950000000000001
1
1.05
1.1
1.15
1.2
Baseline AUC-BASE AUC-ENH1-IMB AUC-ENH2-ACC
Pe
rfo
rma
nc
e s
pe
ed
-up
1.231.25
1.28 1.341.28
1.37
Increasing auction accuracy, we take better replacement decisionsNetwork contention is a key constraint
Conclusions
The decentralized nature of NUCA makes replacement policies not effective.
The Auction finds the most appropriate data to evict, not only from a particular bank but from the whole NUCA cache.
The Auction adapts to the program behaviour and relocates data only if it is worthy.
By using auction-based replacement policies, the baseline NUCA improved its performance by 8% and reduced energy consumption by 4%.
25
More results (1)
27
MIX
black
schole
s
bodytra
ck
canneal
dedup
face
sim ferre
t
fluid
animate
freqm
ine
rayt
race
stre
amclu
ster
swaptio
nsvip
sx2
64
A-mean
0
2
4
6
8
10
12
14 BaselineVictim CacheOne-CopyAUC-BASE
Mis
se
s p
er
tho
us
an
d i
ns
tru
cti
on
s
More results (2)
28
MIX
black
schole
s
bodytra
ck
canneal
dedup
face
simferre
t
fluid
animate
freqm
ine
rayt
race
stre
amclu
ster
swaptio
nsvip
sx2
64
A-Mean
0.950000000000001
1.05
1.15
1.25
1.35
1.45
Common Network Traffic Auction
A: Baseline, B: One-Copy and C: AUC-BASE
Ne
two
rk t
raff
ic (
no
rma
lize
d)
A B C
More results (3)
29
MIX
black
scho
les
body
track
cann
eal
dedu
p
face
simfe
rret
fluida
nimat
e
freqm
ine
rayt
race
stre
amclu
ster
swap
tions vip
sx2
64
A-mea
n0.8
0.85
0.9
0.95
1
1.05
1.1 OffchipDynamic
A: Baseline, B: AUC-BASE, C: AUC-ENH1-IMB and D: AUC-ENH2-ACC
En
erg
y p
er
ins
tru
cti
on
(n
orm
ali
zed
)
A B C D