university of utah 1 the effect of interconnect design on the performance of large l2 caches naveen...
Post on 22-Dec-2015
216 views
TRANSCRIPT
![Page 1: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/1.jpg)
University of Utah 1
The Effect of Interconnect Design on the Performance of Large L2
Caches
Naveen Muralimanohar Rajeev Balasubramonian
![Page 2: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/2.jpg)
University of Utah 2
Motivation: Large Caches
Future processors will have large on-chip caches Intel Montecito has 24MB on-chip cache
Wire delay dominates in large caches Conventional design can lead to very high hit time
(CACTI access time for 24 MB cache is 90 cycles @ 5GHz, 65nm Tech)
Careful network choices Improve access time
Open room for several other optimizations
Reduces power significantly
![Page 3: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/3.jpg)
University of Utah 3
Effect of L2 Hit Time
0%
10%
20%
30%
40%
50%
amm
p
app
lu
apsi art
bzi
p2
craf
ty
eon
equ
ake
fma3
d
gal
gel
gap gcc
gzi
p
luca
s
mcf
mes
a
mg
rid
par
ser
swim
two
lf
vort
ex vpr
wu
pw
ise
IPC
imp
rove
men
t
Increase in IPC due to reduction in L2 access time
8-issue, out-of-order processor (L2-hit time 30-15 cycles)
Avg = 17%
![Page 4: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/4.jpg)
University of Utah 4
Cache DesignInput address
Dec
oderWordline
Bitlines
Tag
arr
ay
Dat
a ar
ray
Column muxesSense Amps
Comparators
Output driver
Valid output?
Mux drivers
Data output
Output driver
![Page 5: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/5.jpg)
University of Utah 5
Existing Model - CACTI
Decoder delay Decoder delay
Wordline & bitline delay Wordline & bitline delay
Cache model with 4 sub-arrays Cache model with 16 sub-arrays
![Page 6: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/6.jpg)
University of Utah 6
Shortcomings
CACTI Suboptimal for large cache size Access delay is equal to the delay of slowest
sub-array Very high hit time for large caches
Employs a separate bus for each cache bank for multi-banked caches
![Page 7: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/7.jpg)
University of Utah 7
Non-Uniform Cache Access (NUCA)
Large cache is broken into
a number of small banks
Employs on-chip network
for communication
Access delay (distance
between bank and cache
controller)
CPU & L1
Cache banks
![Page 8: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/8.jpg)
University of Utah 8
Shortcomings
NUCA Banks are sized such that the link latency is
one cycle (Kim et al. ASPLOS 02)
Increased routing complexity
Dissipates more power
![Page 9: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/9.jpg)
University of Utah 9
Extension to CACTI
On-chip network
Wire model is done using ITRS 2005 parameters
Grid network
No. of rows = No. of columns (or ½ the no. of columns)
Network latency vs Bank access latency tradeoff
Modified the exhaustive search to include the network
overhead
![Page 10: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/10.jpg)
University of Utah 10
Effect of Network Delay (32MB cache)
0
20
40
60
80
100
120
140
2 4 8 16 32 64 128 256 512 1024 2048 4096
Bank Count
Cy
cle
s (
Fre
q 5
GH
z)
Bank Access Time
Average Cache Access Latency (Global wires)
Average Network Delay
Delay optimal point
![Page 11: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/11.jpg)
University of Utah 11
Outline
Overview
Cache Design
Effect of Network Delay Wire Design Space Exploiting Heterogeneous Wires Results
![Page 12: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/12.jpg)
University of Utah 12
Wire Characteristics Wire Resistance and capacitance per unit length
),()22(0 verthorizverthorizwire fringenglayerspaci
width
spacing
thicknessKC
)2()( BarrierwidthBarrierthicknessRwire
Resistance Capacitance Bandwidth
Width
Spacing
![Page 13: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/13.jpg)
University of Utah 13
Design Space Exploration Tuning wire width and spacing
Base caseB wires
Fast butLow bandwidth
L wires
(Width & Spacing)
Delay Bandwidth
![Page 14: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/14.jpg)
University of Utah 14
Design Space Exploration Tuning Repeater size and spacing
Traditional WiresLarge repeatersOptimum spacing
Power Optimal WiresSmaller repeatersIncreased spacing
Dela
y Po
wer
![Page 15: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/15.jpg)
University of Utah 15
Design Space Exploration
Base caseB wires8x plane
Base caseW wires4x plane
PoweroptimizedPW wires4x plane
Fast, low bandwidth
L wires8x plane
Latency 1x
Power 1x
Area 1x
Latency 1.6x
Power 0.9x
Area 0.5x
Latency 3.2x
Power 0.3x
Area 0.5x
Latency 0.5x
Power 0.5x
Area 5x
![Page 16: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/16.jpg)
University of Utah 16
Access time for different link types
Bank Count
Bank Access Time
Avg Access time
8x-wires 4x-wires L-wires
16 17 46 75 21
32 9 40 71 15
64 6 38 63 14
128 5 44 68 17
256 4 51 83 20
512 3 82 113 27
1024 3 100 133 35
2048 3 99 162 51
4096 3 131 196 67
![Page 17: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/17.jpg)
University of Utah 17
Outline
Overview
Cache Design
Effect of Network Delay
Wire Design Space Exploiting Heterogeneous Wires Results
![Page 18: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/18.jpg)
University of Utah 18
Cache Look-UpTotal cache access time
Network delay
(req 6-8 bits to
identify the cache
Bank)
Decoder,
Wordline,
Bitline delay
(req 10-15 bits
of address)
Comparator,
output driver delay
(req remaining address
for tag match)
The entire access happens in a sequential
manner
Bank access
![Page 19: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/19.jpg)
University of Utah 19
Early Look-Up Send partial
address in L-wires Initiate the bank
lookup Wait for the
complete address Complete the
access
L
Early lookup
(req 10-15
bits
of address)
Tag match
We can hide 60-70%
of the bank access
delay
![Page 20: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/20.jpg)
University of Utah 20
Aggressive Look-Up Send partial address bits on L-wires
Do early look-up and do partial tag match
Send all the matched blocks aggressively
L
Agg. lookup
(req additional
8-bits of
address fpr
partial tag
match)
Tag match
at cache
controller
Network
delay reduced
![Page 21: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/21.jpg)
University of Utah 21
Aggressive Look-Up Significant reduction in network delay (for address
transfer) Increase in traffic due to false match < 1% Marginal increase in link overhead
Additional 8-bits of L-wires compared to early lookup
- Adds complexity to cache controller- Needs logic to do tag match
![Page 22: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/22.jpg)
University of Utah 22
Outline
Overview
Cache Design
Effect of Network Delay
Wire Design Space
Exploiting Heterogeneous Wires Results
![Page 23: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/23.jpg)
University of Utah 23
Experimental Setup
Simplescalar with contention modeled in detail
Single core, 8-issue out-of-order processor
32 MB, 8-way set-associative, on-chip L2 cache
(SNUCA organization)
32KB I-cache and 32KB D-cache with hit latency
of 3 cycles
Main memory latency 300 cycles
![Page 24: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/24.jpg)
University of Utah 24
Cache Models
Model Bank Access
(cycles)
Bank Count Network Link Description
1 3 512 B-wires Based on prior work
2 6 64 B-wires CACTI-L2
3 6 64 B & L–wires Early Lookup
4 6 64 B & L–wires Agg. Lookup
5 6 64 B & L–wires Upper bound
![Page 25: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/25.jpg)
University of Utah 25
Performance Results (Global Wires)
Model 2 (CACTI-L2) : Average performance improvement – 11%
Performance improvement for L2 latency sensitive benchmarks – 16.3%
Model 3 (Early Lookup): Average performance improvement – 14.4%
Performance improvement for L2 latency sensitive benchmarks – 21.6%
Model 4 (Aggressive Lookup): Average performance improvement – 17.6%
Performance improvement for L2 latency sensitive benchmarks – 26.6%
Model 6 (L-Network): Average performance improvement – 11.4%
Performance improvement for L2 latency sensitive benchmarks – 16.2%
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Model 1 Model 2 Model 3 Model 4 Model 5 Model 6
All Benchmarks Latency Sensitive Benchmarks
![Page 26: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/26.jpg)
University of Utah 26
Performance Results (4X – Wires)
Wire delay constrained
model Performance
improvements are better
Early lookup performs 5% better
Aggressive model performs 28% better
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Model 1 Model 2 Model 3 Model 4 Model 5
Different Cache Configurations
IPC
(N
orm
ali
zed
to
Mo
de
l 1
)
All Benchmarks Latency Sensitive Benchmarks
![Page 27: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/27.jpg)
University of Utah 27
Future Work Heterogeneous network in a CMP environment Hybrid-network
Employs a combination of point-to-point and bus for L-messages Effective use of L-wires Latency/bandwidth trade-off
Use of heterogeneous wires in DNUCA environment Cache design focusing on power
Pre-fetching (Power optimized wires) Writeback (Power optimized wires)
![Page 28: University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d7b5503460f94a5f433/html5/thumbnails/28.jpg)
University of Utah 28
Conclusion
Traditional design approaches for large caches is sub-optimal
Network parameters play a significant role in the performance
of large caches
Modified CACTI model, that includes network overhead
performs 16.3% better compared to previous models
Heterogeneous network has potential to further improve the
performance
Early lookup – 21.6%
Aggressive lookup – 26.6%