feb 14 th 2005university of utah1 microarchitectural wire management for performance and power in...
Post on 22-Dec-2015
214 views
TRANSCRIPT
Feb 14th 2005 University of Utah 1
Microarchitectural Wire Management for Performance and Power in Partitioned
Architectures
Rajeev BalasubramonianNaveen Muralimanohar
Karthik RamaniVenkatanand Venkatachalapathy
February 14th 2005
2 University of Utah
Overview/Motivation
Wire delays are costly for performance and
power
Latencies of 30 cycles to reach ends of a
chip
50% of dynamic power is in interconnect
switching (Magen et al. SLIP 04)
Abundant number of metal layers
February 14th 2005
3 University of Utah
Wire Characteristics
Wire Resistance and capacitance per unit length
),()22(0 verthorizverthorizwire fringenglayerspaci
width
spacing
thicknessKC
)2()( BarrierwidthBarrierthicknessRwire
(Width & Spacing) Delay (as delay RC), Bandwidth
Resistance Capacitance Bandwidth
Width
Spacing
February 14th 2005
4 University of Utah
Design Space Exploration
Tuning wire width and spacing
d
2d
B WiresResistance
Capacitance
Resistance
Capacitance
BandwidthL wires
February 14th 2005
5 University of Utah
Transmission Lines
Allow extremely low delay
High implementation complexity and overhead!
Large width
Large spacing between wires
Design of sensing circuit
Shielding power and ground lines adjacent to each line
Implemented in test CMOS chips
Not employed in this study
February 14th 2005
6 University of Utah
Design Space Exploration
Tuning Repeater size and spacing
Traditional WiresLarge repeatersOptimum spacing
Power Optimal WiresSmaller repeatersIncreased spacing
Dela
y Po
wer
February 14th 2005
7 University of Utah
Design Space Exploration
Base caseB wires
BandwidthOptimizedW wires
PowerOptimized
P wires
Power and B/WOptimizedPW wires
Fast, low bandwidth
L wires
February 14th 2005
8 University of Utah
Outline
Overview
Wire Design Space Exploration
Employing L wires for Performance
PW wires: The Power Optimizers
Results
Conclusions
February 14th 2005
9 University of Utah
Evaluation Platform
L1 DCache Cluster
Centralized front-end
I-Cache & D-Cache
LSQ
Branch Predictor
Clustered back-end
February 14th 2005
10 University of Utah
Cache Pipeline
L1 DCache
LSQ
Eff. Address Transfer 10c
Mem. DepResolution
5c
CacheAccess
5c
Data return at 20c
L1 DCache
LSQ
Eff. Address Transfer 10c
Mem. DepResolution
5c
CacheAccess
5c
Data return at 20c
L1 DCache
LSQ
Eff. Address Transfer 10c
PartialMem. DepResolution
3c
CacheAccess
5c
8-bit Transfer 5c
Data return at 14c
Functional
Unit
February 14th 2005
11 University of Utah
L wires: Accelerating cache access
Transmit LSB bits of effective address through L wires Faster memory disambiguation
Partial comparison of loads and stores in LSQ
Introduces false dependences ( < 9%)
Indexing data and tag RAM arrays LSB bits can prefetch data out of L1$
Reduce access latency of loads
February 14th 2005
12 University of Utah
L wires: Narrow Bit Width Operands
PowerPC: Data bit-width determines FU
latency
Transfer of 10 bit integers on L wires
Can introduce scheduling difficulties
A predictor table of saturating counters
Accuracy of 98%
Reduction in branch mispredict penalty
February 14th 2005
13 University of Utah
Power Efficient Wires.
Base caseB wires
Power and B/WOptimizedPW wires
Idea: steer non-critical data through
energy efficient PW interconnect
February 14th 2005
14 University of Utah
PW wires: Power/Bandwidth Efficient
Ready Register operands Transfer of data at
instruction dispatch
Transfer of input operands
to remote register file
Covered by long dispatch to
issue latency
Store data Could stall commit process
Delay dependent loads
Rename&
Dispatch
IQ
Regfile
FU
IQ
Regfile
FU
IQ
Regfile
FU
IQ
Regfile
FU
Operand is ready at cycle 90
Consumer instruction Dispatched at cycle 100
February 14th 2005
15 University of Utah
Outline
Overview
Wire Design Space Exploration
Employing L wires for Performance
PW wires: The Power Optimizers
Results
Conclusions
February 14th 2005
16 University of Utah
Evaluation Methodology
L1 DCache
B wires (2 cycles)
L wires (1 cycle)
PW wires (3 cycles)
Cluster
Simplescalar -3.0 augmented to simulate a dynamically scheduled 4-cluster model
Crossbar interconnects (L, B and PW wires)
February 14th 2005
17 University of Utah
Heterogeneous Interconnects Intercluster global Interconnect
72 B wires (64 data bits and 8 control bits) Repeaters sized and spaced for optimum delay
18 L wires Wide wires and large spacing
Occupies more area
Low latencies 144 PW wires
Poor delay
High bandwidth
Low power
February 14th 2005
18 University of Utah
Analytical Model
C = Ca + WsCb + Cc/Ws
1 2 31 Fringing Capacitance
2 Capacitance between
different layers of wires
3 Capacitance between wires
Of same metal layer
RC Model of the wire
Total Power = Short-Circuit Power + Switching Power + Leakage
Power
February 14th 2005
19 University of Utah
Evaluation methodology
I-Cache
D-cache
LSQ Cluster
Cross bar
Ring interconnect
Simplescalar -3.0
augmented to simulate
a dynamically
scheduled 16-cluster
model
Ring latencies
B wires ( 4 cycles)
PW wires ( 6 cycles)
L wires (2 cycles)
February 14th 2005
20 University of Utah
IPC improvements: L wires
L wires improve performance by 4.2% on four cluster
system and 7.1% on a sixteen cluster system
0
0.5
1
1.5
2
2.5
am
mp
ap
plu
ap
si art
bzi
p2
cra
fty
eo
n
eq
ua
ke
fma
3d
ga
lge
l
ga
p
gcc
gzi
p
luca
s
mcf
me
sa
mg
rid
pa
rse
r
swim
two
lf
vort
ex
vpr
wu
pw
ise
AM
Baseline: 144 B-Wires
Low-latency optimizations: 144 B-Wires and 36 L-Wires
February 14th 2005
21 University of Utah
Four Cluster System: ED2 Improvements
92.195.0970.961.5144 PW 36 L
99.296.61030.982.0288 B
94.593.31010.992.0144 B, 36 L
93.294.4990.972.0288 PW,36 L
100.2103.4970.921.0288 PW
1001001000.951.0144 B
Relative
ED2
(20%)
Relative
ED2
(10%)
Relative
processor
energy
(10%)
IPCRelative
metal
area
Link
February 14th 2005
22 University of Utah
Sixteen Cluster system: ED2 gains
93.11051.18288 B
88.71071.22288 B, 36 L
88.71021.19144 B, 36 L
105.3941.05144 PW, 36 L
1001001.11144 B
Relative ED2
(20%)
Relative
Processor
Energy (20%)
IPCLink
February 14th 2005
23 University of Utah
Conclusions
Exposing the wire design space to the architecture
A case for micro-architectural wire management!
A low latency low bandwidth network alone helps improve performance by up to 7%
ED2 improvements of about 11% compared to a baseline processor with homogeneous interconnect
Entails hardware complexity
February 14th 2005
24 University of Utah
Future work
3-D wire model for the interconnects
Design of heterogeneous clusters
Interconnects for cache coherence and L2$
February 14th 2005
25 University of Utah
Questions and Comments?
Thank you!
February 14th 2005
26 University of Utah
Backup
February 14th 2005
27 University of Utah
L wires: Accelerating cache access
TLB access for page look up Transmit a few bits of
Virtual page number on L wires
Prefetch data our of L1$ and TLB
18 L wires( 6 tag bits, 8 L1 index and 4 TLB index bits)
Wire
Type
Crossb
ar
delay
Ring
hop
delay
PW
wires
3 6
B wires 2 4
L wires 1 2
February 14th 2005
28 University of Utah
Model parameters
Simplescalar-3.0 with separate integer and
floating point queues
32 KB 2 way Instruction cache
32 KB 4 way Data cache
128 entry 8 way I and D TLB
February 14th 2005
29 University of Utah
Overview/Motivation:
± Three wire implementations employed in this study
± B wires: traditional Optimal delay
Huge power consumption
± L wires: Faster than B wires
Lesser bandwidth
± PW wires: Reduced power consumption
Higher bandwidth compared to B wires
Increased delay through the wires
February 14th 2005
30 University of Utah