feb 14 th 2005university of utah1 microarchitectural wire management for performance and power in...
Post on 22-Dec-2015
214 views
TRANSCRIPT
![Page 1: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/1.jpg)
Feb 14th 2005 University of Utah 1
Microarchitectural Wire Management for Performance and Power in Partitioned
Architectures
Rajeev BalasubramonianNaveen Muralimanohar
Karthik RamaniVenkatanand Venkatachalapathy
![Page 2: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/2.jpg)
February 14th 2005
2 University of Utah
Overview/Motivation
Wire delays are costly for performance and
power
Latencies of 30 cycles to reach ends of a
chip
50% of dynamic power is in interconnect
switching (Magen et al. SLIP 04)
Abundant number of metal layers
![Page 3: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/3.jpg)
February 14th 2005
3 University of Utah
Wire Characteristics
Wire Resistance and capacitance per unit length
),()22(0 verthorizverthorizwire fringenglayerspaci
width
spacing
thicknessKC
)2()( BarrierwidthBarrierthicknessRwire
(Width & Spacing) Delay (as delay RC), Bandwidth
Resistance Capacitance Bandwidth
Width
Spacing
![Page 4: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/4.jpg)
February 14th 2005
4 University of Utah
Design Space Exploration
Tuning wire width and spacing
d
2d
B WiresResistance
Capacitance
Resistance
Capacitance
BandwidthL wires
![Page 5: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/5.jpg)
February 14th 2005
5 University of Utah
Transmission Lines
Allow extremely low delay
High implementation complexity and overhead!
Large width
Large spacing between wires
Design of sensing circuit
Shielding power and ground lines adjacent to each line
Implemented in test CMOS chips
Not employed in this study
![Page 6: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/6.jpg)
February 14th 2005
6 University of Utah
Design Space Exploration
Tuning Repeater size and spacing
Traditional WiresLarge repeatersOptimum spacing
Power Optimal WiresSmaller repeatersIncreased spacing
Dela
y Po
wer
![Page 7: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/7.jpg)
February 14th 2005
7 University of Utah
Design Space Exploration
Base caseB wires
BandwidthOptimizedW wires
PowerOptimized
P wires
Power and B/WOptimizedPW wires
Fast, low bandwidth
L wires
![Page 8: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/8.jpg)
February 14th 2005
8 University of Utah
Outline
Overview
Wire Design Space Exploration
Employing L wires for Performance
PW wires: The Power Optimizers
Results
Conclusions
![Page 9: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/9.jpg)
February 14th 2005
9 University of Utah
Evaluation Platform
L1 DCache Cluster
Centralized front-end
I-Cache & D-Cache
LSQ
Branch Predictor
Clustered back-end
![Page 10: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/10.jpg)
February 14th 2005
10 University of Utah
Cache Pipeline
L1 DCache
LSQ
Eff. Address Transfer 10c
Mem. DepResolution
5c
CacheAccess
5c
Data return at 20c
L1 DCache
LSQ
Eff. Address Transfer 10c
Mem. DepResolution
5c
CacheAccess
5c
Data return at 20c
L1 DCache
LSQ
Eff. Address Transfer 10c
PartialMem. DepResolution
3c
CacheAccess
5c
8-bit Transfer 5c
Data return at 14c
Functional
Unit
![Page 11: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/11.jpg)
February 14th 2005
11 University of Utah
L wires: Accelerating cache access
Transmit LSB bits of effective address through L wires Faster memory disambiguation
Partial comparison of loads and stores in LSQ
Introduces false dependences ( < 9%)
Indexing data and tag RAM arrays LSB bits can prefetch data out of L1$
Reduce access latency of loads
![Page 12: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/12.jpg)
February 14th 2005
12 University of Utah
L wires: Narrow Bit Width Operands
PowerPC: Data bit-width determines FU
latency
Transfer of 10 bit integers on L wires
Can introduce scheduling difficulties
A predictor table of saturating counters
Accuracy of 98%
Reduction in branch mispredict penalty
![Page 13: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/13.jpg)
February 14th 2005
13 University of Utah
Power Efficient Wires.
Base caseB wires
Power and B/WOptimizedPW wires
Idea: steer non-critical data through
energy efficient PW interconnect
![Page 14: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/14.jpg)
February 14th 2005
14 University of Utah
PW wires: Power/Bandwidth Efficient
Ready Register operands Transfer of data at
instruction dispatch
Transfer of input operands
to remote register file
Covered by long dispatch to
issue latency
Store data Could stall commit process
Delay dependent loads
Rename&
Dispatch
IQ
Regfile
FU
IQ
Regfile
FU
IQ
Regfile
FU
IQ
Regfile
FU
Operand is ready at cycle 90
Consumer instruction Dispatched at cycle 100
![Page 15: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/15.jpg)
February 14th 2005
15 University of Utah
Outline
Overview
Wire Design Space Exploration
Employing L wires for Performance
PW wires: The Power Optimizers
Results
Conclusions
![Page 16: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/16.jpg)
February 14th 2005
16 University of Utah
Evaluation Methodology
L1 DCache
B wires (2 cycles)
L wires (1 cycle)
PW wires (3 cycles)
Cluster
Simplescalar -3.0 augmented to simulate a dynamically scheduled 4-cluster model
Crossbar interconnects (L, B and PW wires)
![Page 17: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/17.jpg)
February 14th 2005
17 University of Utah
Heterogeneous Interconnects Intercluster global Interconnect
72 B wires (64 data bits and 8 control bits) Repeaters sized and spaced for optimum delay
18 L wires Wide wires and large spacing
Occupies more area
Low latencies 144 PW wires
Poor delay
High bandwidth
Low power
![Page 18: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/18.jpg)
February 14th 2005
18 University of Utah
Analytical Model
C = Ca + WsCb + Cc/Ws
1 2 31 Fringing Capacitance
2 Capacitance between
different layers of wires
3 Capacitance between wires
Of same metal layer
RC Model of the wire
Total Power = Short-Circuit Power + Switching Power + Leakage
Power
![Page 19: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/19.jpg)
February 14th 2005
19 University of Utah
Evaluation methodology
I-Cache
D-cache
LSQ Cluster
Cross bar
Ring interconnect
Simplescalar -3.0
augmented to simulate
a dynamically
scheduled 16-cluster
model
Ring latencies
B wires ( 4 cycles)
PW wires ( 6 cycles)
L wires (2 cycles)
![Page 20: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/20.jpg)
February 14th 2005
20 University of Utah
IPC improvements: L wires
L wires improve performance by 4.2% on four cluster
system and 7.1% on a sixteen cluster system
0
0.5
1
1.5
2
2.5
am
mp
ap
plu
ap
si art
bzi
p2
cra
fty
eo
n
eq
ua
ke
fma
3d
ga
lge
l
ga
p
gcc
gzi
p
luca
s
mcf
me
sa
mg
rid
pa
rse
r
swim
two
lf
vort
ex
vpr
wu
pw
ise
AM
Baseline: 144 B-Wires
Low-latency optimizations: 144 B-Wires and 36 L-Wires
![Page 21: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/21.jpg)
February 14th 2005
21 University of Utah
Four Cluster System: ED2 Improvements
92.195.0970.961.5144 PW 36 L
99.296.61030.982.0288 B
94.593.31010.992.0144 B, 36 L
93.294.4990.972.0288 PW,36 L
100.2103.4970.921.0288 PW
1001001000.951.0144 B
Relative
ED2
(20%)
Relative
ED2
(10%)
Relative
processor
energy
(10%)
IPCRelative
metal
area
Link
![Page 22: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/22.jpg)
February 14th 2005
22 University of Utah
Sixteen Cluster system: ED2 gains
93.11051.18288 B
88.71071.22288 B, 36 L
88.71021.19144 B, 36 L
105.3941.05144 PW, 36 L
1001001.11144 B
Relative ED2
(20%)
Relative
Processor
Energy (20%)
IPCLink
![Page 23: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/23.jpg)
February 14th 2005
23 University of Utah
Conclusions
Exposing the wire design space to the architecture
A case for micro-architectural wire management!
A low latency low bandwidth network alone helps improve performance by up to 7%
ED2 improvements of about 11% compared to a baseline processor with homogeneous interconnect
Entails hardware complexity
![Page 24: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/24.jpg)
February 14th 2005
24 University of Utah
Future work
3-D wire model for the interconnects
Design of heterogeneous clusters
Interconnects for cache coherence and L2$
![Page 25: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/25.jpg)
February 14th 2005
25 University of Utah
Questions and Comments?
Thank you!
![Page 26: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/26.jpg)
February 14th 2005
26 University of Utah
Backup
![Page 27: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/27.jpg)
February 14th 2005
27 University of Utah
L wires: Accelerating cache access
TLB access for page look up Transmit a few bits of
Virtual page number on L wires
Prefetch data our of L1$ and TLB
18 L wires( 6 tag bits, 8 L1 index and 4 TLB index bits)
Wire
Type
Crossb
ar
delay
Ring
hop
delay
PW
wires
3 6
B wires 2 4
L wires 1 2
![Page 28: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/28.jpg)
February 14th 2005
28 University of Utah
Model parameters
Simplescalar-3.0 with separate integer and
floating point queues
32 KB 2 way Instruction cache
32 KB 4 way Data cache
128 entry 8 way I and D TLB
![Page 29: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/29.jpg)
February 14th 2005
29 University of Utah
Overview/Motivation:
± Three wire implementations employed in this study
± B wires: traditional Optimal delay
Huge power consumption
± L wires: Faster than B wires
Lesser bandwidth
± PW wires: Reduced power consumption
Higher bandwidth compared to B wires
Increased delay through the wires
![Page 30: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d7f5503460f94a636f1/html5/thumbnails/30.jpg)
February 14th 2005
30 University of Utah