feb 14 th 2005university of utah1 microarchitectural wire management for performance and power in...

Feb 14th 2005 University of Utah 1

Microarchitectural Wire Management for Performance and Power in Partitioned

Architectures

Rajeev BalasubramonianNaveen Muralimanohar

Karthik RamaniVenkatanand Venkatachalapathy

February 14th 2005

2 University of Utah

Overview/Motivation

Wire delays are costly for performance and

power

Latencies of 30 cycles to reach ends of a

chip

50% of dynamic power is in interconnect

switching (Magen et al. SLIP 04)

Abundant number of metal layers

February 14th 2005


Wire Characteristics

Wire Resistance and capacitance per unit length

),()22(0 verthorizverthorizwire fringenglayerspaci

width

spacing

thicknessKC

)2()( BarrierwidthBarrierthicknessRwire

(Width & Spacing) Delay (as delay RC), Bandwidth

Resistance Capacitance Bandwidth

Width

Spacing

February 14th 2005


Design Space Exploration

Tuning wire width and spacing

d

2d

B WiresResistance

Capacitance

Resistance

Capacitance

BandwidthL wires

February 14th 2005


Transmission Lines

Allow extremely low delay

High implementation complexity and overhead!

Large width

Large spacing between wires

Design of sensing circuit

Shielding power and ground lines adjacent to each line

Implemented in test CMOS chips

Not employed in this study

February 14th 2005



Tuning Repeater size and spacing

Traditional WiresLarge repeatersOptimum spacing

Power Optimal WiresSmaller repeatersIncreased spacing

Dela

y Po

wer

February 14th 2005



Base caseB wires

BandwidthOptimizedW wires

PowerOptimized

P wires

Power and B/WOptimizedPW wires

Fast, low bandwidth

L wires

February 14th 2005


Outline

Overview

Wire Design Space Exploration

Employing L wires for Performance

PW wires: The Power Optimizers

Results

Conclusions

February 14th 2005


Evaluation Platform

L1 DCache Cluster

Centralized front-end

I-Cache & D-Cache

LSQ

Branch Predictor

Clustered back-end

February 14th 2005


Cache Pipeline

L1 DCache

LSQ

Eff. Address Transfer 10c

Mem. DepResolution

5c

CacheAccess

5c

Data return at 20c

L1 DCache

LSQ


Mem. DepResolution

5c

CacheAccess

5c

Data return at 20c

L1 DCache

LSQ


PartialMem. DepResolution

3c

CacheAccess

5c

8-bit Transfer 5c

Data return at 14c

Functional

Unit

February 14th 2005


L wires: Accelerating cache access

Transmit LSB bits of effective address through L wires Faster memory disambiguation

Partial comparison of loads and stores in LSQ

Introduces false dependences ( < 9%)

Indexing data and tag RAM arrays LSB bits can prefetch data out of L1$

Reduce access latency of loads

February 14th 2005


L wires: Narrow Bit Width Operands

PowerPC: Data bit-width determines FU

latency

Transfer of 10 bit integers on L wires

Can introduce scheduling difficulties

A predictor table of saturating counters

Accuracy of 98%

Reduction in branch mispredict penalty

February 14th 2005


Power Efficient Wires.

Base caseB wires

Power and B/WOptimizedPW wires

Idea: steer non-critical data through

energy efficient PW interconnect

February 14th 2005


PW wires: Power/Bandwidth Efficient

Ready Register operands Transfer of data at

instruction dispatch

Transfer of input operands

to remote register file

Covered by long dispatch to

issue latency

Store data Could stall commit process

Delay dependent loads

Rename&

Dispatch

IQ

Regfile

FU

IQ

Regfile

FU

IQ

Regfile

FU

IQ

Regfile

FU

Operand is ready at cycle 90

Consumer instruction Dispatched at cycle 100

February 14th 2005


Outline

Overview

Wire Design Space Exploration

Employing L wires for Performance

PW wires: The Power Optimizers

Results

Conclusions

February 14th 2005


Evaluation Methodology

L1 DCache

B wires (2 cycles)

L wires (1 cycle)

PW wires (3 cycles)

Cluster

Simplescalar -3.0 augmented to simulate a dynamically scheduled 4-cluster model

Crossbar interconnects (L, B and PW wires)

February 14th 2005


Heterogeneous Interconnects Intercluster global Interconnect

72 B wires (64 data bits and 8 control bits) Repeaters sized and spaced for optimum delay

18 L wires Wide wires and large spacing

Occupies more area

Low latencies 144 PW wires

Poor delay

High bandwidth

Low power

February 14th 2005


Analytical Model

C = Ca + WsCb + Cc/Ws

1 2 31 Fringing Capacitance

2 Capacitance between

different layers of wires

3 Capacitance between wires

Of same metal layer

RC Model of the wire

Total Power = Short-Circuit Power + Switching Power + Leakage

Power

February 14th 2005


Evaluation methodology

I-Cache

D-cache

LSQ Cluster

Cross bar

Ring interconnect

Simplescalar -3.0

augmented to simulate

a dynamically

scheduled 16-cluster

model

Ring latencies

B wires ( 4 cycles)

PW wires ( 6 cycles)

L wires (2 cycles)

February 14th 2005


IPC improvements: L wires

L wires improve performance by 4.2% on four cluster

system and 7.1% on a sixteen cluster system

0

0.5

1

1.5

2

2.5

am

mp

ap

plu

ap

si art

bzi

p2

cra

fty

eo

n

eq

ua

ke

fma

3d

ga

lge

l

ga

p

gcc

gzi

p

luca

s

mcf

me

sa

mg

rid

pa

rse

r

swim

two

lf

vort

ex

vpr

wu

pw

ise

AM

Baseline: 144 B-Wires

Low-latency optimizations: 144 B-Wires and 36 L-Wires

February 14th 2005


Four Cluster System: ED2 Improvements

92.195.0970.961.5144 PW 36 L

99.296.61030.982.0288 B

94.593.31010.992.0144 B, 36 L

93.294.4990.972.0288 PW,36 L

100.2103.4970.921.0288 PW

1001001000.951.0144 B

Relative

ED2

(20%)

Relative

ED2

(10%)

Relative

processor

energy

(10%)

IPCRelative

metal

area

Link

February 14th 2005


Sixteen Cluster system: ED2 gains

93.11051.18288 B

88.71071.22288 B, 36 L

88.71021.19144 B, 36 L

105.3941.05144 PW, 36 L

1001001.11144 B

Relative ED2

(20%)

Relative

Processor

Energy (20%)

IPCLink

February 14th 2005


Conclusions

Exposing the wire design space to the architecture

A case for micro-architectural wire management!

A low latency low bandwidth network alone helps improve performance by up to 7%

ED2 improvements of about 11% compared to a baseline processor with homogeneous interconnect

Entails hardware complexity

February 14th 2005


Future work

3-D wire model for the interconnects

Design of heterogeneous clusters

Interconnects for cache coherence and L2$

February 14th 2005


Questions and Comments?

Thank you!

February 14th 2005


Backup

February 14th 2005


L wires: Accelerating cache access

TLB access for page look up Transmit a few bits of

Virtual page number on L wires

Prefetch data our of L1$ and TLB

18 L wires( 6 tag bits, 8 L1 index and 4 TLB index bits)

Wire

Type

Crossb

ar

delay

Ring

hop

delay

PW

wires

3 6

B wires 2 4

L wires 1 2

February 14th 2005


Model parameters

Simplescalar-3.0 with separate integer and

floating point queues

32 KB 2 way Instruction cache

32 KB 4 way Data cache

128 entry 8 way I and D TLB

February 14th 2005


Overview/Motivation:

± Three wire implementations employed in this study

± B wires: traditional Optimal delay

Huge power consumption

± L wires: Faster than B wires

Lesser bandwidth

± PW wires: Reduced power consumption

Higher bandwidth compared to B wires

Increased delay through the wires

February 14th 2005


feb 14 th 2005university of utah1 microarchitectural wire management for performance and power in...

Documents

power efficient wires

c cache access

spacing delay power

performance pw wires

c l1 d cache lsqlsq

c data return

base case b wires power

c functional unit slide