widget: wisconsin decoupled grid execution tiles yasuko watanabe*, john d. davis †, david a. wood*...

55
WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis , David A. Wood* *University of Wisconsin Microsoft Research ISCA 2010, Saint-Malo, France

Upload: dwain-ross

Post on 27-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

WiDGET:Wisconsin Decoupled Grid Execution Tiles

Yasuko Watanabe*, John D. Davis†, David A. Wood*

*University of Wisconsin †Microsoft Research

ISCA 2010, Saint-Malo, France

Page 2: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Power vs. Performance

A full range of operating pointson a single chip

2

0.3 0.5 0.7 0.9 1.1 1.3 1.50.20.30.40.50.60.70.80.9

1

Normalized Performance

Nor

mal

ized

Chi

p Po

wer

Xeon-like

Atom-like

A single core

Page 3: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Executive Summary

• WiDGET framework– Sea of resources– In-order Execution Units (EUs)

• ALU & FIFO instruction buffers• Distributed in-order buffers → OoO execution

– Simple instruction steering– Core scaling through resource allocation

• ↓ EUs → Slower with less power• ↑ EUs → Turbo speed with more power

3

Page 4: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

19711972

19741976

19781982

19851989

19931996

19981999

20002005

20072009

0.15

1.5

15

150

Ther

mal

Des

ign

Pow

er (W

)Intel CPU Power Trend

Core i7

4004

8008

8080

8085

8086 286386

486Pentium

Pentium MMX

Pentium II

Pentium III

Pentium 4Pentium D

Core 2

4

Page 5: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Conflicting Goal: Low Power & High Performance

• Need for high single-thread performance– Amdahl’s Law [Hill08]– Service-level agreements [Reddi10]

• Challenge– Energy proportional computing [Barroso07]

• Prior approach– Dynamic voltage and frequency scaling (DVFS)

5

Page 6: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Diminishing Returns of DVFS

• Near saturation in voltage scaling• Innovations in microarchitecture needed

1997 2000 2003 2006 20090

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

IBM PowerPC 405LPIntel Xscale 80200TransMeta Crusoe TM 5800Intel Itanium MontecitoAtom Sil-verthorneVminO

pera

ting

Volta

ge (V

)

6

Page 7: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Outline• High-level design overview• Microarchitecture• Single-thread evaluation• Conclusions

7

Page 8: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

High-Level Design• Sea of resources

Core

Core

Core

Core

Core

Core

Core

Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L1I

Thread context management orInstruction Engine(Front-end + Back-end)

In-order Execution Unit (EU)

L2

8

Page 9: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

WiDGET Vision

TLPPower

ILPPower

TLPILPPower

Just 5 examples.Much more can be done.

9

ILPPower

ILPPower

Page 10: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

In-Order EU

10

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L2

Router

Router

Operand Buffer

•Executes 1 instruction/cycle

•EU aggregation for OoO-like performance

Increases both issue BW & bufferingPrevents stalled instructions from

blocking ready instructionsExtracts MLP & ILP

In-OrderInstr Buffers

Page 11: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

EU ClusterL1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L1I

Thread context management orInstruction Engine(Front-end + Back-end)

In-order Execution Unit (EU)

L2

L1I 0

L1D 0

IE 0

L1I 1

IE 1

L1D 1

EU Cluster

Full bypass within a cluster

11

1-cycle inter-cluster link

Page 12: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Instruction Engine (IE)

12

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L2• Thread specific structures• Front-end + back-end• Similar to a conventional OoO pipe

• Steering logic for distributed EUs• Achieve OoO performance with in-order EUs• Expose independent instr chains

Steering

RF

DecodeFetch Rename

RF

DecodeFetch RenameBR

Pred

Front-End

CommitROB Commit

Back-End

Page 13: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Coarse-Grain OoO Execution

1

2

5

3

4

6

7

8

1

3

5

2

4

7

6

8

OoO Issue WiDGET

1

2

5

3

4

6

7

8

1

3

5

2

4

7

6

8

13

Page 14: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Methodology• Goal: Power proportionality

– Wide performance & power ranges• Full-system execution-driven simulator

– Based on GEMS– Integrated Wattch and CACTI

• SPEC CPU2006 benchmark suite• 2 comparison points

– Neon: Aggressive proc for high ILP– Mite: Simple, low-power proc

• Config: 1 - 8 EUs to 1 IE– 1 - 4 instruction buffers / EU

L1I 0

L1D 0

IE 0

14

Page 15: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

• 21% power savings to match Neon’s performance• 8% power savings for 26% better performance than Neon• Power scaling of 54% to approximate Mite• Covers both Neon and Mite on a single chip

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.30.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NeonMite1 EU2 EUs3 EUs4 EUs5 EUs6 EUs7 EUs8 EUs

Normalized Performance

Nor

mal

ized

Chi

p Po

wer

Power Proportionality

1EU+1IB

8EUs+4IBsNeon

Mite

15

Page 16: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Power Breakdown

• Less than ⅓ of Neon’s execution power– Due to no OoO scheduler and limited bypass

• Increase in WiDGET’s power caused by:– Increased EUs and instruction buffers– Higher utilization of other resources

16

0

0.2

0.4

0.6

0.8

1

L3L2L1DL1IFetch/Decode/RenameBackendALUExecution

Nor

mal

ized

Pow

er

Neon 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

Mite 1 EU 2 EUs 3 EUs 4 EUs 5 EUs 6 EUs 7 EUs 8 EUs

Page 17: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Conclusions• WiDGET

– EU provisioning for power-performance target– Trade-off complexity for power– OoO approximation using in-order EUs– Distributed buffering to extract MLP & ILP

• Single-thread performance– Scale from close to Mite to better than Neon– Power proportional computing

17

Page 18: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

How is this different from the next talk?

In-Order Approximation

In-order,

SteeringMechanism

Scalable CoresScalable CoresVision

Forwardflow[Gibson10]

WiDGET[watanabe10]

18

Page 19: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Thank you!Questions?

19

Page 20: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Backup Slides• DVFS• Design choices of WiDGET• Steering heuristic• Memory disambiguation• Comparison to related work• Vs. Clustered architectures• Vs. Complexity-effective superscalars• Steering cost model• Steering mechanism• Machine configuration• Area model• Power efficiency• Vs. Dynamic HW resizing• Vs. Heterogeneous CMPs• Vs. Dynamic multi-cores• Vs. Thread-level speculation• Vs. Braid Architecture• Vs. ILDP

20

Page 21: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Dynamic Voltage/Freq Scaling (DVFS)

• Dynamically trade-off power for performance– Change voltage and freq at runtime– Often regulated by OS

• Slow response time

• Linear reduction of V & F– Cubic in dynamic power– Linear in performance– Quadratic in dynamic energy

• Effective for thermal management

• Challenges– Controlling DVFS– Diminishing returns of DVFS

21

Page 22: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Service-Level Agreements (SLAs)• Expectations b/w consumer and provider

– QoS, boundaries, conditions, penalties

• EX: Web server SLA– Combination of latency, throughput, and QoS (min %

performed successfully)

• Guaranteeing SLAs on WiDGET1.Set deadline & throughput goals

• Adjust EU provisioning2.Set target machine specs

• X processor-like with X GB memory

22

Page 23: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Nehalem-like CMP

Design Choice of WiDGET (1/2)

L2

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

RF

DecodeFetch RenameOoOIQ

ROB

Commit

D $RF

DecodeFetch RenameOoOIQ

ROB

Commit

D $

I $BR

Pred

OoO issue queue + full bypass =

35% of processor power

23

Page 24: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Design Choice of WiDGET (2/2)

L2

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

RF

DecodeFetch RenameOoOIQ

ROB

Commit

D $RF

DecodeFetch RenameOoOIQ

ROB

Commit

D $

I $BR

Pred

OoO issue queue + full bypass =

35% of processor power

Replace with simple building blocks

Decouple from the rest

In-Order Exec

Nehalem-like CMP

24

Page 25: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Steering Heuristic• Based on dependence-based steering [Palacharla97]

– Expose independent instr chains– Consumer directly behind the producer– Stall steering when no empty buffer is found

• WiDGET: Power-performance goal– Emphasize locality & scalability Cluster 0 Cluster 1

Outstanding Ops?

Producer bufEmpty bufwithin cluster

Any empty buf Avail behind producer? Avail behindeither of producers?

Empty buf ineither of clusters

0 1 2

Y Y NN

• Consumer-push operand transfers– Send steered EU ID to the producer EU– Multi-cast result to all consumers 25

Page 26: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Opportunities / Challenges

• Decoupled design + modularity =Reconfig by turning on/off componentsCore scaling by EU provisioning

Rather than fusing coresCore customization for ILP, TLP, DLP, MLP & power

• Challengeso Parallelism-communication trade-offo Varying communication demands

26

Page 27: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Memory Disambiguation on WiDGET

• Challenges arising from modularity– Less communication between modules– Less centralized structures

• Benefits of NoSQ– Mem dependency -> register dependency– Reduced communication– No centralized structure– Only register dependency relation b/w EUs

• Faster execution of loads27

Page 28: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Memory Instructions?

• No LSQ thanks to NoSQ [Sha06]

• Instead,– Exploit in-window ST-LD forwarding– LDs: Predict if dependent ST is in-flight @ Rename

• If so, read from ST’s source register, not from cache• Else, read from cache• @ Commit, re-execute if necessary

– STs: Write @ Commit– Prediction

• Dynamic distance in stores• Path-sensitive

28

Page 29: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Comparison to Related Work

29

Design EX Scale Up & Down? Symmetric? Decoupled

Exec? In-Order? Wire Delays? Data Driven? ISA Compativility?

WiDGET √ √ √ √ √ √ √

Adaptive Cores X - √ / X √ / X - - √Heterogeneous CMPs X X X √ / X - - √

Core Fusion √ √ X X √ - √

CLP √ √ √ √ √ √ X

TLS X √ X √ / X - X √

Multiscalar X √ X X - X XComplexity-Effective X √ √ √ X √ √

Salverda & Zilles √ √ X √ X √ √

ILDP & Braid X √ √ √ - √ X

Quad-Cluster X √ √ X √ √ / X √

Access/Execute X X X √ - √ X

Page 30: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Vs. OoO Clusters• OoO Clusters

– Goal: Superscalar ILP without impacting cycle time– Decentralize deep and wide structures in superscalars

• Cluster: Smaller OoO design

– Steering goal• High performance through communication hiding & load balance

• WiDGET– Goal: Power & Performance– Designed to scale cores up & down

• Cluster: 4 in-order EUs

– Steering goal• Localization to reduce communication latency & power

30

Page 31: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

EX: Steering for Locality

31

1

32 5

4

21

43 Delay

Cluster 0 Cluster 1

21

43 Delay

Cluster 0 Cluster 1

5

Clustered Architectures

WiDGET

e.g., Advanced RMBS, Modulo

6

6

56

Exploit locality

Maintainload balance

Page 32: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Vs. Complexity-Effective Superscalars

• Palacharla et al.– Goal: Performance– Consider all buffers for steering and issuing

• More buffers -> More options

• WiDGET– Goal: Power-performance– Requirements

• Shorter wires, power gating, core scaling

– Differences: Localization & scalability• Cluster-affined steering• Keep dependent chains nearby• Issuing selection only from a subset of buffers

– New question: Which empty buffer to steer to? 32

Page 33: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Steering Cost Model [Salverda08]• Steer to distributed IQs

• Steering policy determines issue time– Constrained by dependency, structural hazards, issue

policy

• Ideal steering will issue an instr:– As soon as it becomes ready (horizon)– Without blocking others (frontier) (constraints of in-order)

• Steering Cost = horizon – frontier– Good steering: Min absolute (Steering Cost)

33

Page 34: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

EX: Application of Cost Model• Steer instr 3 to In-order IQs

• Challenges– Check all IQs to find an optimal steering– Actual exec time is unknown in advance

• Argument of Salverda & Zilles– Too complex to build or– Too many execution resources needed to match OoO

3

1

2

4

1

2

IQ 0 IQ 1 IQ 2 IQ 3

1

2

3

Tim

e

C = -1 C = 0 C = 1

3

4

F

F

F

HorizonF: Frontier

Other instrsCost = H - F

FC = -1

34

Page 35: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Impact of Comm Delays

3

1

2

4

1

2

IQ 0

Tim

e

3

4

IQ 1 IQ 2 IQ 3

1

2

3

4

5

1

2

IQ 0

3

4

IQ 1 IQ 2 IQ 3

1

2

3

4

5

If 1-cycle comm latency is added…

Exec latency: 3

What should happen instead

Exec latency: 4 cyclesTrade off parallelism for comm

5 cycles

35

Page 36: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Observation Under Comm Delays

• Not beneficial to spread instrsReduced pressure for more execution resources

• Not much need to consider distant IQsReduced problem spaceSimplified steering

36

Page 37: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Instruction Steering Mechanism

• Goals– Expose independent instruction chains– Achieve OoO performance with multiple in-order EUs– Keep dependent instrs nearby

• 3 things to keep track– Producer’s location– Whether producer has another consumer– Empty buffers

Last Producer Table &Full bit vector

Empty bit vector

37

Page 38: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Steering Example

0+8

1

2

5

3

4

7

6

8

0 1

2 3

1 + 0

2 + 0

3 + 0

4 + 0

5 + 0

6 + 0

7 + 0

Last Producer Table

Register

Buffer ID

Has a consumer?

0

0

0

0

Empty / full bit vectors

0

1

2

3

1

1

1

1

1

5

2 3

4

7

6

8

0 0

0

1

0

0

1

1

01

2

1

2

1

1

1

38

Page 39: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Instruction Buffers

• Small FIFO buffer– Config: 16 entries

• 1 straight instr chain per buffer

• Entry– Consumer EU field

• Set if a consumer is steered to different EU• Read after computation

– Multi-cast the result to consumers

Instr Op 1 Op 2

Consumer EU bit vector

39

Page 40: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Machine ConfigurationsAtom

[Gerosa08]Xeon

[Tam06]WiDGET

L1 I / D* 32 KB, 4-way, 1 cycle

BR Predictor† Tage predictor; 16-entry RAS; 64-entry, 4-way BTB

Instr Engine 2-way front-end & back-end

4-way front-end & back-end;128-entry ROB

Exec Core 16-entry unified instr queue;2 INT, 2 FP, 2 AG

32-entry unified instr queue;3 INT, 3 FP, 2AG;0-cycle operand bypass to anywhere in core

16-entry instr buffer per EU;1 INT, 1FP, 1AG per EU;0-cycle operand bypass within a cluster

Disambiguation† No Store Queue (NoSQ) [Sha06]

L2 / L3 / DRAM* 1 MB, 8-way, 12 cycles / 4 MB, 16-way, 24 cycles / ~300 cycles

Process 45 nm

* Based on Xeon † Configuration choice40

Page 41: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Area Model (45nm)

• Assumptions– Single-threaded uniprocessor– On-chip 1MB L2– Atom chip ≈ WiDGET (2 EUs, 1 buffer per EU)

• WiDGET:> Mite by 10%< Neon by 19%

Mite WiDGET Neon05

1015202530354045

Are

a (m

m²)

41

Page 42: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Harmonic Mean IPCs

1 2 3 40

0.2

0.4

0.6

0.8

1

1.2

1.4NeonMite1 EU2 EUs3 EUs4 EUs5 EUs6 EUs7 EUs8 EUs

Instruction Buffers

Nor

mal

ized

IPC

• Best-case: 26% better than Neon• Dynamic performance range: 3.8

42

Page 43: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

• 8 - 58% power savings compared to Neon• 21% power savings to match Neon’s performance• Dynamic power range: 2.2

1 2 3 40

0.2

0.4

0.6

0.8

1

1.2NeonMite1 EU2 EUs3 EUs4 EUs5 EUs6 EUs7 EUs8 EUs

Instruction Buffers

Nor

mal

ized

Pow

er

Harmonic Mean Power

43

Page 44: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Geometric Mean Power Efficiency (BIPS³/W)

• Best-case: 2x of Neon, 21x of Mite• 1.5x the efficiency of Xeon for the same performance

NeonMite

44

Page 45: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Energy-Proportional Computing for Servers[Barroso07]

• Servers– 10-50% utilization most of the time

• Yet, availability is crucial

– Common energy-saving techs inapplicable– 50% of full power even during low utilization

• Solution: Energy proportionality– Energy consumption in proportion to work done

• Key features– Wide dynamic power range– Active low-power modes

• Better than sleep states with wake-up penalties 45

Page 46: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

PowerNap [Meisner09]

• Goals– Reduction of server idle power– Exploitation of frequent idle periods

• Mechanisms– System level– Reduce transition time into & out of nap state– Ease power-performance trade-offs– Modify hardware subsystems with high idle power

• e.g., DRAM (self-refresh), fans (variable speed)

46

Page 47: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Thread Motion [Rangan09]

• Goals– Fine-grained power management for CMPs– Alternative to per-core DVFS– High system throughput within power budget

• Mechanisms– Migrate threads rather than adjusting voltage– Homogeneous cores in multiple, static voltage/freq

domains– 2 migration policies

• Time-driven & miss-driven

47

Page 48: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Dynamic HW Resizing• Resize if under-utilized for power savings

– Elimination of transistor switching– e.g., IQs, LSQs, ROBs, caches

• Mechanisms– Physical: Enable/disable segments (or associativity)– Logical: Limit usable space– Wire partitioning with tri-state buffers

• Policies– Performance (e.g., IPC, ILP)– Occupancy– Usefulness

48

Page 49: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Vs. Heterogeneous CMPs• Their way

– Equip with small and powerful cores– Migrate thread to a powerful core for higher ILP

• Shortcomings– More design and verification time– Bound to static design choices– Poor performance for non-targeted apps– Difficult resource scheduling

• My way– Get high ILP by aggregating many in-order EUs

L2

OoO Core

49

Page 50: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Vs. Dynamic Multi-Cores• Their way

– Deploy small cores for TLP– Dynamically fuse cores for higher ILP

• Shortcomings– Large centralized structures [Ipek07]– Non-traditional ISA [Kim07]

• My Way– Only “fuse” EUs– No recompilation or binary translation

L2

Fuse

50

Page 51: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Vs. Thread-Level Speculation• Their way

– SW: Divides into contiguous segments– HW: Runs speculative threads in parallel

• Shortcomings– Only successful for regular program structures– Load imbalance– Squash propagation

• My Way– No SW reliance– Support a wider range of programs

L2

Speculation support

51

Page 52: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Vs. Braid Architecture [Tseng08]

• Their way– ISA extension– SW: Re-orders instrs based on dependency– HW: Sends a group of instrs to FIFO issue queues

• Shortcomings– Re-ordering limited to basic blocks

• My Way– No SW reliance– Exploit dynamic dependency

52

Page 53: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Vs. Instruction Level Distributed Processing(ILDP) [Kim02]

• Their way– New ISA or binary translation– SW: Identifies instr dependency– HW: Sends a group of instrs to FIFO issue queues

• Shortcomings– Lose binary compatibility

• My Way– No SW reliance– Exploit dynamic dependency

53

Page 54: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Vs. Multiscalar

• Similarity– ILP extraction from sequential threads– Parallel execution resources

• Differences– Divide by data dependency, not control dependency

• Less communication b/w resources• No imposed resource ordering

– Communication via multi-cast rather than ring– Higher resource utilization

– No load balancing issue54

Page 55: WiDGET: Wisconsin Decoupled Grid Execution Tiles Yasuko Watanabe*, John D. Davis †, David A. Wood* *University of Wisconsin † Microsoft Research ISCA 2010,

Approach

• Simple building blocks for low power– In-order EUs

• Sea of resources– Power-Performance tradeoff through resource

allocation• Distributed buffering for:

– Latency tolerance– Coarse-grain out-of-order (OoO)

55