low-complexity reorder buffer architecture*

61
ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 16 th Annual ACM International Conference on Supercomputing (ICS’02), June 24 th 2002

Upload: lillian-schroeder

Post on 03-Jan-2016

27 views

Category:

Documents


1 download

DESCRIPTION

Low-Complexity Reorder Buffer Architecture*. Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Low-Complexity Reorder Buffer Architecture*

ICS’02 1

Low-ComplexityReorder Buffer Architecture*

*supported in part by DARPA through the PAC-C program and NSF

Gurhan Kucuk, Dmitry Ponomarev, Kanad GhoseDepartment of Computer Science

State University of New YorkBinghamton, NY 13902-6000

http://www.cs.binghamton.edu/~lowpower

16th Annual ACM International Conference on Supercomputing (ICS’02), June 24th 2002

Page 2: Low-Complexity Reorder Buffer Architecture*

ICS’02 2

Outline

ROB complexities

Motivation for the low-complexity ROB

Low-complexity ROB design

Results

Concluding remarks

Page 3: Low-Complexity Reorder Buffer Architecture*

ICS’02 3

What This Work is All About

Complex, richly-ported ROBs are common in modern superscalar datapaths

Number of ports are aggravated when results are held within ROB slots (Example: Pentium III)

ROB complexity reduction is important for reducing power and improving performance

ROB dissipates a non-trivial fraction of the total chip power

ROB accesses stretch over several cycles

Goal of this work: Reduce the complexity and power dissipation of the ROB without sacrificing performance

Page 4: Low-Complexity Reorder Buffer Architecture*

ICS’02 4

Pentium III-like Superscalar Datapath

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

Page 5: Low-Complexity Reorder Buffer Architecture*

ICS’02 5

ROB Port Requirements for a W-way CPU

ROB

WritebackW write portsto write results

Dispatch/Issue2W read ports

to read the source operands

Decode/DispatchW write portsto setup entries

CommitW read portsfor instruction commitment

Page 6: Low-Complexity Reorder Buffer Architecture*

ICS’02 6

ROB Port Requirements for a W-way CPU

ROB

WritebackW write ports

To write results

Dispatch/Issue2W read ports

to read the source operands

Decode/Dispatch1 W-wide write port

to setup entries

Commit1 W-wide read port

for instruction commitment

Page 7: Low-Complexity Reorder Buffer Architecture*

ICS’02 7

Where are the Source Values Coming From?

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

12

3

Page 8: Low-Complexity Reorder Buffer Architecture*

ICS’02 8

Where are the Source Values Coming From ?

0%

20%

40%

60%

80%

100%

Forwarding ARF ROB

96-entry ROB, 4-way processorSPEC2K Benchmarks

62% 32%32% 6%

Page 9: Low-Complexity Reorder Buffer Architecture*

ICS’02 9

How Efficiently are the Ports Used ?

ROB

WritebackW write ports

To write results

Dispatch/Issue2W read ports

to read the source operands

Decode/DispatchW write portsto setup entries

CommitW read portsfor instruction commitment

6%

Page 10: Low-Complexity Reorder Buffer Architecture*

ICS’02 10

Approaches to Reducing ROB Complexity

Reduce the number of read ports for reading out the source operand values

More radical (and better): Completely eliminate the read ports for reading source operand values!

Page 11: Low-Complexity Reorder Buffer Architecture*

ICS’02 11

0

4

8

12

16

1 read port 2 read ports

Reducing the Number of Read PortsP

erfo

rman

ce D

rop

%

048

121620

3.5% 1.0%Average IPC Drop:

bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr

applu apsi art equake mesa mgrid swim wupwise FP Avg.

Page 12: Low-Complexity Reorder Buffer Architecture*

ICS’02 12

Problems with Retaining Fewer Source Read Ports on the ROB

Need arbitration for the small number of ports

Additional logic needed to block the instructions which could not get the port.

Need a switching network to route the operands to correct destinations

Multi-cycle access still remains in the critical path of Dispatch/Issue logic

Page 13: Low-Complexity Reorder Buffer Architecture*

ICS’02 13

Our Solution: Elimination of Read Ports

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

12

3

Page 14: Low-Complexity Reorder Buffer Architecture*

ICS’02 14

Our Solution: Elimination of Read Ports

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

12

3

Page 15: Low-Complexity Reorder Buffer Architecture*

ICS’02 15

Our Solution: Elimination of Read Ports

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

1

3

ROB

Page 16: Low-Complexity Reorder Buffer Architecture*

ICS’02 16

Comparison of ROB Bitcells (0.18µ, TSMC)

Layout of a 32-ported SRAM bitcell

Layout of a 16-ported SRAM bitcell

Area Reduction – 71%

Shorter bit and wordlines

Page 17: Low-Complexity Reorder Buffer Architecture*

ICS’02 17

Our Solution: Elimination of Read Ports

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

Area Reduction – 45%

Page 18: Low-Complexity Reorder Buffer Architecture*

ICS’02 18

Eliminating/Reducing the Number of Read Ports: Effects on Power Dissipation

Power is reduced because:shorter bitlines and wordlines

lower capacitive loading

fewer decoders

fewer drivers and sense amps

Page 19: Low-Complexity Reorder Buffer Architecture*

ICS’02 19

Completely Eliminating the Source Read Ports on the ROB

The Problem: Issue of instructions that require a value stored in the ROB will stall

Solutions:

Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING

Page 20: Low-Complexity Reorder Buffer Architecture*

ICS’02 20

Late Forwarding: Use the Normal Forwarding Buses!

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

Result/status forwarding buses:

Page 21: Low-Complexity Reorder Buffer Architecture*

ICS’02 21

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

Result/status forwarding buses:

Late Forwarding: Use the Normal Forwarding Buses!

Page 22: Low-Complexity Reorder Buffer Architecture*

ICS’02 22

Optimizing Late Forwarding

PROBLEM: If Late Forwarding is done for every result that is committed, additional forwarding buses are needed in order not to degrade the performance

SOLUTION: Selective Late Forwarding (SLF)

SLF requires additional bit in the ROBThat bit is set by the dispatched instructions that require Late Forwarding

No additional forwarding buses are needed, since SLF traffic is very small

Page 23: Low-Complexity Reorder Buffer Architecture*

ICS’02 23

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Only 3.5% of the traffic is from

SELECTIVE LATE FORWARDING

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

Result/status forwarding buses:

Late Forwarding: Use the Normal Forwarding Buses!

Page 24: Low-Complexity Reorder Buffer Architecture*

ICS’02 24

0

4

8

12

16

No ROB read ports with SLF 1 read port 2 read ports

Performance Drop of Simplified ROB P

erfo

rman

ce D

rop

%

0

5

10

15

20

25

30

9.6% 3.5% 1.0%Average IPC Drop:

bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr

applu apsi art equake mesa mgrid swim wupwise FP Avg.

37%

17%

Page 25: Low-Complexity Reorder Buffer Architecture*

ICS’02 25

IPC Penalty:Source Value Not Accessible within the ROB

ForwardingLate Forwarding/

Commitment

Lifetime of a Result Value

ResultGeneration

time

Valuewithin ARF

Valuewithin ROB

Page 26: Low-Complexity Reorder Buffer Architecture*

ICS’02 26

Improving IPC with No Read Ports

Cache recently generated values in a set of RETENTION LATCHES (RL)

Retention Latches are SMALL and FAST

Only 8 to 16 latches needed in the set

Entire set has 1 or 2 read ports

Page 27: Low-Complexity Reorder Buffer Architecture*

ICS’02 27

Datapath with the Retention Latches

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

Architectural Register File

Page 28: Low-Complexity Reorder Buffer Architecture*

ICS’02 28

Datapath with the Retention Latches

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

RETENTION LATCHES

ROB

Page 29: Low-Complexity Reorder Buffer Architecture*

ICS’02 29

The Structure of the Retention Latch Set

L ROB slot addresses(L=1 or 2)

L-ported CAM field(key = ROB_slot_id)

W write ports for writing up to W results in parallel

Status

L recently-written results (L=1 or 2 works great)

Result Values

8 or 16 latches

Page 30: Low-Complexity Reorder Buffer Architecture*

ICS’02 30

Retention Latch Management Strategies

FIFO

8 entry RL: 42% hit rate

16 entry RL: 55% hit rate

LRU

8 entry RL: 56% hit rate

16 entry RL: 62% hit rate

Random Replacement

Worse performance than FIFO

Page 31: Low-Complexity Reorder Buffer Architecture*

ICS’02 31

Hit Ratios to Retention Latches

0

20

40

60

80

100

FIFO 8 2 FIFO 16 2 LRU 8 2 LRU 16 2

42% 55% 56% 62%

0

20

40

60

80

100

Hit

Rat

ios

bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr

applu apsi art equake mesa mgrid swim wupwise FP Avg.

Average Hit Ratio:

Page 32: Low-Complexity Reorder Buffer Architecture*

ICS’02 32

Accessing Retention Latch Entries

ROB index is used as a unique key in the Retention Latches to search the result values

Need to maintain unique keys even when we have:

Reuse of a ROB slot:Not a problem for FIFO

simply flush a RL entry at commit time for LRU

Branch mispredictions

Page 33: Low-Complexity Reorder Buffer Architecture*

ICS’02 33

Handling Branch Mispredictions

Selective RL Flushing: Retention latch entries that are in the mispredicted path are flushed

Uses branch tagsComplicated implementation

Complete RL Flushing: All retention latch entries are flushed

Very simple implementationPerformance drop is only 1.5% compared to selective flushing

Page 34: Low-Complexity Reorder Buffer Architecture*

ICS’02 34

Misprediction Handling: Performance

0

0.5

1

1.5

2

2.5

3

3.5

bzip gap gcc gzip mcf pars perl twol vort vpr appl apsi art equ mesa mgrid swim wupw Int. FP Avg.

Selective flushing Complete flushing

1.5%Average IPC Drop:

IPC

Page 35: Low-Complexity Reorder Buffer Architecture*

ICS’02 35

Scenario 1: Traditional Design

5ROB index

Src1 valid ?

Src1 value ?

?

?

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 arch. 3

Src1 arch. 2

ADDInstruction

Instruction: ADD R1, R2, R3

Page 36: Low-Complexity Reorder Buffer Architecture*

ICS’02 36

Scenario 1: Traditional Design

5ROB index

Src1 valid ?

Src1 value ?

?

?

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 reg. 3

Src1 reg. 2

ADDInstruction

Instruction: ADD R1, R2, R3

Arch.ROB=0ARF=1

0

1

2

3

4

… …

… …

… …

… …

12

3

0

1

ROB#/Phys.

Rename Table

Page 37: Low-Complexity Reorder Buffer Architecture*

ICS’02 37

Scenario 1: Traditional Design

5ROB index

Src1 valid ?

Src1 value ?

?

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 reg. 3

Src1 reg. 2

ADDInstruction

Instruction: ADD R1, R2, R3

Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

… …

… …

… …

… …

12

3

0

1

?

ROB#/Phys.

Phys.valid

Phys.value

… … …

12

… … …

1 7

Rename Table

ROB

Page 38: Low-Complexity Reorder Buffer Architecture*

ICS’02 38

Scenario 1: Traditional Design

5ROB index

Src1 valid 1

Src1 value 7

?

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 reg. 3

Src1 reg. 2

ADDInstruction

Instruction: ADD R1, R2, R3

Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

… …

… …

… …

… …

12

3

0

1

?

ROB#/Phys.

Phys.valid

Phys.value

… … …

12

… … …

1 7

Rename Table

ROB

Page 39: Low-Complexity Reorder Buffer Architecture*

ICS’02 39

Scenario 1: Traditional Design

5ROB index

Src1 valid ?

Src1 value ?

?

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 reg. 3

Src1 reg. 2

ADDInstruction

Instruction: ADD R1, R2, R3

Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

… …

… …

… …

… …

12

3

0

1

?

ROB#/Phys.

Phys.valid

Phys.value

… … …

12

… … …

0 ?

Rename Table

ROB

Page 40: Low-Complexity Reorder Buffer Architecture*

ICS’02 40

Scenario 1: Traditional Design

5ROB index

Src1 valid 0

Src1 value ?

?

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 reg. 3

Src1 reg. 2

ADDInstruction

Instruction: ADD R1, R2, R3

Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

… …

… …

… …

… …

12

3

0

1

?

ROB#/Phys.

Phys.valid

Phys.value

… … …

12

… … …

0 ?

Rename Table

ROB

Page 41: Low-Complexity Reorder Buffer Architecture*

ICS’02 41

Scenario 1: Traditional Design

5ROB index

Src1 valid 1

Src1 value 7

?

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 reg. 3

Src1 reg. 2

ADDInstruction

Instruction: ADD R1, R2, R3

Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

… …

… …

… …

… …

12

3

0

1

?

Arch. Arch.value

… …

3

… …

43Rename Table

ARF

Page 42: Low-Complexity Reorder Buffer Architecture*

ICS’02 42

Scenario 1: Traditional Design

5ROB index

Src1 valid 1

Src1 value 7

43

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 reg. 3

Src1 reg. 2

ADDInstruction

Instruction: ADD R1, R2, R3

Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

… …

… …

… …

… …

12

3

0

1

1

Arch. Arch.value

… …

3

… …

43Rename Table

ARF

Page 43: Low-Complexity Reorder Buffer Architecture*

ICS’02 43

Scenario 2: Simplified ROB with RLs

5ROB index

Src1 valid ?

Src1 value ?

?

?

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 arch. 3

Src1 arch. 2

ADDInstruction

Instruction: ADD R1, R2, R3

Page 44: Low-Complexity Reorder Buffer Architecture*

ICS’02 44

Scenario 2: Simplified ROB with RLs

5ROB index

Src1 valid ?

Src1 value ?

?

?

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 reg. 3

Src1 reg. 2

ADDInstruction

Instruction: ADD R1, R2, R3

Arch.ROB=0ARF=1

0

1

2

3

4

… …

… …

… …

… …

12

3

0

1

ROB#/Phys.

Rename Table

Page 45: Low-Complexity Reorder Buffer Architecture*

ICS’02 45

Scenario 2: Simplified ROB with RLs

5ROB index

Src1 valid ?

Src1 value ?

?

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 reg. 3

Src1 reg. 2

ADDInstruction

Instruction: ADD R1, R2, R3

Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

… …

… …

… …

… …

12

3

0

1

?

ROB#/Phys.

Phys.value

… …

12

… …

7

Rename Table

RetentionLatches

Page 46: Low-Complexity Reorder Buffer Architecture*

ICS’02 46

Scenario 2: Simplified ROB with RLs

5ROB index

Src1 valid 1

Src1 value 7

?

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 reg. 3

Src1 reg. 2

ADDInstruction

Instruction: ADD R1, R2, R3

Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

… …

… …

… …

… …

12

3

0

1

?

Rename Table

ROB#/Phys.

Phys.value

… …

12

… …

7RetentionLatches

Page 47: Low-Complexity Reorder Buffer Architecture*

ICS’02 47

Scenario 2: Simplified ROB with RLs

5ROB index

Src1 valid ?

Src1 value ?

?

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 reg. 3

Src1 reg. 2

ADDInstruction

Instruction: ADD R1, R2, R3

Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

… …

… …

… …

… …

12

3

0

1

?

Rename Table

ROB#/Phys.

Phys.value

… …

… …

…MISS RetentionLatches

Page 48: Low-Complexity Reorder Buffer Architecture*

ICS’02 48

Scenario 2: Simplified ROB with RLs

5ROB index

Src1 valid 0

Src1 value ?

?

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 reg. 3

Src1 reg. 2

ADDInstruction

Instruction: ADD R1, R2, R3

Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

… …

… …

… …

… …

12

3

0

1

?

ROB#/Phys.

Phys.valid

Phys.value

… … …

12

… … …

X XRename Table

ROB

ROB#/Phys.

Phys.value

… …

… …

…RetentionLatches

MISS

X: Don’t Care

SLF

0

Page 49: Low-Complexity Reorder Buffer Architecture*

ICS’02 49

Scenario 2: Simplified ROB with RLs

5ROB index

Src1 valid 0

Src1 value ?

?

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 reg. 3

Src1 reg. 2

ADDInstruction

Instruction: ADD R1, R2, R3

Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

… …

… …

… …

… …

12

3

0

1

?

ROB#/Phys.

Phys.valid

Phys.value

… … …

12

… … …

X XRename Table

ROB

ROB#/Phys.

Phys.value

… …

… …

…RetentionLatches

MISS

X: Don’t Care

SLF

1

Page 50: Low-Complexity Reorder Buffer Architecture*

ICS’02 50

Scenario 2: Simplified ROB with RLs

5ROB index

Src1 valid 1

Src1 value 7

?

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 reg. 3

Src1 reg. 2

ADDInstruction

Instruction: ADD R1, R2, R3

Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

… …

… …

… …

… …

12

3

0

1

?

Arch. Arch.value

… …

3

… …

43Rename Table

ARF

Page 51: Low-Complexity Reorder Buffer Architecture*

ICS’02 51

Scenario 2: Simplified ROB with RLs

5ROB index

Src1 valid 1

Src1 value 7

43

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 reg. 3

Src1 reg. 2

ADDInstruction

Instruction: ADD R1, R2, R3

Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

… …

… …

… …

… …

12

3

0

1

1

Arch. Arch.value

… …

3

… …

43Rename Table

ARF

Page 52: Low-Complexity Reorder Buffer Architecture*

ICS’02 52

Experimental Setup: the AccuPower (DATE’02)

CompiledSPEC

benchmarks

Datapathspecs

Performance stats

VLSI layoutdata

SPICEdeck

SPICE

MicroarchitecturalSimulator(Rooted in

SimpleScalar)

Energy/PowerEstimator

Power/energystats

SPICE measures ofenergy per transition

Transition counts,Context information

Page 53: Low-Complexity Reorder Buffer Architecture*

ICS’02 53

Configuration of the Simulated System

Machine width 4-way

Issue Queue 32 entries

96 entriesReorder Buffer

Load/Store Queue 32 entries

Simulated the execution of SPEC2000 benchmarks

Page 54: Low-Complexity Reorder Buffer Architecture*

ICS’02 54

Assumed Timings

Rename Tablelookup forROB index

Rename TableLookup forROB index

Associativelookup ofoperand fromretention latchesusing ROBindex as a key

Source operandread from the ROB

Source operandread from the ROB

Smaller delay:few latches

D1 D2 D3 D1 D2

Timing of the baseline model Timing of the simplified ROB

Page 55: Low-Complexity Reorder Buffer Architecture*

ICS’02 55

-5

-3

-1

1

3

5

8 2-ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU

Experimental Results: Effect on PerformanceP

erfo

rman

ce D

rop

%

-6

-4

-2

0

2

4

6

0.1% -1.6% -1.0% -2.3%

applu apsi art equake mesa mgrid swim wupwise FP Avg.

bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr

Avg. IPC Drop:

Page 56: Low-Complexity Reorder Buffer Architecture*

ICS’02 56

0

2

4

6

8

8 2-ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU

Experimental Results: Effect on PerformanceP

erfo

rman

ce D

rop

%

0

2

4

6

8

10

3.3% 1.7% 2.3% 1.0%

applu apsi art equake mesa mgrid swim wupwise FP Avg.

bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr

Avg. IPC Drop:

Page 57: Low-Complexity Reorder Buffer Architecture*

ICS’02 57

0

10

20

30

40

No RO B ports 8 2-ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU

Experimental Results: Effect on PowerP

ower

Sav

ings

%

0

10

20

30

40

50

bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr

applu apsi art equake mesa mgrid swim wupwise FP Avg.

30% 23.4% 22.2% 21% 20.2%Avg. Savings:

Page 58: Low-Complexity Reorder Buffer Architecture*

ICS’02 58

Summary of Results

Significantly reduced ROB complexity and power dissipation

45% area reduction

20% to 30% power reduction across SPEC 2000 benchmarks

Actual IPC improvements:

1.6% to 2.3% gain across SPEC benchmarks

IPC gains come from 1 cycle access to RL (vs. 2 cycles that would be needed for ROB access)

Page 59: Low-Complexity Reorder Buffer Architecture*

ICS’02 59

Related Work

Value-Aging Buffer (Hu & Martonosi, PACS 2000)

Forwarding Buffer and Clustered Register Cache (Borch et.al., HPCA’02)

Multiple Register Banks (Cruz et.al., ISCA’00 & Balasubramonian et.al., MICRO’01)

See paper for discussions

Page 60: Low-Complexity Reorder Buffer Architecture*

ICS’02 60

Conclusions

Typical source operand location statistics can be successfully exploited to reduce ROB complexity

Significant reduction in ROB area and power – no ROB ports needed for reading source operands

IPC gains are possible because of the use of a small sized, low-ported Retention Latch to supply cached operand values in a single cycle

Page 61: Low-Complexity Reorder Buffer Architecture*

ICS’02 61

Low-ComplexityReorder Buffer Architecture*

*supported in part by DARPA through the PAC-C program and NSF

Gurhan Kucuk, Dmitry Ponomarev, Kanad GhoseDepartment of Computer Science

State University of New YorkBinghamton, NY 13902-6000

http://www.cs.binghamton.edu/~lowpower

16th Annual ACM International Conference on Supercomputing (ICS’02), June 24th 2002