low-complexity reorder buffer architecture*

ICS’02 1

Low-ComplexityReorder Buffer Architecture*

*supported in part by DARPA through the PAC-C program and NSF

Gurhan Kucuk, Dmitry Ponomarev, Kanad GhoseDepartment of Computer Science

State University of New YorkBinghamton, NY 13902-6000

http://www.cs.binghamton.edu/~lowpower

16th Annual ACM International Conference on Supercomputing (ICS’02), June 24th 2002

ICS’02 2

Outline

ROB complexities

Motivation for the low-complexity ROB

Low-complexity ROB design

Results

Concluding remarks

ICS’02 3

What This Work is All About

Complex, richly-ported ROBs are common in modern superscalar datapaths

Number of ports are aggravated when results are held within ROB slots (Example: Pentium III)

ROB complexity reduction is important for reducing power and improving performance

ROB dissipates a non-trivial fraction of the total chip power

ROB accesses stretch over several cycles

Goal of this work: Reduce the complexity and power dissipation of the ROB without sacrificing performance

ICS’02 4

Pentium III-like Superscalar Datapath

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

ICS’02 5

ROB Port Requirements for a W-way CPU

ROB

WritebackW write portsto write results

Dispatch/Issue2W read ports

to read the source operands

Decode/DispatchW write portsto setup entries

CommitW read portsfor instruction commitment

ICS’02 6

ROB Port Requirements for a W-way CPU

ROB

WritebackW write ports

To write results



Decode/Dispatch1 W-wide write port

to setup entries

Commit1 W-wide read port

for instruction commitment

ICS’02 7

Where are the Source Values Coming From?

IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

D-cache

LSQ

ROB

12

3

ICS’02 8

Where are the Source Values Coming From ?

0%

20%

40%

60%

80%

100%

Forwarding ARF ROB

96-entry ROB, 4-way processorSPEC2K Benchmarks

62% 32%32% 6%

ICS’02 9

How Efficiently are the Ports Used ?

ROB

WritebackW write ports

To write results



Decode/DispatchW write portsto setup entries

CommitW read portsfor instruction commitment

6%

ICS’02 10

Approaches to Reducing ROB Complexity

Reduce the number of read ports for reading out the source operand values

More radical (and better): Completely eliminate the read ports for reading source operand values!

ICS’02 11

0

4

8

12

16

1 read port 2 read ports

Reducing the Number of Read PortsP

erfo

rman

ce D

rop

%

048

121620

3.5% 1.0%Average IPC Drop:

bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr

applu apsi art equake mesa mgrid swim wupwise FP Avg.

ICS’02 12

Problems with Retaining Fewer Source Read Ports on the ROB

Need arbitration for the small number of ports

Additional logic needed to block the instructions which could not get the port.

Need a switching network to route the operands to correct destinations

Multi-cycle access still remains in the critical path of Dispatch/Issue logic

ICS’02 13

Our Solution: Elimination of Read Ports

IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

D-cache

LSQ

ROB

12

3

ICS’02 14


IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

D-cache

LSQ

ROB

12

3

ICS’02 15


IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

D-cache

LSQ

1

3

ROB

ICS’02 16

Comparison of ROB Bitcells (0.18µ, TSMC)

Layout of a 32-ported SRAM bitcell

Layout of a 16-ported SRAM bitcell

Area Reduction – 71%

Shorter bit and wordlines

ICS’02 17


IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

D-cache

LSQ

ROB

Area Reduction – 45%

ICS’02 18

Eliminating/Reducing the Number of Read Ports: Effects on Power Dissipation

Power is reduced because:shorter bitlines and wordlines

lower capacitive loading

fewer decoders

fewer drivers and sense amps

ICS’02 19

Completely Eliminating the Source Read Ports on the ROB

The Problem: Issue of instructions that require a value stored in the ROB will stall

Solutions:

Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING

ICS’02 20

Late Forwarding: Use the Normal Forwarding Buses!

IQ


F1 D1

FU1

FU2

FUm

ARF

EX



F2


D2

D-cache

LSQ

ROB

Result/status forwarding buses:

ICS’02 21

IQ


F1 D1

FU1

FU2

FUm

ARF

EX



F2


D2

D-cache

LSQ

ROB



ICS’02 22

Optimizing Late Forwarding

PROBLEM: If Late Forwarding is done for every result that is committed, additional forwarding buses are needed in order not to degrade the performance

SOLUTION: Selective Late Forwarding (SLF)

SLF requires additional bit in the ROBThat bit is set by the dispatched instructions that require Late Forwarding

No additional forwarding buses are needed, since SLF traffic is very small

ICS’02 23

IQ


F1 D1

FU1

FU2

FUm

ARF

Only 3.5% of the traffic is from

SELECTIVE LATE FORWARDING

EX



F2


D2

D-cache

LSQ

ROB



ICS’02 24

0

4

8

12

16

No ROB read ports with SLF 1 read port 2 read ports

Performance Drop of Simplified ROB P

erfo

rman

ce D

rop

%

0

5

10

15

20

25

30

9.6% 3.5% 1.0%Average IPC Drop:



37%

17%

ICS’02 25

IPC Penalty:Source Value Not Accessible within the ROB

ForwardingLate Forwarding/

Commitment

Lifetime of a Result Value

ResultGeneration

time

Valuewithin ARF

Valuewithin ROB

ICS’02 26

Improving IPC with No Read Ports

Cache recently generated values in a set of RETENTION LATCHES (RL)

Retention Latches are SMALL and FAST

Only 8 to 16 latches needed in the set

Entire set has 1 or 2 read ports

ICS’02 27

Datapath with the Retention Latches

IQ


F1 D1

FU1

FU2

FUm

ARF


EX


F2


D2

D-cache

LSQ

ROB


ICS’02 28

Datapath with the Retention Latches

IQ


F1 D1

FU1

FU2

FUm

ARF


EX



F2


D2

D-cache

LSQ

RETENTION LATCHES

ROB

ICS’02 29

The Structure of the Retention Latch Set

L ROB slot addresses(L=1 or 2)

L-ported CAM field(key = ROB_slot_id)

W write ports for writing up to W results in parallel

Status

L recently-written results (L=1 or 2 works great)

Result Values

8 or 16 latches

ICS’02 30

Retention Latch Management Strategies

FIFO

8 entry RL: 42% hit rate


LRU



Random Replacement

Worse performance than FIFO

ICS’02 31

Hit Ratios to Retention Latches

0

20

40

60

80

100

FIFO 8 2 FIFO 16 2 LRU 8 2 LRU 16 2

42% 55% 56% 62%

0

20

40

60

80

100

Hit

Rat

ios



Average Hit Ratio:

ICS’02 32

Accessing Retention Latch Entries

ROB index is used as a unique key in the Retention Latches to search the result values

Need to maintain unique keys even when we have:

Reuse of a ROB slot:Not a problem for FIFO

simply flush a RL entry at commit time for LRU

Branch mispredictions

ICS’02 33

Handling Branch Mispredictions

Selective RL Flushing: Retention latch entries that are in the mispredicted path are flushed

Uses branch tagsComplicated implementation

Complete RL Flushing: All retention latch entries are flushed

Very simple implementationPerformance drop is only 1.5% compared to selective flushing

ICS’02 34

Misprediction Handling: Performance

0

0.5

1

1.5

2

2.5

3

3.5

bzip gap gcc gzip mcf pars perl twol vort vpr appl apsi art equ mesa mgrid swim wupw Int. FP Avg.

Selective flushing Complete flushing

1.5%Average IPC Drop:

IPC

ICS’02 35

Scenario 1: Traditional Design

5ROB index

Src1 valid ?

Src1 value ?

?

?

Src2 valid

Src2 value

Simplified IDB entry #1

Src2 arch. 3

Src1 arch. 2

ADDInstruction

Instruction: ADD R1, R2, R3

ICS’02 36


5ROB index

Src1 valid ?

Src1 value ?

?

?

Src2 valid

Src2 value


Src2 reg. 3

Src1 reg. 2

ADDInstruction


Arch.ROB=0ARF=1

0

1

2

3

4

…

… …

… …

… …

… …

12

3

0

1

ROB#/Phys.

Rename Table

ICS’02 37


5ROB index

Src1 valid ?

Src1 value ?

?

Src2 valid

Src2 value


Src2 reg. 3

Src1 reg. 2

ADDInstruction


Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

…

… …

… …

… …

… …

12

3

0

1

?

ROB#/Phys.

Phys.valid

Phys.value

… … …

12

… … …

1 7

Rename Table

ROB

ICS’02 38


5ROB index

Src1 valid 1

Src1 value 7

?

Src2 valid

Src2 value


Src2 reg. 3

Src1 reg. 2

ADDInstruction


Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

…

… …

… …

… …

… …

12

3

0

1

?

ROB#/Phys.

Phys.valid

Phys.value

… … …

12

… … …

1 7

Rename Table

ROB

ICS’02 39


5ROB index

Src1 valid ?

Src1 value ?

?

Src2 valid

Src2 value


Src2 reg. 3

Src1 reg. 2

ADDInstruction


Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

…

… …

… …

… …

… …

12

3

0

1

?

ROB#/Phys.

Phys.valid

Phys.value

… … …

12

… … …

0 ?

Rename Table

ROB

ICS’02 40


5ROB index

Src1 valid 0

Src1 value ?

?

Src2 valid

Src2 value


Src2 reg. 3

Src1 reg. 2

ADDInstruction


Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

…

… …

… …

… …

… …

12

3

0

1

?

ROB#/Phys.

Phys.valid

Phys.value

… … …

12

… … …

0 ?

Rename Table

ROB

ICS’02 41


5ROB index

Src1 valid 1

Src1 value 7

?

Src2 valid

Src2 value


Src2 reg. 3

Src1 reg. 2

ADDInstruction


Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

…

… …

… …

… …

… …

12

3

0

1

?

Arch. Arch.value

… …

3

… …

43Rename Table

ARF

ICS’02 42


5ROB index

Src1 valid 1

Src1 value 7

43

Src2 valid

Src2 value


Src2 reg. 3

Src1 reg. 2

ADDInstruction


Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

…

… …

… …

… …

… …

12

3

0

1

1

Arch. Arch.value

… …

3

… …

43Rename Table

ARF

ICS’02 43

Scenario 2: Simplified ROB with RLs

5ROB index

Src1 valid ?

Src1 value ?

?

?

Src2 valid

Src2 value


Src2 arch. 3

Src1 arch. 2

ADDInstruction


ICS’02 44


5ROB index

Src1 valid ?

Src1 value ?

?

?

Src2 valid

Src2 value


Src2 reg. 3

Src1 reg. 2

ADDInstruction


Arch.ROB=0ARF=1

0

1

2

3

4

…

… …

… …

… …

… …

12

3

0

1

ROB#/Phys.

Rename Table

ICS’02 45


5ROB index

Src1 valid ?

Src1 value ?

?

Src2 valid

Src2 value


Src2 reg. 3

Src1 reg. 2

ADDInstruction


Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

…

… …

… …

… …

… …

12

3

0

1

?

ROB#/Phys.

Phys.value

… …

12

… …

7

Rename Table

RetentionLatches

ICS’02 46


5ROB index

Src1 valid 1

Src1 value 7

?

Src2 valid

Src2 value


Src2 reg. 3

Src1 reg. 2

ADDInstruction


Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

…

… …

… …

… …

… …

12

3

0

1

?

Rename Table

ROB#/Phys.

Phys.value

… …

12

… …

7RetentionLatches

ICS’02 47


5ROB index

Src1 valid ?

Src1 value ?

?

Src2 valid

Src2 value


Src2 reg. 3

Src1 reg. 2

ADDInstruction


Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

…

… …

… …

… …

… …

12

3

0

1

?

Rename Table

ROB#/Phys.

Phys.value

… …

…

… …

…MISS RetentionLatches

ICS’02 48


5ROB index

Src1 valid 0

Src1 value ?

?

Src2 valid

Src2 value


Src2 reg. 3

Src1 reg. 2

ADDInstruction


Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

…

… …

… …

… …

… …

12

3

0

1

?

ROB#/Phys.

Phys.valid

Phys.value

… … …

12

… … …

X XRename Table

ROB

ROB#/Phys.

Phys.value

… …

…

… …

…RetentionLatches

MISS

X: Don’t Care

SLF

…

…

0

ICS’02 49


5ROB index

Src1 valid 0

Src1 value ?

?

Src2 valid

Src2 value


Src2 reg. 3

Src1 reg. 2

ADDInstruction


Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

…

… …

… …

… …

… …

12

3

0

1

?

ROB#/Phys.

Phys.valid

Phys.value

… … …

12

… … …

X XRename Table

ROB

ROB#/Phys.

Phys.value

… …

…

… …

…RetentionLatches

MISS

X: Don’t Care

SLF

…

…

1

ICS’02 50


5ROB index

Src1 valid 1

Src1 value 7

?

Src2 valid

Src2 value


Src2 reg. 3

Src1 reg. 2

ADDInstruction


Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

…

… …

… …

… …

… …

12

3

0

1

?

Arch. Arch.value

… …

3

… …

43Rename Table

ARF

ICS’02 51


5ROB index

Src1 valid 1

Src1 value 7

43

Src2 valid

Src2 value


Src2 reg. 3

Src1 reg. 2

ADDInstruction


Arch.ROB#/Phys.

ROB=0ARF=1

0

1

2

3

4

…

… …

… …

… …

… …

12

3

0

1

1

Arch. Arch.value

… …

3

… …

43Rename Table

ARF

ICS’02 52

Experimental Setup: the AccuPower (DATE’02)

CompiledSPEC

benchmarks

Datapathspecs

Performance stats

VLSI layoutdata

SPICEdeck

SPICE

MicroarchitecturalSimulator(Rooted in

SimpleScalar)

Energy/PowerEstimator

Power/energystats

SPICE measures ofenergy per transition

Transition counts,Context information

ICS’02 53

Configuration of the Simulated System

Machine width 4-way

Issue Queue 32 entries

96 entriesReorder Buffer

Load/Store Queue 32 entries

Simulated the execution of SPEC2000 benchmarks

ICS’02 54

Assumed Timings

Rename Tablelookup forROB index

Rename TableLookup forROB index

Associativelookup ofoperand fromretention latchesusing ROBindex as a key

Source operandread from the ROB

Source operandread from the ROB

Smaller delay:few latches

D1 D2 D3 D1 D2

Timing of the baseline model Timing of the simplified ROB

ICS’02 55

-5

-3

-1

1

3

5

8 2-ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU

Experimental Results: Effect on PerformanceP

erfo

rman

ce D

rop

%

-6

-4

-2

0

2

4

6

0.1% -1.6% -1.0% -2.3%



Avg. IPC Drop:

ICS’02 56

0

2

4

6

8

8 2-ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU

Experimental Results: Effect on PerformanceP

erfo

rman

ce D

rop

%

0

2

4

6

8

10

3.3% 1.7% 2.3% 1.0%



Avg. IPC Drop:

ICS’02 57

0

10

20

30

40

No RO B ports 8 2-ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU

Experimental Results: Effect on PowerP

ower

Sav

ings

%

0

10

20

30

40

50



30% 23.4% 22.2% 21% 20.2%Avg. Savings:

ICS’02 58

Summary of Results

Significantly reduced ROB complexity and power dissipation

45% area reduction

20% to 30% power reduction across SPEC 2000 benchmarks

Actual IPC improvements:

1.6% to 2.3% gain across SPEC benchmarks

IPC gains come from 1 cycle access to RL (vs. 2 cycles that would be needed for ROB access)

ICS’02 59

Related Work

Value-Aging Buffer (Hu & Martonosi, PACS 2000)

Forwarding Buffer and Clustered Register Cache (Borch et.al., HPCA’02)

Multiple Register Banks (Cruz et.al., ISCA’00 & Balasubramonian et.al., MICRO’01)

See paper for discussions

ICS’02 60

Conclusions

Typical source operand location statistics can be successfully exploited to reduce ROB complexity

Significant reduction in ROB area and power – no ROB ports needed for reading source operands

IPC gains are possible because of the use of a small sized, low-ported Retention Latch to supply cached operand values in a single cycle

ICS’02 61

Low-ComplexityReorder Buffer Architecture*

*supported in part by DARPA through the PAC-C program and NSF

Gurhan Kucuk, Dmitry Ponomarev, Kanad GhoseDepartment of Computer Science

State University of New YorkBinghamton, NY 13902-6000

http://www.cs.binghamton.edu/~lowpower

16th Annual ACM International Conference on Supercomputing (ICS’02), June 24th 2002

low-complexity reorder buffer architecture*

Documents