islped’03 1 reducing reorder buffer complexity through selective operand caching *supported in...

31
ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF rhan Kucuk, Dmitry Ponomarev, Oguz Ergin, Kanad Gho Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower International Symposium on Low Power Electronics and Design (ISLPED’03), August 26 th 2003

Post on 21-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 1

Reducing Reorder Buffer Complexity Through Selective Operand Caching

*supported in part by DARPA through the PAC-C program and NSF

Gurhan Kucuk, Dmitry Ponomarev, Oguz Ergin, Kanad GhoseDepartment of Computer Science

State University of New YorkBinghamton, NY 13902-6000

http://www.cs.binghamton.edu/~lowpower

International Symposium on Low Power Electronics and Design (ISLPED’03), August 26 th 2003

Page 2: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 2

Outline

Reorder Buffer (ROB) complexitiesMotivation for the low-complexity ROBLow-complexity ROB (ICS’02)Improving the design using short-lived valuesResultsConcluding remarks

Page 3: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 3

P6 Style Superscalar Datapath

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

Page 4: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 4

ROB Port Requirements for a W-way CPU

ROB

WritebackW write portsto write results

Dispatch/Issue2W read ports

to read the source operands

Decode/DispatchW write portsto setup entries

CommitW read portsfor instruction commitment

Page 5: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 5

Where are the Source Values Coming From?

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

12

3

Page 6: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 6

Where are the Source Values Coming From ?

0%

20%

40%

60%

80%

100%

Forwarding ARF ROB

96-entry ROB, 4-way processorSPEC2K Benchmarks

62% 32%32% 6%

Page 7: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 7

How Efficiently are the Ports Used ?

ROB

WritebackW write ports

To write results

Dispatch/Issue2W read ports

to read the source operands

Decode/DispatchW write portsto setup entries

CommitW read portsfor instruction commitment

6%

Page 8: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 8

Our Solution: Elimination of Read Ports

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

12

3

Page 9: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 9

Our Solution: Elimination of Read Ports

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

12

3

Page 10: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 10

Our Solution: Elimination of Read Ports

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

1

3

ROB

Page 11: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 11

Comparison of ROB Bitcells (0.18µ, TSMC)

Layout of a 32-ported SRAM bitcell

Layout of a 16-ported SRAM bitcell

Area Reduction – 71%

Shorter bit and wordlines

Page 12: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 12

Completely Eliminating the Source Read Ports on the ROB

The Problem: Issue of instructions that require a value stored in the ROB will stall

Solutions:

Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING

Page 13: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 13

Late Forwarding: Use the Normal Forwarding Buses!

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

Result/status forwarding buses:

Page 14: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 14

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

Result/status forwarding buses:

Late Forwarding: Use the Normal Forwarding Buses!

Page 15: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 15

Improving Performance

Cache recently generated values in a set of RETENTION LATCHES (RL)

Retention Latches are SMALL and FAST

Only 8 to 16 latches needed in the set

Entire set has 1 or 2 read ports

Page 16: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 16

Datapath with the Retention Latches

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

ROB

Architectural Register File

Page 17: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 17

Datapath with the Retention Latches

IQ

FunctionUnitsInstruction Issue

F1 D1

FU1

FU2

FUm

ARF

Result/status forwarding buses

EX

Instruction dispatch

Architectural Register File

F2

Fetch Decode/Dispatch

D2

D-cache

LSQ

RETENTION LATCHES

ROB

Page 18: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 18

Retention Latch Management Strategies

FIFO

8 entry RL: 42% hit rate

16 entry RL: 55% hit rate

LRU

8 entry RL: 56% hit rate

16 entry RL: 62% hit rate

Random Replacement

Worse performance than FIFO

Page 19: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 19

Advantages of Using Retention Latches

Reduces energy dissipation in the ROB – avoids creating a localized hot spot

Reduces associated performance losses

Reduces ROB complexity – smaller floor plan, easier validation

Page 20: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 20

Improving Retention Latch Management

PROBLEM: All generated results, irrespective of whether they could be potentially read from the RLs, are written into the latches unconditionally

CONSEQUENCE: The array of RLs is not utilized efficiently and performance loss is still noticeable

SOLUTION: We identify the values which are never going to be read after the cycle of their generation and avoid writing of these values into the RLs

Page 21: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 21

Our definition: a value is short-lived if the destination register is renamed by the time of the result generation

Identified one cycle before the result writeback

LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4

LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4RENAMER

Short-Lived Values

Page 22: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 22

AVOID WRITING SHORT-LIVED VALUES INTO THE RETENTION LATCHES

Reasons:

Short-lived values are forwarded directly to all potential consumers in the issue queue

No instruction will ever consume a short-lived value from the retention latches

Results:

Increased RL hit ratios and better overall performance

Key Idea: Do not cache short-lived values

Page 23: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 23

0

10

20

30

40

50

60

70

80

90

100

96-entry ROB, 4-way processor

The Good News : 80%+ of the Values are Short-Lived

%

Page 24: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 24

Maintain the bit-vector Renamed

Set by the Renamer at the time of renaming

Arch. Reg

Phys. Reg.

Location(0-ROB,1-ARF)

0 0 1

1 31 0

2 2 1

3 3 1

4 4 1

5 32 0

LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4

LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4

31

1

Renamed

Identifying Short-Lived Values

Page 25: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 25

Maintain the bit-vector Renamed

Set by the Renamer at the time of renaming

Arch. Reg

Phys. Reg.

Location(0-ROB,1-ARF)

0 0 1

1 33 0

2 2 1

3 3 1

4 4 1

5 32 0

LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4

LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4

31

1

Renamed

Identifying Short-Lived Values

Page 26: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 26

Renamed bit is checked one cycle before writeback

Value produced by LOAD is short-lived because Renamed [31]=1

LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4

LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4

31

1

Renamed

Identifying Short-Lived Values

Page 27: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 27

Hit Ratios to Retention Latches

0

20

40

60

80

100

8 original FIFO RLs 8 optimized FIFO RLs

46% 73%73%

0

20

40

60

80

100

Hit

Rat

ios

bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr

applu apsi art equake mesa mgrid swim wupwise FP Avg.

Average Hit Ratio:

Page 28: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 28

0

1

2

3

Baseline 8 original RLs 8 optimized RLs 4 optimized RLs 2 optimized RLs

Experimental Results: Effect on Performance

IPC

0

1

2

3

1.7%1.7%1.7% 0.5% 1.1%

applu apsi art equake mesa mgrid swim wupwise FP Avg.

bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr

Avg. IPC Drop:

Page 29: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 29

0

300

600

900

1200

0

300

600

900

Baseline 8 optimized RLs 4 optimized RLs 2 optimized RLs

Experimental Results: Effect on ROB Power

Energy (pJ)

15.9%13.7%13.7% 15.0%15.0%

applu apsi art equake mesa mgrid swim wupwise FP Avg.

bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr

Avg. Savings:

Page 30: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 30

Conclusions

We proposed a mechanism to further improve the performance and reduce the complexity of a processor that uses retention latches and eliminates the ROB source read portsThe idea is to avoid caching the short-lived result values in the retention latchesBoth retention latch hit ratio and the overall performance improvedAlternatively, fewer retention latches can be used with the same performance

Page 31: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,

ISLPED’03 31

THANK YOU !

*supported in part by DARPA through the PAC-C program and NSF

LOW POWER RESEARCH GROUP Department of Computer Science

State University of New YorkBinghamton, NY 13902-6000

http://www.cs.binghamton.edu/~lowpower

International Symposium on Low Power Electronics and Design (ISLPED’03), August 27th 2003