islped’03 1 reducing reorder buffer complexity through selective operand caching *supported in...
Post on 21-Dec-2015
218 views
TRANSCRIPT
ISLPED’03 1
Reducing Reorder Buffer Complexity Through Selective Operand Caching
*supported in part by DARPA through the PAC-C program and NSF
Gurhan Kucuk, Dmitry Ponomarev, Oguz Ergin, Kanad GhoseDepartment of Computer Science
State University of New YorkBinghamton, NY 13902-6000
http://www.cs.binghamton.edu/~lowpower
International Symposium on Low Power Electronics and Design (ISLPED’03), August 26 th 2003
ISLPED’03 2
Outline
Reorder Buffer (ROB) complexitiesMotivation for the low-complexity ROBLow-complexity ROB (ICS’02)Improving the design using short-lived valuesResultsConcluding remarks
ISLPED’03 3
P6 Style Superscalar Datapath
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
ISLPED’03 4
ROB Port Requirements for a W-way CPU
ROB
WritebackW write portsto write results
Dispatch/Issue2W read ports
to read the source operands
Decode/DispatchW write portsto setup entries
CommitW read portsfor instruction commitment
ISLPED’03 5
Where are the Source Values Coming From?
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
12
3
ISLPED’03 6
Where are the Source Values Coming From ?
0%
20%
40%
60%
80%
100%
Forwarding ARF ROB
96-entry ROB, 4-way processorSPEC2K Benchmarks
62% 32%32% 6%
ISLPED’03 7
How Efficiently are the Ports Used ?
ROB
WritebackW write ports
To write results
Dispatch/Issue2W read ports
to read the source operands
Decode/DispatchW write portsto setup entries
CommitW read portsfor instruction commitment
6%
ISLPED’03 8
Our Solution: Elimination of Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
12
3
ISLPED’03 9
Our Solution: Elimination of Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
12
3
ISLPED’03 10
Our Solution: Elimination of Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
1
3
ROB
ISLPED’03 11
Comparison of ROB Bitcells (0.18µ, TSMC)
Layout of a 32-ported SRAM bitcell
Layout of a 16-ported SRAM bitcell
Area Reduction – 71%
Shorter bit and wordlines
ISLPED’03 12
Completely Eliminating the Source Read Ports on the ROB
The Problem: Issue of instructions that require a value stored in the ROB will stall
Solutions:
Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING
ISLPED’03 13
Late Forwarding: Use the Normal Forwarding Buses!
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
Result/status forwarding buses:
ISLPED’03 14
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
Result/status forwarding buses:
Late Forwarding: Use the Normal Forwarding Buses!
ISLPED’03 15
Improving Performance
Cache recently generated values in a set of RETENTION LATCHES (RL)
Retention Latches are SMALL and FAST
Only 8 to 16 latches needed in the set
Entire set has 1 or 2 read ports
ISLPED’03 16
Datapath with the Retention Latches
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
Architectural Register File
ISLPED’03 17
Datapath with the Retention Latches
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
RETENTION LATCHES
ROB
ISLPED’03 18
Retention Latch Management Strategies
FIFO
8 entry RL: 42% hit rate
16 entry RL: 55% hit rate
LRU
8 entry RL: 56% hit rate
16 entry RL: 62% hit rate
Random Replacement
Worse performance than FIFO
ISLPED’03 19
Advantages of Using Retention Latches
Reduces energy dissipation in the ROB – avoids creating a localized hot spot
Reduces associated performance losses
Reduces ROB complexity – smaller floor plan, easier validation
ISLPED’03 20
Improving Retention Latch Management
PROBLEM: All generated results, irrespective of whether they could be potentially read from the RLs, are written into the latches unconditionally
CONSEQUENCE: The array of RLs is not utilized efficiently and performance loss is still noticeable
SOLUTION: We identify the values which are never going to be read after the cycle of their generation and avoid writing of these values into the RLs
ISLPED’03 21
Our definition: a value is short-lived if the destination register is renamed by the time of the result generation
Identified one cycle before the result writeback
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4RENAMER
Short-Lived Values
ISLPED’03 22
AVOID WRITING SHORT-LIVED VALUES INTO THE RETENTION LATCHES
Reasons:
Short-lived values are forwarded directly to all potential consumers in the issue queue
No instruction will ever consume a short-lived value from the retention latches
Results:
Increased RL hit ratios and better overall performance
Key Idea: Do not cache short-lived values
ISLPED’03 23
0
10
20
30
40
50
60
70
80
90
100
96-entry ROB, 4-way processor
The Good News : 80%+ of the Values are Short-Lived
%
ISLPED’03 24
Maintain the bit-vector Renamed
Set by the Renamer at the time of renaming
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 31 0
2 2 1
3 3 1
4 4 1
5 32 0
LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
31
1
Renamed
Identifying Short-Lived Values
ISLPED’03 25
Maintain the bit-vector Renamed
Set by the Renamer at the time of renaming
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 33 0
2 2 1
3 3 1
4 4 1
5 32 0
LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
31
1
Renamed
Identifying Short-Lived Values
ISLPED’03 26
Renamed bit is checked one cycle before writeback
Value produced by LOAD is short-lived because Renamed [31]=1
LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
31
1
Renamed
Identifying Short-Lived Values
ISLPED’03 27
Hit Ratios to Retention Latches
0
20
40
60
80
100
8 original FIFO RLs 8 optimized FIFO RLs
46% 73%73%
0
20
40
60
80
100
Hit
Rat
ios
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
applu apsi art equake mesa mgrid swim wupwise FP Avg.
Average Hit Ratio:
ISLPED’03 28
0
1
2
3
Baseline 8 original RLs 8 optimized RLs 4 optimized RLs 2 optimized RLs
Experimental Results: Effect on Performance
IPC
0
1
2
3
1.7%1.7%1.7% 0.5% 1.1%
applu apsi art equake mesa mgrid swim wupwise FP Avg.
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
Avg. IPC Drop:
ISLPED’03 29
0
300
600
900
1200
0
300
600
900
Baseline 8 optimized RLs 4 optimized RLs 2 optimized RLs
Experimental Results: Effect on ROB Power
Energy (pJ)
15.9%13.7%13.7% 15.0%15.0%
applu apsi art equake mesa mgrid swim wupwise FP Avg.
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
Avg. Savings:
ISLPED’03 30
Conclusions
We proposed a mechanism to further improve the performance and reduce the complexity of a processor that uses retention latches and eliminates the ROB source read portsThe idea is to avoid caching the short-lived result values in the retention latchesBoth retention latch hit ratio and the overall performance improvedAlternatively, fewer retention latches can be used with the same performance
ISLPED’03 31
THANK YOU !
*supported in part by DARPA through the PAC-C program and NSF
LOW POWER RESEARCH GROUP Department of Computer Science
State University of New YorkBinghamton, NY 13902-6000
http://www.cs.binghamton.edu/~lowpower
International Symposium on Low Power Electronics and Design (ISLPED’03), August 27th 2003