iccd’03 1 distributed reorder buffer schemes for low power * *supported in part by darpa through...
Post on 21-Dec-2015
219 views
TRANSCRIPT
ICCD’03
1
Distributed Reorder Buffer Schemes for Low Power *
*supported in part by DARPA through the PAC-C program and NSF
Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad GhoseDepartment of Computer Science
State University of New YorkBinghamton, NY 13902-6000
http://www.cs.binghamton.edu/~lowpower
21st International Conference on Computer Design (ICCD’03), October 14th 2003
ICCD’03
2
– Reorder Buffer (ROB) complexities– Motivation for the low-complexity ROB– Low-complexity ROB designs
Fully Distributed ROB Retention Latches (RLs) revisited (ICS’02) Combined Scheme
– Results– Concluding remarks
Outline
ICCD’03
3
P6-style Superscalar Datapath
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
ROB
ICCD’03
4
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2ROB
RB
PPC 620-style Superscalar Datapath
ICCD’03
5
ROB Port Requirements for a W-way CPU
ROB
WritebackW write portsto write results
Dispatch/Issue2W read ports
to read the source operands
Decode/DispatchW write portsto setup entries
CommitW read portsfor instruction commitment
ICCD’03
6
What This Work is All About
– ROB complexity reduction is important for reducing power and improving performance
ROB dissipates a non-trivial fraction of the total chip power ROB accesses stretch over several cycles
– Goal of this work: Reduce the complexity and power dissipation of the ROB without sacrificing performance
ICCD’03
7
Comparison of ROB Bitcells (0.18µ, TSMC)
Layout of a 32-ported SRAM bitcell
Layout of a 16-ported SRAM bitcell
Area Reduction – 71%
Shorter bit and wordlines
ICCD’03
8
Instruction dispatch
P6-style Superscalar Datapath
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Architectural Register File
F2
Fetch Decode/Dispatch
D2
ROB
ICCD’03
9
Reorder Buffer Distribution
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
ROBC 1
ROBC 2
ROBC m
ROB
Holds pointers to entries within
ROBCs
ROB Components
(ROBCs)
ICCD’03
10
Impact of Distributing the ROB
– Each ROBC is effectively is a small Rename Buffer Smaller read/write access energy Faster access time
– Distributing physical storage in this manner allows FUs to use shorter buses to write their respective ROBCs
Lower energy dissipation on the wires (We have NOT accounted for energy savings from using shorter wires)
– Fits in naturally with a multi-clustered datapath design
ICCD’03
11
– Port conflicts result in performance penalty
– Interconnection network is more complex
Problems with the earlier Multi-banked RF Schemes
ICCD’03
12
– Port conflicts result in performance penaltyTotally avoid write port conflictsMinimize read port conflicts at commitment
– Interconnection network is more complex
and some good news!
Problems with the earlier Multi-banked RF Schemes
ICCD’03
13
– Port conflicts result in performance penaltyTotally avoid write port conflictsMinimize read port conflicts at commitment
– Interconnection network is more complexCompletely remove source read ports
and some good news!
Problems with the earlier Multi-banked RF Schemes
ICCD’03
14
Problems with the earlier Multi-banked RF Schemes
– Port conflicts result in performance penaltyTotally avoid write port conflictsMinimize read port conflicts at commitmentTotally avoid source read port conflicts
– Interconnection network is more complexCompletely remove source read ports
and some good news!
ICCD’03
15
ROBCs Assigned to Each Function Unit
1
2
3
4
n
ROBC #11 1
2
3
1
ROBC #21
2
3
4
m 1
2 1
ROBC #m1FU #m
FU #2
FU #1
Centralized ROB Distributed ROBCs
FU_id offset
ICCD’03
16
Good News:Write port conflicts are avoided
ROBC #11
2
3
ROBC #21
2
3
4
ROBC #m1FU #m
FU #2
FU #1
1 write port
Distributed ROBCs
1
2
3
4
n
1 1
m 1
2 1
Centralized ROB
FU_id offset
ICCD’03
17
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
instruction
5
ICCD’03
18
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADDinstruction
5
ICCD’03
19
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADDreserved
instruction
5
ICCD’03
20
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADD1 1
instruction
reserved
5
ADD
ICCD’03
21
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADD1 1
instruction
reservedSUB
5
ICCD’03
22
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADD1 1
instruction
reservedSUB
reserved
5
ICCD’03
23
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADD1 1
instruction
reserved
reserved
SUB2 1
5
SUB
ICCD’03
24
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADD1 1
instruction
reserved
reserved
SUB2 1AND
5
ICCD’03
25
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADD1 1
instruction
reserved
reserved
SUB2 1
reserved
AND
5
ICCD’03
26
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
Int ADDROBC #1
1
2
FU_id offset
Centralized ROB Distributed ROBCs
Int ADDROBC #2
1
2
Int ADDROBC #3
1
2
Int ADDROBC #4
1
2
ADD1 1
instruction
reserved
reserved
SUB2 1
reserved
AND13
5
AND
ICCD’03
27
Good News:Avoiding Read Port Conflicts
1
2
3
4
n
1
2
FU_id offset
Centralized ROB Distributed ROBCs
1
2
1
2
1
2
ADD1 1
instruction
reserved
reserved
SUB2 1
1 read port
Tocommitment
3 1 AND
reserved
5
ICCD’03
28
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
FU_id offset
Centralized ROB Distributed ROBCs
1
2
ADD1 1
instruction
SUB2 1
AND13MUL
5
IntMUL/DIVROBC #5
ICCD’03
29
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
FU_id offset
Centralized ROB Distributed ROBCs
2
1
ADD1 1
instruction
SUB2 1
AND13MUL
5
reserved
IntMUL/DIVROBC #5
ICCD’03
30
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
FU_id offset
Centralized ROB Distributed ROBCs
1
2
ADD1 1
instruction
reserved
SUB2 1
AND13
5
5 1 MUL
IntMUL/DIVROBC #5
MUL
ICCD’03
31
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
FU_id offset
Centralized ROB Distributed ROBCs
ADD1 1
instruction
SUB2 1
AND13
DIV5
5 1 MUL1
2reserved
IntMUL/DIVROBC #5
ICCD’03
32
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
FU_id offset
Centralized ROB Distributed ROBCs
ADD1 1
instruction
SUB2 1
AND13
DIV5
5 1 MUL1
2reservedreserved
IntMUL/DIVROBC #5
ICCD’03
33
Round Robin Scheduling at Dispatch Time
1
2
3
4
n
FU_id offset
Centralized ROB Distributed ROBCs
ADD1 1
instruction
SUB2 1
AND13
5
5 1 MUL
5 2 DIV
1
2reservedreserved
IntMUL/DIVROBC #5
DIV
ICCD’03
34
Read Port Conflicts at Commitment
1
2
3
4
n
FU_id offset
Centralized ROB Distributed ROBCs
ADD1 1
instruction
SUB2 1
AND13
5
5 1 MUL
5 2 DIV
1
2reserved
IntMUL/DIVROBC #5
reserved Tocommitment
CONFLICT:If MUL and DIV wantsto commit in the same cycle
1 read port
DIV
ICCD’03
35
Distributed ROB Design 1
ROBC
Writeback1 write port
to write results
ICCD’03
36
Distributed ROB Design 1
ROBC
Writeback1 write port
to write results
Commit1 read port
for instruction commitment
ICCD’03
37
Distributed ROB Design 1: with source read ports
ROBC
Writeback1 write port
to write resultsDispatch/Issue1 read port
to read the source operands
Commit1 read port
for instruction commitment
ICCD’03
38
Experimental Setup: the AccuPower (DATE’02)Compiled
SPEC benchmarks
Datapathspecs
Performance stats
VLSI layoutdata
SPICEdeck
SPICE
MicroarchitecturalSimulator(Rooted in
SimpleScalar)
Energy/PowerEstimator
Power/energystats
SPICE measures ofenergy per transition
Transition counts,Context information
ICCD’03
39
Configuration of the Simulated System
Machine width 4-way
Issue Queue 32 entries
96 entriesReorder Buffer
Load/Store Queue 32 entries
Simulated the execution of SPEC2000 benchmarks
ICCD’03
40
Peak/Average demands on the number of ROBC entries
ROBC type IntADD#1, #2, #3, #4
IntMUL/DIV
FPADD#1, #2, #3, #4
FPMUL/DIV
Load
SPEC 2000Integer Average 16.9 4.4 4.1 0.1 1.6 0.04 3.8 0.04 28.6 9.3
SPEC 2000FP Average 14.2 4.9 3.2 0.8 3.8 0.6 6.7 1.1 23.5 7.5
SPEC 2000Average 15.7 4.6 3.7 0.4 2.6 0.3 5.0 0.5 26.4 8.5
peak peakpeak peak peak avg.avg.avg.avg.avg.
ICCD’03
41
Peak/Average demands on the number of ROBC entries
ROBC type IntADD#1, #2, #3, #4
IntMUL/DIV
FPADD#1, #2, #3, #4
FPMUL/DIV
Load
SPEC 2000Integer Average 16.9 4.4 4.1 0.1 1.6 0.04 3.8 0.04 28.6 9.3
SPEC 2000FP Average 14.2 4.9 3.2 0.8 3.8 0.6 6.7 1.1 23.5 7.5
SPEC 2000Average 15.7 4.6 3.7 0.4 2.6 0.3 5.0 0.5 26.4 8.5
peak peakpeak peak peak avg.avg.avg.avg.avg.
8 8 8 8 4 4 4 4 4 4 16Number of entriesassigned to eachROBC
ICCD’03
42
Peak/Average demands on the number of ROBC entries
ROBC type IntADD#1, #2, #3, #4
IntMUL/DIV
FPADD#1, #2, #3, #4
FPMUL/DIV
Load
SPEC 2000Integer Average 16.9 4.4 4.1 0.1 1.6 0.04 3.8 0.04 28.6 9.3
SPEC 2000FP Average 14.2 4.9 3.2 0.8 3.8 0.6 6.7 1.1 23.5 7.5
SPEC 2000Average 15.7 4.6 3.7 0.4 2.6 0.3 5.0 0.5 26.4 8.5
peak peakpeak peak peak avg.avg.avg.avg.avg.
8 8 8 8 4 4 4 4 4 4 16+ + + + + + + + + + = 72entry
8_4_4_4_16 configuration
Number of entriesassigned to eachROBC
ICCD’03
43
Percentage of cycles when dispatch blocks for 8_4_4_4_16
ROBC type IntADD#1, #2, #3, #4
IntMUL/DIV
FPADD#1, #2, #3, #4
FPMUL/DIV
Load
SPEC 2000Integer Average 0.9 0.1 0 0 5.2
SPEC 2000FP Average 1.5 1.0 0.1 0.8 1.9
SPEC 2000Average 1.2 0.5 0 0.4 3.8
Average IPC drop% with 8_4_4_4_16 configuration = 4.8%
ICCD’03
44
Percentage of cycles when dispatch blocks for 8_4_4_4_16
ROBC type IntADD#1, #2, #3, #4
IntMUL/DIV
FPADD#1, #2, #3, #4
FPMUL/DIV
Load
SPEC 2000Integer Average 0.9 0.1 0 0 5.2
SPEC 2000FP Average 1.5 1.0 0.1 0.8 1.9
SPEC 2000Average 1.2 0.5 0 0.4 3.8
8 8 8 8 4 4 4 4 4 4 16+ + + + + + + + + + = 72entry
Number of entriesassigned to eachROBC
ICCD’03
45
Reducing performance penalty: 12_6_4_6_20 Configuration
ROBC type IntADD#1, #2, #3, #4
IntMUL/DIV
FPADD#1, #2, #3, #4
FPMUL/DIV
Load
SPEC 2000Integer Average 0.9 0.1 0 0 5.2
SPEC 2000FP Average 1.5 1.0 0.1 0.8 1.9
SPEC 2000Average 1.2 0.5 0 0.4 3.8
12 12 12 12 6 4 4 4 4 6 20+ + + + + + + + + + = 96entry
12_6_4_6_20 configuration
Number of entriesassigned to eachROBC
ICCD’03
46
0
1
2
3
Base, 2-cycle RO B access and full bypass 2 read ports, 12_6_4_6_20
Performance Results for 12_6_4_6_20 Configuration
0
1
2
3
gap gcc gzip parser perl twolf Int Avg.vortex vpr
applu art mesa mgrid swim wupwise FP Avg.
IPC
Average IPC drop% with 12_6_4_6_20 configuration = 2.4%
ICCD’03
47
Distributed ROB Design 1: with source read ports
ROBC
Writeback1 write port
to write resultsDispatch/Issue1 read port
to read the source operands
Commit1 read port
for instruction commitment
ICCD’03
48
Eliminating All Source Read Ports
ROBC
Writeback1 write port
to write resultsDispatch/Issue1 read port
to read the source operands
Commit1 read port
for instruction commitment
ICCD’03
49
Eliminating All Source Read Ports
ROBC
Writeback1 write port
to write results
Commit1 read port
for instruction commitment
ICCD’03
50
Where are the Source Values Coming From?
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
ROB
12
3
ICCD’03
51
Where are the Source Values Coming From ?
0%
20%
40%
60%
80%
100%
Forwarding ARF ROB
96-entry ROB, 4-way processorSPEC2K Benchmarks
62% 32%32% 6%
ICCD’03
52
How Efficiently are the Ports Used ?
ROB
WritebackW write portsto write results
Dispatch/Issue2W read ports
to read the source operands
Decode/DispatchW write portsto setup entries
CommitW read portsfor instruction commitment
6%
ICCD’03
53
Our Solution: Elimination of Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
ROB
12
3
ICCD’03
54
Our Solution: Elimination of Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
ROB
12
3
ICCD’03
55
Our Solution: Elimination of Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
1
3
ROB
ICCD’03
56
Distributed Reorder Buffer Scheme
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
ROBC 1
ROBC 2
ROBC m
ROB
Holds pointers to entries within
ROBCs
ROBCs
ICCD’03
57
Elimination of Source Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
ROBC 1
ROBC 2
ROBC m
ROB
ROBCs
Holds pointers to entries within
ROBCs
ICCD’03
58
Elimination of Source Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
ROBC 1
ROBC 2
ROBC m
ROB
ROBCs
Holds pointers to entries within
ROBCs
ICCD’03
59
Completely Eliminating the Source Read Ports on the ROBCs
– The Problem: Issue of instructions that require a value stored in a ROBC will stall
– Solutions:Forward the value to the waiting instruction at the
time of committing the value: LATE FORWARDING
ICCD’03
60
Late Forwarding: Use the Normal Forwarding Buses!
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
ROBC 1
ROBC 2
ROBC m
ROB
ROBCs
Holds pointers to entries within
ROBCs
ICCD’03
61
Late Forwarding: Use the Normal Forwarding Buses!
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
ROBC 1
ROBC 2
ROBC m
ROB
Late Forwarding
ROBCs
Holds pointers to entries within
ROBCs
ICCD’03
62
0
8
16
24
No ROBC source read ports with Late Forwarding
Performance Drop of Simplified ROBC Design
Per
form
ance
Dro
p %
0
8
16
24
32
40
48
9.6%Average IPC Drop:
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
applu apsi art equake mesa mgrid swim wupwise FP Avg.
37%
17%
ICCD’03
63
IPC Penalty:Source Value Not Accessible within the ROBC
ForwardingLate Forwarding/
Commitment
Lifetime of a Result Value
ResultGeneration
time
Valuewithin ARF
Valuewithin a ROBC
ICCD’03
64
Improving IPC with No Read Ports
– Cache recently generated values in a set of RETENTION LATCHES (RL)
– Retention Latches are SMALL and FASTOnly 8 to 16 latches needed in the setEntire set has 1 or 2 read ports
ICCD’03
65
Adding Retention Latches into the Picture
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
ROBC 1
ROBC 2
ROBC m
ROB
Late Forwarding
ROBCs
Holds pointers to entries within
ROBCs
ICCD’03
66
Adding Retention Latches into the Picture
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
ROBC 1
ROBC 2
ROBC m
ROB
Late Forwarding
RETENTION LATCHES
Holds pointers to entries within
ROBCs
ICCD’03
67
Eliminating All Source Read Ports
ROBC
Writeback1 write port
to write results
Commit1 read port
for instruction commitment
ICCD’03
68
Distributed ROB Design 2: with Retention Latches
ROBC
Writeback1 write port
to write results
Commit1 read port
for instruction commitment
Eight,2-ported
FIFORLs
ICCD’03
69
0
1
2
3
Base, 2-cycle RO B access and full bypass 2 read ports, 12_6_4_6_20
Performance Results for 12_6_4_6_20 Configuration
0
1
2
3
gap gcc gzip parser perl twolf Int Avg.vortex vpr
applu art mesa mgrid swim wupwise FP Avg.
IPC
Average IPC drop% with 12_6_4_6_20 configuration = 2.4%
ICCD’03
70
0
1
2
3
gap gcc gzip pars perl twolf vortex vpr
Base, 2-cycle ROB access and full bypassDesign 1: 2 read ports, 12_6_4_6_20Design 2: Eight 2-ported FIFO RLs, 12_6_4_6_20 with 1 read port
Performance Results for 12_6_4_6_20 Configuration
0
1
2
3
gap gcc gzip parser perl twolf Int Avg.vortex vpr
applu art mesa mgrid swim wupwise FP Avg.
IPC
Average IPC drop% with 12_6_4_6_20 configuration = 1.7%
ICCD’03
71
0
1
2
3
gap gcc gzip pars perl twolf vortex vpr
Base, 1-cycle ROB access and full bypassDesign 1: 2 read ports, 12_6_4_6_20Design 2: Eight 2-ported FIFO RLs, 12_6_4_6_20 with 1 read port
Performance Results for 12_6_4_6_20 Configuration
0
1
2
3
gap gcc gzip parser perl twolf Int Avg.vortex vpr
applu art mesa mgrid swim wupwise FP Avg.
IPC
Average IPC drop% with 12_6_4_6_20 configuration = 3.8%
ICCD’03
72
0
10
20
30
40
50
60
Eight 2-ported FIFO latchesDesign 1: 2 read ports, 12_6_4_6_20Design 2: Eight 2-ported FIFO RLs, 12_6_4_6_20 with 1 read port
Power Results for 12_6_4_6_20 Configuration
0
10
20
30
40
50
60
gap gcc gzip parser perl twolf Int Avg.vortex vpr
applu art mesa mgrid swim wupwise FP Avg.
Pow
er S
avin
gs %
Power savings%: 49% 47%23%
ICCD’03
73
0
10
20
30
40
50
60
Eight 2-ported FIFO latchesDesign 1: 2 read ports, 12_6_4_6_20Design 2: Eight 2-ported FIFO RLs, 12_6_4_6_20 with 1 read port
Power Results for 12_6_4_6_20 Configuration(Compared to Baseline case with 64 entry Rename Buffers)
0
10
20
30
40
50
60
gap gcc gzip parser perl twolf Int Avg.vortex vpr
applu art mesa mgrid swim wupwise FP Avg.
Pow
er S
avin
gs %
Power savings%: 39% 37%20%
ICCD’03
74
Summary of Results
– Low performance degradation: 1.7% IPC drop on the average (compared to 2-cycle ROB) 3.8% IPC drop on the average (compared to 1-cycle ROB)
– ROB Power savings: as high as 49% are realized (compared to P6-style datapath: 96
entry ROB) as high as 39% (compared to Rename Buffer design: 96 entry
ROB, 64 entry RB)
ICCD’03
75
Conclusions
– We introduced a conflict-free distributed Reorder Buffer design
– ROB power savings of as high as 49% are realized with only a small (1.7%) performance penalty
– ROB complexity is drastically reduced by Distributing the ROB into multiple banks Reducing the port requirements to no more than 2 ports for
each ROB components
ICCD’03
76
~ Thank You~
ICCD’03
77
Distributed Reorder Buffer Schemes for Low Power *
*supported in part by DARPA through the PAC-C program and NSF
Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad GhoseDepartment of Computer Science
State University of New YorkBinghamton, NY 13902-6000
http://www.cs.binghamton.edu/~lowpower
21st International Conference on Computer Design (ICCD’03), October 14th 2003
ICCD’03
78
Related Work
– Replicated (Kessler, IEEE Micro) and distributed (Canal et.al, HPCA’00 and Farkas et.al, MICRO’97) RFs in a clustered organization
– Multiple Register Banks (Cruz et.al., ISCA’00 & Balasubramonian et.al., MICRO’01)
– Multiple Register Banks with additional pipeline stage to avoid complex arbitration logic (Tseng et.al, ISCA’03
– Multiple Register Banks without write port conflicts (Wallase et.al, PACT’96)
ICCD’03
79
ROB Port Requirements for a W-way CPU
ROB
WritebackW write portsto write results
Dispatch/Issue2W read ports
to read the source operands
Decode/DispatchW write portsto setup entries
CommitW read portsfor instruction commitment
ICCD’03
80
ROB Port Requirements for a W-way CPU
ROB
WritebackW write ports
To write results
Dispatch/Issue2W read ports
to read the source operands
Decode/Dispatch1 W-wide write port
to setup entries
Commit1 W-wide read port
for instruction commitment
ICCD’03
85
Fully Distributed Reorder Buffer Scheme
ICCD’03
86
Fully Distributed Reorder Buffer Scheme
– Distributed ROB Components (ROBCs) are assigned to each Function Unit
No write port conflicts at writeback stage, and minimal read port conflicts at commitment: Negligible performance penalty
Each ROBC can be tailored to the needs of its FU : No over commitment of resources, less complexity
– The FIFO structure that maintains pointers to the ROBCs remains centralized
ICCD’03
87
Fully Distributed Reorder Buffer Scheme
1
2
3
4
n
ROBC #11 1
2
3
1
FU_id offset
ROBC #21
2
3
4
m 1
2 1
ROBC #m1
Centralized ROB Distributed ROBCs
ICCD’03
88
Fully Distributed Reorder Buffer Scheme
1
2
3
4
n
ROBC #11 1
2
3
1
ROBC #21
2
3
4
m 1
2 1
ROBC #m1
Centralized ROB Distributed ROBCs
FU_id offset
ICCD’03
90
0
10
20
30
40
50
60
Centralized ROB, Eight 2-ported FIFO Retention Latches
Results for the Scheme with Retention Latches
0
10
20
30
40
50
60
gap gcc gzip parser perl twolf Int Avg.vortex vpr
applu art mesa mgrid swim wupwise FP Avg.
Pow
er S
avin
gs %
Power savings%: 23%