decoupled pipelines: rationale, analysis, and evaluation frederick a. koopmans, sanjay j. patel...
TRANSCRIPT
Decoupled Pipelines:Rationale, Analysis, and Evaluation
Frederick A. Koopmans, Sanjay J. PatelDepartment of Computer EngineeringUniversity of Illinois at Urbana-Champaign
2
Outline
Introduction & MotivationBackgroundDSEP DesignAverage Case OptimizationsExperimental Results
3
Motivation
Why Asynchronous? No clock skew No clock distribution circuitry Lower power (potentially) Increased modularity
But what about performance? What is the architectural benefit of
removing the clock? Decoupled Pipelines!
4
Motivation
Advantages of a Decoupled Pipeline Pipeline achieves average-case performance Rarely taken critical paths no longer affect
performance New potential for average-case optimizations
5
Synchronous vs. Decoupled
go ack
data
go ack
dataStage1 Stage2 Stage3
Control3Control2Control1
Decoupled
Stage1 Stage2 Stage3
Synchronous
data
data
clock
AsynchronousCommunication
Protocol
Self-TimingLogic
ElasticBuffer
SynchronousLatch
Synchronizingmechanism
6
Outline
Introduction & MotivationBackgroundDSEP DesignAverage Case OptimizationsExperimental Results
7
Self-Timed Logic
Bounded Delay Model Definition: event = signal transition start event provided when inputs are available done event produced when outputs are stable
Fixed delay based on critical path analysis Computational circuit is unchanged
ComputationalCircuit
Self-Timing Circuit
Input
Start Done
Output
8
Asynchronous Logic Gates
C-gate logical AND
Waits for events to arrive on both inputsXOR-gate logical OR
Waits for an event to arrive on either inputSEL-gate logical DEMUX
Routes input event to one of the outputs
SEL
1XOR
C0
9
Asynchronous Communication Protocol
2-Step, Event Triggered, Level Insensitive Protocol Transactions are encoded in go / ack events Asynchronously passes instructions between stages
go
ack
data_1 data_2
Transaction 2Transaction 1
go
ackdata
SenderStage
ReceiverStage
01
01
10
Outline
Introduction & MotivationBackgroundDSEP DesignAverage Case OptimizationsExperimental Results
11
DSEP Microarchitecture
At a high-level: 9 stage dynamic pipeline Multiple instruction issue Multiple functional units Out-of-order execution Looks like Intel P6 µarch
What’s the difference?
Decoupled, Self-Timed, Elastic Pipeline
Retire
Results
Retire
Write back
Commit
Flush
From I–Cache
Fetch
Decode
Rename
Read/Reorder
Issue
Execute
Data Read
12
DSEP Microarchitecture
Decoupled: Each stage controls its
own latency Based on local critical path Stage balancing not
important Each stage can have
several different latencies Selection based on inputs
Decoupled, Self-Timed, Elastic Pipeline
Retire
Results
Retire
Write back
Commit
Flush
From I–Cache
Fetch
Decode
Rename
Read/Reorder
Issue
Execute
Data Read
Pipeline is operating at several different speeds simultaneously!
13
Pipeline Elasticity
Definition: Pipeline’s ability to stretch with the latency of its
instruction stream
Global Elasticity Provided by reservation stations and reorder
buffer Same for synchronous and asynchronous pipelines
Fetch Execute Retire
When Execute stalls, the buffers allow Fetch and Retire to keep operating
14
Pipeline Elasticity
Local Elasticity Needed for a completely
decoupled pipeline Provided by micropipelines Variable length queues
between stages Efficient implementation,
little overhead Behave like shock
absorbers
Retire
Results
Retire
Write back
Commit
Flush
From I–Cache
Fetch
Decode
Rename
Read/Reorder
Issue
Execute
Data Read
15
Outline
Introduction & MotivationBackgroundDSEP DesignAverage Case OptimizationsExperimental Results
16
Analysis
Synchronous Processor Each stage runs at the speed of the worst-case
stage running its worst-case operation Designer: Focus on critical paths, stage
balancingDSEP Each stage runs at the speed of its own
average operation Designer: Optimize for most common
operation Fundamental advantage of Decoupled Pipeline
17
Average-Case Optimizations
Designer’s Strategy: Implement fine grain latency tuning Avoid latency of untaken paths
Consider a generic example: If short op is much more common,
throughput is proportional to the select logic
Generic Stage
OutputsInputs
MUXLong operation
Short operation
Select logic
18
Average-Case ALU
Tune ALU latency to closely match the input operation ALU performance is proportional to the average op Computational Circuit is unchanged
ALU Self-Timing CircuitArithmetic
Logic
Shift
CompareInputs
ALU Computational Circuit
SEL
XOR
Start
Output
Done
19
Average-Case Decoder
Tune Decoder latency to match the input instruction Common instructions often have simple encodings Prioritize most frequent instructions
Decoder Self-Timing Circuit
Format 1
Format 3
Format 2
Input
Decoder Computational Circuit
SEL
XOR
Start
Output
Done
20
Average-Case Fetch Alignment
Optimize for aligned fetch blocks If the fetch block is aligned on a cache line, it
can skip alignment and masking overhead
Optimized Fetch Alignment
Fetch Block
MUXFetch Align/Mask
Aligned?
Address Inst. Block
Optimization is effective when software/hardware alignment optimizations are effective
21
Average-Case Cache Access
Optimize for consecutive reads to the same cache line Allows subsequent references to skip cache
accessOptimized Cache Access
Cache Line
MUX
Read line from cache
To Same Line?
Address
Previous line
Effective for small stride access patterns, tight loops in I-Cache
Very little overhead for non-consecutive references
22
Average-Case Comparator
Optimize for the case that a difference exists in the lower 4 bits of the inputs 4-bit comparison is > 50% faster than 32-bit
Optimized Comparator
MUX32-bit Compare
4-bit Compare
?
Inputs
Output
Very effective for iterative loops Can be extended for tag comparisons
23
Outline
Introduction & MotivationBackgroundDSEP DesignAverage Case OptimizationsExperimental Evaluation
24
Simulation Environment
VHDL Simulator using Renoir Design Suite MIPS I Instruction set Fetch and Retire Bandwidth = 1 Execute Bandwidth ≤ 4 4-entry split Instruction Window 64-entry Reorder Buffer
Benchmarks BS 50-element bubble sort MM 10x10 integer matrix multiply
25
Two Pipeline Configurations
Operation DSEP Latencies Fixed Latencies
Fetch 100 120
Decode 50/80/120 120
Rename 80/120/150 120
Read 120 120
Execute 20/40/80/100/130/150/360/600
120/360/600
Retire 5/100/150 120
Caches 100 120
Main Memory 960 960
Micropipeline Register 5 5
“Synchronous” Clock Period = 120 time units
26
DSEP Performance
Compared Fixed and DSEP configurations DSEP increased performance 28% and 21%
for BS and MM respectively
0
20
40
60
80
100
Bubble-Sort Matrix-Multiply
Fixed DSEP
Execu
tion T
ime
27
Micropipeline Performance
Goals: Determine the need for local elasticity Determine appropriate lengths of the queues
Method: Evaluate DSEP configurations of form AxBxC
A Micropipelines in Decode, Rename and Retire B Micropipelines in Read C Micropipelines in Execute
All configurations include fixed length instruction window and reorder buffer
28
Measured percent speedup over 1x1x1 2x2x1 best for both benchmarks
2.4% performance improvement for BS, 1.7% for MM Stalls in Fetch reduced by 60% for 2x2x1
Micropipeline Performance
2.3
1.4
2.4
0
0.5
1
1.5
2
2.5
3
2x2x2 3x3x3 2x2x1
1.5
1.1
1.7
0
0.5
1
1.5
2
2.5
3
2x2x2 3x3x3 2x2x1
Bubble-Sort Matrix-Multiply
Perc
en
t Speedu
p
29
OOO Engine Utilization
Measured OOO Engine utilization Instruction Window (IW) and Reorder Buffer (RB) Utilization = Avg # of instructions in the buffer IW-Utilization up 75%, RB-Utilization up 40%
0
2
4
6
8
BS MM BS MM
1x1x1 2x2x1
InstructionWindow
ReorderBuffer
Uti
lizati
on
30
Total Performance
Compared Fixed and DSEP configurations DSEP 2x2x1 increased performance 29% and
22% for BS and MM respectively
0
20
40
60
80
100
Bubble-Sort Matrix-Multiply
Fixed DSEP 1x1x1 DSEP 2x2x1
Execu
tion T
ime
31
Conclusions
Decoupled, Self-Timing Average-Case optimizations significantly
increase performance Rarely taken critical paths no longer matter
Elasticity Removes pipeline jitter from decoupled
operation Increases utilization of existing resources Not as important as Average-Case
Optimizations(At least for our experiments)
32
Questions?