l16 – pipelining 1 comp 411 – spring 2011 4/20/2011 pipelining between 411 problems sets, i...
TRANSCRIPT
L16 – Pipelining 1Comp 411 – Spring 2011 4/20/2011
Pipelining
Between 411 problems sets, I
haven’t had a minute to do laundry
Now that’s what Icall dirty laundry
Read Chapter 4.5-4.8
L16 – Pipelining 2Comp 411 – Spring 2011 4/20/2011
Forget 411… Let’s Solve a “Relevant Problem”
Device: Washer
Function: Fill, Agitate, Spin
WasherPD = 30 mins
Device: Dryer
Function: Heat, Spin
DryerPD = 60 mins
INPUT:dirty laundry
OUTPUT:4 more weeks
L16 – Pipelining 3Comp 411 – Spring 2011 4/20/2011
One Load at a Time
Everyone knows that the real reason that UNC students put off doing laundry so long is *not* because they procrastinate, are lazy, or even have better things to do.
The fact is, doing laundry one load at a time is not smart.
(Sorry Mom, but you were wrong about this one!)
Step 1:
Step 2:
Total = WasherPD + DryerPD
= _________ mins90
L16 – Pipelining 4Comp 411 – Spring 2011 4/20/2011
Doing N Loads of Laundry
Here’s how they do laundry at Duke, the “combinational” way.
(Actually, this is just an urban legend. No one at Duke actually does laundry. The butler’s all arrive on Wednesday morning, pick up the dirty laundry and return it all pressed and starched by dinner)
Step 1:
Step 2:
Step 3:
Step 4:
Total = N*(WasherPD + DryerPD)
= ____________ minsN*90
…
L16 – Pipelining 5Comp 411 – Spring 2011 4/20/2011
Doing N Loads… the UNC way
UNC students “pipeline” the laundry process.
That’s why we wait!
Step 1:
Step 2:
Step 3:
Total = N * Max(WasherPD, DryerPD)
= ____________ mins
N*60
…Actually, it’s more like N*60 + 30 if we account for the startup transient correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs.
Actually, it’s more like N*60 + 30 if we account for the startup transient correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs.
L16 – Pipelining 6Comp 411 – Spring 2011 4/20/2011
Recall Our Performance Measures
Latency:The delay from when an input is established until the output associated with that input becomes valid.
(Duke Laundry = _________ mins)( UNC Laundry = _________ mins)
Throughput:The rate at which inputs or outputs are processed.
(Duke Laundry = _________ outputs/min)( UNC Laundry = _________ outputs/min)
90
120
1/90
1/60
Assuming that the wash is started as soon as possible and waits (wet) in the washer until dryer is available.
Even though we increaselatency, it takes less time per load
L16 – Pipelining 7Comp 411 – Spring 2011 4/20/2011
Okay, Back to Circuits…
F
G
HX P(X)
For combinational logic: latency = tPD, throughput = 1/tPD.
We can’t get the answer faster, but are we making effective use of our hardware at all times?
G(X)
F(X)
P(X)
X
F & G are “idle”, just holding their outputs stable while H performs its computation
L16 – Pipelining 8Comp 411 – Spring 2011 4/20/2011
Pipelined Circuits
use registers to hold H’s input stable!
F
G
HX P(X)
15
20
25
Now F & G can be working on input Xi+1 while H is performing its computation on Xi. We’ve created a 2-stage pipeline : if we have a valid input X during clock cycle j, P(X) is valid during clock j+2.
Suppose F, G, H have propagation delays of 15, 20, 25 ns and we are using ideal zero-delay registers (ts = 0, tpd = 0):
latency45
______
throughput1/45
______unpipelined
2-stage pipeline 50
worse
1/25better
Pipelining uses registers
to improve the
throughput of combinational
circuits
L16 – Pipelining 9Comp 411 – Spring 2011 4/20/2011
Pipeline Diagrams
Input
F Reg
G Reg
H Reg
i i+1 i+2 i+3
Xi Xi+1
F(Xi)
G(Xi)
Xi+2
F(Xi+1)
G(Xi+1)
H(Xi)
Xi+3
F(Xi+2)
G(Xi+2)
H(Xi+1)
Clock cycleP
ipe
line
sta
ge
s
The results associated with a particular set of input data moves diagonally through the diagram, progressing through one pipeline stage each clock cycle.
H(Xi+2)
…
…
F
G
HX P(X)
15
20
25
This is an exampleof parallelism. Atany instant we arecomputing 2 results.
L16 – Pipelining 10Comp 411 – Spring 2011 4/20/2011
Pipelining Summary
Advantages:– Higher throughput than combinational system– Different parts of the logic work on different parts of the
problem…
Disadvantages:– Generally, increases latency– Only as good as the *weakest* link
(often called the pipeline’s BOTTLENECK)
L16 – Pipelining 11Comp 411 – Spring 2011 4/20/2011
Review of CPU Performance
MIPS = Millions of Instructions/Second
Freq = Clock Frequency, MHz
CPI = Clocks per Instruction
MIPS =FreqCPI
To Increase MIPS:
1. DECREASE CPI.
- RISC simplicity reduces CPI to 1.0.
- CPI below 1.0? State-of-the-art multiple instruction issue
2. INCREASE Freq.
- Freq limited by delay along longest combinational path; hence
- PIPELINING is the key to improving performance.
L16 – Pipelining 12Comp 411 – Spring 2011 4/20/2011
Where Are the Bottlenecks?
Pipelining goal:Break LONG combinational paths memories, ALU in separate stages
WA
PC
+4
InstructionMemory
A
D
RegisterFile
RA1 RA2
RD1 RD2
ALUA B
WA
ALUFN
Control Logic
Data Memory
RD
WD R/W
Adr
Wr
WDSEL0 1 2
BSELWDSELALUFNWr
J:<25:0>
PCSEL
WERF
WERF
00
PC+4
Rt: <20:16>
Imm: <15:0>
ASEL
SEXT
+x4
BT
Z
BT
WASEL
Rd:<15:11>Rt:<20:16> 0
123
WASEL
PC<31:29>:J<25:0>:00
JT
JT
N V C
ZVN C
Rs: <25:21>
ASEL20
SEXT
BSEL01
SEXT
shamt:<10:6>
PCSEL 0123456
“16”
IRQ
0x800000800x800000400x80000000
RESET
“31”“27”
1
WD
WE
L16 – Pipelining 13Comp 411 – Spring 2011 4/20/2011
miniMIPS Timing
Different instructions use various parts of the data path.
add $4, $5, $6beq $1, $2, 40lw $3, 30($0)jal 20000
Program execution order
Time
CLK
Instruction FetchInstruction DecodeRegister Prop DelayALU OperationBranch TargetData AccessRegister Setup
sw $2, 20($4)
The above scenario is possible only if the system could vary the clock period based on the instruction being executed. This leads to complicated timing generation, and, in the end, slower systems, since it is not very compatible with pipelining!
1 instr every 14 nS, 14 nS, 20 nS, 9 nS, 19 nS
6 nS2 nS2 nS5 nS4 nS6 nS1 nS
L16 – Pipelining 14Comp 411 – Spring 2011 4/20/2011
Uniform miniMIPS Timing
With a fixed clock period, we have to allow for the worse case.
add $4, $5, $6beq $1, $2, 40lw $3, 30($0)jal 20000
Program execution order
TimeCLK
Instruction FetchInstruction DecodeRegister Prop DelayALU OperationBranch TargetData AccessRegister Setup
sw $2, 20($4)
By accounting for the “worse case” path (i.e. allowing time for each possible combination of operations) we can implement a fixed clock period. This simplifies timing generation, enforces a uniform processing order, and allows for pipelining!
Isn’t the net effect just a slower CPU?
1 instr EVERY 20 nS
6 nS2 nS2 nS5 nS4 nS6 nS1 nS
L16 – Pipelining 15Comp 411 – Spring 2011 4/20/2011
Goal: 5-Stage Pipeline
GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ALU)
APPROACH: structure processor as 5-stage pipeline:
IFInstruction Fetch stage: Maintains PC, fetches
one instruction per cycle and passes it to
WB Write-Back stage: writes result back into register file.
ID/RFInstruction Decode/Register File stage: Decode
control lines and select source operands
ALUALU stage: Performs specified operation,
passes result to
MEMMemory stage: If it’s a lw, use ALU result as an
address, pass mem data (or ALU result if not lw) to
L16 – Pipelining 16Comp 411 – Spring 2011 4/20/2011
ALUA B
ALUFN
Data MemoryRD
WD R/WAdrWr
WDSEL0 1 2
PC+4
ZVN C
PC
+4
InstructionMemory
AD
00
BT
PC<31:29>:J<25:0>:00
JT
PCSEL 0123456
0x800000800x800000400x80000000
PCREG 00 IRREG
WARegister
FileRA1 RA2
RD1 RD2
J:<25:0>
Imm: <15:0>
+x4
BT
JT
Rt: <20:16>Rs: <25:21>
ASEL20 BSEL01
SEXTSEXT
shamt:<10:6>
“16”
1
=BZ
5-Stage miniMIPS
PCALU 00 IRALU
A B WDALU
PCMEM 00 IRMEM YMEM WDMEM
WARegister
File
WA WD
WEWERF
WASEL
Rd:<15:11>Rt:<20:16>
“31”
“27”
0 1 2 3
InstructionFetch
RegisterFile
ALU
WriteBack
PCWB 00 IRWB YWB
Memory
Address is available right after instruction enters Memory stage
Data is needed just before rising clock edge at end of Write Back stage
almost 2 clock cycles
• Omits some details• NO bypass or interlock logic
L16 – Pipelining 17Comp 411 – Spring 2011 4/20/2011
Pipelining
Improve performance by increasing instruction throughput
Ideal speedup is number of stages in the pipeline. Do we achieve this?
Instructionfetch
Reg ALUData
accessReg
8 nsInstruction
fetchReg ALU
Dataaccess
Reg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch
Reg ALUData
accessReg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetchReg ALU
Dataaccess
Reg
2 nsInstruction
fetchReg ALU
Dataaccess
Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
L16 – Pipelining 18Comp 411 – Spring 2011 4/20/2011
Pipelining
What makes it easyall instructions are the same length
just a few instruction formats
memory operands appear only in loads and stores
What makes it hard?structural hazards: suppose we had only one memory
control hazards: need to worry about branch instructions
data hazards: an instruction depends on a previous instruction
Individual Instructions still take the same number of cycles
But we’ve improved the through-put by increasing the number of simultaneously executing instructions
L16 – Pipelining 19Comp 411 – Spring 2011 4/20/2011
Structural Hazards
Inst
Fetch
Reg
Read
ALU Data
Access
Reg Write
Inst
Fetch
Reg
Read
ALU Data
Access
Reg Write
Inst
Fetch
Reg
Read
ALU Data
Access
Reg Write
Inst
Fetch
Reg
Read
ALU Data
Access
Reg Write
L16 – Pipelining 20Comp 411 – Spring 2011 4/20/2011
Problem with starting next instruction before first is finisheddependencies that “go backward in time” are data hazards
Data Hazards
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2:
DM Reg
Reg
Reg
Reg
DM
L16 – Pipelining 21Comp 411 – Spring 2011 4/20/2011
Have compiler guarantee no hazards
Where do we insert the “nops” ?
sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)
Problem: this really slows us down!
Software Solution
L16 – Pipelining 22Comp 411 – Spring 2011 4/20/2011
Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding
Forwarding
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecution order(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2 :
DM Reg
Reg
Reg
Reg
X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :
DM
L16 – Pipelining 23Comp 411 – Spring 2011 4/20/2011
Load word can still cause a hazard:an instruction tries to read a register following a load instruction that
writes to the same register.
Thus, we need a hazard detection unit to “stall” the instruction
Can't always forward
Reg
IM
Reg
Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
DM Reg
Reg
Reg
DM
L16 – Pipelining 24Comp 411 – Spring 2011 4/20/2011
Stalling
We can stall the pipeline by keeping an instruction in the same stage
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
IM
Reg
Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Reg
bubble
L16 – Pipelining 25Comp 411 – Spring 2011 4/20/2011
When we decide to branch, other instructions are in the pipeline!
We are predicting “branch not taken”need to add hardware for flushing instructions if we are
wrong
Branch Hazards
Reg
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Programexecutionorder(in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
L16 – Pipelining 26Comp 411 – Spring 2011 4/20/2011
ALUA B
ALUFN
RD
WD R/WAdrWr
WDSEL0 1 2
PC+4
ZVN C
PC
+4
InstructionMemory
AD
00
BT
PC<31:29>:J<25:0>:00
JT
PCSEL 0123456
0x800000800x800000400x80000000
PCREG 00 IRREG
WARegister
FileRA1 RA2
RD1 RD2
J:<25:0>
Imm: <15:0>
+x4
BT
JT
Rt: <20:16>Rs: <25:21>
ASEL20 BSEL01
SEXTSEXT
shamt:<10:6>
“16”
1
=BZ
5-Stage miniMIPS
PCALU 00 IRALU
A B WDALU
PCMEM 00 IRMEM YMEM WDMEM
WARegister
File
WA WD
WEWERF
WASEL
Rd:<15:11>Rt:<20:16>
“31”
“27”
0 1 2 3
InstructionFetch
RegisterFile
ALU
WriteBack
PCWB 00 IRWB YWB
Memory
We wanted a simple, clean pipeline but…
• added CLK EN to freeze IF/RF stages so we can wait for lw to reach WB stage
NOP
Data Memory
broke the sequential semantics of ISA by adding a branch delay-slot and early branch resolution logicadded A/B bypass muxes to get data before it’s written to regfile
L16 – Pipelining 27Comp 411 – Spring 2011 4/20/2011
Pipeline Summary (I)
• Started with unpipelined implementation– direct execute, 1 cycle/instruction
– it had a long cycle time: mem + regs + alu + mem + wb
• We ended up with a 5-stage pipelined implementation
– increase throughput (3x???)
– delayed branch decision (1 cycle)
Choose to execute instruction after branch
– delayed register writeback (3 cycles)
Add bypass paths (6 x 2 = 12) to forward correct value
– memory data available only in WB stage
Introduce NOPs at IRALU, to stall IF and RF stages until LD result was ready
L16 – Pipelining 28Comp 411 – Spring 2011 4/20/2011
Pipeline Summary (II)
Fallacy #1: Pipelining is easySmart people get it wrong all of the time!
Fallacy #2: Pipelining is independent of ISAMany ISA decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes).
Fallacy #3: Increasing Pipeline stages improves performanceDiminishing returns. Increasing complexity.