pipelining
DESCRIPTION
Pipelining. Automobile Manufacturing. 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency : Time from start to finish for one car. 275 minutes per car. (smaller is better). - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/1.jpg)
![Page 2: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/2.jpg)
Automobile Manufacturing1. Build frame. 60 min.
2. Add engine. 50 min.
3. Build body. 80 min.
4. Paint. 40 min.
5. Finish. 45 min.
275 min.
Latency: Time from start to finish for one car.
Throughput: Number of finished cars per time unit.
1 car/275 min = 0.218 cars/hour
275 minutes per car.
Issues: How can we make the process better by adding more workers?
(smaller is better)
(larger is better)
6.1
![Page 3: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/3.jpg)
An Assembly line
6.1
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
4
4
4
4
4
60 50 80 40 45
First two stagescan’t produce faster thanone car/80 min or a backlog will occurat third stage.
80 80
Last two stages only receive onecar/80 min to work on.
80 80
Latency: 400 min/carThroughput: 4 cars/640 min (1 car/160 min)
time
Will approach 1 car/80 min as time goes on
![Page 4: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/4.jpg)
Applying Assembly Lines to CPUs
• The single-cycle design did everything “at once”
• Can we break the single-cycle design up into stages?
6.1
• Issues:
• Car assembly works well. Will it be so easy to do the same technique to a CPU?
![Page 5: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/5.jpg)
InstructionMemory
Data Memory
AddAdd
4
Read address
Instruction [31-0]
Read address
Write address
Write data
Read dataResult
Zero
Result
Result Sh.Left2
1
00
1
signextend
PC
16 32
Read reg. num A
RegistersRead reg num B
Write reg num
Write reg data
Read reg data A
Read reg data B
Read reg num A
0
1
Imm:[15-0]
Rs:[25-21]
Rt:[20-16]
Rd:[15-11]
1
0
Instr. Fetch,PC=PC+4
Instr. DecodeRegister Fetch
Execute,Address Calc.
Memory
Reg.Write-back
Breaking up the Single-Cycle Datapath
6.2
Stages frommulti-cycle design
![Page 6: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/6.jpg)
InstructionMemory
Data Memory
AddAdd
4
Read address
Instruction [31-0]
Read address
Write address
Write data
Read dataResult
Zero
Result
Result Sh.Left2
1
00
1
signextend
PC
16 32
Read reg. num A
RegistersRead reg num B
Write reg num
Write reg data
Read reg data A
Read reg data B
Read reg num A
0
1
Imm:[15-0]
Rs:[25-21]
Rt:[20-16]
Rd:[15-11]
1
0
Instr. Fetch,PC=PC+4
Instr. DecodeRegister Fetch
Execute,Address Calc.
Memory
Reg.Write-back
The Key - Pipeline Registers
6.2
clock
PC+4
![Page 7: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/7.jpg)
InstructionMemory
Data Memory
AddAdd
4
Read address
Instruction [31-0]
Read address
Write address
Write data
Read dataResult
Zero
Result
Result Sh.Left2
1
00
1
signextend
PC
16 32
Read reg. num A
RegistersRead reg num B
Write reg num
Write reg data
Read reg data A
Read reg data B
Read reg num A
0
1
Imm:[15-0]
Rs:[25-21]
Rt:[20-16]
Rd:[15-11]
1
0
Example: R-type Instruction
6.2
PC+4
Writes the correct data to the wrong register
In general, arrows that go backwards across pipeline stages may be bad news...
![Page 8: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/8.jpg)
InstructionMemory
Data Memory
AddAdd
4
Read address
Instruction [31-0]
Read address
Write address
Write data
Read dataResult
Zero
Result
Result Sh.Left2
1
00
1
signextend
PC
16 32
Read reg. num A
RegistersRead reg num B
Write reg num
Write reg data
Read reg data A
Read reg data B
Read reg num A
Imm:[15-0]
Rs:[25-21]
Rt:[20-16]
0
1Rd:[15-11]
1
0
Correcting the Write Register Problem
6.2
PC+4
0
1
Rt:[20-16]
Rd:[15-11]
![Page 9: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/9.jpg)
Assembly-line Control Signals
135 4
In an assembly line, the manufacturing instructions can be attachedto the car. The instructions then move along with the car.
F: StandardE: 135 HPB: 2-doorP: GreenF: Leather
E: 190 HPB: 4-doorP: BlueF: Cotton
B: 2-doorP: LavenderF: Leather
P: GreenF: Vinyl
F: Leather
2
By separating the control signals by stages, only the signals needed for the current stage must be decoded.
All signals for later stages must be passed along.
6.1
![Page 10: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/10.jpg)
InstructionMemory
Data Memory
AddAdd
4
Read address
Instruction [31-0]
Read address
Write address
Write data
Read dataResult
Zero
ResultResult
Sh.Left2
1
00
1
signextend
PC
16 32
Read reg. num A
RegistersRead reg num B
Write reg num
Write reg data
Read reg data A
Read reg data B
Read reg num A
Imm:[15-0]
Rs:[25-21]
Rt:[20-16]
1
0
The Pipelined Control Logic
6.3
PC+4
0
1
Rt:[20-16]
Rd:[15-11]
ALUcontrol
ALUOp
RegWrite
Mem
To
Reg
MemWrite
MemRead
ALUSrc
PCSrc
RegDest
Op:[31-26]
W
ME
Control W
MW
Branch
![Page 11: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/11.jpg)
How’d we do?
• Compared to Single-cycle
• 5 stages --> Potentially 5x speedup
• Not likely• Stages won’t all be equally long• Pipeline registers will cause some delays
• Latency --> Greater than in single-cycle design
• More complexity, but nicely divided up
![Page 12: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/12.jpg)
Example 1
• Consider executing the following code
add $3, $4, $5
and $6, $7, $8
sub $9, $10, $11
on
i) A single-cycle machine with a cycle time of 200 ns
ii) A 5-stage pipeline machine with a cycle time of 50 ns
Which one runs faster?
What if the instructions were 100 instead of 3?
![Page 13: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/13.jpg)
![Page 14: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/14.jpg)
Analyzing Pipelines
6.4
ADD $10, $14, $0SUB $12, $13, $2AND $1, $6, $11SW $3, 200($9)OR $9, $13, $7
OR IF RF M WBEX
IF RF M WBSW EX
IF RF M WBAND EX
IF RF M WBSUB EX
IF RF MADD EX WB
![Page 15: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/15.jpg)
Data Hazards
6.4
ADD $13, $14, $0SUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $7 Writes register $13Writes register $13
Reads wrong $13Reads wrong $13
Reads wrong $13Reads wrong $13
Reads ? $13Reads ? $13
Reads correct $13Reads correct $13 OR IF RF M WBEX
IF RF M WBSW EX
IF RF M WBAND EX
IF RF M WBSUB EX
IF RF MADD EX WB
![Page 16: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/16.jpg)
Preventing Data Hazards
6.4
ADD $13, $14, $0NOPNOPNOPSUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $7
Insert NOP’s into the instructionstream to allow WB to happen beforeRF.
Assume we can’t write a registerand read the new value in the same cycle
Assume we can’t write a registerand read the new value in the same cycle
IF RFOR
IF RFSW EX
IF RF MAND EX
IF M WBSUB EXRF
IF RF MADD WBEX
IF M WBSUB EXRF
![Page 17: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/17.jpg)
Detecting Hazards
6.5
ADD $13, $14, $0SUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $7
Check each instruction as it is being decoded (RF-ID stage).If it reads a register that will be written by any instruction ahead of it (in RF, EX, or M stages), there is a hazard.
Write: $13
Read A: $13
Read B: $13
Read A: $13 IF RFOR EX
SW IF RF MEX
IF RF M WBAND EX
IF RF M WBSUB EX
ADD IF RF M WBEXCompare write reg # in EX with read reg # in RF
Compare write reg # in M with read reg # in RF
Compare write reg # in WB with read reg # in RF
![Page 18: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/18.jpg)
Stalling with Bubbles
6.5
ADD $13, $14, $0SUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $7
IF RFOR
IFSUB
IFSUB
IFSUB
Stalling:• Kill the current executionby “neutralizing” all the controlsignals so that it won’t write any registers.• Don’t write PC+4 into PC --> Stay at the current instruction and try again.
IF RF MADD WBEX
IF RF M WBSUB EX
IF RF MAND EX
SW IF RF EX
==
=
![Page 19: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/19.jpg)
Register Forwarding
6.6
ADD $13, $14, $0SUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $2
Register $13’s value is computed in the EX stage of the ADD even thoughit isn’t written in the register until the WB stage.
--> The pipeline register following the EX stage hold the value of $13 that’s needed in the SUB instruction’s EX stage.
IF RF M WBSUB EX
IF RF M WBAND EX
IF RF M WBOR EX
IF RF M WBSW EX
IF RF MADD WBEX
![Page 20: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/20.jpg)
Unforwardable Loads
6.6
LW $2, 30($2)AND $1, $2, $13SW $3, 200($2)OR $9, $2, $1
IF RF M WBAND EX
IF RF MLW WBEX
IF RF M WBSW EX
IF RF M WBOR EXOR
IF RF M WBAND EX
Loads don’t compute the register to write back until the Memory stage. This is one stage to late for the next instruction. ---> We can’t prevent stalls if the instruction following a Load uses the result of the Load.
![Page 21: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/21.jpg)
Example 2
• Consider executing the following code on a 5-stage pipeline datapath
add $3, $4, $5
lw $7, 100($3)
sub $8, $7, $9
1. Identify any potential data dependencies
2. How many cycles will it take to execute this code assuming no register forwarding?
3. How many cycles will it take to execute this code assuming register forwarding is available?
![Page 22: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/22.jpg)
Branch Hazards
6.7
BEQ $2, $1, SKIPAND $1, $2, $13SW $3, 200($2)OR $9, $2, $4ADD $3, $2, $5
SKIP: LW $2,32($4)
IF RF M WBAND EX
IF RF M WBOR EX
IF RF M WBOR EXLW
IF RF M WBSW EX
Don’t know result of branch untilthe end of the M stage
Don’t know result of branch untilthe end of the M stage
If the branch is taken,we’ve blown it by executingthe intervening instructions
If the branch is taken,we’ve blown it by executingthe intervening instructions
IF RFBEQ WBEX M
![Page 23: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/23.jpg)
Solution 1: Stall
6.5
IF RFADD
IFAND
IFAND
IFAND
IF RF MBEQ WBEX
IF RF M WBAND EX
IF RF MSW EX
OR IF RF EX
BEQ $2, $1, SKIPAND $1, $2, $13SW $3, 200($2)OR $9, $2, $4ADD $3, $2, $5
SKIP: LW $2,32($4)
Stalling always solves theproblem. If we didn’t have somany branches in programs, it wouldnot be a problem
Branchnot taken
Branchnot taken
![Page 24: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/24.jpg)
6.7
BEQ $2, $1, SKIPAND $1, $2, $13SW $3, 200($2)OR $9, $2, $4ADD $3, $2, $5
SKIP: LW $2,32($4)
IF RFBEQ WBEX M
If we guess right, we win --> No stall at all
IF RF M WBLW EX
IF RF M WBOR EX
If we guessed wrong, 1. We have to undo all that we did (fortunately, no writebacks have occured yet). 2. We still take all the time of a stall
IF RF M WBAND EX
IF RF M WBSW EX
Solution 2: Assume not Taken
Must be undone if branchis taken!
Must be undone if branchis taken!
Branch is taken...Branch is taken...
![Page 25: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/25.jpg)
6.7
Solution 3: Better Prediction
• Predict that the branch goes the same way as the last time
• Works great for loops
• Works great for “special-case” code
• Need to keep track of the information for each branch, though...
• One or two bits will do
• Keep a small table of recently used branches and which way they went
![Page 26: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/26.jpg)
6.7
Solution 4: Delayed BranchesXOR $1, $3, $3ADD $2, $3, $4SUB $4, $3, $1OR $3, $2, $0BEQ $10, $11, SKIPLW $4, 60($2)
SKIP AND $1, $2, $3
If we had some warning, wecould compute the branch aheadof time...
XOR $1, $3, $3 Branch-After-Three-EQ $10,$11,SKIP
ADD $2, $3, $4SUB $4, $3, $1OR $3, $2, $0LW $4, 60($2)
SKIP AND $1, $2, $3
3 delay slots3 delay slots These instructionsare always executed.Branch can’t dependon them...
![Page 27: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/27.jpg)
3-slot Delayed Branch
6.7
IF RFB3E WBEX M
IF RF M WBLW or AND EX
Branch-After-Three-EQ $10,$11,SKIPADD $2, $3, $4SUB $4, $3, $1OR $3, $2, $0LW $4, 60($2)
SKIP AND $1, $2, $3
IF RF WBEX MADD
IF RF WBEX MSUB
IF RF WBEX MOR
![Page 28: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/28.jpg)
Branch summary
• Two decent solutions:
• Branch prediction• Requires more hardware• Used in modern microprocessors
• Delayed branch• Requires special software manipulation• Often doesn’t deliver its promise• Used often in CPUs 4-10 years ago
![Page 29: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/29.jpg)
Example 3
• Consider executing the following codeLOOP: add $3, $4, $5
and $6, $7, $8bne $12, $8, LOOP
oni) A single-cycle machine with a cycle time of 200 nsii) A 5-stage pipeline machine with a cycle time of 50
nsA. Assume the loop executes 10 timesB. Assume the loop executes 100 timesC. Assume the loop executes 1000 timesWhich one runs faster?
![Page 30: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/30.jpg)
Example 4
• Consider executing the following code on a 5-stage pipeline datapath
addi $3, $0, 10LOOPSTART: lw $5, ARRAY($3)
addi $5, $5, 1sw $5, ARRAYaddi $3, $3, -1bne $3, $0, LOOPSTARTadd $3, $5, $6sub $7, $8, $9addi $4, $6, 3
1. Identify potential data dependencies2. How many cycles will it take to execute this code?
A. With nops/stallsB. With branch prediction assuming branch not takenC. With branch prediction based on one previous result
![Page 31: Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062517/56813098550346895d9676ff/html5/thumbnails/31.jpg)
Example 5• For the following code:
addi $t0, $0, 10
addi $t1, $0, 0
LOOP: lw $t2, ARRAY1($t1)
lw $t3, ARRAY2($t1)
add $t4, $t2, $t3
sw $t4, ARRAY3($t1)
addi $t1, $0, 4
addi $t0, $0, -1
bne $t0, $0, LOOP
addi $v0, 0, 10
syscall
A. Calculate the execution time in a single-cycle datapath with a cycle time of 20 ns
B. Identify data dependencies (pipeline hazards)
C. Calculate the execution time in a 5-stage pipeline datapath with no branch prediction or register forwarding
D. Calculate the execution time in a 5-stage pipeline datapath with register forwarding and no branch prediction
E. Calculate the execution time in a 5-stage pipeline datapath with register forwarding and assuming branch not taken
F. Calculate the execution time in a 5-stage pipeline datapath with register forwarding and branch prediction based on one previous result
G. Calculate the CPI and MIPs for C, D, E and F