eel5708 lotzi bölöni eel 5708 high performance computer architecture pipelining
TRANSCRIPT
![Page 1: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/1.jpg)
EEL5708
Lotzi Bölöni
EEL 5708High Performance Computer Architecture
Pipelining
![Page 2: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/2.jpg)
EEL5708
Acknowledgements
• All the lecture slides were adopted from the slides of David Patterson (1998, 2001) and David E. Culler (2001), Copyright 1998-2002, University of California Berkeley
![Page 3: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/3.jpg)
EEL5708
Pipelining
![Page 4: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/4.jpg)
EEL5708
Sequential Laundry
• Sequential laundry takes 6 hours for 4 loads• If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 2030 40 2030 40 2030 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
![Page 5: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/5.jpg)
EEL5708
Pipelined LaundryStart work ASAP
• Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20
![Page 6: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/6.jpg)
EEL5708
Pipelining Lessons
• Pipelining doesn’t help latency of single task, it helps throughput of entire workload
• Pipeline rate limited by slowest pipeline stage
• Multiple tasks operating simultaneously
• Potential speedup = Number pipe stages
• Unbalanced lengths of pipe stages reduces speedup
• Time to “fill” pipeline and time to “drain” it reduces speedup
A
B
C
D
6 PM 7 8 9
Task
Order
Time
30 40 40 40 40 20
![Page 7: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/7.jpg)
EEL5708
Fast, Pipelined Instruction Interpretation
Instruction Register
Operand Registers
Instruction Address
Result Registers
Next Instruction
Instruction Fetch
Decode &Operand Fetch
Execute
Store Results
NIIF
DE
W
NIIF
DE
W
NIIF
DE
W
NIIF
DE
W
NIIF
DE
W
Time
Registers or Mem
![Page 8: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/8.jpg)
EEL5708
Instruction Pipelining
• Execute billions of instructions, so throughput is what matters
– except when?
• What is desirable in instruction sets for pipelining?
– Variable length instructions vs. all instructions same length?
– Memory operands part of any operation vs. memory operands only in loads or stores?
– Register operand many places in instruction format vs. registers located in same place?
![Page 9: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/9.jpg)
EEL5708
Example: MIPS (Note register location)
Op
31 26 01516202125
Rs1 Rd immediate
Op
31 26 025
Op
31 26 01516202125
Rs1 Rs2
target
Rd Opx
Register-Register
561011
Register-Immediate
Op
31 26 01516202125
Rs1 Rs2/Opx immediate
Branch
Jump / Call
![Page 10: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/10.jpg)
EEL5708
5 Steps of MIPS Datapath
MemoryAccess
Write
Back
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
MU
X
Mem
ory
Reg File
MU
XM
UX
Data
Mem
ory
MU
X
SignExtend
4
Ad
der Zero?
Next SEQ PC
Addre
ss
Next PC
WB Data
Inst
RD
RS1
RS2
Immediate
![Page 11: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/11.jpg)
EEL5708
5 Steps of MIPS Datapath
MemoryAccess
Write
Back
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MU
XM
UX
Data
Mem
ory
MU
X
SignExtend
Zero?
IF/ID
ID/E
X
MEM
/WB
EX
/MEM
4
Ad
der
Next SEQ PC Next SEQ PC
RD RD RD WB
Data
• Data stationary control– local decode for each instruction phase / pipeline stage
Next PC
Addre
ss
RS1
RS2
Imm
MU
X
![Page 12: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/12.jpg)
EEL5708
Visualizing PipeliningFigure 3.3, Page 133 , CA:AQA 2e
Instr.
Order
Time (clock cycles)
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
![Page 13: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/13.jpg)
EEL5708
Its Not That Easy for Computers
• Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle
– Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away)
– Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)
– Control hazards: Pipelining of branches & other instructions that change the PC
– Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline
![Page 14: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/14.jpg)
EEL5708
Speed Up Equation for Pipelining
CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr
Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined
Ideal CPI + Pipeline stall CPI Clock Cyclepipelined
Speedup = Pipeline depth Clock Cycleunpipelined 1 + Pipeline stall CPI Clock Cyclepipelined
x
x
![Page 15: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/15.jpg)
EEL5708
Structural Hazard Example: Dual-port vs. Single-port
• Machine A: Dual ported memory• Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate• Ideal CPI = 1 for both• Loads are 40% of instructions executed SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
• Machine A is 1.33 times faster
![Page 16: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/16.jpg)
EEL5708
Three Generic Data Hazards
InstrI followed by InstrJ
• Read After Write (RAW) InstrJ tries to read operand before InstrI
writes it
![Page 17: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/17.jpg)
EEL5708
Three Generic Data Hazards
InstrI followed by InstrJ
• Write After Read (WAR) InstrJ tries to write operand before InstrI
reads i– Gets wrong operand
• Can’t happen in our 5 stage pipeline because:– All instructions take 5 stages, and– Reads are always in stage 2, and – Writes are always in stage 5
![Page 18: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/18.jpg)
EEL5708
Three Generic Data Hazards
InstrI followed by InstrJ
• Write After Write (WAW) InstrJ tries to write operand before InstrI writes it
– Leaves wrong result ( InstrI not InstrJ )
• Can’t happen in our 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5
![Page 19: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/19.jpg)
EEL5708
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory. Slow code:
LW Rb,b
LW Rc,c
ADD Ra,Rb,Rc
SW a,Ra
LW Re,e
LW Rf,f
SUB Rd,Re,Rf
SW d,Rd
Software Scheduling to Avoid Load Hazards
Fast code:LW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd
![Page 20: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/20.jpg)
EEL5708
Control Hazard on BranchesThree Stage Stall
![Page 21: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/21.jpg)
EEL5708
Branch Stall Impact
• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!
• Two part solution:– Determine branch taken or not sooner, AND– Compute taken branch address earlier
• Branch tests if register = 0 or <> 0• Solution:
– Move Zero test to ID/RF stage– Adder to calculate new PC in ID/RF stage– 1 clock cycle penalty for branch versus 3
![Page 22: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/22.jpg)
EEL5708
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear#2: Predict Branch Not Taken
– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Advantage of late pipeline state update– 47% branches not taken on average– PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken– 53% branches taken on average– But haven’t calculated branch target address
» still incurs 1 cycle branch penalty» Other machines: branch target known before outcome
![Page 23: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/23.jpg)
EEL5708
Four Branch Hazard Alternatives
#4: Delayed Branch– Define branch to take place AFTER a following instruction
branch instructionsequential successor1
sequential successor2........sequential successorn
branch target if taken
– 1 slot delay allows proper decision and branch target address in 5 stage pipeline
Branch delay of length n
![Page 24: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/24.jpg)
EEL5708
Delayed Branch
• Where to get instructions to fill branch delay slot?– Before branch instruction– From the target address: only valuable when branch taken– From fall through: only valuable when branch not taken– Cancelling branches allow more slots to be filled
• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots– About 80% of instructions executed in branch delay slots
useful in computation– About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
![Page 25: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/25.jpg)
EEL5708
Evaluating Branch Alternatives
Scheduling Branch CPIspeedup v. speedup v. scheme penalty unpipelined stall
Stall pipeline 3 1.42 3.5 1.0Predict taken 1 1.14 4.4 1.26Predict not taken 1 1.09 4.5 1.29Delayed branch 0.5 1.07 4.6 1.31
Conditional & Unconditional = 14%, 65% change PC
Pipeline speedup = Pipeline depth1 +Branch frequencyBranch penalty
![Page 26: EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f065503460f94c1bd3b/html5/thumbnails/26.jpg)
EEL5708
Pipelining Summary
• Just overlap tasks, and easy if tasks are independent
• Speed Up / Pipeline Depth; if ideal CPI is 1, then:
• Hazards limit performance on computers:– Structural: need more HW resources– Data (RAW,WAR,WAW): need forwarding, compiler
scheduling– Control: delayed branch, prediction
Speedup =Pipeline Depth
1 + Pipeline stall CPIX
Clock Cycle Unpipelined
Clock Cycle Pipelined