Computer Structure 2013 – Pipeline1
Computer Structure
MIPS Pipeline
Lihu Rappoport and Adi Yoaz
Some of the slides were taken from:
(1) Avi Mendelson (2) Randi Katz (3) Patterson
Computer Structure 2013 – Pipeline2
DataAccess
DataAccess
DataAccess
DataAccess
DataAccess
Pipelining Instructions
Ideal speedup is number of stages in the pipeline. Do we achieve this?
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
. ..
InstFetch
Reg ALU Reg
InstFetch
Reg ALU Reg
InstFetch
InstFetch
Reg ALU Reg
InstFetch
Reg ALU Reg
InstFetch
Reg ALU Reg
2 ns
2 ns
2 ns 2 ns 2 ns 2 ns 2 ns
8 ns
8 ns
8 ns
Time
Programexecutionorder
lw R1, 100(R0)
lw R2, 200(R0)
lw R3, 300(R0)
Time
Programexecutionorder
lw R1, 100(R0)
lw R2, 200(R0)
lw R3, 300(R0)
Computer Structure 2013 – Pipeline3
Pipelined Car Assembly
chassis engine finish
1 hour 2 hours 1 hour
Car 1Car 2Car 3
Computer Structure 2013 – Pipeline4
Pipelining Pipelining does not reduce the latency of single task,
it increases the throughput of entire workload Potential speedup = Number of pipe stages
Pipeline rate is limited by the slowest pipeline stageÞ Partition the pipe to many pipe stagesÞ Make the longest pipe stage to be as short as possible Þ Balance the work in the pipe stages
Pipeline adds overhead (e.g., latches) Time to “fill” pipeline and time to “drain” it reduces speedup Stall for dependencies
Þ Too many pipe-stages start to loose performance
IPC of an ideal pipelined machine is 1 Every clock one instruction finishes
Computer Structure 2013 – Pipeline5
Instruction Decode /
register fetch
Instruction fetch
Execute /address
calculation
Memoryaccess
Writeback
ALUSrc
6
ALUresult
Zero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
Pipelined CPU
Computer Structure 2013 – Pipeline6
Pipelined CPU with Control
ALUSrc
6
ALUresult
Zero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX
EX/MEM
MEM/WBIn
stru
ctio
n
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
Co
ntr
ol
WBMEM
WBMEM
WB
EXE
Instruction Decode /
register fetch
Instruction fetch
Execute /address
calculation
Memoryaccess
Writeback
Computer Structure 2013 – Pipeline7
Structural Hazard Attempt to use the same resource two different ways at the same time Register File:
Accessed in 2 stages: Read during stage 2 (ID) Write during stage 5 (WB)
Solution: 2 read ports, 1 write port Memory
Accessed in 2 stages: Instruction Fetch during stage 1 (IF) Data read/write during stage 4 (MEM)
Solution: separate instruction cache and data cache
Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all
instructions
IM Reg
IM Reg
IM Reg DM Reg
IM DM Reg
IM DM Reg
DM Reg
Reg
Reg
Reg
DM
Computer Structure 2013 – Pipeline8
Pipeline Example: cycle 1
0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7
ALUSrc
6
ALUresult
Zero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction lwPC
4
4
Computer Structure 2013 – Pipeline9
Pipeline Example: cycle 2
ALUSrc
6
ALUresult
Zero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
lw
PC
8
8 4
sub[R1]
9
10
0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7
Computer Structure 2013 – Pipeline10
Pipeline Example: cycle 3
ALUSrc
6
ALUresult
Zero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
lw
PC
12
12 8
and[R2]
[R1]+9
sub
4
[R3]
1011
0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7
Computer Structure 2013 – Pipeline11
Pipeline Example: cycle 4
ALUSrc
6
ALUresult
Zero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
lw
PC
16
12 8
or[R4]
Data frommemoryaddress[R1]+9
sub
4
[R5]
1112
and
16
10
[R2]-[R3]
0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7
Computer Structure 2013 – Pipeline12
Problem with starting next instruction before first is finished dependencies that “go backward in time” are data hazards
Dependencies: RAW Hazard
IM Reg
IM Reg
IM Reg DM Reg
IM DM Reg
IM DM Reg
DM Reg
Reg
Reg
Reg
DM
sub R2, R1, R3
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw R15,100(R2)
Programexecutionorder
1 2 3 4 5 6Time (clock cycles)
7 8 910 10 /-20Value of R2 10 10 10 -20 -20 -20 -20
Computer Structure 2013 – Pipeline13
RAW Hazard: HW Solution 1 - Add Stalls
IM bubble bubble bubble bubble
IM bubble bubble bubble bubble
IM bubble bubble bubble bubble
IM Reg
1 2 3 4 5 6Time (clock cycles)
7 8 910 10/-20Value of R2
DM Reg
IM DM RegReg
IM Reg
IM Reg DM Reg
IM DM Reg
Reg
Reg
DM
sub R2, R1, R3
stall
stall
stall
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw R15,100(R2)
Programexecutionorder
10 10 10 -20 -20 -20 -20
• Have the hardware detect hazard and add stalls if needed
• Problem: this also slows us down!
Computer Structure 2013 – Pipeline14
Use temporary results, don’t wait for them to be written to the register file register file forwarding to handle read/write to same register ALU forwarding
RAW Hazard: HW Solution 2 - Forwarding
IM Reg
IM Reg
IM Reg DM Reg
IM DM Reg
IM DM Reg
DM Reg
Reg
Reg
Reg
X X X –20 X X X X X
X X X X –20 X X X X
DM
sub R2, R1, R3
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw R15,100(R2)
Programexecutionorder
1 2 3 4 5 6Time (clock cycles)
7 8 910 10/–20Value of R2 10 10 10 -20 -20 -20 -20
Value EX/MEM
Value MEM/WB
Computer Structure 2013 – Pipeline15
Forwarding Hardware
Co
ntr
ol
EX
M
WB
ForwardingUnit
1mux
0
0mux
1
0mux
B
1
2
0mux
A
1
2
DataMemoryA
LU
Register File
Ins
tru
cti
on
Me
mo
ry
PC
ID/EX EX/MEM MEM/WBIF/ID
Inst
ruct
ion
IF/ID.Rs
IF/ID.Rt
IF/ID.RtIF/ID.Rd
Rs
Rt
Rt
Rd
WB
M WB
EX/MEM.Rd
MEM/WB.Rd
EX
/ME
M.R
eg
Wri
te
ME
M/W
B.R
eg
Wri
te
Computer Structure 2013 – Pipeline16
Forwarding Control
EX Hazard: if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg1))
ALUSelA = 1
if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg2)) ALUSelB = 1
MEM Hazard: if (MEM/WB.RegWrite and ((not EX/MEM.RegWrite) or (EX/MEM.WriteReg ID/EX.ReadReg1)) and (MEM/WB.WriteReg = ID/EX.ReadReg1)) ALUSelA = 2
if (MEM/WB.RegWrite and ((not EX/MEM.RegWrite) or (EX/MEM.WriteReg ID/EX.ReadReg2)) and (MEM/WB.WriteReg = ID/EX.ReadReg2)) ALUSelB = 2
Computer Structure 2013 – Pipeline17
Forwarding Hardware Example: Bypassing From EX to Src1 and From WB to Src2
lw R11,9(R1) sub R10,R2, R3and R12,R10,R11
Co
ntr
ol
EX
M
WB
ForwardingUnit
1mux
0
0mux
1
0mux
1
2
0mux
1
2
DataMemoryA
LU
Register File
Ins
tru
cti
on
Me
mo
ry
PC
ID/EX EX/MEM MEM/WBIF/ID
Inst
ruct
ion
IF/ID.Rs
IF/ID.Rt
IF/ID.RtIF/ID.Rd
Rs
Rt
Rt
Rd
WB
M WB
EX/MEM.Rd
MEM/WB.Rd
lw[R10]
Data frommemoryaddress[R1]+9
sub
[R11]
1012
and
11
[R2]-[R3]
1110
Computer Structure 2013 – Pipeline18
Forwarding Hardware Example 2: Bypassing From WB to Src2
sub R10,R2, R3xxxand R12,R10,R11
Co
ntr
ol
EX
M
WB
ForwardingUnit
1mux
0
0mux
1
0mux
1
2
0mux
1
2
DataMemoryA
LU
Register File
Ins
tru
cti
on
Me
mo
ry
PC
ID/EX EX/MEM MEM/WBIF/ID
Inst
ruct
ion
IF/ID.Rs
IF/ID.Rt
IF/ID.RtIF/ID.Rd
Rs
Rt
Rt
Rd
WB
M WB
EX/MEM.Rd
MEM/WB.Rd
sub[R11]
xxx
[R10]
12
and
10
[R2]-[R3]
1110
Computer Structure 2013 – Pipeline19
Register File Split
Register file is written during first half of the cycle Register file is read during second half of the cycleÞ Register file is written before it is read returns the correct
data
sub R2, R1, R3
xxx
xxx
and R12,R2,R11
IM Reg
IM Reg
IM DM Reg
IM DM Reg
DM Reg
Reg
DMReg
Reg
Computer Structure 2013 – Pipeline20
Load word can still causes a hazard: an instruction tries to read a register following a load
instruction that writes to the same register
Þ A hazard detection unit is needed to “stall” the load instruction
Can't always forward
Reg
IM
Reg
Reg
IM
IM Reg DM Reg
IM DM Reg
IM DM Reg
DM Reg
Reg
Reg
DM
1 2 3 4 5 6Time (clock cycles)
7 8 9Programexecutionorder
lw R2, 30(R1)
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw R15,100(R2)
Computer Structure 2013 – Pipeline21
De-assert the enable to the IF/ID latch, and to the PC The dependent instruction (and) stays another cycle in IF/ID
Issue a NOP into the ID/EXE latch (instead of the stalled inst.) Allow the stalling instruction (lw) to move on
Reg
IM
Reg
Reg
IM DM
IM Reg DM RegIM
IM DM Reg
IM DM Reg
DM Reg
RegReg
Reg
bubble
1 2 3 4 5 6Time (clock cycles)
7 8 9Programexecutionorder
lw R2, 30(R1)
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw R15,100(R2)
Stallingif (ID/EX.RegWrite and (ID/EX.opcode = lw) and
( (ID/EX.WriteReg = IF/ID.ReadReg1) or (ID/EX.WriteReg = IF/ID.ReadReg2) )
then stall
Computer Structure 2013 – Pipeline22
Forwarding + Hazard Detection Unit
Co
ntr
ol
EX
M
WB
ForwardingUnit
1mux
0
0mux
1
0mux
1
2
0mux
1
2
DataMemoryA
LU
Register File
Ins
tru
cti
on
Me
mo
ry
PC
ID/EX EX/MEM MEM/WBIF/ID
Inst
ruct
ion
IF/ID.Rs
IF/ID.Rt
IF/ID.RtIF/ID.Rd
Rs
Rt
Rt
Rd
WB
M WB
EX/MEM.Rd
MEM/WB.Rd
HazardDetection
Unit
0
ID/EX.MemRead
PC
Wri
te
0mux
1
IF/ID
Wri
te
ID/EX.Rt
Computer Structure 2013 – Pipeline23
Example: code for (assume all variables are in memory):a = b + c;
d = e – f;Slow code
LW Rb,bLW Rc,c
Stall ADD Ra,Rb,RcSW a,Ra LW Re,e LW Rf,f
Stall SUB Rd,Re,RfSW d,Rd
Instruction order can be changed as long as the correctness is kept
Software Scheduling to Avoid Load Hazards
Fast code
LW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd
Computer Structure 2013 – Pipeline24
Control Hazards
Computer Structure 2013 – Pipeline25
Executing a BEQ Instruction (i)BEQ R4, R5, 27 ; if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4
0 or 4 beq R4, R5, 27 8 and12 sw16 sub
Calculate branch condition
ALUSrc
6
ALUresult
Zero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
R4
R5 -
27
812
12
beqand
Calculate branch target
Computer Structure 2013 – Pipeline26
Executing a BEQ Instruction (ii)BEQ R4, R5, 27 ; if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4
0 or 4 beq R4, R5, 27 8 and12 sw16 sub
ALUSrc
6
ALUresult
Zero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
InstructionR4-R5=0-
8+SignExt(27)*4
1216
16
beqandsw
Computer Structure 2013 – Pipeline27
Executing a BEQ Instruction (iii)BEQ R4, R5, 27 ; if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4
0 or 4 beq R4, R5, 27 8 and12 sw16 sub
ALUSrc
6
ALUresult
Zero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
8+SignExt(27)*4
16
20 or 116
beqandsw
20
sub
Computer Structure 2013 – Pipeline28
Control Hazard on Branches
And
Beq
sub
sw
The 3 instructions following the branch get into the pipe even if the branch is taken
Inst from target
IM Reg DM Reg
PC
IM Reg DM Reg
PC
IM Reg DM Reg
PC
IM Reg DM Reg
PC
IM Reg DM Reg
PC
Computer Structure 2013 – Pipeline29
Control Hazard: Stall
Stall pipe when branch is encountered until resolved Stall impact: assumptions
CPI = 1 20% of instructions are branches Stall 3 cycles on every branch
CPI new = 1 + 0.2 × 3 = 1.6
(CPI new = CPI Ideal + avg. stall cycles / instr.)
We loose 60% of the performance
Computer Structure 2013 – Pipeline30
Control Hazard: Predict Not Taken
Execute instructions from the fall-through (not-taken) path As if there is no branch If the branch is not-taken (~50%), no penalty is paid
If branch actually taken Flush the fall-through path instructions before they
change the machine state (memory / registers) Fetch the instructions from the correct (taken) path
Assuming ~50% branches not taken on average
CPI new = 1 + (0.2 × 0.5) × 3 = 1.3
Computer Structure 2013 – Pipeline31
Dynamic Branch Prediction
Look up
PC of inst in fetch
?=Branch predicted taken or not taken
No:Inst is not pred to be branch
Yes:Inst is pred to be branch
Branch PC Target PC History
Predicted Target
Add a Branch Target Buffer (BTB) the predicts (at fetch) Instruction is a branch Branch taken / not-taken Taken branch target
Computer Structure 2013 – Pipeline32
BTB Allocation
Allocate instructions identified as branches (after decode) Both conditional and unconditional branches are allocated
Not taken branches need not be allocated BTB miss implicitly predicts not-taken
Prediction BTB lookup is done parallel to IC lookup BTB provides
Indication that the instruction is a branch (BTB hits) Branch predicted target Branch predicted direction Branch predicted type (e.g., conditional, unconditional)
Update (when branch outcome is known) Branch target Branch history (taken / not-taken)
Computer Structure 2013 – Pipeline33
BTB (cont.)
Wrong prediction Predict not-taken, actual taken Predict taken, actual not-taken, or actual taken but wrong
target In case of wrong prediction – flush the pipeline
Reset latches (same as making all instructions to be NOPs) Select the PC source to be from the correct path
Need get the fall-through with the branch Start fetching instruction from correct path
Assuming P% correct prediction rate
CPI new = 1 + (0.2 × (1-P)) × 3 For example, if P=0.7
CPI new = 1 + (0.2 × 0.3) × 3 = 1.18
Computer Structure 2013 – Pipeline34
Adding a BTB to the Pipeline
Shift left 2
ALUSrc
6
ALUresult
Zero
+
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EXEX/MEM MEM
/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4+
IF/ID
PC
0
1
mux
0
1
mux
0mux
1
0
mux
Inst.Memory
Address
Instruction
BTB
1
2
pred target
pred dir
PC+4 (Not-taken target)taken target
3
Mis-predict
DetectionUnit
Flush
predicted targetPC+4 (Not-taken target)
predicted direction
−4
address
target
direction
alloc/u
pd
t
Computer Structure 2013 – Pipeline35
Using The BTBPC moves to next instruction
Inst Mem gets PCand fetches new inst
BTB gets PCand looks it up
IF/ID latch loadedwith new inst
BTB Hit ?
Br taken ?
PC PC + 4PC perd addr
IF
IDIF/ID latch loaded
with pred instIF/ID latch loaded
with seq. instBranch ?
yes no
noyes
noyesEXE
Computer Structure 2013 – Pipeline36
Using The BTB (cont.)ID
EXE
MEM
WB
Branch ?
Calculate brcond & trgt
Flush pipe &update PC
Corect pred ?
yes no
IF/ID latch loadedwith correct inst
continue
Update BTB
yes no
continue
Computer Structure 2013 – Pipeline37
Backup
Computer Structure 2013 – Pipeline38
· R-type (register insts)
· I-type (Load, Store, Branch, inst’s w/imm data)· J-type (Jump)
op: operation of the instructionrs, rt, rd: the source and destination register specifiersshamt: shift amountfunct: selects the variant of the operation in the “op” fieldaddress / immediate: address offset or immediate valuetarget address: target address of the jump instruction
op target address
02631
6 bits 26 bits
op rs rt rd shamt funct
061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
MIPS Instruction Formats
Computer Structure 2013 – Pipeline39
Each memory location is 8 bit = 1 byte wide has an address
We assume 32 byte address An address space of 232 bytes
Memory stores both instructions and data Each instruction is 32 bit wide
stored in 4 consecutive bytes in memory
Various data types have different width
The Memory Space1 byte
00000000000000010000000200000003000000040000000500000006
FFFFFFFAFFFFFFFBFFFFFFFCFFFFFFFDFFFFFFFEFFFFFFFF
Computer Structure 2013 – Pipeline40
Register FileRegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
RegisterFile
32
32
32
5
5
5
The Register File holds 32 registers Each register is 32 bit wide The RF supports parallel
reading any two registers and writing any register
Inputs Read reg 1/2: #register whose value
will be output on Read data 1/2 RegWrite: write enable
Write reg (relevant when RegWrite=1) #register to which the value in Write data is written to
Write data (relevant when RegWrite=1) data written to Write reg
Outputs Read data 1/2: data read from Read reg 1/2
Computer Structure 2013 – Pipeline41
Memory Components Inputs
Address: address of the memory location we wish to access
Read: read data from location Write: write data into location Write data (relevant when Write=1)
data to be written into specified location Outputs
Read data (relevant when Read=1) data read from specified location
Write
Address
WriteData
ReadDataMemory
Read
32
32
32
Cache Memory components are slow relative to the CPU
A cache is a fast memory which contains only small part of the memory Instruction cache stores parts of the memory space which hold code Data Cache stores parts of the memory space which hold data
Computer Structure 2013 – Pipeline42
The Program Counter (PC)
Holds the address (in memory) of the next instruction to be executed
After each instruction, advanced to point to the next instruction If the current instruction is not a taken branch,
the next instruction resides right after the current instruction
PC PC + 4 If the current instruction is a taken branch,
the next instruction resides at the branch target
PC target (absolute jump)
PC PC + 4 + offset×4 (relative jump)
Computer Structure 2013 – Pipeline43
Fetch Fetch instruction pointed by PC from I-Cache
Decode Decode instruction (generate control signals) Fetch operands from register file
Execute For a memory access: calculate effective address For an ALU operation: execute operation in ALU For a branch: calculate condition and target
Memory Access For load: read data from memory For store: write data into memory
Write Back Write result back to register file update program counter
Instruction Execution Stages
Instruction
Fetch
Instruction
Decode
Execute
Memory
Result
Store
Computer Structure 2013 – Pipeline44
The MIPS CPU
Instruction Decode /
register fetch
Instruction fetch
Execute /address
calculation
Memoryaccess
Writeback
ALUSrcALU
ALU res
Zero
AddShift left 2
0
1
mux
ALUControl
ALUOp
Signextend
16 32
Instruction
4
Add
PC
InstructionCache
Address
Instruction
0
1
mux
MemRead
MemWrite
Address
WriteData
ReadData
DataCache
Branch
PCSrc
MemtoReg
1
0
mux
RegDst
RegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e
[20-16]
[25-21]
0
1
mux [15-11]
[15-0]
[5-0]
Co
ntr
ol
Computer Structure 2013 – Pipeline45
Executing an Add Instruction
ALUSrc=0ALU
ALU res
Zero
AddShift left 2
0
1
mux
ALUControl
ALUOp
Signextend
16 32
Instruction
4
Add
PC
InstructionMemory
Address
Instruction
0
1
mux
MemRead=0
MemWrite=0
Address
WriteData
ReadData
DataMemory
Branch=0
PCSrc=0
MemtoReg=0
1
0
mux
RegDst=1
RegWrite=1
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e [20-16]
[25-21]
0
1
mux [15-11]
[15-0]
[5-0]
R3
R5
35
R3 + R5
+
[PC]+4
2
op rs rt rd shamt funct 061116212631
3 5 2 0 0 = AddALU
Add R2, R3, R5 ; R2 R3+R5
ADD
Computer Structure 2013 – Pipeline46
Executing a Load Instruction
op rs rt immediate 016212631
2 1 30LW
LW R1, (30)R2 ; R1 Mem[R2+30]
ALUSrc=1ALU
ALU res
Zero
AddShift left 2
0
1
mux
ALUControl
ALUOp
Signextend
16 32
Instruction
4
Add
PC
InstructionMemory
Address
Instruction
0
1
mux
MemRead=1
MemWrite=0
Address
WriteData
ReadData
DataMemory
Branch=0
PCSrc=0
MemtoReg=11
0
mux
RegDst=0
RegWrite=1
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e [20-16]
[25-21]
0
1
mux [15-11]
[15-0]
[5-0]
R2
30
2
R2+30+
[PC]+4
1
Mem[R2+30]
LW
Computer Structure 2013 – Pipeline47
Executing a Store Instruction
op rs rt immediate 016212631
2 1 30SW
SW R1, (30)R2 ; Mem[R2+30] R1
ALUSrc = 1ALU
ALU res
Zero
AddShift left 2
0
1
mux
ALUControl
ALUOp
Signextend
16 32
Instruction
4
Add
PC
InstructionMemory
Address
Instruction
0
1
mux
MemRead
MemWrite = 1
Address
WriteData
ReadData
DataMemory
Branch=0
PCSrc = 0
MemtoReg =1
0
mux
RegDst =
RegWrite = 0
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e [20-16]
[25-21]
0
1
mux [15-11]
[15-0]
[5-0]
R2
30
2
R2+30+
[PC]+4
1SW
R1
Computer Structure 2013 – Pipeline48
Executing a BEQ InstructionBEQ R4, R5, 27 ; if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4
op rs rt immediate 016212631
4 5 27BEQ
ALUSrc=0ALU
ALU res
Zero
AddShift left 2
0
1
mux
ALUControl
ALUOp
Signextend
16 32
Instruction
4
Add
PC
InstructionMemory
Address
Instruction
0
1
mux
MemRead=0
MemWrite=0
Address
WriteData
ReadData
DataMemory
Branch=1
PCSrc= (R4 – R5 == 0)
MemtoReg= 1
0
mux
RegDst=
RegWrite=0
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e [20-16]
[25-21]
0
1
mux [15-11]
[15-0]
[5-0]
R4
R5
45
R4-R5-
PC+4
27
ADD
PC+4orPC+4+SignExt(27)*4
Computer Structure 2013 – Pipeline49
Control Signals
add sub ori lw sw beq jump
RegDst
ALUSrc
MemtoReg
RegWrite
MemWrite
Branch
Jump
ALUctr<2:0>
1
0
0
1
0
0
0
Add
1
0
0
1
0
0
0
Subtract
0
1
0
1
0
0
0
Or
0
1
1
1
0
0
0
Add
x
1
x
0
1
0
0
Add
x
0
x
0
0
1
0
Subtract
x
x
x
0
0
x
1
xxx
func
op 00 0000 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
10 0000 10 0010 Don’t Care
Computer Structure 2013 – Pipeline50
Pipelined CPU: Load (cycle 1 – Fetch)
ALUSrc
6
ALUresult
Zero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction lw
op rs rt immediate 016212631
2 1 30LW
LW R1, (30)R2 ; R1 Mem[R2+30]
PC+4
Computer Structure 2013 – Pipeline51
Pipelined CPU: Load (cycle 2 – Dec)
ALUSrc
6
ALUresult
Zero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
op rs rt immediate 016212631
2 1 30LW
LW R1, (30)R2 ; R1 Mem[R2+30]
PC+4
R2
30
Computer Structure 2013 – Pipeline52
Pipelined CPU: Load (cycle 3 – Exe)
ALUSrc
6
ALUresult
Zero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
op rs rt immediate 016212631
2 1 30LW
LW R1, (30)R2 ; R1 Mem[R2+30]
R2+30
Computer Structure 2013 – Pipeline53
Pipelined CPU: Load (cycle 4 – Mem)
ALUSrc
6
ALUresult
Zero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
op rs rt immediate 016212631
2 1 30LW
LW R1, (30)R2 ; R1 Mem[R2+30]
D
Computer Structure 2013 – Pipeline54
Pipelined CPU: Load (cycle 5 – WB)
ALUSrc
6
ALUresult
Zero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1
Readreg 2
Writereg
Writedata
Readdata 1
Readdata 2
Re
gis
ter
Fil
e
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
op rs rt immediate 016212631
2 1 30LW
LW R1, (30)R2 ; R1 Mem[R2+30]
D
Computer Structure 2013 – Pipeline55
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Me
mto
Re
g
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2
Re
gWrit
e
MemRead
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Me
mW
rite
AddressData
memory
Address
Datapath with Control
Computer Structure 2013 – Pipeline56
Multi-Cycle Control
Pass control signals along just like the data
Execution/Address Calculation stage control lines
Memory access stage control lines
Write-back stage control
lines
InstructionReg Dst
ALU Op1
ALU Op0
ALU Src Branch
Mem Read
Mem Write
Reg write
Mem to Reg
R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X
Control
EX
M
WB
M
WB
WB
IF/ID ID/EX EX/MEM MEM/WB
Instruction
Computer Structure 2013 – Pipeline57
Five Execution Steps
Instruction Fetch Use PC to get instruction and put it in the Instruction Register. Increment the PC by 4 and put the result back in the PC.
IR = Memory[PC];PC = PC + 4;
Instruction Decode and Register Fetch Read registers rs and rt Compute the branch address
A = Reg[IR[25-21]]; B = Reg[IR[20-16]]; ALUOut = PC + (sign-extend(IR[15-0]) << 2);
We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)
Computer Structure 2013 – Pipeline58
Five Execution Steps (cont.)
• Execution
ALU is performing one of three functions, based on instruction type: Memory Reference: effective address calculation.
ALUOut = A + sign-extend(IR[15-0]); R-type:
ALUOut = A op B; Branch:
if (A==B) PC = ALUOut;
• Memory Access or R-type instruction completion
• Write-back step
Computer Structure 2013 – Pipeline59
The Store Instruction
sw rt, rs, imm16
mem[PC] Fetch the instruction from memory
Addr <- R[rs] + SignExt(imm16)
Calculate the memory addressMem[Addr] <- R[rt] Store the register into memoryPC <- PC + 4 Calculate the next instruction’s
address
op rs rt immediate
016212631
6 bits 16 bits5 bits5 bits
Mem[Rs + SignExt[imm16]] <- Rt Example: sw rt, rs, imm16
Computer Structure 2013 – Pipeline60
RAW Hazard: SW Solution
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)
CC 7 CC 8 CC 910 10/–20Value of R2
DM Reg
IM DM RegReg
IM Reg
IM Reg DM Reg
IM DM Reg
Reg
Reg
DM
sub R2, R1, R3
NOP
NOP
NOP
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw R15,100(R2)
Programexecutionorder
10 10 10 -20 -20 -20 -20
IM Reg
IM Reg DM Reg
IM DM Reg
Reg
Reg
DM
• Have compiler avoid hazards by adding NOP instructions
• Problem: this really slows us down!
Computer Structure 2013 – Pipeline61
Delayed Branch Define branch to take place AFTER n following instruction
HW executes n instructions following the branch regardless of branch is taken or not
SW puts in the n slots following the branch instructions that need to be executed regardless of branch resolution Instructions that are before the branch instruction, or Instructions from the converged path after the branch
If cannot find independent instructions, put NOP
Original Coder3 = 23R4 = R3+R5If (r1==r2) goto xR1 = R4 + R5X: R7 = R1
New CodeIf (r1==r2) goto x r3 = 23R4 = R3 +R5NOPR1 = R4 + R5X: R7 = R1
Computer Structure 2013 – Pipeline62
Delayed Branch Performance
Filling 1 delay slot is easy, 2 is hard, 3 is harder Assuming we can effectively fill d% of the delayed slots
CPInew = 1 + 0.2 × (3 × (1-d))
For example, for d=0.5, we get CPInew = 1.3 Mixing architecture with micro-arch
New generations requires more delay slots Cause computability issues between generations