multi-cycle implementations arvind computer science & artificial intelligence lab. massachusetts...
TRANSCRIPT
Multi-cycle Implementations
ArvindComputer Science & Artificial Intelligence Lab.Massachusetts Institute of Technology
January 13, 2012 L6-1http://csg.csail.mit.edu/SNU
Harvard-Style Datapath for MIPS
0x4
RegWrite
Add
Add
clk
WBSrcMemWrite
addr
wdata
rdataData Memory
we
RegDst BSrcExtSelOpCode
z
OpSel
clk
zero?
clk
addrinst
Inst.Memory
PC rd1
GPRs
rs1rs2
wswd rd2
we
ImmExt
ALU
ALUControl
31
PCSrcbrrindjabspc+4
At least the instruction fetch and a Load (or Store) cannot be executed in the same cycle
What problem arises if instructions and data reside in the same memory?
Structural hazard
Princeton MicroarchitectureDatapath & Control
IR
0x4
clk
RegDst
PCSrc RegWrite
BSrc zero?
WBSrc
31
ExtSelOpCode
Add
rd1
GPRs
rs1rs2
wswd rd2
we
ImmExt
addr
wdata
rdataData Memory
z
ALU
Add
OpSel
ALUControl
clk
we
MemWrite
clk
PC
PCen
IRen AddrSrc
clk
Fetch phase
on
offoffoff
= PC
Two-State Controller: Princeton Architecture
fetch phase
execute phaseAddrSrc=ALU
IRen=offPCen=onWen=on
AddrSrc=PCIRen=onPCen=offWen=off
A flipflop can be used to remember the phase
Hardwired Controller: Princeton Architecture
oldcombinational
logic(Harvard)
op code
zero?
ExtSel, BSrc, OpSel, WBSrc, RegDest,PCsrc1, PCsrc2
MemWrite
IR
newcombinational
logic
PCenIRenAddrSrc
S
1-bit Toggle FF I-fetch / Execute
RegWrite
.
.
.
Wen
Two-Cycle SMIPS
7
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4ir
stage
January 13, 2012 L6-7http://csg.csail.mit.edu/SNU
Two-Cycle SMIPSmodule mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; Memory mem <- mkMemory; PipeReg#(FBundle) ir <- mkPipeReg; Reg#(Bit#(1)) stage <- mkReg(0); let pcir = ir.first(); let pc = pcir.pc; let inst = pcir.inst;
rule doProc; if(stage==0 && ir.notFull) begin //fetch let instResp <- mem(MemReq{op:Ld, addr:pc, data:?}); ir.enq(FBundle{pc:pc, inst:instResp}); stage <= 1; end
January 13, 2012 L6-8http://csg.csail.mit.edu/SNU
if(stage==1 && ir.notEmpty) begin //decode let decInst = decode(inst); Data rVal1 = rf.rd1(decInst.rSrc1); Data rVal2 = rf.rd2(decInst.rSrc2);
//execute let execInst = exec(decInst, pc, rVal1, rVal2); if(execInst.instType==Ld || execInst.instType==St)
execInst.data <- mem(MemReq{op:execInst.instType,
addr:execInst.addr, data:execInst.data}); pc <= execInst.brTaken ? execInst.addr : pc+4; //writeback
Two-Cycle SMIPS cont-1
January 13, 2012 L6-9http://csg.csail.mit.edu/SNU
//writeback if(execInst.instType==Alu ||
execInst.instType==Ld) rf.wr(execInst.rDst, execInst.data);
ir.deq; stage <= 0; end endrule endmodule;
Two-Cycle SMIPS cont-2
January 13, 2012 L6-10http://csg.csail.mit.edu/SNU
Processor Performance
Microarchitecture CPI cycle time
Microcoded >1 short
Single-cycle unpipelined 1 long
Pipelined 1 short
Time = Instructions Cycles Time Program Program * Instruction * Cycle
– Instructions per program depends on source code, compiler technology and ISA
– Cycles per instructions (CPI) depends upon the ISA and the microarchitecture
– Time per cycle depends upon the microarchitecture and the base technology
Single-Cycle Hardwired Control: Harvard architecture
We will assume clock period is sufficiently long for all of the following steps to be “completed”:
1. instruction fetch2. decode and register fetch3. ALU operation4. data fetch if required5. register write-back setup time
tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB
At the rising edge of the following clock, the PC, the register file and the memory are updated
Clock Period
tC-Princeton > max {tM , tRF+ tALU+ tM + tWB}
tC-Princeton > tRF+ tALU+ tM + tWB
while in the hardwired Harvard architecture
tC-Harvard > tM + tRF + tALU+ tM+ tWB
which will execute instructions faster?
Clock Rate vs CPI
Is it possible to design a controller for the Princeton architecture with CPI < 2 ?
CPI = Clock cycles Per Instruction
Suppose tM >> tRF+ tALU + tWB
tC-Princeton = 0.5 * tC-Harvard
CPIPrinceton = 2
CPIHarvard = 1
No difference in performance!
Princeton microarchitecture (redrawn)
The same(mux not shown)
Only one of the phases is active in any cycle a lot of datapath is not in use at any given time
fetchphase execute phase
addr
wdata
rdataMemory
weALU
ImmExt
PC
0x4
Add
IRaddr
wdata
rdata
Memory
werd1
GPRs
rs1rs2
wswd rd2
we
Princeton Microarchitecture Overlapped execution
Yes, unless IR contains a Load or Store
What do we do with Fetch?
fetchphase execute phase
addr
wdata
rdataMemory
weALU
ImmExt
PC
0x4
Add
IRaddr
wdata
rdata
Memory
werd1
GPRs
rs1rs2
wswd rd2
we
How?
Can we overlap instruction fetch and execute?
ExecuteWhich action should be prioritized?
Stall it
Stalling the instruction fetch Princeton Microarchitecture
When stall condition is indicated don’t fetch a new instruction and don’t change the PC insert a nop in the IR set the Memory Address mux to ALU (not shown)
fetchphase execute phase
addr
wdata
rdataMemory
weALU
ImmExt
rd1
GPRs
rs1rs2
wswdrd2
we
PC
0x4
Add
addr
wdata
rdata
Memory
we
stall?
nop
IR
What if IR contains a jump or branch instruction?
Need to stall on branchesPrinceton Microarchitecture
When IR contains a jump or branch-taken no “structural conflict” for the memory but we do not have the correct PC value in the PC memory cannot be used – Address Mux setting is irrelevant insert a nop in the IR insert the nextPC (branch-target) address in the PC
addr
wdata
rdataMemory
weALU
ImmExt
rd1
GPRs
rs1rs2
wswdrd2
we
PC
0x4
Add
addr
wdata
rdata
Memory
we
Jump?
nop
IR
Pipelined Princeton Microarchitecture
IR
0x4
clk
RegDst
PCSrc RegWrite
BSrc zero?
WBSrc
31
ExtSelOpCode
Add
rd1
GPRs
rs1rs2
wswd rd2
we
ImmExt
addr
wdata
rdataData Memory
z
ALU
Add
OpSel
ALUControl
clk
we
MemWrite
clk
PC
PCen
MAddrSrc
clknop
IRSrc
PCSrc2
stall?stall
Pipelined Princeton Architecture
Clock: tC-Princeton > tRF+ tALU+ tM
CPI: (1- f) + 2f cycles per instructionwhere f is the fraction of instructions that cause a stall
What is a likely value of f?