multi-cycle implementations arvind computer science & artificial intelligence lab. massachusetts...

Multi-cycle Implementations

ArvindComputer Science & Artificial Intelligence Lab.Massachusetts Institute of Technology

January 13, 2012 L6-1http://csg.csail.mit.edu/SNU

Harvard-Style Datapath for MIPS

0x4

RegWrite

Add

Add

clk

WBSrcMemWrite

addr

wdata

rdataData Memory

we

RegDst BSrcExtSelOpCode

z

OpSel

clk

zero?

clk

addrinst

Inst.Memory

PC rd1

GPRs

rs1rs2

wswd rd2

we

ImmExt

ALU

ALUControl

31

PCSrcbrrindjabspc+4

At least the instruction fetch and a Load (or Store) cannot be executed in the same cycle

What problem arises if instructions and data reside in the same memory?

Structural hazard

Princeton MicroarchitectureDatapath & Control

IR

0x4

clk

RegDst

PCSrc RegWrite

BSrc zero?

WBSrc

31

ExtSelOpCode

Add

rd1

GPRs

rs1rs2

wswd rd2

we

ImmExt

addr

wdata

rdataData Memory

z

ALU

Add

OpSel

ALUControl

clk

we

MemWrite

clk

PC

PCen

IRen AddrSrc

clk

Fetch phase

on

offoffoff

= PC

Two-State Controller: Princeton Architecture

fetch phase

execute phaseAddrSrc=ALU

IRen=offPCen=onWen=on

AddrSrc=PCIRen=onPCen=offWen=off

A flipflop can be used to remember the phase

Hardwired Controller: Princeton Architecture

oldcombinational

logic(Harvard)

op code

zero?

ExtSel, BSrc, OpSel, WBSrc, RegDest,PCsrc1, PCsrc2

MemWrite

IR

newcombinational

logic

PCenIRenAddrSrc

S

1-bit Toggle FF I-fetch / Execute

RegWrite

.

.

.

Wen

Two-Cycle SMIPS

7

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4ir

stage


Two-Cycle SMIPSmodule mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; Memory mem <- mkMemory; PipeReg#(FBundle) ir <- mkPipeReg; Reg#(Bit#(1)) stage <- mkReg(0); let pcir = ir.first(); let pc = pcir.pc; let inst = pcir.inst;

rule doProc; if(stage==0 && ir.notFull) begin //fetch let instResp <- mem(MemReq{op:Ld, addr:pc, data:?}); ir.enq(FBundle{pc:pc, inst:instResp}); stage <= 1; end


if(stage==1 && ir.notEmpty) begin //decode let decInst = decode(inst); Data rVal1 = rf.rd1(decInst.rSrc1); Data rVal2 = rf.rd2(decInst.rSrc2);

//execute let execInst = exec(decInst, pc, rVal1, rVal2); if(execInst.instType==Ld || execInst.instType==St)

execInst.data <- mem(MemReq{op:execInst.instType,

addr:execInst.addr, data:execInst.data}); pc <= execInst.brTaken ? execInst.addr : pc+4; //writeback

Two-Cycle SMIPS cont-1


//writeback if(execInst.instType==Alu ||

execInst.instType==Ld) rf.wr(execInst.rDst, execInst.data);

ir.deq; stage <= 0; end endrule endmodule;

Two-Cycle SMIPS cont-2


Processor Performance

Microarchitecture CPI cycle time

Microcoded >1 short

Single-cycle unpipelined 1 long

Pipelined 1 short

Time = Instructions Cycles Time Program Program * Instruction * Cycle

– Instructions per program depends on source code, compiler technology and ISA

– Cycles per instructions (CPI) depends upon the ISA and the microarchitecture

– Time per cycle depends upon the microarchitecture and the base technology

Single-Cycle Hardwired Control: Harvard architecture

We will assume clock period is sufficiently long for all of the following steps to be “completed”:

1. instruction fetch2. decode and register fetch3. ALU operation4. data fetch if required5. register write-back setup time

tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB

At the rising edge of the following clock, the PC, the register file and the memory are updated

Clock Period

tC-Princeton > max {tM , tRF+ tALU+ tM + tWB}

tC-Princeton > tRF+ tALU+ tM + tWB

while in the hardwired Harvard architecture

tC-Harvard > tM + tRF + tALU+ tM+ tWB

which will execute instructions faster?

Clock Rate vs CPI

Is it possible to design a controller for the Princeton architecture with CPI < 2 ?

CPI = Clock cycles Per Instruction

Suppose tM >> tRF+ tALU + tWB

tC-Princeton = 0.5 * tC-Harvard

CPIPrinceton = 2

CPIHarvard = 1

No difference in performance!

Princeton microarchitecture (redrawn)

The same(mux not shown)

Only one of the phases is active in any cycle a lot of datapath is not in use at any given time

fetchphase execute phase

addr

wdata

rdataMemory

weALU

ImmExt

PC

0x4

Add

IRaddr

wdata

rdata

Memory

werd1

GPRs

rs1rs2

wswd rd2

we

Princeton Microarchitecture Overlapped execution

Yes, unless IR contains a Load or Store

What do we do with Fetch?


addr

wdata

rdataMemory

weALU

ImmExt

PC

0x4

Add

IRaddr

wdata

rdata

Memory

werd1

GPRs

rs1rs2

wswd rd2

we

How?

Can we overlap instruction fetch and execute?

ExecuteWhich action should be prioritized?

Stall it

Stalling the instruction fetch Princeton Microarchitecture

When stall condition is indicated don’t fetch a new instruction and don’t change the PC insert a nop in the IR set the Memory Address mux to ALU (not shown)


addr

wdata

rdataMemory

weALU

ImmExt

rd1

GPRs

rs1rs2

wswdrd2

we

PC

0x4

Add

addr

wdata

rdata

Memory

we

stall?

nop

IR

What if IR contains a jump or branch instruction?

Need to stall on branchesPrinceton Microarchitecture

When IR contains a jump or branch-taken no “structural conflict” for the memory but we do not have the correct PC value in the PC memory cannot be used – Address Mux setting is irrelevant insert a nop in the IR insert the nextPC (branch-target) address in the PC

addr

wdata

rdataMemory

weALU

ImmExt

rd1

GPRs

rs1rs2

wswdrd2

we

PC

0x4

Add

addr

wdata

rdata

Memory

we

Jump?

nop

IR

Pipelined Princeton Microarchitecture

IR

0x4

clk

RegDst

PCSrc RegWrite

BSrc zero?

WBSrc

31

ExtSelOpCode

Add

rd1

GPRs

rs1rs2

wswd rd2

we

ImmExt

addr

wdata

rdataData Memory

z

ALU

Add

OpSel

ALUControl

clk

we

MemWrite

clk

PC

PCen

MAddrSrc

clknop

IRSrc

PCSrc2

stall?stall

Pipelined Princeton Architecture

Clock: tC-Princeton > tRF+ tALU+ tM

CPI: (1- f) + 2f cycles per instructionwhere f is the fraction of instructions that cause a stall

What is a likely value of f?

multi-cycle implementations arvind computer science & artificial intelligence lab. massachusetts...

Documents