multi-cycle implementations arvind computer science & artificial intelligence lab. massachusetts...

20
Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012 L6-1 http:// csg.csail.mit.edu/SNU

Upload: neil-knight

Post on 18-Dec-2015

219 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Multi-cycle Implementations

ArvindComputer Science & Artificial Intelligence Lab.Massachusetts Institute of Technology

January 13, 2012 L6-1http://csg.csail.mit.edu/SNU

Page 2: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Harvard-Style Datapath for MIPS

0x4

RegWrite

Add

Add

clk

WBSrcMemWrite

addr

wdata

rdataData Memory

we

RegDst BSrcExtSelOpCode

z

OpSel

clk

zero?

clk

addrinst

Inst.Memory

PC rd1

GPRs

rs1rs2

wswd rd2

we

ImmExt

ALU

ALUControl

31

PCSrcbrrindjabspc+4

Page 3: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

At least the instruction fetch and a Load (or Store) cannot be executed in the same cycle

What problem arises if instructions and data reside in the same memory?

Structural hazard

Page 4: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Princeton MicroarchitectureDatapath & Control

IR

0x4

clk

RegDst

PCSrc RegWrite

BSrc zero?

WBSrc

31

ExtSelOpCode

Add

rd1

GPRs

rs1rs2

wswd rd2

we

ImmExt

addr

wdata

rdataData Memory

z

ALU

Add

OpSel

ALUControl

clk

we

MemWrite

clk

PC

PCen

IRen AddrSrc

clk

Fetch phase

on

offoffoff

= PC

Page 5: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Two-State Controller: Princeton Architecture

fetch phase

execute phaseAddrSrc=ALU

IRen=offPCen=onWen=on

AddrSrc=PCIRen=onPCen=offWen=off

A flipflop can be used to remember the phase

Page 6: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Hardwired Controller: Princeton Architecture

oldcombinational

logic(Harvard)

op code

zero?

ExtSel, BSrc, OpSel, WBSrc, RegDest,PCsrc1, PCsrc2

MemWrite

IR

newcombinational

logic

PCenIRenAddrSrc

S

1-bit Toggle FF I-fetch / Execute

RegWrite

.

.

.

Wen

Page 7: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Two-Cycle SMIPS

7

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4ir

stage

January 13, 2012 L6-7http://csg.csail.mit.edu/SNU

Page 8: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Two-Cycle SMIPSmodule mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; Memory mem <- mkMemory; PipeReg#(FBundle) ir <- mkPipeReg; Reg#(Bit#(1)) stage <- mkReg(0); let pcir = ir.first(); let pc = pcir.pc; let inst = pcir.inst;

rule doProc; if(stage==0 && ir.notFull) begin //fetch let instResp <- mem(MemReq{op:Ld, addr:pc, data:?}); ir.enq(FBundle{pc:pc, inst:instResp}); stage <= 1; end

January 13, 2012 L6-8http://csg.csail.mit.edu/SNU

Page 9: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

if(stage==1 && ir.notEmpty) begin //decode let decInst = decode(inst); Data rVal1 = rf.rd1(decInst.rSrc1); Data rVal2 = rf.rd2(decInst.rSrc2);

//execute let execInst = exec(decInst, pc, rVal1, rVal2); if(execInst.instType==Ld || execInst.instType==St)

execInst.data <- mem(MemReq{op:execInst.instType,

addr:execInst.addr, data:execInst.data}); pc <= execInst.brTaken ? execInst.addr : pc+4; //writeback

Two-Cycle SMIPS cont-1

January 13, 2012 L6-9http://csg.csail.mit.edu/SNU

Page 10: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

//writeback if(execInst.instType==Alu ||

execInst.instType==Ld) rf.wr(execInst.rDst, execInst.data);

ir.deq; stage <= 0; end endrule endmodule;

Two-Cycle SMIPS cont-2

January 13, 2012 L6-10http://csg.csail.mit.edu/SNU

Page 11: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Processor Performance

Microarchitecture CPI cycle time

Microcoded >1 short

Single-cycle unpipelined 1 long

Pipelined 1 short

Time = Instructions Cycles Time Program Program * Instruction * Cycle

– Instructions per program depends on source code, compiler technology and ISA

– Cycles per instructions (CPI) depends upon the ISA and the microarchitecture

– Time per cycle depends upon the microarchitecture and the base technology

Page 12: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Single-Cycle Hardwired Control: Harvard architecture

We will assume clock period is sufficiently long for all of the following steps to be “completed”:

1. instruction fetch2. decode and register fetch3. ALU operation4. data fetch if required5. register write-back setup time

tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB

At the rising edge of the following clock, the PC, the register file and the memory are updated

Page 13: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Clock Period

tC-Princeton > max {tM , tRF+ tALU+ tM + tWB}

tC-Princeton > tRF+ tALU+ tM + tWB

while in the hardwired Harvard architecture

tC-Harvard > tM + tRF + tALU+ tM+ tWB

which will execute instructions faster?

Page 14: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Clock Rate vs CPI

Is it possible to design a controller for the Princeton architecture with CPI < 2 ?

CPI = Clock cycles Per Instruction

Suppose tM >> tRF+ tALU + tWB

tC-Princeton = 0.5 * tC-Harvard

CPIPrinceton = 2

CPIHarvard = 1

No difference in performance!

Page 15: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Princeton microarchitecture (redrawn)

The same(mux not shown)

Only one of the phases is active in any cycle a lot of datapath is not in use at any given time

fetchphase execute phase

addr

wdata

rdataMemory

weALU

ImmExt

PC

0x4

Add

IRaddr

wdata

rdata

Memory

werd1

GPRs

rs1rs2

wswd rd2

we

Page 16: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Princeton Microarchitecture Overlapped execution

Yes, unless IR contains a Load or Store

What do we do with Fetch?

fetchphase execute phase

addr

wdata

rdataMemory

weALU

ImmExt

PC

0x4

Add

IRaddr

wdata

rdata

Memory

werd1

GPRs

rs1rs2

wswd rd2

we

How?

Can we overlap instruction fetch and execute?

ExecuteWhich action should be prioritized?

Stall it

Page 17: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Stalling the instruction fetch Princeton Microarchitecture

When stall condition is indicated don’t fetch a new instruction and don’t change the PC insert a nop in the IR set the Memory Address mux to ALU (not shown)

fetchphase execute phase

addr

wdata

rdataMemory

weALU

ImmExt

rd1

GPRs

rs1rs2

wswdrd2

we

PC

0x4

Add

addr

wdata

rdata

Memory

we

stall?

nop

IR

What if IR contains a jump or branch instruction?

Page 18: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Need to stall on branchesPrinceton Microarchitecture

When IR contains a jump or branch-taken no “structural conflict” for the memory but we do not have the correct PC value in the PC memory cannot be used – Address Mux setting is irrelevant insert a nop in the IR insert the nextPC (branch-target) address in the PC

addr

wdata

rdataMemory

weALU

ImmExt

rd1

GPRs

rs1rs2

wswdrd2

we

PC

0x4

Add

addr

wdata

rdata

Memory

we

Jump?

nop

IR

Page 19: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Pipelined Princeton Microarchitecture

IR

0x4

clk

RegDst

PCSrc RegWrite

BSrc zero?

WBSrc

31

ExtSelOpCode

Add

rd1

GPRs

rs1rs2

wswd rd2

we

ImmExt

addr

wdata

rdataData Memory

z

ALU

Add

OpSel

ALUControl

clk

we

MemWrite

clk

PC

PCen

MAddrSrc

clknop

IRSrc

PCSrc2

stall?stall

Page 20: Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Pipelined Princeton Architecture

Clock: tC-Princeton > tRF+ tALU+ tM

CPI: (1- f) + 2f cycles per instructionwhere f is the fraction of instructions that cause a stall

What is a likely value of f?