ece 550d - all facultypeople.ee.duke.edu/~jab/ece550/slides/08-datapath.pdf · ece 550d...

ECE 550DFundamentals of Computer Systems and Engineering

Fall 2017

Datapaths

Prof. John BoardDuke University

Slides are derived from work byProfs. Tyler Bletch and Andrew Hilton (Duke) and Amir Roth (Penn)

Last time

• What did we do last time?• MIPS Assembly

• Practice translating C to assembly together• Using functions

• Calling conventions• jal (call)• jr (return)

Now confluence of MIPS + digital logic

• Start of semester: Digital Logic• Building blocks of digital design

• Most recently: MIPS assembly, ISA• Lowest level software

• Now: where they meet…• Datapaths: hardware implementation of processors• By the way: homework 4 = build a datapath

• With some components from the TAs…

Necessary ingredient: the ALU

• ALU: Arithmetic/Logic Unit

• Performs any supported math or logic operation on two inputs

• Which operation is chosen by a third input

Add/Subtract With Overflow Detection

Full AdderFull AdderFull AdderFull Adder

S0S1Sn- 2Sn- 1

Overflow

a0a1 b0b1an- 2bn- 2an- 1bn- 1

Add/Sub

Add/sub

Add/sub F

A F Q0 0 a + b1 0 a - b- 1 NOT b- 2 a OR b- 3 a AND b

ALU Slice

The ALU

ALU SliceALU SliceALU SliceALU SliceALU control

a0b0a1b1an-2bn-2an-1bn-1

Q0Q1Qn-2Qn-1

Overflow Is non-zero?

Datapath for MIPS ISA

• Consider only the following instructionsadd $1,$2,$3addi $1,2,$3lw $1,4($3)sw $1,4($3)beq $1,$2,PC_relative_targetj absolute_target

• Why only these?• Most other instructions are the same from datapath viewpoint• The one’s that aren’t are left for you to figure out

InstructionFetch

InstructionDecode

OperandFetch

Execute

ResultStore

NextInstruction

Remember The von Neumann Model?

• Instruction Fetch: Read instruction bits from memory

• Decode:Figure out what those bits mean

• Operand Fetch:Read registers (+ mem to get sources)

• Execute:Do the actual operation (e.g., add the #s)

• Result Store:Write result to register or memory

• Next Instruction:Figure out mem addr of next insn, repeat

Start With Fetch

• Same for all instructions (don’t know insn yet)• PC and instruction memory• A +4 incrementer computes default next instruction PC

• Details of Insn Mem: later…• For now: just assume a bunch of DFFs

InsnMem

First Instruction: add

• Add register file and ALU

InsnMem

RegisterFile

R-type

s1 s2 d

Decoding: Very easy in MIPS

Op(6) Rs(5) Rt(5) Rd(5) Sh(5) Func(6)

First Instruction: add

• Add register file and ALU

InsnMem

RegisterFile

R-type

s1 s2 d

Decoding: Very easy in MIPS

Op(6) Rs(5) Rt(5) Rd(5) Sh(5) Func(6)

AND, OR, other r-type identical, just change func code! Same datapath

Second Instruction: addi

• Destination register can now be either Rd or Rt• Add sign extension unit and mux into second ALU input

InsnMem

RegisterFile

Op(6) Rs(5) Rt(5)I-type Immed(16)

s1 s2 d

Third Instruction: lw

• Add data memory, address is ALU output• Add register write data mux to select memory output or ALU output

InsnMem

RegisterFile

s1 s2 d

DataMemd

Fourth Instruction: sw

• Add path from second input register to data memory data input

InsnMem

RegisterFile

s1 s2 d

DataMem

Fifth Instruction: beq

• Add left shift unit and adder to compute PC-relative branch target• Add PC input mux to select PC+4 or branch target

• Note: shift by fixed amount very simple

InsnMem

RegisterFile

s1 s2 d

DataMem

Sixth Instruction: j

• Add shifter to compute left shift of 26-bit immediate• Add additional PC input mux for jump target

InsnMem

RegisterFile

Op(6)J-type Immed(26)

s1 s2 d

DataMem

More Instructions…

• Figure out datapath modifications for • jal (J-type)• jr (R-type)

InsnMem

RegisterFile

s1 s2 d

DataMem

• For jal, need to get PC+4 to RF write mux• (and constant 31 to destination register ID – probably another

InsnMem

RegisterFile

s1 s2 d

DataMem

• For JR need to get RF read value to next PC mux• (and constant 31 to source register ID, again probably another

InsnMem

RegisterFile

s1 s2 d

DataMem

Good practice: Try other insns

• Pick other MIPS instructions, contemplate how to add them

InsnMem

RegisterFile

s1 s2 d

DataMem

“Continuous Read” Datapath Timing

• Works because writes (PC, RegFile, DMem) are independent• And because no read logically follows any write (until next complete

instruction)• ONE LONG CLOCK CYCLE!

InsnMem

RegisterFile

s1 s2 d

DataMem

Read IMem Read Registers Read DMEM Write DMEMWrite Registers

Write PC

What Is Control?

• 8 signals control flow of data through this datapath• (well, ALUop is more than one bit..)• MUX selectors, or register/memory write enable signals• A real datapath might have 300-500 control signals

InsnMem

RegisterFile

s1 s2 d

DataMem

ALUinB

Example: Control for add

• Control for an instruction: • Values of all control signals to correctly execute it

InsnMem

RegisterFile

s1 s2 d

DataMem

DMwe=0ALUop=0

ALUinB=0Rdst=1

Example: Control for sw

• Difference between sw and add is 5 signals• 3 if you don’t count the X (don’t care) signals

InsnMem

RegisterFile

s1 s2 d

DataMem

ALUinB=1

DMwe=1

ALUop=0

Rdst=X

Example: Control for beq

• Difference between sw and beq is only 4 signals

InsnMem

RegisterFile

s1 s2 d

DataMem

ALUinB=0

DMwe=0

ALUop=1

Rdst=X

Let’s figure out LW

• How would these control signals be set for LW?

InsnMem

RegisterFile

s1 s2 d

DataMem

ALUinB

Example: Control for LW

InsnMem

RegisterFile

s1 s2 d

DataMem

DMwe=0ALUop=0

ALUinB=1Rdst=0

How Is Control Implemented?

InsnMem

RegisterFile

s1 s2 d

DataMem

ALUinB

Control?

Implementing Control

• Each insn has a unique set of control signals• Most are function of opcode• Some may be encoded in the instruction itself

• E.g., the ALUop signal is some portion of the MIPS Func field+ Simplifies controller implementation

• Requires careful ISA design

Control Implementation: ROM

• ROM (read only memory): think rows of bits• Bits in data words are control signals• Lines indexed by opcode• Example: ROM control for 6-insn MIPS datapath• X is “don’t care” (electrically must be 0 or 1 but doesn’t matter)

BR JP ALUinB ALUop DMwe Rwe Rdst Rwd

add 0 0 0 0 0 1 0 0addi 0 0 1 0 0 1 1 0

lw 0 0 1 0 0 1 1 1sw 0 0 1 0 1 0 X X

beq 1 0 0 1 0 0 X Xj 0 1 0 0 0 0 X X

opcode

Control Implementation: Random Logic

• Real machines have 100+ insns 300+ control signals• 30,000+ control bits (~4KB)• Not huge, but hard to make faster than datapath (important!)

• Alternative: random logic (random = ‘non-repeating’)• More or less what I did for protocomputer• Exploits the observation: many signals have few 1s or few 0s• Example: random logic control for 6-insn MIPS datapath

ALUinB

addaddilwswbeqj

BR JP DMwe Rwd Rdst ALUopRwe

Yes, “random logic” is a very dumb and

misleading name for this concept. Sorry.

Datapath and Control Timing

• Will usually need an IR (instruction register) buffering current instruction, as in protocomputer, but here can get by with Imem output

InsnMem

RegisterFile

s1 s2 d

DataMem

Control ROM/random logic

Read IMem Read Registers(Read Control ROM)

Read DMEM Write DMEMWrite Registers

Write PC

Single-Cycle Datapath Performance

• This machine will work, and it will be simple, but it will be slow

• Goes against make common case fast (MCCF) principle+ Low Cycles Per Instruction (CPI): 1– Long clock period: to accommodate slowest insn

InsnMem

RegisterFile

s1 s2 d

DataMem

Interlude: Performance

• Previous slide alludes to something new: Performance• Don’t just want it to work…• But want it to go fast!

• Three components to performance:Number of instructions

x Cycles per instruction (CPI)x Clock Period (1 / Clock frequency)

Instructions Cycles Seconds Seconds—————— x ————— x ————— = ——————Program Instruction Cycle Program

Interlude: Performance

• Three components to performance:Number of instructions <- Compiler’s Job

x Cycles per instruction (CPI)x Clock Period (1 / Clock frequency)

Instructions Cycles Seconds Seconds—————— x ————— x ————— = ——————Program Instruction Cycle Program

• Insns/Program: determined by compiler + ISA• Generally assume fixed program when doing micro-architecture

Micro-architectural factors

• Micro-architecture: • The details of how the ISA is implemented• Affects CPI and Clock frequency

• Often will look at fixed program, and consider MIPS• Million Instructions Per Second• MIPS = IPC * Frequency (in MHz)• IPC = Instruction Per Cycle (1 / CPI)• Gives “Bigger is better” number

Instructions Cycles Instructions————— x ————— = ——————Cycle Second Second(IPC) (Frequency) (Throughput)

The use of “MIPS” to mean “Millions of

Instructions Per Second” has nothing to do with the CPU architecture also called “MIPS”,

which actually stands for “Microprocessor without

Interlocked Pipeline Stages”.

This fact that a major CPU architecture shares a

name with an important metric for performance is incredibly confusing and dumb, and I apologize.

I blame the cocaine-fueled CPU architects of the

1980s.

“Best” IPC

• For now, best we can do: IPC = 1 (CPI = 1)• Do 1 instruction every cycle

• Later:• Real processors can do multiple instructions at once!• Potentially: IPC > 1! (CPI < 1!)• Best possible IPC depends on design

Performance vs ….

• 1990s: Performance at all cost• Actually more “clock frequency” at all cost…

• Now: Care about other things• Energy (electric bill, battery life)• Power (cooling, also affects energy)• Area (chip cost)• Reliability (tolerance of transient faults: e.g., charge particle strikes) • …

• Important metric these days “Performance / Watt”• Throughput divided by power consumption• Why? What device in particular?

Performance Modeling and Analysis

• Speaking of performance• Making a processor takes time (years) and money (millions)• Want to know it will perform well before you finish

• If its wrong, doing it all over is painful…• Performance can be simulated in software

• Estimate what IPC will be• Guide design• Patterson and Hennessy’s other more advanced textbook:

“Computer Architecture: A Quantitative Approach"

Single-Cycle Datapath Performance

• Goes against make common case fast (MCCF) principle+ Low Cycles Per Instruction (CPI): 1– Long clock period: to accommodate slowest insn

InsnMem

RegisterFile

s1 s2 d

DataMem

Alternative: Multi-Cycle Datapath

• Multi-cycle datapath: attacks high clock period• Cut datapath into multiple stages (5 here), isolate using FFs• FSM control “walks” insns thru stages (by staging control signals)• Not every instruction needs every stage+ Instructions can bypass stages and exit early

InsnMem

RegisterFile

s1 s2 dDataMem

Multi-cycle Datapath FSM

• First state: Get a New Instruction• Output signals to fetch (e.g., read enable IMEM)• Next State: Always Decode

NextInsn

DecodeInsn

• Second State: Decode• Output signals to decode instruction (RdEn RegFile)• Go to Next Insn if NOP• Otherwise Execute

NextInsn

DecodeInsn

ExecuteInsn

• Execute State• Execute Insn (varies by insn type)• Next State: Also depends on insn type

• Branches: Next Insn

NextInsn

DecodeInsn

ExecuteInsnBranch

• ALU op: write register - we call this Writeback (to register file)

NextInsn

DecodeInsn

ExecuteInsnBranch

Writeback

• Load: Read Memory

NextInsn

DecodeInsn

ExecuteInsnBranch

Writeback

Read DMEM

• Store: Write Memory

NextInsn

DecodeInsn

ExecuteInsnBranch

Writeback

Read DMEM

LoadWriteDMEM

• Read DMEM State• Control signals enable DMEM Read• Next state is writeback (what we read from memory needs to go to

a register)

NextInsn

DecodeInsn

ExecuteInsnBranch

Writeback

Read DMEM

LoadWriteDMEM

• Writeback state• Control signals enable regfile write• Next state: Next Insn

NextInsn

DecodeInsn

ExecuteInsnBranch

Writeback

Read DMEM

LoadWriteDMEM

• Write DMEM state• Control signals enable memory write• Next state: Next Insn

NextInsn

DecodeInsn

ExecuteInsnBranch

Writeback

Read DMEM

LoadWriteDMEM

Multi-Cycle Datapath Example: Add

InsnMem

RegisterFile

s1 s2 dDataMem

• Example: Add• Cycle 1: Read IMEM

InsnMem

RegisterFile

s1 s2 dDataMem

• Example: Add• Cycle 1: Read IMEM• Cycle 2: Decode + Read RF

InsnMem

RegisterFile

s1 s2 dDataMem

• Example: Add• Cycle 1: Read IMEM• Cycle 2: Decode + Read RF• Cycle 3: ALU

• Example: Add• Cycle 1: Read IMEM• Cycle 2: Decode + Read RF• Cycle 3: ALU• Cycle 4: Writeback + Increment PC

InsnMem

RegisterFile

s1 s2 dDataMem

Multi-Cycle Datapath Performance

• Opposite performance split of single-cycle datapath+ Short clock period– High CPI

InsnMem

RegisterFile

s1 s2 dDataMem

Multi-cycle Data-path CPI

• CPI depends on instructions• Branches / Jumps: 3 cycles• ALU: 4 cycles• Stores: 4 cycles• Loads: 5 cycles

• Overall CPI is weighted average

• Example: • 20% loads, 15% stores, 20% branches, 45% ALU

CPI= 0.20 * 5 +

CPI= 0.20 * 5 + 0.15 * 4 +

CPI= 0.20 * 5 + 0.15 * 4 + 0.20 * 3 + 0.45 * 4 = 4.0

Multi-cycle Datapath Performance

• Single-cycle• Clock period = 50ns, CPI = 1• Performace = 50 ns/insn

• Multi-cycle• Clock period = 10ns • CPI = (0.2*3+0.2*5+0.6*4) = 4• Performance = 40 ns/insn

• But wait…

Multi-Cycle Datapath Performance

• Did not just cut up existing logic into 5 pieces• Also added logic (flip flops)

• So clock period not 1/5 of single cycle, but slightly longer

InsnMem

RegisterFile

s1 s2 dDataMem

Multi-cycle Datapath Performance

• Single-cycle• Clock period = 50ns, CPI = 1• Performace = 50 ns/insn

• Multi-cycle• Clock period = 12ns• CPI = (0.2*3+0.2*5+0.6*4) = 4• Performance = 48 ns/insn

• Better, but not as exciting…• Can we do better still? • Have our cake (low CPI) and eat it too (high clock frequency)?

Clock Period and CPI

• Single-cycle datapath+ Low CPI: 1– Long clock period: to accommodate slowest insn

• Multi-cycle datapath+ Short clock period– High CPI

• Can we have both low CPI and short clock period?– No good way to make a single insn go faster+ Insn latency doesn’t matter anyway … insn throughput matters• Key: exploit inter-insn parallelism

insn0.fetch, dec, execinsn1.fetch, dec, exec

insn0.decinsn0.fetchinsn1.decinsn1.fetch

insn0.execinsn1.exec

Pipelining

• Pipelining: important performance technique• Improves insn throughput rather than insn latency• Exploits parallelism at insn-stage level to do so• Begin with multi-cycle design

• When insn advances from stage 1 to 2, next insn enters stage 1

• Individual insns take same number of stages+ But insns enter and leave at a much faster rate• Physically breaks “atomic” VN loop ... but must maintain illusion

• Revisit at end of semester (hopefully)

insn0.decinsn0.fetchinsn1.decinsn1.fetch

insn0.execinsn1.exec

insn0.decinsn0.fetchinsn1.decinsn1.fetchinsn0.exec

insn1.exec

Summary

• Datapaths:• Single Cycle

• What do we need?• Control• How control is implemented

• Multi-cycle• Faster clock (yay!)• Worse CPI (boooo)

• Performance:• IPC• Performance / Watt• CPU Performance Equation

• Pipelining• Teaser for later!

ece 550d - all facultypeople.ee.duke.edu/~jab/ece550/slides/08-datapath.pdf · ece 550d...

Documents

ece 550d - faculty

cach su dung may anh canon 550d

ece 550d fundamentals of computer systems and engineering...

eos 550d eng leaflet-6 - canon.com.hk · title: eos 550d...

ece technical elective and amp info session...

ece 550d - duke electrical and computer...

building a datapath datapath 1 - computer science at...

ece inplant training ece/ece/ece inplant training chennai

ece 550d fundamentals of computer systems and engineering...

rdx 550d-00 表紙 · canon eos rebel t1i / eos 500d canon...

ece 550d - duke...

y canon strap clamp 100d and 550d€¦ · 550d canon550d...

canon eos 550d brochure

ece 550d fundamentals of computer systems and engineering...

eos 550d manuel - fr

building a datapath datapath 1 - department of computer...

ece regulations ece r 32

cankaya university - ece department - ece 373...

instruction manual eos 550d

ece 550d - all...