mi ps details

Upload: milind-shah

Post on 04-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 Mi Ps Details

    1/124

    Embedded Processor Architecture

    TU/e 5kk73

    Henk Corporaal

    Bart Mesman

    RISCInstruction Set

    ImplementationAlternatives

    == using MIPS as example ==

  • 7/30/2019 Mi Ps Details

    2/124

    H.Corporaal EmbProcArch 5kk73 2

    Topics

    MIPS ISA: Instruction Set Architecture

    MIPS single cycle implementation MIPS multi-cycle implementation

    MIPS pipelined implementation

    Pipeline hazards

    Recap of RISC principles Other architectures

    Based on the book: ch2-4 (4th ed)

    Many slides; I'll go quick andskip some

  • 7/30/2019 Mi Ps Details

    3/124

    H.Corporaal EmbProcArch 5kk73 3

    Main Types of Instructions

    Arithmetic Integer

    Floating Point

    Memory access instructions Load & Store

    Control flow

    Jump Conditional Branch

    Call & Return

  • 7/30/2019 Mi Ps Details

    4/124

    H.Corporaal EmbProcArch 5kk73 4

    MIPS arithmetic

    Most instructions have 3 operands Operand order is fixed (destination first)

    Example:

    C code: A = B + C

    MIPS code: add $s0, $s1, $s2

    ($s0, $s1 and $s2 are associated with variables bycompiler)

  • 7/30/2019 Mi Ps Details

    5/124

    H.Corporaal EmbProcArch 5kk73 5

    MIPS arithmetic

    C code: A = B + C + D;E = F - A;

    MIPS code:add $t0, $s1, $s2

    add $s0, $t0, $s3

    sub $s4, $s5, $s0

    Operands must be registers, only 32 registers provided

    Design Principle: smaller is faster. Why?

  • 7/30/2019 Mi Ps Details

    6/124

    H.Corporaal EmbProcArch 5kk73 6

    Registers vs. Memory

    Arithmetic instruction operands must be registers, only 32 registers provided

    Compiler associates variables with registers

    What about programs with lots of variables ?

    CPU Memory

    IO

    register file

  • 7/30/2019 Mi Ps Details

    7/124

    H.Corporaal EmbProcArch 5kk73 7

    Register allocation

    Compiler tries to keep as many variables in registers aspossible

    Some variables can not be allocated

    large arrays (too few registers) aliased variables (variables accessible through pointers in C)

    dynamic allocated variables

    heap

    stack

    Compiler may run out of registers => spilling

  • 7/30/2019 Mi Ps Details

    8/124

    H.Corporaal EmbProcArch 5kk73 8

    Memory Organization

    Viewed as a large, single-dimension array, with anaddress

    A memory address is an index into the array

    "Byte addressing" means that successive addresses are

    one byte apart0

    1

    2

    3

    4

    5

    6

    ...

    8 bits of data

    8 bits of data

    8 bits of data

    8 bits of data

    8 bits of data

    8 bits of data

    8 bits of data

  • 7/30/2019 Mi Ps Details

    9/124

  • 7/30/2019 Mi Ps Details

    10/124

    H.Corporaal EmbProcArch 5kk73 10

    Memory layout: Alignment

    Words are aligned

    What are the least 2 significant bits of a wordaddress?

    this word is aligned; the others are not!

    ad

    dress

    0

    4

    8

    12

    16

    20

    24

    31 071523

  • 7/30/2019 Mi Ps Details

    11/124

    H.Corporaal EmbProcArch 5kk73 11

    Instructions: load and store

    Example:

    C code: A[8] = h + A[8];

    MIPS code: lw $t0, 32($s3)add $t0, $s2, $t0

    sw $t0, 32($s3)

    Store word operation has no destination (reg) operand

    Remember arithmetic operands are registers, notmemory!

  • 7/30/2019 Mi Ps Details

    12/124

    H.Corporaal EmbProcArch 5kk73 12

    Let's translate some C-code

    Can we figure out the code?

    swap(int v[], int k);{ int temp;

    temp = v[k]v[k] = v[k+1];v[k+1] = temp;

    }

    swap:muli $2 , $5, 4add $2 , $4, $2lw $15, 0($2)lw $16, 4($2)sw $16, 0($2)sw $15, 4($2)jr $31

    Explanation:

    index k : $5

    base address of v: $4

    address of v[k] is $4 + 4.$5

  • 7/30/2019 Mi Ps Details

    13/124

  • 7/30/2019 Mi Ps Details

    14/124

    H.Corporaal EmbProcArch 5kk73 14

    Consider the load-word and store-word instructions, What would the regularity principle have us do?

    New principle: Good design demands a compromise

    Introduce a new type of instruction format I-type for data transfer instructions

    other format was R-type for register

    Example: lw $t0, 32($s2)

    35 18 9 32

    op rs rt 16 bit number

    Machine Language

  • 7/30/2019 Mi Ps Details

    15/124

  • 7/30/2019 Mi Ps Details

    16/124

    H.Corporaal EmbProcArch 5kk73 16

    Decision making instructions alter the control flow,

    i.e., change the "next" instruction to be executed

    MIPS conditional branch instructions:

    bne $t0, $t1, Label

    beq $t0, $t1, Label

    Example: if (i==j) h = i + j;

    bne $s0, $s1, Label

    add $s3, $s0, $s1

    Label: ....

    Control

  • 7/30/2019 Mi Ps Details

    17/124

    H.Corporaal EmbProcArch 5kk73 17

    MIPS unconditional branch instructions:j label

    Example:

    if (i!=j) beq$s4, $s5, Lab1h=i+j; add $s3, $s4, $s5

    else j Lab2

    h=i-j; Lab1: sub $s3, $s4, $s5

    Lab2: ...

    Can you build a simple for loop?

    Control

  • 7/30/2019 Mi Ps Details

    18/124

    H.Corporaal EmbProcArch 5kk73 18

    So far:

    Instruction Meaning

    add $s1,$s2,$s3 $s1 = $s2 + $s3

    sub $s1,$s2,$s3 $s1 = $s2 $s3

    lw $s1,100($s2) $s1 = Memory[$s2+100]

    sw $s1,100($s2) Memory[$s2+100] = $s1bne $s4,$s5,L Next instr. is at Label if $s4 $s5

    beq $s4,$s5,L Next instr. is at Label if $s4 = $s5

    j Label Next instr. is at Label

    Formats:

    op rs rt rd shamt funct

    op rs rt 16 bit address

    op 26 bit address

    R

    I

    J

  • 7/30/2019 Mi Ps Details

    19/124

    H.Corporaal EmbProcArch 5kk73 19

    We have: beq, bne, what about Branch-if-less-than? New instruction:

    meaning:if $s1 < $s2 then

    $t0 = 1

    slt $t0, $s1, $s2 else

    $t0 = 0

    Can use this instruction to build "blt $s1, $s2, Label"

    can now build general control structures

    Note that the assembler needs a register to do this, use conventions for registers

    Control Flow

  • 7/30/2019 Mi Ps Details

    20/124

    H.Corporaal EmbProcArch 5kk73 20

    MIPS compiler/assembler Conventions

    Name Register number Usage$zero 0 the constant value 0

    $v0-$v1 2-3 values for results and expression evaluation

    $a0-$a3 4-7 arguments

    $t0-$t7 8-15 temporaries

    $s0-$s7 16-23 saved (by callee)

    $t8-$t9 24-25 more temporaries

    $gp28 global pointer $sp 29 stack pointer

    $fp 30 frame pointer

    $ra 31 return address

  • 7/30/2019 Mi Ps Details

    21/124

    H.Corporaal EmbProcArch 5kk73 21

    Small constants are used quite frequently (50% of operands)

    e.g., A = A + 5;B = B + 1;C = C - 18;

    Solutions? Why not?

    put 'typical constants' in memory and load them create hard-wired registers (like $zero) for constants like one

    or .

    MIPS Instructions:

    addi $29, $29, 4slti $8, $18, 10andi $29, $29, 6ori $29, $29, 4

    3

    Constants

  • 7/30/2019 Mi Ps Details

    22/124

    H.Corporaal EmbProcArch 5kk73 22

    We'd like to be able to load a 32 bit constant into a register Must use two instructions; new "load upper immediate"

    instructionlui $t0, 1010101010101010

    1010101010101010 0000000000000000

    0000000000000000 1010101010101010

    1010101010101010 1010101010101010

    ori

    1010101010101010 0000000000000000

    filled with zeros

    How about larger constants?

    Then must get the lower order bits right, i.e.,

    ori $t0, $t0, 1010101010101010

  • 7/30/2019 Mi Ps Details

    23/124

    H.Corporaal EmbProcArch 5kk73 23

    Assembly provides convenient symbolic representation much easier than writing down numbers

    e.g., destination first

    Machine language is the underlying reality

    e.g., destination is no longer first

    Assembly can provide 'pseudoinstructions'

    e.g., move $t0, $t1 exists only in Assembly

    would be implemented using add $t0,$t1,$zero

    When considering performance you should count real

    instructions

    Assembly Language vs. Machine Language

  • 7/30/2019 Mi Ps Details

    24/124

    H.Corporaal EmbProcArch 5kk73 24

    Instructions:bne $t4,$t5,Label Next instruction is at Label if $t4 $t5

    beq $t4,$t5,Label Next instruction is at Label if $t4 = $t5

    j Label Next instruction is at Label

    Formats:

    Addresses are not 32 bits How do we handle this with load and store instructions?

    op rs rt 16 bit address

    op 26 bit address

    I

    J

    Addresses in Branches and Jumps

  • 7/30/2019 Mi Ps Details

    25/124

    H.Corporaal EmbProcArch 5kk73 25

    Instructions:bne $t4,$t5,Label Next instruction is at Label if $t4 $t5beq $t4,$t5,Label Next instruction is at Label if $t4 = $t5

    Formats:

    Could specify a register (like lw and sw) and add it to address

    use Instruction Address Register (PC = program counter)

    most branches are local (principle of locality)

    Jump instructions just use high order bits of PC

    address boundaries of 256 MB

    op rs rt 16 bit addressI

    What's the next address?

    T i

  • 7/30/2019 Mi Ps Details

    26/124

    H.Corporaal EmbProcArch 5kk73 26

    To summarize:MIPS assembly language

    Category Instruction Example Meaning Commentsadd add $s1, $s2, $s3 $s1 = $s2 + $s3 Three operands; data in registers

    Arithmetic subtract sub $s1, $s2, $s3 $s1 = $s2 - $s3 Three operands; data in registers

    add immediate addi $s1, $s2, 100 $s1 = $s2 + 100 Used to add constants

    load word lw $s1, 100($s2) $s1 = Memory[$s2 + 100] Word from memory to register

    store word sw $s1, 100($s2) Memory[$s2 + 100] = $s1 Word from register to memory

    Data transfer load byte lb $s1, 100($s2) $s1 = Memory[$s2 + 100] Byte from memory to register

    store byte sb $s1, 100($s2)

    Memory[$s2

    + 100] = $s1Byte from register to memory

    load upper immediate lui $s1, 100 $s1 = 100 * 216 Loads constant in upper 16 bits

    branch on equal beq $s1, $s2, 25 if ($s1 == $s2) go to

    PC + 4 + 100

    Equal test; PC-relative branch

    Conditional

    branch on not equal bne $s1, $s2, 25 if ($s1 != $s2) go to

    PC + 4 + 100

    Not equal test; PC-relative

    branch set on less than slt $s1, $s2, $s3 if ($s2 < $s3) $s1 = 1;else $s1 = 0

    Compare less than; for beq, bne

    set less than

    immediate

    slti $s1, $s2, 100 if ($s2 < 100) $s1 = 1;

    else $s1 = 0

    Compare less than constant

    jump j 2500 go to 10000 Jump to target address

    Uncondi- jump register jr $ra go to $ra For switch, procedure return

    tional jump jump and link jal 2500 $ra = PC + 4; go to 10000 For procedure call

    MIPS (3+2) addressing modes overview

  • 7/30/2019 Mi Ps Details

    27/124

    H.Corporaal EmbProcArch 5kk73 27

    Byte Halfword Word

    Registers

    Memory

    Memory

    Word

    Memory

    Word

    Register

    Register

    1. Immediate addressing

    2. Register addressing

    3. Base addressing

    4. PC-relative addressing

    5. Pseudodirect addressing

    op rs rt

    op rs rt

    op rs rt

    op

    op

    rs rt

    Address

    Address

    Address

    rd . . . funct

    Immediate

    PC

    PC

    +

    +

    MIPS (3+2) addressing modes overview

  • 7/30/2019 Mi Ps Details

    28/124

    H.Corporaal EmbProcArch 5kk73 28

    MIPS Datapath

    Building a datapath support a subset of the MIPS-I instruction-set

    A single cycle processor datapath

    all instruction actions in one (long) cycle

    A multi-cycle processor datapath

    each instructions takes multiple (shorter) cycles

    For details see book (ch 5):

  • 7/30/2019 Mi Ps Details

    29/124

    H.Corporaal EmbProcArch 5kk73 29

    Datapath and Control

    DatapathControl

    Registers &

    Memories

    Multiplexors

    Buses

    ALUs

    FSM

    or

    Micro-programming

  • 7/30/2019 Mi Ps Details

    30/124

    H.Corporaal EmbProcArch 5kk73 30

    Simplified MIPS implementation to contain only: memory-reference instructions: lw, sw

    arithmetic-logical instructions: add, sub, and, or, slt

    control flow instructions: beq, j

    Generic Implementation:

    use the program counter (PC) to supply instruction address get the instruction from memory

    read registers

    use the instruction to decide exactly what to do

    All instructions use the ALU after reading the registersWhy? memory-reference?

    arithmetic?

    control flow?

    The Processor: Datapath & Control

  • 7/30/2019 Mi Ps Details

    31/124

    H.Corporaal EmbProcArch 5kk73 31

    Abstract / Simplified View:

    Two types of functional units: elements that operate on data values (combinational)

    elements that contain state (sequential)

    More Implementation Details

    Registers

    Register#

    Data

    Register#

    Datamemory

    Address

    Data

    Register#

    PC Instruction ALU

    Instructionmemory

    Address

  • 7/30/2019 Mi Ps Details

    32/124

    H.Corporaal EmbProcArch 5kk73 32

    Unclocked vs. Clocked Clocks used in synchronous logic

    when should an element that contains state be updated?

    cycle time

    rising edge

    falling edge

    State Elements

  • 7/30/2019 Mi Ps Details

    33/124

    H.Corporaal EmbProcArch 5kk73 33

    The set-reset (SR) latch output depends on present inputs and also on past inputs

    An unclocked state element

    R

    S

    Q

    Q

    Truth table:R S Q

    0 0 Q

    0 1 1

    1 0 0

    1 1 ?

    state change

  • 7/30/2019 Mi Ps Details

    34/124

    H.Corporaal EmbProcArch 5kk73 34

    Output is equal to the stored value inside the element(don't need to ask for permission to look at the value)

    Change of state (value) is based on the clock

    Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge

    (edge-triggered methodology)

    A clocking methodology defines when signals can be read and writtenwouldn't want to read a signal at the same time it was being written

    Latches and Flip-flops

  • 7/30/2019 Mi Ps Details

    35/124

    H.Corporaal EmbProcArch 5kk73 35

    Two inputs: the data value to be stored (D)

    the clock signal (C) indicating when to read & store D

    Two outputs:

    the value of the internal state (Q) and it's complement

    D-latch

    Q

    C

    D

    _Q

    D

    C

    Q

  • 7/30/2019 Mi Ps Details

    36/124

    H.Corporaal EmbProcArch 5kk73 36

    D flip-flop

    Output changes only on the clock edge

    QQ

    _Q

    Q

    _Q

    Dlatch

    D

    C

    Dlatch

    DD

    C

    C

    D

    C

    Q

  • 7/30/2019 Mi Ps Details

    37/124

    H.Corporaal EmbProcArch 5kk73 37

    Our Implementation

    An edge triggered methodology Typical execution:

    read contents of some state elements,

    send values through some combinational logic,

    write results to one or more state elements

    Clockcycle

    Stateelement

    1Combinationallogic

    Stateelement

    2

  • 7/30/2019 Mi Ps Details

    38/124

    H.Corporaal EmbProcArch 5kk73 38

    3-ported: one write, two read ports

    Register File

    Read reg. #1

    Read reg.#2

    Write reg.#

    Readdata 1

    Readdata 2

    Write

    Writedata

  • 7/30/2019 Mi Ps Details

    39/124

    H.Corporaal EmbProcArch 5kk73 39

    Register file: read ports

    Mu

    x

    Register0

    Register1

    Registern 1Registern

    M

    u

    x

    Readdata1

    Readdata2

    Readregister

    number1

    Readregister

    number2

    Implementation of the read ports

    Register file built using D flip-flops

  • 7/30/2019 Mi Ps Details

    40/124

    H.Corporaal EmbProcArch 5kk73 40

    Register file: write port

    Note: we still use the real clock to determine when towrite

    n-to-1

    decoder

    Register 0

    Register 1

    Register n 1

    C

    C

    D

    D

    Register n

    C

    C

    D

    D

    Register number

    Write

    Register data

    0

    1

    n 1

    n

  • 7/30/2019 Mi Ps Details

    41/124

    H.Corporaal EmbProcArch 5kk73 41

    Building the Datapath Use multiplexors to stitch them together

    PC

    Instruction

    memory

    Readaddress

    Instruction

    16 32

    AddALUresult

    Mux

    Registers

    Writeregister

    Writedata

    Readdata1

    Readdata2

    Readregister1

    Readregister2

    Shift

    left2

    4

    Mu

    x

    ALUoperation3

    RegWrite

    MemRead

    MemWrite

    PCSrc

    ALUSrc

    MemtoReg

    ALUresult

    ZeroALU

    Datamemory

    Address

    Writedata

    Readdata M

    u

    x

    Signextend

    Add

  • 7/30/2019 Mi Ps Details

    42/124

    H.Corporaal EmbProcArch 5kk73 42

    All of the logic is combinational We wait for everything to settle down, and the right thing

    to be done

    ALU might not produce right answer right away

    we use write signals along with clock to determine when to write

    Cycle time determined by length of the longest path

    Our Simple Control Structure

    We are ignoring some details like setup and hold times !

    Clockcycle

    Stateelement

    1Combinational logic

    Stateelement

    2

  • 7/30/2019 Mi Ps Details

    43/124

    H.Corporaal EmbProcArch 5kk73 43

    Control Selecting the operations to perform (ALU, read/write, etc.)

    Controlling the flow of data (multiplexor inputs)

    Information comes from the 32 bits of the instruction

    Example:

    add $8, $17, $18 Instruction Format:

    000000 10001 10010 01000 00000 100000

    op rs rt rd shamt funct

    ALU's operation based on instruction type and function code

  • 7/30/2019 Mi Ps Details

    44/124

    H.Corporaal EmbProcArch 5kk73 44

    Control: 2 level implementation

    instructionre

    gister ALUop

    ALUcontrol

    Opcode

    Funct.

    31

    26

    0

    5

    bit

    Control 1

    Control 2

    ALU

    00: lw, sw01: beq10: add, sub, and, or, slt

    000: and001: or010: add

    110: sub111: set on less than

    6

    6

    2

    3

    Datapath with Control

  • 7/30/2019 Mi Ps Details

    45/124

    H.Corporaal EmbProcArch 5kk73 45

    Datapath with Control

    PC

    Instructionmemory

    Readaddress

    Instruction[310]

    Instruction[2016]

    Instruction[2521]

    Add

    Instruction[50]

    MemtoReg

    ALUOp

    MemWrite

    RegWrite

    MemRead

    BranchRegDst

    ALUSrc

    Instruction[3126]

    4

    16 32Instruction[150]

    0

    0Mux

    0

    1

    Control

    Add ALUresult

    Mux

    0

    1

    RegistersWriteregister

    Writedata

    Readdata1

    Readdata2

    Readregister1

    Readregister2

    Signextend

    Shiftleft2

    Mux1

    ALUresult

    Zero

    DatamemoryWritedata

    Readdata

    Mu

    x

    1

    Instruction[1511]

    ALUcontrol

    ALUAddress

    ALU C t l1

  • 7/30/2019 Mi Ps Details

    46/124

    H.Corporaal EmbProcArch 5kk73 46

    What should the ALU do with this instructionexample: lw $1, 100($2)

    35 2 1 100

    op rs rt 16 bit offset

    ALU control input

    000 AND001 OR010 add110 subtract111 set-on-less-than

    Why is the code for subtract 110 and not 011?

    ALU Control1

    ALU C t l1

  • 7/30/2019 Mi Ps Details

    47/124

    H.Corporaal EmbProcArch 5kk73 47

    Must describe hardware to compute 3-bit ALU control input given instruction type

    00 = lw, sw01 = beq,10 = arithmetic

    function code for arithmetic

    Describe it using a truth table (can turn into gates):

    ALU Operation class,computed from instruction type

    ALU Control1

    ALUOp Funct field Operation

    ALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0

    0 0 X X X X X X 010

    X 1 X X X X X X 1101 X X X 0 0 0 0 010

    1 X X X 0 0 1 0 110

    1 X X X 0 1 0 0 000

    1 X X X 0 1 0 1 001

    1 X X X 1 0 1 0 111

    outputsinputs

    ALU C t l1

  • 7/30/2019 Mi Ps Details

    48/124

    H.Corporaal EmbProcArch 5kk73 48

    ALU Control1

    Simple combinational logic (truth tables)

    Operation2

    Operation1

    Operation0

    Operation

    ALUOp1

    F3

    F2

    F1

    F0

    F (5

    0)

    ALUOp0

    ALUOp

    ALU control block

    D i i C t l2 i l

  • 7/30/2019 Mi Ps Details

    49/124

    H.Corporaal EmbProcArch 5kk73 49

    Deriving Control2 signals

    Instruction RegDst ALUSrc

    Memto-

    Reg

    Reg

    Write

    Mem

    Read

    Mem

    Write Branch ALUOp1 ALUp0

    R-format 1 0 0 1 0 0 0 1 0

    lw 0 1 1 1 1 0 0 0 0sw X 1 X 0 0 1 0 0 0

    beq X 0 X 0 0 0 1 0 1

    9 control (output) signals

    Determine these control signals directly from the opcodes:

    R-format: 0

    lw: 35

    sw: 43

    beq: 4

    Input6-bits

    C t l 2

  • 7/30/2019 Mi Ps Details

    50/124

    H.Corporaal EmbProcArch 5kk73 50

    Control 2

    PLA exampleimplementation

    R-format Iw sw beq

    Op0

    Op1Op2

    Op3

    Op4

    Op5

    Inputs

    Outputs

    RegDst

    ALUSrc

    MemtoReg

    RegWrite

    MemRead

    MemWrite

    Branch

    ALUOp1

    ALUOpO

    Si l C l I l t ti

  • 7/30/2019 Mi Ps Details

    51/124

    H.Corporaal EmbProcArch 5kk73 51

    Single Cycle Implementation Calculate cycle time assuming negligible delays except:

    memory (2ns), ALU and adders (2ns), register file access (1ns)

    MemtoReg

    MemRead

    MemWrite

    ALUOp

    ALUSrc

    RegDst

    PC

    Instructionmemory

    Readaddress

    Instruction[310]

    Instruction[2016]

    Instruction[2521]

    Add

    Instruction[50]

    RegWrite

    4

    16 32Instruction[150]

    0

    Registers

    Writeregister

    Write

    dataWritedata

    Readdata1

    Readdata2

    Readregister1

    Readregister2

    Signextend

    ALUresult

    Zero

    Datamemory

    Address Readdata

    Mu

    x

    1

    0

    Mux

    1

    0

    Mux

    1

    0

    Mux

    1

    Instruction[1511]

    ALUcontrol

    Shift

    left2

    PCSrc

    ALU

    AddALU

    result

    Si l C l I l t ti

  • 7/30/2019 Mi Ps Details

    52/124

    H.Corporaal EmbProcArch 5kk73 52

    Single Cycle Implementation

    Memory (2ns), ALU & adders (2ns), reg. file access (1ns)

    Fixed length clock: longest instruction is the lw which requires 8 ns

    Variable clock length (not realistic, just as exercise):

    R-instr: 6 ns

    Load: 8 ns

    Store: 7 ns

    Branch: 5 ns

    Jump: 2 ns

    Average depends on instruction mix

    Where we are headed

  • 7/30/2019 Mi Ps Details

    53/124

    H.Corporaal EmbProcArch 5kk73 53

    Where we are headed Single Cycle Problems:

    what if we had a more complicated instruction like floating point?

    wasteful of area: NO Sharing of Hardware resources

    One Solution: use a smaller cycle time

    have different instructions take different numbers of cycles

    a multicycle datapath:

    PC

    Memory

    Address

    Instructionordata

    Data

    Instructionregister

    Registers

    Register#

    Data

    Register#

    Register#

    ALU

    Memory

    dataregister

    A

    B

    ALUOut

    IR

    MDR

    M lti l A h

  • 7/30/2019 Mi Ps Details

    54/124

    H.Corporaal EmbProcArch 5kk73 54

    We will be reusing functional units ALU used to compute address and to increment PC Memory used for instruction and data

    Add registers after every major functional unit

    Our control signals will not be determined solely byinstruction e.g., what should the ALU do for a subtract instruction?

    Well use a finite state machine (FSM) ormicrocode forcontrol

    Multicycle Approach

    R i fi it t t hi

  • 7/30/2019 Mi Ps Details

    55/124

    H.Corporaal EmbProcArch 5kk73 55

    Finite state machines: a set of states and

    next state function (determined by current state and the input)

    output function (determined by current state and possibly input)

    Well use a Moore machine (output based only on current state)

    Review: finite state machines

    Next-statefunction

    Currentstate

    Clock

    Outputfunction

    Nextstate

    Outputs

    Inputs

    M lti l A h

  • 7/30/2019 Mi Ps Details

    56/124

    H.Corporaal EmbProcArch 5kk73 56

    Break up the instructions into steps, each step takes acycle balance the amount of work to be done

    restrict each cycle to use only one major functional unit

    At the end of a cycle store values for use in later cycles (easiest thing to do)

    introduce additional internal registers

    Notice: we distinguish processor state: programmer visible registers

    internal state: programmer invisible registers (like IR, MDR, A, B,and ALUout)

    Multicycle Approach

    M ltic cle Approach

  • 7/30/2019 Mi Ps Details

    57/124

    H.Corporaal EmbProcArch 5kk73 57

    Multicycle Approach

    Shift

    left2

    PC

    Memory

    MemData

    Writedata

    Mux

    0

    1

    RegistersWriteregister

    Writedata

    Readdata1

    Read

    data2

    Readregister1

    Readregister2

    M

    ux

    0

    1

    Mux

    0

    1

    4

    Instruction[150]

    Signextend

    3216

    Instruction[2521]

    Instruction[2016]

    Instruction[150]

    Instructionregister

    1Mux

    0

    3

    2

    Mux

    ALUresult

    ALU

    Zero

    Memorydata

    register

    Instruction[1511]

    A

    B

    ALUOut

    0

    1

    Address

    Multicycle Approach

  • 7/30/2019 Mi Ps Details

    58/124

    H.Corporaal EmbProcArch 5kk73 58

    Multicycle Approach

    Note that previous picture does not include: branch support jump support

    Control lines and logic

    Tclock > max (ALU delay, Memory access, Regfile access)

    See book for complete picture

    Five Execution Steps

  • 7/30/2019 Mi Ps Details

    59/124

    H.Corporaal EmbProcArch 5kk73 59

    Instruction Fetch

    Instruction Decode and Register Fetch

    Execution, Memory Address Computation, or Branch

    Completion

    Memory Access or R-type instruction completion

    Write-back step

    Five Execution Steps

    INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

    Step 1: Instruction Fetch

  • 7/30/2019 Mi Ps Details

    60/124

    H.Corporaal EmbProcArch 5kk73 60

    Use PC to get instruction and put it in the Instruction

    Register Increment the PC by 4 and put the result back in the PC

    Can be described succinctly using RTL "Register-TransferLanguage"

    IR = Memory[PC];

    PC = PC + 4;

    Can we figure out the values of the control signals?

    What is the advantage of updating the PC now?

    Step 1: Instruction Fetch

    Step 2: Instruction Decode and

  • 7/30/2019 Mi Ps Details

    61/124

    H.Corporaal EmbProcArch 5kk73 61

    Read registers rs and rt in case we need them Compute the branch address in case the instruction is a

    branch

    Previous two actions are done optimistically!!

    RTL:A = Reg[IR[25-21]];B = Reg[IR[20-16]];ALUOut = PC+(sign-extend(IR[15-0])

  • 7/30/2019 Mi Ps Details

    62/124

    H.Corporaal EmbProcArch 5kk73 62

    ALU is performing one of four functions, based on instruction type

    Memory Reference:

    ALUOut = A + sign-extend(IR[15-0]);

    R-type:

    ALUOut = A op B; Branch:

    if (A==B) PC = ALUOut;

    Jump:

    PC = PC[31-28] || (IR[25-0]

  • 7/30/2019 Mi Ps Details

    63/124

    H.Corporaal EmbProcArch 5kk73 63

    Loads and stores access memoryMDR = Memory[ALUOut];

    orMemory[ALUOut] = B;

    R-type instructions finish

    Reg[IR[15-11]] = ALUOut;

    The write actually takes place at the end of the cycleon the edge

    Step 4 (R-type or Memory-access)

    Write back step

  • 7/30/2019 Mi Ps Details

    64/124

    H.Corporaal EmbProcArch 5kk73 64

    Memory read completion step

    Reg[IR[20-16]]= MDR;

    What about all the other instructions?

    Write-back step

    Summary execution steps

  • 7/30/2019 Mi Ps Details

    65/124

    H.Corporaal EmbProcArch 5kk73 65

    Ste name

    Action for R-type

    instructions

    Action for memory-reference

    instructions

    Action for

    branches

    Action for

    um s

    Instruction fetch IR = Memory[PC]

    PC = PC + 4

    Instruction A = Reg [IR[25-21]]

    decode/register fetch B = Reg [IR[20-16]]ALUOut = PC + (sign-extend (IR[15-0])

  • 7/30/2019 Mi Ps Details

    66/124

    H.Corporaal EmbProcArch 5kk73 66

    How many cycles will it take to execute this code?

    lw $t2, 0($t3)

    lw $t3, 4($t3)

    beq $t2, $t3, L1 #assume not taken

    add $t5, $t2, $t3

    sw $t5, 8($t3)L1: ...

    What is going on during the 8th cycle of execution?

    In what cycle does the actual addition of$t2 and $t3 takes place?

    Simple Questions

    Implementing the Control

  • 7/30/2019 Mi Ps Details

    67/124

    H.Corporaal EmbProcArch 5kk73 67

    Value of control signals is dependent upon:

    what instruction is being executed which step is being performed

    Use the information we have accumulated to specify a finitestate machine (FSM) specify the finite state machine graphically, or use microprogramming

    Implementation can be derived from specification

    Implementing the Control

    Graphical Specification Instruction fetch Instructiondecode/register fetch0

  • 7/30/2019 Mi Ps Details

    68/124

    How many

    state bits willwe need?

    p pof FSM

    PCWrite

    PCSource=10

    ALUSrcA=1

    ALUSrcB=00ALUOp=01

    PCWriteCondPCSource=01

    ALUSrcA=1

    ALUSrcB=00ALUOp=10

    RegDst=1RegWrite

    MemtoReg=0

    MemWriteIorD=1

    MemReadIorD=1

    ALUSrcA=1ALUSrcB=10ALUOp=00

    RegDst=0RegWrite

    MemtoReg=1

    ALUSrcA=0

    ALUSrcB=11ALUOp=00

    MemReadALUSrcA=0

    IorD=0

    IRWriteALUSrcB=01

    ALUOp=00PCWrite

    PCSource=00

    Jumpcompletion

    BranchcompletionExecution

    Memoryaddress

    computation

    Memoryaccess

    Memoryaccess R-typecompletion

    Write-backstep

    (Op='J')

    (Op='LW')

    4

    01

    9862

    753

    Start

    Finite State Machine for Control

  • 7/30/2019 Mi Ps Details

    69/124

    H.Corporaal EmbProcArch 5kk73 69

    Implementation:

    Finite State Machine for ControlPCWrite

    PCWriteCond

    IorD

    MemtoReg

    PCSource

    ALUOp

    ALUSrcB

    ALUSrcARegWrite

    RegDst

    NS3

    NS2

    NS1

    NS0

    Op5

    Op4

    Op3

    Op2

    Op1

    Op0

    S3

    S2

    S1

    S0

    State register

    IRWrite

    MemRead

    MemWrite

    Instruction register

    opcode field

    Outputs

    Control logic

    Inputs

    PLAOp5

    Op4

  • 7/30/2019 Mi Ps Details

    70/124

    H.Corporaal EmbProcArch 5kk73 70

    Implemen-tation

    If I picked ahorizontal orvertical line could

    you explain it ? What type of

    FSM is used?Mealy or Moore?

    Op3

    Op2

    Op1

    Op0

    S3

    S2

    S1

    S0

    IorD

    IRWrite

    MemRead

    MemWrite

    PCWrite

    PCWriteCond

    MemtoReg

    PCSource1

    ALUOp1

    ALUSrcB0

    ALUSrcA

    RegWrite

    RegDst

    NS3

    NS2

    NS1

    NS0

    ALUSrcB1

    ALUOp0

    PCSource0

    (see book)

    next

    state

    currentstate

    datapathcontrol

    opcode

    Pipelined implementation

  • 7/30/2019 Mi Ps Details

    71/124

    H.Corporaal EmbProcArch 5kk73 71

    Pipelined implementation

    Pipelining Pipelined datapath

    Pipelined control

    Hazards:

    Structural Data

    Control

    Exceptions

    Scheduling For details see the book (chapter 6):

    Pipelining

  • 7/30/2019 Mi Ps Details

    72/124

    H.Corporaal EmbProcArch 5kk73 72

    PipeliningImprove performance by increasing instruction throughput

    Instruction

    fetchReg ALU

    Data

    accessReg

    8 nsInstruction

    fetchReg ALU

    Data

    accessReg

    8 nsInstruction

    fetch

    8 ns

    Time

    lw $1, 100($0)

    lw $2, 200($0)

    lw $3, 300($0)

    2 4 6 8 10 12 14 16 18

    2 4 6 8 10 12 14

    ...

    Program

    execution

    order

    (in instructions)

    Instructionfetch

    Reg ALU Dataaccess

    Reg

    Time

    lw $1,100($0)

    lw $2,200($0)

    lw $3,300($0)

    2 nsInstruction

    fetchReg ALU

    Data

    accessReg

    2 nsInstruction

    fetchReg ALU

    Data

    accessReg

    2 ns 2 ns 2 ns 2 ns 2 ns

    Program

    execution

    order

    (in instructions)

    Pipelining

  • 7/30/2019 Mi Ps Details

    73/124

    H.Corporaal EmbProcArch 5kk73 73

    Pipelining

    Ideal speedup = number of stages

    Do we achieve this?

    Pipelining

  • 7/30/2019 Mi Ps Details

    74/124

    H.Corporaal EmbProcArch 5kk73 74

    Pipelining What makes it easy

    all instructions are the same length just a few instruction formats

    memory operands appear only in loads and stores

    What makes it hard? structural hazards: suppose we had only one memory

    control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction

    Well build a simple pipeline and look at these issues

    Well talk about modern processors and what really makes ithard: exception handling

    trying to improve performance with out-of-order execution, etc.

    Basic idea: start from single cycle impl.

  • 7/30/2019 Mi Ps Details

    75/124

    H.Corporaal EmbProcArch 5kk73 75

    g y pWhat do we need to add to actually split the datapath into stages?

    Instruction

    memory

    Address

    4

    32

    0

    Add Addresult

    Shiftleft2

    Instruction

    Mux

    0

    1

    Add

    PC

    0Writedata

    M

    ux

    1Registers

    Readdata1

    Readdata2

    Readregister1

    Readregister2

    16Sign

    extend

    Writeregister

    Writedata

    Readdata

    Address

    Datamemory1

    ALUresult

    M

    ux

    ALUZero

    IF:Instructionfetch ID:Instructiondecode/registerfileread

    EX:Execute/addresscalculation

    MEM:Memoryaccess WB:Writeback

    Pipelined Datapath

  • 7/30/2019 Mi Ps Details

    76/124

    H.Corporaal EmbProcArch 5kk73 76

    p pCan you find a problem even if there are no dependencies?

    What instructions can we execute to manifest the problem?

    Instructionmemory

    Address

    4

    32

    0

    AddAdd

    result

    Shift

    left2

    Instruction

    IF/ID EX/MEM MEM/WB

    Mux

    0

    1

    Add

    PC

    0Writedata

    Mux

    1

    Registers

    Readdata1

    Readdata2

    Readregister1

    Readregister2

    16Sign

    extend

    Writeregister

    Writedata

    Readdata

    1

    ALUresult

    Mu

    x

    ALU

    Zero

    ID/EX

    Data

    memory

    Address

    Corrected Datapath

  • 7/30/2019 Mi Ps Details

    77/124

    H.Corporaal EmbProcArch 5kk73 77

    Corrected Datapath

    Instruction

    memory

    Address

    4

    32

    0

    AddAdd

    result

    Shift

    left2

    Instruction

    IF/ID EX/MEM MEM/WB

    Mux

    0

    1

    Add

    PC

    0

    Address

    Writedata

    Mux

    1

    Registers

    Readdata1

    Readdata2

    Readregister1

    Readregister2

    16Sign

    extend

    Writeregister

    Write

    data

    Readdata

    Data

    memory

    1

    ALUresult

    Mux

    ALU

    Zero

    ID/EX

    Graphically Representing Pipelines

  • 7/30/2019 Mi Ps Details

    78/124

    H.Corporaal EmbProcArch 5kk73 78

    Graphically Representing Pipelines

    Can help with answering questions like: how many cycles does it take to execute this code?

    what is the ALU doing during cycle 4?

    use this representation to help understand datapaths

    IM Reg DM Reg

    IM Reg DM Reg

    CC1 CC2 CC3 CC4 CC5 CC6

    Time(inclockcycles)

    lw$10,20($1)

    Program

    executionorder(ininstructions)

    sub$11,$2,$3

    ALU

    ALU

    Pipeline Control

  • 7/30/2019 Mi Ps Details

    79/124

    H.Corporaal EmbProcArch 5kk73 79

    Pipeline Control

    PC

    Instructionmemory

    Address

    Instruction

    Instruction[2016]

    MemtoReg

    ALUOp

    Branch

    RegDst

    ALUSrc

    4

    16 32

    Instruction

    [150]

    0

    0Registers

    Writeregister

    Writedata

    Readdata1

    Readdata2

    Readregister1

    Readregister2

    Signextend

    Mux1

    Write

    data

    Read

    data Mux

    1

    ALUcontrol

    RegWrite

    MemRead

    Instruction[1511]

    6

    IF/ID ID/EX EX/MEM MEM/WB

    MemWrite

    Address

    Datamemory

    PCSrc

    Zero

    AddAdd

    result

    Shift

    left2

    ALUresult

    ALU

    Zero

    Add

    0

    1

    Mux

    0

    1

    Mux

    Pipeline control

  • 7/30/2019 Mi Ps Details

    80/124

    H.Corporaal EmbProcArch 5kk73 80

    We have 5 stages. What needs to be controlled in eachstage? Instruction Fetch and PC Increment

    Instruction Decode / Register Fetch

    Execution

    Memory Stage Write Back

    How would control be handled in an automobile plant?

    a fancy control center telling everyone what to do? should we use a finite state machine?

    Pipeline control

    Pipeline Control

  • 7/30/2019 Mi Ps Details

    81/124

    H.Corporaal EmbProcArch 5kk73 81

    Pass control signals along

    just like the data:

    Pipeline ControlExecution/Address

    Calculation stage control

    lines

    Memory access stage

    control lines

    Write-back

    stage control

    lines

    Instruction

    Reg

    Dst

    ALU

    Op1

    ALU

    Op0

    ALU

    Src

    Branc

    h

    Mem

    Read

    Mem

    Write

    Reg

    write

    Mem

    to Reg

    R-format 1 1 0 0 0 0 0 1 0

    lw 0 0 0 1 0 1 0 1 1

    sw X 0 0 1 0 0 1 0 X

    beq X 0 1 0 1 0 0 0 X

    Control

    EX

    M

    WB

    M

    WB

    WB

    IF/ID ID/EX EX/MEM MEM/WB

    Instruction

    (compare single cycle control!)

    Datapath with Control

  • 7/30/2019 Mi Ps Details

    82/124

    H.Corporaal EmbProcArch 5kk73 82

    Datapath with Control

    PC

    Instructionmemory

    Instruction

    Add

    Instruction[2016]

    MemtoReg

    ALUOp

    Branch

    RegDst

    ALUSrc

    4

    16 32Instruction[150]

    0

    0

    Mux

    0

    1

    AddAdd

    result

    RegistersWriteregister

    Writedata

    Readdata1

    Readdata2

    Readregister1

    Readregister2

    Signextend

    Mux1

    ALUresult

    Zero

    Writedata

    Readdata

    Mux

    1

    ALUcontrol

    Shiftleft2RegWri

    te

    MemRead

    Control

    ALU

    Instruction[1511]

    6

    EX

    M

    WB

    M

    WB

    WBIF/ID

    PCSrc

    ID/EX

    EX/MEM

    MEM/WB

    M

    ux

    0

    1

    MemWri

    te

    Address

    Datamemory

    Address

  • 7/30/2019 Mi Ps Details

    83/124

    H.Corporaal EmbProcArch 5kk73 83

    Hazards: problems due to pipelining

  • 7/30/2019 Mi Ps Details

    84/124

    H.Corporaal EmbProcArch 5kk73 84

    Hazards: problems due to pipelining

    Hazard types: Structural

    same resource is needed multiple times in the same cycle

    Data

    data dependencies limit pipelining Control

    next executed instruction may not be the next specifiedinstruction

    Structural hazards

  • 7/30/2019 Mi Ps Details

    85/124

    H.Corporaal EmbProcArch 5kk73 85

    Structural hazards

    Examples: Two accesses to a single ported memory

    Two operations need the same function unitat the same time

    Two operations need the same function unitin successive cycles, but the unit is not pipelined

    Solutions:

    stalling add more hardware

  • 7/30/2019 Mi Ps Details

    86/124

    Data hazards

  • 7/30/2019 Mi Ps Details

    87/124

    H.Corporaal EmbProcArch 5kk73 87

    Data hazards

    Data dependencies: RaW (read-after-write) WaW (write-after-write)

    WaR (write-after-read)

    Hardware solution: Forwarding / Bypassing

    Detection logic

    Stalling

    Software solution: Scheduling

    Data dependences

  • 7/30/2019 Mi Ps Details

    88/124

    H.Corporaal EmbProcArch 5kk73 88

    pThree types: RaW, WaR and WaW

    add r1, r2, 5 ; r1 := r2+5sub r4, r1, r3 ; RaW of r1

    add r1, r2, 5sub r2, r4, 1 ; WaR of r2

    add r1, r2, 5sub r1, r1, 1 ; WaW of r1

    st r1, 5(r2) ; M[r2+5] := r1ld r5, 0(r4) ; RaW if 5+r2 = 0+r4

    WaW and WaR do not occur in simple pipelines, but they limitscheduling freedom!

    Problems for your compiler and Pentium!

    useregister renamingto solve this!

    RaW on MIPS pipeline

  • 7/30/2019 Mi Ps Details

    89/124

    H.Corporaal EmbProcArch 5kk73 89

    RaW on MIPS pipeline

    IM Reg

    IM Reg

    CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

    Time (in clock cycles)

    sub $2, $1, $3

    Program

    execution

    order

    (in instructions)

    and $12, $2, $5

    IM Reg DM Reg

    IM DM Reg

    IM DM Reg

    CC 7 CC 8 CC 9

    10 10 10 10 10/

    20

    20

    20

    20

    20

    or $13, $6, $2

    add $14, $2, $2

    sw $15, 100($2)

    Value of

    register$2:

    DM Reg

    Reg

    Reg

    Reg

    DM

    Use temporary results dont wait for them to be writtenForwarding

  • 7/30/2019 Mi Ps Details

    90/124

    H.Corporaal EmbProcArch 5kk73 90

    Use temporary results, don t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding

    What if this

    $2 was $13?

    IM Reg

    IM Reg

    CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

    Time (inclock cycles)

    sub$2, $1,$3

    Program

    executionorder

    (in instructions)

    and$12,$2,$5

    IM Reg DM Reg

    IM DM Reg

    IM DM Reg

    CC 7 CC 8 CC 9

    10 10 10 10 10/20 20 20 20 20

    or$13,$6, $2

    add$14,$2,$2

    sw$15,100($2)

    Value ofregister$2 :

    DM Reg

    Reg

    Reg

    Reg

    X X X 20 X X X X XValue ofEX/MEM :

    X X X X 20 X X X XValue ofMEM/WB :

    DM

    Forwarding hardware

  • 7/30/2019 Mi Ps Details

    91/124

    H.Corporaal EmbProcArch 5kk73 91

    Forwarding hardware

    ALU forwarding circuitry principle:

    ALU

    from register file

    from register file

    to register file

    Note: there are two options buf - ALU bypass mux - buf

    buf - bypass

    mux

    ALU - buf

    Forwarding ID/EX

  • 7/30/2019 Mi Ps Details

    92/124

    H.Corporaal EmbProcArch 5kk73 92

    g

    PC Instructionmemory

    Registers

    Mux

    Mux

    Control

    ALU

    EX

    M

    WB

    M

    WB

    WB

    EX/MEM

    MEM/WB

    Data

    memory

    Mux

    Forwarding

    unit

    IF/ID

    Instruction

    Mux

    RdEX/MEM.RegisterRd

    MEM/WB.RegisterRd

    Rt

    Rt

    Rs

    IF/ID.RegisterRd

    IF/ID.RegisterRt

    IF/ID.RegisterRt

    IF/ID.RegisterRs

    ForwardA

    ForwardB

    Forwarding check

  • 7/30/2019 Mi Ps Details

    93/124

    H.Corporaal EmbProcArch 5kk73 93

    g

    Check for matching register-ids:

    For each source-id of operation in the EX-stage check ifthere is a matching pending dest-id

    if (EX/MEM.RegWrite)

    (EX/MEM.RegisterRd 0)

    (EX/MEM.RegisterRd = ID/EX.RegisterRs)

    then ForwardA = 10

    Example:

    Q. How many comparators do we need?

    Can't always forward

  • 7/30/2019 Mi Ps Details

    94/124

    H.Corporaal EmbProcArch 5kk73 94

    Load word can still cause a hazard: an instruction tries to read registerr following a load to the same r

    Need a hazard detection unit to stall the load instruction

    Reg

    IM

    Reg

    Reg

    IM

    CC 1 CC2 CC3 CC4 CC5 CC6

    Time (inclockcycles)

    lw$2,20($1)

    Program

    execution

    order

    (in instructions)

    and$4,$2,$5

    IM Reg DM Reg

    IM DM Reg

    IM DM Reg

    CC7 CC 8 CC 9

    or$8,$2,$6

    add$9,$4,$2

    slt$1,$6,$7

    DM Reg

    Reg

    Reg

    DM

    Stalling

  • 7/30/2019 Mi Ps Details

    95/124

    H.Corporaal EmbProcArch 5kk73 95

    We can stall the pipeline by keeping an instruction in the same stage

    lw$2,20($1)

    Program

    executionorder(ininstructions)

    and$4,$2,$5

    or$8,$2,$6

    add$9,$4,$2

    slt$1,$6,$7

    Reg

    IM

    Reg

    Reg

    IM DM

    CC1 CC2 CC3 CC4 CC5 CC6

    Time(inclockcycles)

    IM Reg DM RegIM

    IM DM Reg

    IM DM Reg

    CC7 CC8 CC9 CC10

    DM Reg

    RegReg

    Reg

    bubble

    In CC4 the ALU is not used,

    Reg, and IM are redone

    Hazard Detection Unit

  • 7/30/2019 Mi Ps Details

    96/124

    H.Corporaal EmbProcArch 5kk73 96

    PCInstructionmemory

    Registers

    Mux

    Mux

    Mux

    Control

    ALU

    EX

    M

    WB

    M

    WB

    WB

    ID/EX

    EX/MEM

    MEM/WB

    Datamemory

    Mux

    Hazarddetection

    unit

    Forwardingunit

    0

    Mux

    IF/ID

    Instruction

    ID/EX.MemRead

    IF/IDWri

    te

    PCW

    rite

    ID/EX.RegisterRt

    IF/ID.RegisterRd

    IF/ID.RegisterRt

    IF/ID.RegisterRtIF/ID.RegisterRs

    Rt

    Rs

    Rd

    RtEX/MEM.RegisterRd

    MEM/WB.RegisterRd

    Software only solution?

  • 7/30/2019 Mi Ps Details

    97/124

    H.Corporaal EmbProcArch 5kk73 97

    Have compiler guarantee that no hazards occur

    Example: where do we insert the NOPs ?

    sub $2, $1, $3

    and $12, $2, $5

    or $13, $6, $2

    add $14, $2, $2

    sw $13, 100($2)

    Problem: this really slows us down!

    y

    sub $2, $1, $3nop

    nop

    and $12, $2, $5

    or $13, $6, $2add $14, $2, $2

    nop

    sw $13, 100($2)

    Control hazards

  • 7/30/2019 Mi Ps Details

    98/124

    H.Corporaal EmbProcArch 5kk73 98

    Control operations may change the sequential flow ofinstructions branch

    jump

    call (jump and link)

    return (exception/interrupt and rti / return from interrupt)

    Control hazard: Branch

  • 7/30/2019 Mi Ps Details

    99/124

    H.Corporaal EmbProcArch 5kk73 99

    Branch actions:

    Compute new address

    Determine condition

    Perform the actual branch (if taken): PC := new address

    Branch example

  • 7/30/2019 Mi Ps Details

    100/124

    H.Corporaal EmbProcArch 5kk73 100

    p

    Reg

    Reg

    CC 1

    Time (in clockcycles)

    40 beq $1,$3,7

    Program

    execution

    order

    (in instructions)

    IM Reg

    IM DM

    IM DM

    IM DM

    DM

    DM Reg

    Reg Reg

    Reg

    Reg

    RegIM

    44 and $12,$2,$5

    48 or $13,$6,$2

    52 add $14,$2,$2

    72 lw $4,50($7)

    CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

    Reg

    Branching

  • 7/30/2019 Mi Ps Details

    101/124

    H.Corporaal EmbProcArch 5kk73 101

    g

    Squash pipeline:

    When we decide to branch, other instructions are in thepipeline!

    We are predicting branch not taken need to add hardware for flushing instructions if we are wrong

    Branch with predict not taken

  • 7/30/2019 Mi Ps Details

    102/124

    H.Corporaal EmbProcArch 5kk73 102

    Branch with predict not taken

    Branch L

    Predictnot taken

    L:

    Clock cycles

    IF ID EX MEM WB

    IF ID EX MEM WB

    IF ID EX MEM WB

    IF ID EX MEM WB

    IF ID EX MEM WB

    Branch speedup

  • 7/30/2019 Mi Ps Details

    103/124

    H.Corporaal EmbProcArch 5kk73 103

    p p

    Earlier address computation Earlier condition calculation

    Put both in the ID pipeline stage adder comparator

    Branch L

    Predictnot taken

    L:

    Clock cycles

    IF ID EX MEM WB

    IF ID EX MEM WB

    IF ID EX MEM WB

    Improved branching / flushing IF/IDIF Fl h

  • 7/30/2019 Mi Ps Details

    104/124

    H.Corporaal EmbProcArch 5kk73 104

    PCInstruction

    memory

    4

    Registers

    Mux

    Mux

    Mux

    ALU

    EX

    M

    WB

    M

    WB

    WB

    ID/EX

    0

    EX/MEM

    MEM/WB

    Datamemory

    Mux

    Hazarddetection

    unit

    Forwardingunit

    IF.Flush

    IF/ID

    Signextend

    Control

    Mu

    x

    =

    Shiftleft2

    Mux

    Exception support

  • 7/30/2019 Mi Ps Details

    105/124

    H.Corporaal EmbProcArch 5kk73 105

    Types of exceptions:

    Overflow I/O device request

    Operating system call

    Undefined instruction

    Hardware malfunction Page fault

    Precise exception:

    finish previous instructions (which are still in the pipeline) flush excepting and following instructions, redo them after

    handling the exception(s)

    Exceptions

  • 7/30/2019 Mi Ps Details

    106/124

    H.Corporaal EmbProcArch 5kk73 106

    Changes needed for handling overflow exception of an

    operation in EX stage (see book for details) :

    Extend PC input mux with extra entry with fixed address

    Add EPC register recording the ID/EX stage PC

    this is the address of the next instruction ! Cause register recording exception type

    E.g., in case of overflow exception insert 3 bubbles;flush the following stages:

    IF/ID stage

    ID/EX stage

    EX/MEM stage

    Scheduling, why?

  • 7/30/2019 Mi Ps Details

    107/124

    H.Corporaal EmbProcArch 5kk73 107

    g y

    Lets look at the execution time:

    Texecution= Ncyclesx Tcycle= Ninstructionsx CPIx Tcycle

    Scheduling may reduce Texecution Reduce CPI(cycles per instruction)

    early scheduling of long latency operations

    avoid pipeline stalls due to structural, data and control hazards

    allow Nissue > 1 and therefore CPI< 1

    Reduce Ninstructions

    compact many operations into each instruction (VLIW)

    Scheduling data hazards:l 1

  • 7/30/2019 Mi Ps Details

    108/124

    H.Corporaal EmbProcArch 5kk73 108

    gexample 1

    Try and avoid RaW stalls (in this case load interlocks)!

    E.g., reorder these instructions:

    lw $t0, 0($t1)

    lw $t2, 4($t1)

    sw $t2, 0($t1)sw $t0, 4($t1)

    lw $t0, 0($t1)

    lw $t2, 4($t1)

    sw $t0, 4($t1)sw $t2, 0($t1)

    Scheduling data hazards Unscheduled code:

  • 7/30/2019 Mi Ps Details

    109/124

    H.Corporaal EmbProcArch 5kk73 109

    gexample 2

    Avoiding RaW stalls:Reordering instructions forfollowing program

    (by you or the compiler)

    Code:

    a = b + c

    d = e - f

    Lw R1,b

    Lw R2,c

    Add R3,R1,R2 interlockSw a,R3

    Lw R1,e

    Lw R2,f

    Sub R4,R1,R2 interlockSw d,R4

    Scheduled code:Lw R1,b

    Lw R2,c

    Lw R5,e extra reg. needed!Add R3,R1,R2

    Lw R2,f

    Sw a,R3

    Sub R4,R5,R2

    Sw d,R4

  • 7/30/2019 Mi Ps Details

    110/124

    Scheduling control hazards

  • 7/30/2019 Mi Ps Details

    111/124

    H.Corporaal EmbProcArch 5kk73 111

    Scheduling control hazards

    What can we do about control hazards and CPIpenalty?

    Keep penalty Pbranchlow: Early computation of new PC

    Early determination of condition

    Visible branch delay slots filled by compiler (MIPS)

    Branch prediction

    Reduce control dependencies (control heightreduction) [Schlansker and Kathail, Micro95]

    Remove branches: if-conversion Conditional instructions: CMOVE, cond skip next

    Guarding all instructions: TriMedia

    Branch delay slot

  • 7/30/2019 Mi Ps Details

    112/124

    H.Corporaal EmbProcArch 5kk73 112

    Add a branch delay slot: the next instruction after a branch is always executed

    rely on compiler to fill the slot with something useful

    Is this a good idea? let's look how it works

    Branch delay slot scheduling

  • 7/30/2019 Mi Ps Details

    113/124

    H.Corporaal EmbProcArch 5kk73 113

    op 1

    beq r1,r2, L

    .............op 2

    L: op 3

    .............

    .............

    'fall-through'

    branch target

    Q. What to put in the delay slot?

    Summary

  • 7/30/2019 Mi Ps Details

    114/124

    H.Corporaal EmbProcArch 5kk73 114

    Modern processors are (deeply) pipelined, to reduceTcycle and aim at CPI = 1

    Hazards increase CPI

    Several software and hardware measure to avoid orreduce hazards are taken

    Not discussed, but important developments:

    Multi-issue further reduces CPI

    Branch prediction to avoid high branch penalties Dynamic scheduling

    In all cases: a scheduling compiler needed

    Recap of MIPS

  • 7/30/2019 Mi Ps Details

    115/124

    H.Corporaal EmbProcArch 5kk73 115

    RISC architecture

    Register space

    Addressing

    Instruction format

    Pipelining

    Why RISC? Keep it simple

  • 7/30/2019 Mi Ps Details

    116/124

    H.Corporaal EmbProcArch 5kk73 116

    RISC characteristics:

    Reduced number of instructions Limited addressing modes

    load-store architecture

    enables pipelining

    Large register set uniform (no distinction between e.g. address and data registers)

    Limited number of instruction sizes (preferably one) know directly where the following instruction starts

    Limited number of instruction formats

    Memory alignment restrictions ......

    Based on quantitative analysis " the famous MIPS one percent rule": don't even think about it

    when its not used more than one percent

    Register space

  • 7/30/2019 Mi Ps Details

    117/124

    H.Corporaal EmbProcArch 5kk73 117

    Name Register number Usage

    $zero 0 the constant value 0

    $v0-$v1 2-3 values for results and expression evaluation

    $a0-$a3 4-7 arguments

    $t0-$t7 8-15 temporaries

    $s0-$s7 16-23 saved (by callee)

    $t8-$t9 24-25 more temporaries

    $gp 28 global pointer $sp 29 stack pointer

    $fp 30 frame pointer

    $ra 31 return address

    32 integer (and 32 floating point) registers of 32-bit

    Addressing1. Immediate addressing

    op rs rt Immediate

  • 7/30/2019 Mi Ps Details

    118/124

    H.Corporaal EmbProcArch 5kk73 118

    Byte Halfword Word

    Registers

    Memory

    Memory

    Word

    Memory

    Word

    Register

    Register

    2. Register addressing

    3. Base addressing

    4. PC-relative addressing

    5. Pseudodirect addressing

    op rs rt

    op rs rt

    op

    op

    rs rt

    Address

    Address

    Address

    rd . . . funct

    PC

    PC

    +

    +

    Instruction format

  • 7/30/2019 Mi Ps Details

    119/124

    H.Corporaal EmbProcArch 5kk73 119

    Example instructions

    Instruction Meaningadd $s1,$s2,$s3 $s1 = $s2 + $s3

    addi $s2,$s3,4 $s2 = $s3 + 4

    lw $s1,100($s2) $s1 = Memory[$s2+100]

    bne $s4,$s5,L if $s4$s5 goto L

    j Label goto Label

    op rs rt rd shamt functop rs rt 16 bit address

    op 26 bit address

    R

    I

    J

    Pipelining

  • 7/30/2019 Mi Ps Details

    120/124

    H.Corporaal EmbProcArch 5kk73 120

    time

    IF ID EX MEM WB

    IF ID EX MEM WB

    IF ID EX MEM WB

    IF ID EX MEM WB

    IF ID EX MEM WB

    All integer instructions fit into the following pipeline

    Other architecture styles

  • 7/30/2019 Mi Ps Details

    121/124

    H.Corporaal EmbProcArch 5kk73 121

    Accumulator architecture one operand (in register or memory), accumulator almost always

    implicitly used

    Stack zero operand: all operands implicit (on TOS)

    Register (load store) three operands, all in registers

    loads and stores are the only instructions accessing memory (i.e.with a memory (indirect) addressing mode

    Register-Memory

    two operands, one in memory Memory-Memory

    three operands, may be all in memory

    (there are more varieties / combinations)

    Accumulator architecture

  • 7/30/2019 Mi Ps Details

    122/124

    H.Corporaal EmbProcArch 5kk73 122

    Accumulator

    ALU Memory

    registers

    address

    latch

    latch

    Example code: a = b+c;

    load b; // accumulator is implicit operand

    add c;

    store a;

    Stack architecture

  • 7/30/2019 Mi Ps Details

    123/124

    H.Corporaal EmbProcArch 5kk73 123

    Example code: a = b+c;

    push b;push c;

    add;

    pop a;

    b

    b

    c b+c

    push b push c add pop a

    stack:

    ALU Memory

    top of

    stack

    stack pt

    latch

    latch

    latch

    Other architecture styles

  • 7/30/2019 Mi Ps Details

    124/124

    Stack

    Architecture

    Accumulator

    Architecture

    Register-Memory

    Memory-Memory

    Register

    (load-store)

    Push A Load A Load r1,A Add C,B,A Load r1,A

    Push B Add B Add r1,B Load r2,B

    Add Store C Store C,r1 Add r3,r1,r2

    Pop C Store C,r3

    Let's look at the code for C = A + B