mi ps details

7/30/2019 Mi Ps Details

1/124

Embedded Processor Architecture

TU/e 5kk73

Henk Corporaal

Bart Mesman

RISCInstruction Set

ImplementationAlternatives

== using MIPS as example ==


2/124

H.Corporaal EmbProcArch 5kk73 2

Topics

MIPS ISA: Instruction Set Architecture

MIPS single cycle implementation MIPS multi-cycle implementation

MIPS pipelined implementation

Pipeline hazards

Recap of RISC principles Other architectures

Based on the book: ch2-4 (4th ed)

Many slides; I'll go quick andskip some


3/124


Main Types of Instructions

Arithmetic Integer

Floating Point

Memory access instructions Load & Store

Control flow

Jump Conditional Branch

Call & Return


4/124


MIPS arithmetic

Most instructions have 3 operands Operand order is fixed (destination first)

Example:

C code: A = B + C

MIPS code: add $s0, $s1, $s2

($s0, $s1 and $s2 are associated with variables bycompiler)


5/124


MIPS arithmetic

C code: A = B + C + D;E = F - A;

MIPS code:add $t0, $s1, $s2

add $s0, $t0, $s3

sub $s4, $s5, $s0

Operands must be registers, only 32 registers provided

Design Principle: smaller is faster. Why?


6/124


Registers vs. Memory

Arithmetic instruction operands must be registers, only 32 registers provided

Compiler associates variables with registers

What about programs with lots of variables ?

CPU Memory

IO

register file


7/124


Register allocation

Compiler tries to keep as many variables in registers aspossible

Some variables can not be allocated

large arrays (too few registers) aliased variables (variables accessible through pointers in C)

dynamic allocated variables

heap

stack

Compiler may run out of registers => spilling


8/124


Memory Organization

Viewed as a large, single-dimension array, with anaddress

A memory address is an index into the array

"Byte addressing" means that successive addresses are

one byte apart0

1

2

3

4

5

6

...

8 bits of data

8 bits of data

8 bits of data

8 bits of data

8 bits of data

8 bits of data

8 bits of data


9/124


10/124


Memory layout: Alignment

Words are aligned

What are the least 2 significant bits of a wordaddress?

this word is aligned; the others are not!

ad

dress

0

4

8

12

16

20

24

31 071523


11/124


Instructions: load and store

Example:

C code: A[8] = h + A[8];

MIPS code: lw $t0, 32($s3)add $t0, $s2, $t0

sw $t0, 32($s3)

Store word operation has no destination (reg) operand

Remember arithmetic operands are registers, notmemory!


12/124


Let's translate some C-code

Can we figure out the code?

swap(int v[], int k);{ int temp;

temp = v[k]v[k] = v[k+1];v[k+1] = temp;

}

swap:muli $2 , $5, 4add $2 , $4, $2lw $15, 0($2)lw $16, 4($2)sw $16, 0($2)sw $15, 4($2)jr $31

Explanation:

index k : $5

base address of v: $4

address of v[k] is $4 + 4.$5


13/124


14/124


Consider the load-word and store-word instructions, What would the regularity principle have us do?

New principle: Good design demands a compromise

Introduce a new type of instruction format I-type for data transfer instructions

other format was R-type for register

Example: lw $t0, 32($s2)

35 18 9 32

op rs rt 16 bit number

Machine Language


15/124


16/124


Decision making instructions alter the control flow,

i.e., change the "next" instruction to be executed

MIPS conditional branch instructions:

bne $t0, $t1, Label

beq $t0, $t1, Label

Example: if (i==j) h = i + j;

bne $s0, $s1, Label

add $s3, $s0, $s1

Label: ....

Control


17/124


MIPS unconditional branch instructions:j label

Example:

if (i!=j) beq$s4, $s5, Lab1h=i+j; add $s3, $s4, $s5

else j Lab2

h=i-j; Lab1: sub $s3, $s4, $s5

Lab2: ...

Can you build a simple for loop?

Control


18/124


So far:

Instruction Meaning

add $s1,$s2,$s3 $s1 = $s2 + $s3

sub $s1,$s2,$s3 $s1 = $s2 $s3

lw $s1,100($s2) $s1 = Memory[$s2+100]

sw $s1,100($s2) Memory[$s2+100] = $s1bne $s4,$s5,L Next instr. is at Label if $s4 $s5

beq $s4,$s5,L Next instr. is at Label if $s4 = $s5

j Label Next instr. is at Label

Formats:

op rs rt rd shamt funct

op rs rt 16 bit address

op 26 bit address

R

I

J


19/124


We have: beq, bne, what about Branch-if-less-than? New instruction:

meaning:if $s1 < $s2 then

$t0 = 1

slt $t0, $s1, $s2 else

$t0 = 0

Can use this instruction to build "blt $s1, $s2, Label"

can now build general control structures

Note that the assembler needs a register to do this, use conventions for registers

Control Flow


20/124


MIPS compiler/assembler Conventions

Name Register number Usage$zero 0 the constant value 0

$v0-$v1 2-3 values for results and expression evaluation

$a0-$a3 4-7 arguments

$t0-$t7 8-15 temporaries

$s0-$s7 16-23 saved (by callee)

$t8-$t9 24-25 more temporaries

$gp28 global pointer $sp 29 stack pointer

$fp 30 frame pointer

$ra 31 return address


21/124


Small constants are used quite frequently (50% of operands)

e.g., A = A + 5;B = B + 1;C = C - 18;

Solutions? Why not?

put 'typical constants' in memory and load them create hard-wired registers (like $zero) for constants like one

or .

MIPS Instructions:

addi $29, $29, 4slti $8, $18, 10andi $29, $29, 6ori $29, $29, 4

3

Constants


22/124


We'd like to be able to load a 32 bit constant into a register Must use two instructions; new "load upper immediate"

instructionlui $t0, 1010101010101010

1010101010101010 0000000000000000

0000000000000000 1010101010101010

1010101010101010 1010101010101010

ori

1010101010101010 0000000000000000

filled with zeros

How about larger constants?

Then must get the lower order bits right, i.e.,

ori $t0, $t0, 1010101010101010


23/124


Assembly provides convenient symbolic representation much easier than writing down numbers

e.g., destination first

Machine language is the underlying reality

e.g., destination is no longer first

Assembly can provide 'pseudoinstructions'

e.g., move $t0, $t1 exists only in Assembly

would be implemented using add $t0,$t1,$zero

When considering performance you should count real

instructions

Assembly Language vs. Machine Language


24/124


Instructions:bne $t4,$t5,Label Next instruction is at Label if $t4 $t5

beq $t4,$t5,Label Next instruction is at Label if $t4 = $t5

j Label Next instruction is at Label

Formats:

Addresses are not 32 bits How do we handle this with load and store instructions?

op rs rt 16 bit address

op 26 bit address

I

J

Addresses in Branches and Jumps


25/124


Instructions:bne $t4,$t5,Label Next instruction is at Label if $t4 $t5beq $t4,$t5,Label Next instruction is at Label if $t4 = $t5

Formats:

Could specify a register (like lw and sw) and add it to address

use Instruction Address Register (PC = program counter)

most branches are local (principle of locality)

Jump instructions just use high order bits of PC

address boundaries of 256 MB

op rs rt 16 bit addressI

What's the next address?

T i


26/124


To summarize:MIPS assembly language

Category Instruction Example Meaning Commentsadd add $s1, $s2, $s3 $s1 = $s2 + $s3 Three operands; data in registers

Arithmetic subtract sub $s1, $s2, $s3 $s1 = $s2 - $s3 Three operands; data in registers

add immediate addi $s1, $s2, 100 $s1 = $s2 + 100 Used to add constants

load word lw $s1, 100($s2) $s1 = Memory[$s2 + 100] Word from memory to register

store word sw $s1, 100($s2) Memory[$s2 + 100] = $s1 Word from register to memory

Data transfer load byte lb $s1, 100($s2) $s1 = Memory[$s2 + 100] Byte from memory to register

store byte sb $s1, 100($s2)

Memory[$s2

+ 100] = $s1Byte from register to memory

load upper immediate lui $s1, 100 $s1 = 100 * 216 Loads constant in upper 16 bits

branch on equal beq $s1, $s2, 25 if ($s1 == $s2) go to

PC + 4 + 100

Equal test; PC-relative branch

Conditional

branch on not equal bne $s1, $s2, 25 if ($s1 != $s2) go to

PC + 4 + 100

Not equal test; PC-relative

branch set on less than slt $s1, $s2, $s3 if ($s2 < $s3) $s1 = 1;else $s1 = 0

Compare less than; for beq, bne

set less than

immediate

slti $s1, $s2, 100 if ($s2 < 100) $s1 = 1;

else $s1 = 0

Compare less than constant

jump j 2500 go to 10000 Jump to target address

Uncondi- jump register jr $ra go to $ra For switch, procedure return

tional jump jump and link jal 2500 $ra = PC + 4; go to 10000 For procedure call

MIPS (3+2) addressing modes overview


27/124


Byte Halfword Word

Registers

Memory

Memory

Word

Memory

Word

Register

Register

1. Immediate addressing

2. Register addressing

3. Base addressing

4. PC-relative addressing

5. Pseudodirect addressing

op rs rt

op rs rt

op rs rt

op

op

rs rt

Address

Address

Address

rd . . . funct

Immediate

PC

PC

+

+

MIPS (3+2) addressing modes overview


28/124


MIPS Datapath

Building a datapath support a subset of the MIPS-I instruction-set

A single cycle processor datapath

all instruction actions in one (long) cycle

A multi-cycle processor datapath

each instructions takes multiple (shorter) cycles

For details see book (ch 5):


29/124


Datapath and Control

DatapathControl

Registers &

Memories

Multiplexors

Buses

ALUs

FSM

or

Micro-programming


30/124


Simplified MIPS implementation to contain only: memory-reference instructions: lw, sw

arithmetic-logical instructions: add, sub, and, or, slt

control flow instructions: beq, j

Generic Implementation:

use the program counter (PC) to supply instruction address get the instruction from memory

read registers

use the instruction to decide exactly what to do

All instructions use the ALU after reading the registersWhy? memory-reference?

arithmetic?

control flow?

The Processor: Datapath & Control


31/124


Abstract / Simplified View:

Two types of functional units: elements that operate on data values (combinational)

elements that contain state (sequential)

More Implementation Details

Registers

Register#

Data

Register#

Datamemory

Address

Data

Register#

PC Instruction ALU

Instructionmemory

Address


32/124


Unclocked vs. Clocked Clocks used in synchronous logic

when should an element that contains state be updated?

cycle time

rising edge

falling edge

State Elements


33/124


The set-reset (SR) latch output depends on present inputs and also on past inputs

An unclocked state element

R

S

Q

Q

Truth table:R S Q

0 0 Q

0 1 1

1 0 0

1 1 ?

state change


34/124


Output is equal to the stored value inside the element(don't need to ask for permission to look at the value)

Change of state (value) is based on the clock

Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge

(edge-triggered methodology)

A clocking methodology defines when signals can be read and writtenwouldn't want to read a signal at the same time it was being written

Latches and Flip-flops


35/124


Two inputs: the data value to be stored (D)

the clock signal (C) indicating when to read & store D

Two outputs:

the value of the internal state (Q) and it's complement

D-latch

Q

C

D

_Q

D

C

Q


36/124


D flip-flop

Output changes only on the clock edge

QQ

_Q

Q

_Q

Dlatch

D

C

Dlatch

DD

C

C

D

C

Q


37/124


Our Implementation

An edge triggered methodology Typical execution:

read contents of some state elements,

send values through some combinational logic,

write results to one or more state elements

Clockcycle

Stateelement

1Combinationallogic

Stateelement

2


38/124


3-ported: one write, two read ports

Register File

Read reg. #1

Read reg.#2

Write reg.#

Readdata 1

Readdata 2

Write

Writedata


39/124


Register file: read ports

Mu

x

Register0

Register1

Registern 1Registern

M

u

x

Readdata1

Readdata2

Readregister

number1

Readregister

number2

Implementation of the read ports

Register file built using D flip-flops


40/124


Register file: write port

Note: we still use the real clock to determine when towrite

n-to-1

decoder

Register 0

Register 1

Register n 1

C

C

D

D

Register n

C

C

D

D

Register number

Write

Register data

0

1

n 1

n


41/124


Building the Datapath Use multiplexors to stitch them together

PC

Instruction

memory

Readaddress

Instruction

16 32

AddALUresult

Mux

Registers

Writeregister

Writedata

Readdata1

Readdata2

Readregister1

Readregister2

Shift

left2

4

Mu

x

ALUoperation3

RegWrite

MemRead

MemWrite

PCSrc

ALUSrc

MemtoReg

ALUresult

ZeroALU

Datamemory

Address

Writedata

Readdata M

u

x

Signextend

Add


42/124


All of the logic is combinational We wait for everything to settle down, and the right thing

to be done

ALU might not produce right answer right away

we use write signals along with clock to determine when to write

Cycle time determined by length of the longest path

Our Simple Control Structure

We are ignoring some details like setup and hold times !

Clockcycle

Stateelement

1Combinational logic

Stateelement

2


43/124


Control Selecting the operations to perform (ALU, read/write, etc.)

Controlling the flow of data (multiplexor inputs)

Information comes from the 32 bits of the instruction

Example:

add $8, $17, $18 Instruction Format:

000000 10001 10010 01000 00000 100000

op rs rt rd shamt funct

ALU's operation based on instruction type and function code


44/124


Control: 2 level implementation

instructionre

gister ALUop

ALUcontrol

Opcode

Funct.

31

26

0

5

bit

Control 1

Control 2

ALU

00: lw, sw01: beq10: add, sub, and, or, slt

000: and001: or010: add

110: sub111: set on less than

6

6

2

3

Datapath with Control


45/124



PC

Instructionmemory

Readaddress

Instruction[310]

Instruction[2016]

Instruction[2521]

Add

Instruction[50]

MemtoReg

ALUOp

MemWrite

RegWrite

MemRead

BranchRegDst

ALUSrc

Instruction[3126]

4

16 32Instruction[150]

0

0Mux

0

1

Control

Add ALUresult

Mux

0

1

RegistersWriteregister

Writedata

Readdata1

Readdata2

Readregister1

Readregister2

Signextend

Shiftleft2

Mux1

ALUresult

Zero

DatamemoryWritedata

Readdata

Mu

x

1

Instruction[1511]

ALUcontrol

ALUAddress

ALU C t l1


46/124


What should the ALU do with this instructionexample: lw $1, 100($2)

35 2 1 100

op rs rt 16 bit offset

ALU control input

000 AND001 OR010 add110 subtract111 set-on-less-than

Why is the code for subtract 110 and not 011?

ALU Control1

ALU C t l1


47/124


Must describe hardware to compute 3-bit ALU control input given instruction type

00 = lw, sw01 = beq,10 = arithmetic

function code for arithmetic

Describe it using a truth table (can turn into gates):

ALU Operation class,computed from instruction type

ALU Control1

ALUOp Funct field Operation

ALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0

0 0 X X X X X X 010

X 1 X X X X X X 1101 X X X 0 0 0 0 010

1 X X X 0 0 1 0 110

1 X X X 0 1 0 0 000

1 X X X 0 1 0 1 001

1 X X X 1 0 1 0 111

outputsinputs

ALU C t l1


48/124


ALU Control1

Simple combinational logic (truth tables)

Operation2

Operation1

Operation0

Operation

ALUOp1

F3

F2

F1

F0

F (5

0)

ALUOp0

ALUOp

ALU control block

D i i C t l2 i l


49/124


Deriving Control2 signals

Instruction RegDst ALUSrc

Memto-

Reg

Reg

Write

Mem

Read

Mem

Write Branch ALUOp1 ALUp0

R-format 1 0 0 1 0 0 0 1 0

lw 0 1 1 1 1 0 0 0 0sw X 1 X 0 0 1 0 0 0

beq X 0 X 0 0 0 1 0 1

9 control (output) signals

Determine these control signals directly from the opcodes:

R-format: 0

lw: 35

sw: 43

beq: 4

Input6-bits

C t l 2


50/124


Control 2

PLA exampleimplementation

R-format Iw sw beq

Op0

Op1Op2

Op3

Op4

Op5

Inputs

Outputs

RegDst

ALUSrc

MemtoReg

RegWrite

MemRead

MemWrite

Branch

ALUOp1

ALUOpO

Si l C l I l t ti


51/124


Single Cycle Implementation Calculate cycle time assuming negligible delays except:

memory (2ns), ALU and adders (2ns), register file access (1ns)

MemtoReg

MemRead

MemWrite

ALUOp

ALUSrc

RegDst

PC

Instructionmemory

Readaddress

Instruction[310]

Instruction[2016]

Instruction[2521]

Add

Instruction[50]

RegWrite

4


0

Registers

Writeregister

Write

dataWritedata

Readdata1

Readdata2

Readregister1

Readregister2

Signextend

ALUresult

Zero

Datamemory

Address Readdata

Mu

x

1

0

Mux

1

0

Mux

1

0

Mux

1

Instruction[1511]

ALUcontrol

Shift

left2

PCSrc

ALU

AddALU

result

Si l C l I l t ti


52/124


Single Cycle Implementation

Memory (2ns), ALU & adders (2ns), reg. file access (1ns)

Fixed length clock: longest instruction is the lw which requires 8 ns

Variable clock length (not realistic, just as exercise):

R-instr: 6 ns

Load: 8 ns

Store: 7 ns

Branch: 5 ns

Jump: 2 ns

Average depends on instruction mix

Where we are headed


53/124


Where we are headed Single Cycle Problems:

what if we had a more complicated instruction like floating point?

wasteful of area: NO Sharing of Hardware resources

One Solution: use a smaller cycle time

have different instructions take different numbers of cycles

a multicycle datapath:

PC

Memory

Address

Instructionordata

Data

Instructionregister

Registers

Register#

Data

Register#

Register#

ALU

Memory

dataregister

A

B

ALUOut

IR

MDR

M lti l A h


54/124


We will be reusing functional units ALU used to compute address and to increment PC Memory used for instruction and data

Add registers after every major functional unit

Our control signals will not be determined solely byinstruction e.g., what should the ALU do for a subtract instruction?

Well use a finite state machine (FSM) ormicrocode forcontrol

Multicycle Approach

R i fi it t t hi


55/124


Finite state machines: a set of states and

next state function (determined by current state and the input)

output function (determined by current state and possibly input)

Well use a Moore machine (output based only on current state)

Review: finite state machines

Next-statefunction

Currentstate

Clock

Outputfunction

Nextstate

Outputs

Inputs

M lti l A h


56/124


Break up the instructions into steps, each step takes acycle balance the amount of work to be done

restrict each cycle to use only one major functional unit

At the end of a cycle store values for use in later cycles (easiest thing to do)

introduce additional internal registers

Notice: we distinguish processor state: programmer visible registers

internal state: programmer invisible registers (like IR, MDR, A, B,and ALUout)

Multicycle Approach

M ltic cle Approach


57/124


Multicycle Approach

Shift

left2

PC

Memory

MemData

Writedata

Mux

0

1


Writedata

Readdata1

Read

data2

Readregister1

Readregister2

M

ux

0

1

Mux

0

1

4

Instruction[150]

Signextend

3216

Instruction[2521]

Instruction[2016]

Instruction[150]

Instructionregister

1Mux

0

3

2

Mux

ALUresult

ALU

Zero

Memorydata

register

Instruction[1511]

A

B

ALUOut

0

1

Address

Multicycle Approach


58/124


Multicycle Approach

Note that previous picture does not include: branch support jump support

Control lines and logic

Tclock > max (ALU delay, Memory access, Regfile access)

See book for complete picture

Five Execution Steps


59/124


Instruction Fetch

Instruction Decode and Register Fetch

Execution, Memory Address Computation, or Branch

Completion

Memory Access or R-type instruction completion

Write-back step

Five Execution Steps

INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

Step 1: Instruction Fetch


60/124


Use PC to get instruction and put it in the Instruction

Register Increment the PC by 4 and put the result back in the PC

Can be described succinctly using RTL "Register-TransferLanguage"

IR = Memory[PC];

PC = PC + 4;

Can we figure out the values of the control signals?

What is the advantage of updating the PC now?

Step 1: Instruction Fetch

Step 2: Instruction Decode and


61/124


Read registers rs and rt in case we need them Compute the branch address in case the instruction is a

branch

Previous two actions are done optimistically!!

RTL:A = Reg[IR[25-21]];B = Reg[IR[20-16]];ALUOut = PC+(sign-extend(IR[15-0])


62/124


ALU is performing one of four functions, based on instruction type

Memory Reference:

ALUOut = A + sign-extend(IR[15-0]);

R-type:

ALUOut = A op B; Branch:

if (A==B) PC = ALUOut;

Jump:

PC = PC[31-28] || (IR[25-0]


63/124


Loads and stores access memoryMDR = Memory[ALUOut];

orMemory[ALUOut] = B;

R-type instructions finish

Reg[IR[15-11]] = ALUOut;

The write actually takes place at the end of the cycleon the edge

Step 4 (R-type or Memory-access)

Write back step


64/124


Memory read completion step

Reg[IR[20-16]]= MDR;

What about all the other instructions?

Write-back step

Summary execution steps


65/124


Ste name

Action for R-type

instructions

Action for memory-reference

instructions

Action for

branches

Action for

um s

Instruction fetch IR = Memory[PC]

PC = PC + 4

Instruction A = Reg [IR[25-21]]

decode/register fetch B = Reg [IR[20-16]]ALUOut = PC + (sign-extend (IR[15-0])


66/124


How many cycles will it take to execute this code?

lw $t2, 0($t3)

lw $t3, 4($t3)

beq $t2, $t3, L1 #assume not taken

add $t5, $t2, $t3

sw $t5, 8($t3)L1: ...

What is going on during the 8th cycle of execution?

In what cycle does the actual addition of$t2 and $t3 takes place?

Simple Questions

Implementing the Control


67/124


Value of control signals is dependent upon:

what instruction is being executed which step is being performed

Use the information we have accumulated to specify a finitestate machine (FSM) specify the finite state machine graphically, or use microprogramming

Implementation can be derived from specification

Implementing the Control

Graphical Specification Instruction fetch Instructiondecode/register fetch0


68/124

How many

state bits willwe need?

p pof FSM

PCWrite

PCSource=10

ALUSrcA=1

ALUSrcB=00ALUOp=01

PCWriteCondPCSource=01

ALUSrcA=1

ALUSrcB=00ALUOp=10

RegDst=1RegWrite

MemtoReg=0

MemWriteIorD=1

MemReadIorD=1

ALUSrcA=1ALUSrcB=10ALUOp=00

RegDst=0RegWrite

MemtoReg=1

ALUSrcA=0

ALUSrcB=11ALUOp=00

MemReadALUSrcA=0

IorD=0

IRWriteALUSrcB=01

ALUOp=00PCWrite

PCSource=00

Jumpcompletion

BranchcompletionExecution

Memoryaddress

computation

Memoryaccess

Memoryaccess R-typecompletion

Write-backstep

(Op='J')

(Op='LW')

4

01

9862

753

Start

Finite State Machine for Control


69/124


Implementation:

Finite State Machine for ControlPCWrite

PCWriteCond

IorD

MemtoReg

PCSource

ALUOp

ALUSrcB

ALUSrcARegWrite

RegDst

NS3

NS2

NS1

NS0

Op5

Op4

Op3

Op2

Op1

Op0

S3

S2

S1

S0

State register

IRWrite

MemRead

MemWrite

Instruction register

opcode field

Outputs

Control logic

Inputs

PLAOp5

Op4


70/124


Implemen-tation

If I picked ahorizontal orvertical line could

you explain it ? What type of

FSM is used?Mealy or Moore?

Op3

Op2

Op1

Op0

S3

S2

S1

S0

IorD

IRWrite

MemRead

MemWrite

PCWrite

PCWriteCond

MemtoReg

PCSource1

ALUOp1

ALUSrcB0

ALUSrcA

RegWrite

RegDst

NS3

NS2

NS1

NS0

ALUSrcB1

ALUOp0

PCSource0

(see book)

next

state

currentstate

datapathcontrol

opcode

Pipelined implementation


71/124


Pipelined implementation

Pipelining Pipelined datapath

Pipelined control

Hazards:

Structural Data

Control

Exceptions

Scheduling For details see the book (chapter 6):

Pipelining


72/124


PipeliningImprove performance by increasing instruction throughput

Instruction

fetchReg ALU

Data

accessReg

8 nsInstruction

fetchReg ALU

Data

accessReg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Program

execution

order

(in instructions)

Instructionfetch

Reg ALU Dataaccess

Reg

Time

lw $1,100($0)

lw $2,200($0)

lw $3,300($0)

2 nsInstruction

fetchReg ALU

Data

accessReg

2 nsInstruction

fetchReg ALU

Data

accessReg

2 ns 2 ns 2 ns 2 ns 2 ns

Program

execution

order

(in instructions)

Pipelining


73/124


Pipelining

Ideal speedup = number of stages

Do we achieve this?

Pipelining


74/124


Pipelining What makes it easy

all instructions are the same length just a few instruction formats

memory operands appear only in loads and stores

What makes it hard? structural hazards: suppose we had only one memory

control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction

Well build a simple pipeline and look at these issues

Well talk about modern processors and what really makes ithard: exception handling

trying to improve performance with out-of-order execution, etc.

Basic idea: start from single cycle impl.


75/124


g y pWhat do we need to add to actually split the datapath into stages?

Instruction

memory

Address

4

32

0

Add Addresult

Shiftleft2

Instruction

Mux

0

1

Add

PC

0Writedata

M

ux

1Registers

Readdata1

Readdata2

Readregister1

Readregister2

16Sign

extend

Writeregister

Writedata

Readdata

Address

Datamemory1

ALUresult

M

ux

ALUZero

IF:Instructionfetch ID:Instructiondecode/registerfileread

EX:Execute/addresscalculation

MEM:Memoryaccess WB:Writeback

Pipelined Datapath


76/124


p pCan you find a problem even if there are no dependencies?

What instructions can we execute to manifest the problem?

Instructionmemory

Address

4

32

0

AddAdd

result

Shift

left2

Instruction

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1

Registers

Readdata1

Readdata2

Readregister1

Readregister2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mu

x

ALU

Zero

ID/EX

Data

memory

Address

Corrected Datapath


77/124


Corrected Datapath

Instruction

memory

Address

4

32

0

AddAdd

result

Shift

left2

Instruction

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1

Registers

Readdata1

Readdata2

Readregister1

Readregister2

16Sign

extend

Writeregister

Write

data

Readdata

Data

memory

1

ALUresult

Mux

ALU

Zero

ID/EX

Graphically Representing Pipelines


78/124


Graphically Representing Pipelines

Can help with answering questions like: how many cycles does it take to execute this code?

what is the ALU doing during cycle 4?

use this representation to help understand datapaths

IM Reg DM Reg

IM Reg DM Reg

CC1 CC2 CC3 CC4 CC5 CC6

Time(inclockcycles)

lw$10,20($1)

Program

executionorder(ininstructions)

sub$11,$2,$3

ALU

ALU

Pipeline Control


79/124


Pipeline Control

PC

Instructionmemory

Address

Instruction

Instruction[2016]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32

Instruction

[150]

0

0Registers

Writeregister

Writedata

Readdata1

Readdata2

Readregister1

Readregister2

Signextend

Mux1

Write

data

Read

data Mux

1

ALUcontrol

RegWrite

MemRead

Instruction[1511]

6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Datamemory

PCSrc

Zero

AddAdd

result

Shift

left2

ALUresult

ALU

Zero

Add

0

1

Mux

0

1

Mux

Pipeline control


80/124


We have 5 stages. What needs to be controlled in eachstage? Instruction Fetch and PC Increment

Instruction Decode / Register Fetch

Execution

Memory Stage Write Back

How would control be handled in an automobile plant?

a fancy control center telling everyone what to do? should we use a finite state machine?

Pipeline control

Pipeline Control


81/124


Pass control signals along

just like the data:

Pipeline ControlExecution/Address

Calculation stage control

lines

Memory access stage

control lines

Write-back

stage control

lines

Instruction

Reg

Dst

ALU

Op1

ALU

Op0

ALU

Src

Branc

h

Mem

Read

Mem

Write

Reg

write

Mem

to Reg

R-format 1 1 0 0 0 0 0 1 0

lw 0 0 0 1 0 1 0 1 1

sw X 0 0 1 0 0 1 0 X

beq X 0 1 0 1 0 0 0 X

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

(compare single cycle control!)



82/124



PC

Instructionmemory

Instruction

Add

Instruction[2016]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4


0

0

Mux

0

1

AddAdd

result


Writedata

Readdata1

Readdata2

Readregister1

Readregister2

Signextend

Mux1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft2RegWri

te

MemRead

Control

ALU

Instruction[1511]

6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

M

ux

0

1

MemWri

te

Address

Datamemory

Address


83/124


Hazards: problems due to pipelining


84/124


Hazards: problems due to pipelining

Hazard types: Structural

same resource is needed multiple times in the same cycle

Data

data dependencies limit pipelining Control

next executed instruction may not be the next specifiedinstruction

Structural hazards


85/124


Structural hazards

Examples: Two accesses to a single ported memory

Two operations need the same function unitat the same time

Two operations need the same function unitin successive cycles, but the unit is not pipelined

Solutions:

stalling add more hardware


86/124

Data hazards


87/124


Data hazards

Data dependencies: RaW (read-after-write) WaW (write-after-write)

WaR (write-after-read)

Hardware solution: Forwarding / Bypassing

Detection logic

Stalling

Software solution: Scheduling

Data dependences


88/124


pThree types: RaW, WaR and WaW

add r1, r2, 5 ; r1 := r2+5sub r4, r1, r3 ; RaW of r1

add r1, r2, 5sub r2, r4, 1 ; WaR of r2

add r1, r2, 5sub r1, r1, 1 ; WaW of r1

st r1, 5(r2) ; M[r2+5] := r1ld r5, 0(r4) ; RaW if 5+r2 = 0+r4

WaW and WaR do not occur in simple pipelines, but they limitscheduling freedom!

Problems for your compiler and Pentium!

useregister renamingto solve this!

RaW on MIPS pipeline


89/124


RaW on MIPS pipeline

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Program

execution

order

(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/

20

20

20

20

20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of

register$2:

DM Reg

Reg

Reg

Reg

DM

Use temporary results dont wait for them to be writtenForwarding


90/124


Use temporary results, don t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding

What if this

$2 was $13?

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (inclock cycles)

sub$2, $1,$3

Program

executionorder

(in instructions)

and$12,$2,$5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/20 20 20 20 20

or$13,$6, $2

add$14,$2,$2

sw$15,100($2)

Value ofregister$2 :

DM Reg

Reg

Reg

Reg

X X X 20 X X X X XValue ofEX/MEM :

X X X X 20 X X X XValue ofMEM/WB :

DM

Forwarding hardware


91/124


Forwarding hardware

ALU forwarding circuitry principle:

ALU

from register file

from register file

to register file

Note: there are two options buf - ALU bypass mux - buf

buf - bypass

mux

ALU - buf

Forwarding ID/EX


92/124


g

PC Instructionmemory

Registers

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

EX/MEM

MEM/WB

Data

memory

Mux

Forwarding

unit

IF/ID

Instruction

Mux

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

Rt

Rt

Rs

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

ForwardA

ForwardB

Forwarding check


93/124


g

Check for matching register-ids:

For each source-id of operation in the EX-stage check ifthere is a matching pending dest-id

if (EX/MEM.RegWrite)

(EX/MEM.RegisterRd 0)

(EX/MEM.RegisterRd = ID/EX.RegisterRs)

then ForwardA = 10

Example:

Q. How many comparators do we need?

Can't always forward


94/124


Load word can still cause a hazard: an instruction tries to read registerr following a load to the same r

Need a hazard detection unit to stall the load instruction

Reg

IM

Reg

Reg

IM

CC 1 CC2 CC3 CC4 CC5 CC6

Time (inclockcycles)

lw$2,20($1)

Program

execution

order

(in instructions)

and$4,$2,$5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC7 CC 8 CC 9

or$8,$2,$6

add$9,$4,$2

slt$1,$6,$7

DM Reg

Reg

Reg

DM

Stalling


95/124


We can stall the pipeline by keeping an instruction in the same stage

lw$2,20($1)

Program

executionorder(ininstructions)

and$4,$2,$5

or$8,$2,$6

add$9,$4,$2

slt$1,$6,$7

Reg

IM

Reg

Reg

IM DM

CC1 CC2 CC3 CC4 CC5 CC6

Time(inclockcycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC7 CC8 CC9 CC10

DM Reg

RegReg

Reg

bubble

In CC4 the ALU is not used,

Reg, and IM are redone

Hazard Detection Unit


96/124


PCInstructionmemory

Registers

Mux

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

0

Mux

IF/ID

Instruction

ID/EX.MemRead

IF/IDWri

te

PCW

rite

ID/EX.RegisterRt

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRtIF/ID.RegisterRs

Rt

Rs

Rd

RtEX/MEM.RegisterRd

MEM/WB.RegisterRd

Software only solution?


97/124


Have compiler guarantee that no hazards occur

Example: where do we insert the NOPs ?

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $13, 100($2)

Problem: this really slows us down!

y

sub $2, $1, $3nop

nop

and $12, $2, $5

or $13, $6, $2add $14, $2, $2

nop

sw $13, 100($2)

Control hazards


98/124


Control operations may change the sequential flow ofinstructions branch

jump

call (jump and link)

return (exception/interrupt and rti / return from interrupt)

Control hazard: Branch


99/124


Branch actions:

Compute new address

Determine condition

Perform the actual branch (if taken): PC := new address

Branch example


100/124


p

Reg

Reg

CC 1

Time (in clockcycles)

40 beq $1,$3,7

Program

execution

order

(in instructions)

IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12,$2,$5

48 or $13,$6,$2

52 add $14,$2,$2

72 lw $4,50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

Branching


101/124


g

Squash pipeline:

When we decide to branch, other instructions are in thepipeline!

We are predicting branch not taken need to add hardware for flushing instructions if we are wrong

Branch with predict not taken


102/124


Branch with predict not taken

Branch L

Predictnot taken

L:

Clock cycles

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

Branch speedup


103/124


p p

Earlier address computation Earlier condition calculation

Put both in the ID pipeline stage adder comparator

Branch L

Predictnot taken

L:

Clock cycles

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

Improved branching / flushing IF/IDIF Fl h


104/124


PCInstruction

memory

4

Registers

Mux

Mux

Mux

ALU

EX

M

WB

M

WB

WB

ID/EX

0

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

Signextend

Control

Mu

x

=

Shiftleft2

Mux

Exception support


105/124


Types of exceptions:

Overflow I/O device request

Operating system call

Undefined instruction

Hardware malfunction Page fault

Precise exception:

finish previous instructions (which are still in the pipeline) flush excepting and following instructions, redo them after

handling the exception(s)

Exceptions


106/124


Changes needed for handling overflow exception of an

operation in EX stage (see book for details) :

Extend PC input mux with extra entry with fixed address

Add EPC register recording the ID/EX stage PC

this is the address of the next instruction ! Cause register recording exception type

E.g., in case of overflow exception insert 3 bubbles;flush the following stages:

IF/ID stage

ID/EX stage

EX/MEM stage

Scheduling, why?


107/124


g y

Lets look at the execution time:

Texecution= Ncyclesx Tcycle= Ninstructionsx CPIx Tcycle

Scheduling may reduce Texecution Reduce CPI(cycles per instruction)

early scheduling of long latency operations

avoid pipeline stalls due to structural, data and control hazards

allow Nissue > 1 and therefore CPI< 1

Reduce Ninstructions

compact many operations into each instruction (VLIW)

Scheduling data hazards:l 1


108/124


gexample 1

Try and avoid RaW stalls (in this case load interlocks)!

E.g., reorder these instructions:

lw $t0, 0($t1)

lw $t2, 4($t1)

sw $t2, 0($t1)sw $t0, 4($t1)

lw $t0, 0($t1)

lw $t2, 4($t1)

sw $t0, 4($t1)sw $t2, 0($t1)

Scheduling data hazards Unscheduled code:


109/124


gexample 2

Avoiding RaW stalls:Reordering instructions forfollowing program

(by you or the compiler)

Code:

a = b + c

d = e - f

Lw R1,b

Lw R2,c

Add R3,R1,R2 interlockSw a,R3

Lw R1,e

Lw R2,f

Sub R4,R1,R2 interlockSw d,R4

Scheduled code:Lw R1,b

Lw R2,c

Lw R5,e extra reg. needed!Add R3,R1,R2

Lw R2,f

Sw a,R3

Sub R4,R5,R2

Sw d,R4


110/124

Scheduling control hazards


111/124


Scheduling control hazards

What can we do about control hazards and CPIpenalty?

Keep penalty Pbranchlow: Early computation of new PC

Early determination of condition

Visible branch delay slots filled by compiler (MIPS)

Branch prediction

Reduce control dependencies (control heightreduction) [Schlansker and Kathail, Micro95]

Remove branches: if-conversion Conditional instructions: CMOVE, cond skip next

Guarding all instructions: TriMedia

Branch delay slot


112/124


Add a branch delay slot: the next instruction after a branch is always executed

rely on compiler to fill the slot with something useful

Is this a good idea? let's look how it works

Branch delay slot scheduling


113/124


op 1

beq r1,r2, L

.............op 2

L: op 3

.............

.............

'fall-through'

branch target

Q. What to put in the delay slot?

Summary


114/124


Modern processors are (deeply) pipelined, to reduceTcycle and aim at CPI = 1

Hazards increase CPI

Several software and hardware measure to avoid orreduce hazards are taken

Not discussed, but important developments:

Multi-issue further reduces CPI

Branch prediction to avoid high branch penalties Dynamic scheduling

In all cases: a scheduling compiler needed

Recap of MIPS


115/124


RISC architecture

Register space

Addressing

Instruction format

Pipelining

Why RISC? Keep it simple


116/124


RISC characteristics:

Reduced number of instructions Limited addressing modes

load-store architecture

enables pipelining

Large register set uniform (no distinction between e.g. address and data registers)

Limited number of instruction sizes (preferably one) know directly where the following instruction starts

Limited number of instruction formats

Memory alignment restrictions ......

Based on quantitative analysis " the famous MIPS one percent rule": don't even think about it

when its not used more than one percent

Register space


117/124


Name Register number Usage

$zero 0 the constant value 0

$v0-$v1 2-3 values for results and expression evaluation

$a0-$a3 4-7 arguments

$t0-$t7 8-15 temporaries

$s0-$s7 16-23 saved (by callee)

$t8-$t9 24-25 more temporaries

$gp 28 global pointer $sp 29 stack pointer

$fp 30 frame pointer

$ra 31 return address

32 integer (and 32 floating point) registers of 32-bit

Addressing1. Immediate addressing

op rs rt Immediate


118/124


Byte Halfword Word

Registers

Memory

Memory

Word

Memory

Word

Register

Register

2. Register addressing

3. Base addressing

4. PC-relative addressing

5. Pseudodirect addressing

op rs rt

op rs rt

op

op

rs rt

Address

Address

Address

rd . . . funct

PC

PC

+

+

Instruction format


119/124


Example instructions

Instruction Meaningadd $s1,$s2,$s3 $s1 = $s2 + $s3

addi $s2,$s3,4 $s2 = $s3 + 4

lw $s1,100($s2) $s1 = Memory[$s2+100]

bne $s4,$s5,L if $s4$s5 goto L

j Label goto Label

op rs rt rd shamt functop rs rt 16 bit address

op 26 bit address

R

I

J

Pipelining


120/124


time

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

All integer instructions fit into the following pipeline

Other architecture styles


121/124


Accumulator architecture one operand (in register or memory), accumulator almost always

implicitly used

Stack zero operand: all operands implicit (on TOS)

Register (load store) three operands, all in registers

loads and stores are the only instructions accessing memory (i.e.with a memory (indirect) addressing mode

Register-Memory

two operands, one in memory Memory-Memory

three operands, may be all in memory

(there are more varieties / combinations)

Accumulator architecture


122/124


Accumulator

ALU Memory

registers

address

latch

latch

Example code: a = b+c;

load b; // accumulator is implicit operand

add c;

store a;

Stack architecture


123/124


Example code: a = b+c;

push b;push c;

add;

pop a;

b

b

c b+c

push b push c add pop a

stack:

ALU Memory

top of

stack

stack pt

latch

latch

latch

Other architecture styles


124/124

Stack

Architecture

Accumulator

Architecture

Register-Memory

Memory-Memory

Register

(load-store)

Push A Load A Load r1,A Add C,B,A Load r1,A

Push B Add B Add r1,B Load r2,B

Add Store C Store C,r1 Add r3,r1,r2

Pop C Store C,r3

Let's look at the code for C = A + B

mi ps details

Documents