mi ps details
TRANSCRIPT
-
7/30/2019 Mi Ps Details
1/124
Embedded Processor Architecture
TU/e 5kk73
Henk Corporaal
Bart Mesman
RISCInstruction Set
ImplementationAlternatives
== using MIPS as example ==
-
7/30/2019 Mi Ps Details
2/124
H.Corporaal EmbProcArch 5kk73 2
Topics
MIPS ISA: Instruction Set Architecture
MIPS single cycle implementation MIPS multi-cycle implementation
MIPS pipelined implementation
Pipeline hazards
Recap of RISC principles Other architectures
Based on the book: ch2-4 (4th ed)
Many slides; I'll go quick andskip some
-
7/30/2019 Mi Ps Details
3/124
H.Corporaal EmbProcArch 5kk73 3
Main Types of Instructions
Arithmetic Integer
Floating Point
Memory access instructions Load & Store
Control flow
Jump Conditional Branch
Call & Return
-
7/30/2019 Mi Ps Details
4/124
H.Corporaal EmbProcArch 5kk73 4
MIPS arithmetic
Most instructions have 3 operands Operand order is fixed (destination first)
Example:
C code: A = B + C
MIPS code: add $s0, $s1, $s2
($s0, $s1 and $s2 are associated with variables bycompiler)
-
7/30/2019 Mi Ps Details
5/124
H.Corporaal EmbProcArch 5kk73 5
MIPS arithmetic
C code: A = B + C + D;E = F - A;
MIPS code:add $t0, $s1, $s2
add $s0, $t0, $s3
sub $s4, $s5, $s0
Operands must be registers, only 32 registers provided
Design Principle: smaller is faster. Why?
-
7/30/2019 Mi Ps Details
6/124
H.Corporaal EmbProcArch 5kk73 6
Registers vs. Memory
Arithmetic instruction operands must be registers, only 32 registers provided
Compiler associates variables with registers
What about programs with lots of variables ?
CPU Memory
IO
register file
-
7/30/2019 Mi Ps Details
7/124
H.Corporaal EmbProcArch 5kk73 7
Register allocation
Compiler tries to keep as many variables in registers aspossible
Some variables can not be allocated
large arrays (too few registers) aliased variables (variables accessible through pointers in C)
dynamic allocated variables
heap
stack
Compiler may run out of registers => spilling
-
7/30/2019 Mi Ps Details
8/124
H.Corporaal EmbProcArch 5kk73 8
Memory Organization
Viewed as a large, single-dimension array, with anaddress
A memory address is an index into the array
"Byte addressing" means that successive addresses are
one byte apart0
1
2
3
4
5
6
...
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
-
7/30/2019 Mi Ps Details
9/124
-
7/30/2019 Mi Ps Details
10/124
H.Corporaal EmbProcArch 5kk73 10
Memory layout: Alignment
Words are aligned
What are the least 2 significant bits of a wordaddress?
this word is aligned; the others are not!
ad
dress
0
4
8
12
16
20
24
31 071523
-
7/30/2019 Mi Ps Details
11/124
H.Corporaal EmbProcArch 5kk73 11
Instructions: load and store
Example:
C code: A[8] = h + A[8];
MIPS code: lw $t0, 32($s3)add $t0, $s2, $t0
sw $t0, 32($s3)
Store word operation has no destination (reg) operand
Remember arithmetic operands are registers, notmemory!
-
7/30/2019 Mi Ps Details
12/124
H.Corporaal EmbProcArch 5kk73 12
Let's translate some C-code
Can we figure out the code?
swap(int v[], int k);{ int temp;
temp = v[k]v[k] = v[k+1];v[k+1] = temp;
}
swap:muli $2 , $5, 4add $2 , $4, $2lw $15, 0($2)lw $16, 4($2)sw $16, 0($2)sw $15, 4($2)jr $31
Explanation:
index k : $5
base address of v: $4
address of v[k] is $4 + 4.$5
-
7/30/2019 Mi Ps Details
13/124
-
7/30/2019 Mi Ps Details
14/124
H.Corporaal EmbProcArch 5kk73 14
Consider the load-word and store-word instructions, What would the regularity principle have us do?
New principle: Good design demands a compromise
Introduce a new type of instruction format I-type for data transfer instructions
other format was R-type for register
Example: lw $t0, 32($s2)
35 18 9 32
op rs rt 16 bit number
Machine Language
-
7/30/2019 Mi Ps Details
15/124
-
7/30/2019 Mi Ps Details
16/124
H.Corporaal EmbProcArch 5kk73 16
Decision making instructions alter the control flow,
i.e., change the "next" instruction to be executed
MIPS conditional branch instructions:
bne $t0, $t1, Label
beq $t0, $t1, Label
Example: if (i==j) h = i + j;
bne $s0, $s1, Label
add $s3, $s0, $s1
Label: ....
Control
-
7/30/2019 Mi Ps Details
17/124
H.Corporaal EmbProcArch 5kk73 17
MIPS unconditional branch instructions:j label
Example:
if (i!=j) beq$s4, $s5, Lab1h=i+j; add $s3, $s4, $s5
else j Lab2
h=i-j; Lab1: sub $s3, $s4, $s5
Lab2: ...
Can you build a simple for loop?
Control
-
7/30/2019 Mi Ps Details
18/124
H.Corporaal EmbProcArch 5kk73 18
So far:
Instruction Meaning
add $s1,$s2,$s3 $s1 = $s2 + $s3
sub $s1,$s2,$s3 $s1 = $s2 $s3
lw $s1,100($s2) $s1 = Memory[$s2+100]
sw $s1,100($s2) Memory[$s2+100] = $s1bne $s4,$s5,L Next instr. is at Label if $s4 $s5
beq $s4,$s5,L Next instr. is at Label if $s4 = $s5
j Label Next instr. is at Label
Formats:
op rs rt rd shamt funct
op rs rt 16 bit address
op 26 bit address
R
I
J
-
7/30/2019 Mi Ps Details
19/124
H.Corporaal EmbProcArch 5kk73 19
We have: beq, bne, what about Branch-if-less-than? New instruction:
meaning:if $s1 < $s2 then
$t0 = 1
slt $t0, $s1, $s2 else
$t0 = 0
Can use this instruction to build "blt $s1, $s2, Label"
can now build general control structures
Note that the assembler needs a register to do this, use conventions for registers
Control Flow
-
7/30/2019 Mi Ps Details
20/124
H.Corporaal EmbProcArch 5kk73 20
MIPS compiler/assembler Conventions
Name Register number Usage$zero 0 the constant value 0
$v0-$v1 2-3 values for results and expression evaluation
$a0-$a3 4-7 arguments
$t0-$t7 8-15 temporaries
$s0-$s7 16-23 saved (by callee)
$t8-$t9 24-25 more temporaries
$gp28 global pointer $sp 29 stack pointer
$fp 30 frame pointer
$ra 31 return address
-
7/30/2019 Mi Ps Details
21/124
H.Corporaal EmbProcArch 5kk73 21
Small constants are used quite frequently (50% of operands)
e.g., A = A + 5;B = B + 1;C = C - 18;
Solutions? Why not?
put 'typical constants' in memory and load them create hard-wired registers (like $zero) for constants like one
or .
MIPS Instructions:
addi $29, $29, 4slti $8, $18, 10andi $29, $29, 6ori $29, $29, 4
3
Constants
-
7/30/2019 Mi Ps Details
22/124
H.Corporaal EmbProcArch 5kk73 22
We'd like to be able to load a 32 bit constant into a register Must use two instructions; new "load upper immediate"
instructionlui $t0, 1010101010101010
1010101010101010 0000000000000000
0000000000000000 1010101010101010
1010101010101010 1010101010101010
ori
1010101010101010 0000000000000000
filled with zeros
How about larger constants?
Then must get the lower order bits right, i.e.,
ori $t0, $t0, 1010101010101010
-
7/30/2019 Mi Ps Details
23/124
H.Corporaal EmbProcArch 5kk73 23
Assembly provides convenient symbolic representation much easier than writing down numbers
e.g., destination first
Machine language is the underlying reality
e.g., destination is no longer first
Assembly can provide 'pseudoinstructions'
e.g., move $t0, $t1 exists only in Assembly
would be implemented using add $t0,$t1,$zero
When considering performance you should count real
instructions
Assembly Language vs. Machine Language
-
7/30/2019 Mi Ps Details
24/124
H.Corporaal EmbProcArch 5kk73 24
Instructions:bne $t4,$t5,Label Next instruction is at Label if $t4 $t5
beq $t4,$t5,Label Next instruction is at Label if $t4 = $t5
j Label Next instruction is at Label
Formats:
Addresses are not 32 bits How do we handle this with load and store instructions?
op rs rt 16 bit address
op 26 bit address
I
J
Addresses in Branches and Jumps
-
7/30/2019 Mi Ps Details
25/124
H.Corporaal EmbProcArch 5kk73 25
Instructions:bne $t4,$t5,Label Next instruction is at Label if $t4 $t5beq $t4,$t5,Label Next instruction is at Label if $t4 = $t5
Formats:
Could specify a register (like lw and sw) and add it to address
use Instruction Address Register (PC = program counter)
most branches are local (principle of locality)
Jump instructions just use high order bits of PC
address boundaries of 256 MB
op rs rt 16 bit addressI
What's the next address?
T i
-
7/30/2019 Mi Ps Details
26/124
H.Corporaal EmbProcArch 5kk73 26
To summarize:MIPS assembly language
Category Instruction Example Meaning Commentsadd add $s1, $s2, $s3 $s1 = $s2 + $s3 Three operands; data in registers
Arithmetic subtract sub $s1, $s2, $s3 $s1 = $s2 - $s3 Three operands; data in registers
add immediate addi $s1, $s2, 100 $s1 = $s2 + 100 Used to add constants
load word lw $s1, 100($s2) $s1 = Memory[$s2 + 100] Word from memory to register
store word sw $s1, 100($s2) Memory[$s2 + 100] = $s1 Word from register to memory
Data transfer load byte lb $s1, 100($s2) $s1 = Memory[$s2 + 100] Byte from memory to register
store byte sb $s1, 100($s2)
Memory[$s2
+ 100] = $s1Byte from register to memory
load upper immediate lui $s1, 100 $s1 = 100 * 216 Loads constant in upper 16 bits
branch on equal beq $s1, $s2, 25 if ($s1 == $s2) go to
PC + 4 + 100
Equal test; PC-relative branch
Conditional
branch on not equal bne $s1, $s2, 25 if ($s1 != $s2) go to
PC + 4 + 100
Not equal test; PC-relative
branch set on less than slt $s1, $s2, $s3 if ($s2 < $s3) $s1 = 1;else $s1 = 0
Compare less than; for beq, bne
set less than
immediate
slti $s1, $s2, 100 if ($s2 < 100) $s1 = 1;
else $s1 = 0
Compare less than constant
jump j 2500 go to 10000 Jump to target address
Uncondi- jump register jr $ra go to $ra For switch, procedure return
tional jump jump and link jal 2500 $ra = PC + 4; go to 10000 For procedure call
MIPS (3+2) addressing modes overview
-
7/30/2019 Mi Ps Details
27/124
H.Corporaal EmbProcArch 5kk73 27
Byte Halfword Word
Registers
Memory
Memory
Word
Memory
Word
Register
Register
1. Immediate addressing
2. Register addressing
3. Base addressing
4. PC-relative addressing
5. Pseudodirect addressing
op rs rt
op rs rt
op rs rt
op
op
rs rt
Address
Address
Address
rd . . . funct
Immediate
PC
PC
+
+
MIPS (3+2) addressing modes overview
-
7/30/2019 Mi Ps Details
28/124
H.Corporaal EmbProcArch 5kk73 28
MIPS Datapath
Building a datapath support a subset of the MIPS-I instruction-set
A single cycle processor datapath
all instruction actions in one (long) cycle
A multi-cycle processor datapath
each instructions takes multiple (shorter) cycles
For details see book (ch 5):
-
7/30/2019 Mi Ps Details
29/124
H.Corporaal EmbProcArch 5kk73 29
Datapath and Control
DatapathControl
Registers &
Memories
Multiplexors
Buses
ALUs
FSM
or
Micro-programming
-
7/30/2019 Mi Ps Details
30/124
H.Corporaal EmbProcArch 5kk73 30
Simplified MIPS implementation to contain only: memory-reference instructions: lw, sw
arithmetic-logical instructions: add, sub, and, or, slt
control flow instructions: beq, j
Generic Implementation:
use the program counter (PC) to supply instruction address get the instruction from memory
read registers
use the instruction to decide exactly what to do
All instructions use the ALU after reading the registersWhy? memory-reference?
arithmetic?
control flow?
The Processor: Datapath & Control
-
7/30/2019 Mi Ps Details
31/124
H.Corporaal EmbProcArch 5kk73 31
Abstract / Simplified View:
Two types of functional units: elements that operate on data values (combinational)
elements that contain state (sequential)
More Implementation Details
Registers
Register#
Data
Register#
Datamemory
Address
Data
Register#
PC Instruction ALU
Instructionmemory
Address
-
7/30/2019 Mi Ps Details
32/124
H.Corporaal EmbProcArch 5kk73 32
Unclocked vs. Clocked Clocks used in synchronous logic
when should an element that contains state be updated?
cycle time
rising edge
falling edge
State Elements
-
7/30/2019 Mi Ps Details
33/124
H.Corporaal EmbProcArch 5kk73 33
The set-reset (SR) latch output depends on present inputs and also on past inputs
An unclocked state element
R
S
Q
Q
Truth table:R S Q
0 0 Q
0 1 1
1 0 0
1 1 ?
state change
-
7/30/2019 Mi Ps Details
34/124
H.Corporaal EmbProcArch 5kk73 34
Output is equal to the stored value inside the element(don't need to ask for permission to look at the value)
Change of state (value) is based on the clock
Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge
(edge-triggered methodology)
A clocking methodology defines when signals can be read and writtenwouldn't want to read a signal at the same time it was being written
Latches and Flip-flops
-
7/30/2019 Mi Ps Details
35/124
H.Corporaal EmbProcArch 5kk73 35
Two inputs: the data value to be stored (D)
the clock signal (C) indicating when to read & store D
Two outputs:
the value of the internal state (Q) and it's complement
D-latch
Q
C
D
_Q
D
C
Q
-
7/30/2019 Mi Ps Details
36/124
H.Corporaal EmbProcArch 5kk73 36
D flip-flop
Output changes only on the clock edge
QQ
_Q
Q
_Q
Dlatch
D
C
Dlatch
DD
C
C
D
C
Q
-
7/30/2019 Mi Ps Details
37/124
H.Corporaal EmbProcArch 5kk73 37
Our Implementation
An edge triggered methodology Typical execution:
read contents of some state elements,
send values through some combinational logic,
write results to one or more state elements
Clockcycle
Stateelement
1Combinationallogic
Stateelement
2
-
7/30/2019 Mi Ps Details
38/124
H.Corporaal EmbProcArch 5kk73 38
3-ported: one write, two read ports
Register File
Read reg. #1
Read reg.#2
Write reg.#
Readdata 1
Readdata 2
Write
Writedata
-
7/30/2019 Mi Ps Details
39/124
H.Corporaal EmbProcArch 5kk73 39
Register file: read ports
Mu
x
Register0
Register1
Registern 1Registern
M
u
x
Readdata1
Readdata2
Readregister
number1
Readregister
number2
Implementation of the read ports
Register file built using D flip-flops
-
7/30/2019 Mi Ps Details
40/124
H.Corporaal EmbProcArch 5kk73 40
Register file: write port
Note: we still use the real clock to determine when towrite
n-to-1
decoder
Register 0
Register 1
Register n 1
C
C
D
D
Register n
C
C
D
D
Register number
Write
Register data
0
1
n 1
n
-
7/30/2019 Mi Ps Details
41/124
H.Corporaal EmbProcArch 5kk73 41
Building the Datapath Use multiplexors to stitch them together
PC
Instruction
memory
Readaddress
Instruction
16 32
AddALUresult
Mux
Registers
Writeregister
Writedata
Readdata1
Readdata2
Readregister1
Readregister2
Shift
left2
4
Mu
x
ALUoperation3
RegWrite
MemRead
MemWrite
PCSrc
ALUSrc
MemtoReg
ALUresult
ZeroALU
Datamemory
Address
Writedata
Readdata M
u
x
Signextend
Add
-
7/30/2019 Mi Ps Details
42/124
H.Corporaal EmbProcArch 5kk73 42
All of the logic is combinational We wait for everything to settle down, and the right thing
to be done
ALU might not produce right answer right away
we use write signals along with clock to determine when to write
Cycle time determined by length of the longest path
Our Simple Control Structure
We are ignoring some details like setup and hold times !
Clockcycle
Stateelement
1Combinational logic
Stateelement
2
-
7/30/2019 Mi Ps Details
43/124
H.Corporaal EmbProcArch 5kk73 43
Control Selecting the operations to perform (ALU, read/write, etc.)
Controlling the flow of data (multiplexor inputs)
Information comes from the 32 bits of the instruction
Example:
add $8, $17, $18 Instruction Format:
000000 10001 10010 01000 00000 100000
op rs rt rd shamt funct
ALU's operation based on instruction type and function code
-
7/30/2019 Mi Ps Details
44/124
H.Corporaal EmbProcArch 5kk73 44
Control: 2 level implementation
instructionre
gister ALUop
ALUcontrol
Opcode
Funct.
31
26
0
5
bit
Control 1
Control 2
ALU
00: lw, sw01: beq10: add, sub, and, or, slt
000: and001: or010: add
110: sub111: set on less than
6
6
2
3
Datapath with Control
-
7/30/2019 Mi Ps Details
45/124
H.Corporaal EmbProcArch 5kk73 45
Datapath with Control
PC
Instructionmemory
Readaddress
Instruction[310]
Instruction[2016]
Instruction[2521]
Add
Instruction[50]
MemtoReg
ALUOp
MemWrite
RegWrite
MemRead
BranchRegDst
ALUSrc
Instruction[3126]
4
16 32Instruction[150]
0
0Mux
0
1
Control
Add ALUresult
Mux
0
1
RegistersWriteregister
Writedata
Readdata1
Readdata2
Readregister1
Readregister2
Signextend
Shiftleft2
Mux1
ALUresult
Zero
DatamemoryWritedata
Readdata
Mu
x
1
Instruction[1511]
ALUcontrol
ALUAddress
ALU C t l1
-
7/30/2019 Mi Ps Details
46/124
H.Corporaal EmbProcArch 5kk73 46
What should the ALU do with this instructionexample: lw $1, 100($2)
35 2 1 100
op rs rt 16 bit offset
ALU control input
000 AND001 OR010 add110 subtract111 set-on-less-than
Why is the code for subtract 110 and not 011?
ALU Control1
ALU C t l1
-
7/30/2019 Mi Ps Details
47/124
H.Corporaal EmbProcArch 5kk73 47
Must describe hardware to compute 3-bit ALU control input given instruction type
00 = lw, sw01 = beq,10 = arithmetic
function code for arithmetic
Describe it using a truth table (can turn into gates):
ALU Operation class,computed from instruction type
ALU Control1
ALUOp Funct field Operation
ALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0
0 0 X X X X X X 010
X 1 X X X X X X 1101 X X X 0 0 0 0 010
1 X X X 0 0 1 0 110
1 X X X 0 1 0 0 000
1 X X X 0 1 0 1 001
1 X X X 1 0 1 0 111
outputsinputs
ALU C t l1
-
7/30/2019 Mi Ps Details
48/124
H.Corporaal EmbProcArch 5kk73 48
ALU Control1
Simple combinational logic (truth tables)
Operation2
Operation1
Operation0
Operation
ALUOp1
F3
F2
F1
F0
F (5
0)
ALUOp0
ALUOp
ALU control block
D i i C t l2 i l
-
7/30/2019 Mi Ps Details
49/124
H.Corporaal EmbProcArch 5kk73 49
Deriving Control2 signals
Instruction RegDst ALUSrc
Memto-
Reg
Reg
Write
Mem
Read
Mem
Write Branch ALUOp1 ALUp0
R-format 1 0 0 1 0 0 0 1 0
lw 0 1 1 1 1 0 0 0 0sw X 1 X 0 0 1 0 0 0
beq X 0 X 0 0 0 1 0 1
9 control (output) signals
Determine these control signals directly from the opcodes:
R-format: 0
lw: 35
sw: 43
beq: 4
Input6-bits
C t l 2
-
7/30/2019 Mi Ps Details
50/124
H.Corporaal EmbProcArch 5kk73 50
Control 2
PLA exampleimplementation
R-format Iw sw beq
Op0
Op1Op2
Op3
Op4
Op5
Inputs
Outputs
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
ALUOp1
ALUOpO
Si l C l I l t ti
-
7/30/2019 Mi Ps Details
51/124
H.Corporaal EmbProcArch 5kk73 51
Single Cycle Implementation Calculate cycle time assuming negligible delays except:
memory (2ns), ALU and adders (2ns), register file access (1ns)
MemtoReg
MemRead
MemWrite
ALUOp
ALUSrc
RegDst
PC
Instructionmemory
Readaddress
Instruction[310]
Instruction[2016]
Instruction[2521]
Add
Instruction[50]
RegWrite
4
16 32Instruction[150]
0
Registers
Writeregister
Write
dataWritedata
Readdata1
Readdata2
Readregister1
Readregister2
Signextend
ALUresult
Zero
Datamemory
Address Readdata
Mu
x
1
0
Mux
1
0
Mux
1
0
Mux
1
Instruction[1511]
ALUcontrol
Shift
left2
PCSrc
ALU
AddALU
result
Si l C l I l t ti
-
7/30/2019 Mi Ps Details
52/124
H.Corporaal EmbProcArch 5kk73 52
Single Cycle Implementation
Memory (2ns), ALU & adders (2ns), reg. file access (1ns)
Fixed length clock: longest instruction is the lw which requires 8 ns
Variable clock length (not realistic, just as exercise):
R-instr: 6 ns
Load: 8 ns
Store: 7 ns
Branch: 5 ns
Jump: 2 ns
Average depends on instruction mix
Where we are headed
-
7/30/2019 Mi Ps Details
53/124
H.Corporaal EmbProcArch 5kk73 53
Where we are headed Single Cycle Problems:
what if we had a more complicated instruction like floating point?
wasteful of area: NO Sharing of Hardware resources
One Solution: use a smaller cycle time
have different instructions take different numbers of cycles
a multicycle datapath:
PC
Memory
Address
Instructionordata
Data
Instructionregister
Registers
Register#
Data
Register#
Register#
ALU
Memory
dataregister
A
B
ALUOut
IR
MDR
M lti l A h
-
7/30/2019 Mi Ps Details
54/124
H.Corporaal EmbProcArch 5kk73 54
We will be reusing functional units ALU used to compute address and to increment PC Memory used for instruction and data
Add registers after every major functional unit
Our control signals will not be determined solely byinstruction e.g., what should the ALU do for a subtract instruction?
Well use a finite state machine (FSM) ormicrocode forcontrol
Multicycle Approach
R i fi it t t hi
-
7/30/2019 Mi Ps Details
55/124
H.Corporaal EmbProcArch 5kk73 55
Finite state machines: a set of states and
next state function (determined by current state and the input)
output function (determined by current state and possibly input)
Well use a Moore machine (output based only on current state)
Review: finite state machines
Next-statefunction
Currentstate
Clock
Outputfunction
Nextstate
Outputs
Inputs
M lti l A h
-
7/30/2019 Mi Ps Details
56/124
H.Corporaal EmbProcArch 5kk73 56
Break up the instructions into steps, each step takes acycle balance the amount of work to be done
restrict each cycle to use only one major functional unit
At the end of a cycle store values for use in later cycles (easiest thing to do)
introduce additional internal registers
Notice: we distinguish processor state: programmer visible registers
internal state: programmer invisible registers (like IR, MDR, A, B,and ALUout)
Multicycle Approach
M ltic cle Approach
-
7/30/2019 Mi Ps Details
57/124
H.Corporaal EmbProcArch 5kk73 57
Multicycle Approach
Shift
left2
PC
Memory
MemData
Writedata
Mux
0
1
RegistersWriteregister
Writedata
Readdata1
Read
data2
Readregister1
Readregister2
M
ux
0
1
Mux
0
1
4
Instruction[150]
Signextend
3216
Instruction[2521]
Instruction[2016]
Instruction[150]
Instructionregister
1Mux
0
3
2
Mux
ALUresult
ALU
Zero
Memorydata
register
Instruction[1511]
A
B
ALUOut
0
1
Address
Multicycle Approach
-
7/30/2019 Mi Ps Details
58/124
H.Corporaal EmbProcArch 5kk73 58
Multicycle Approach
Note that previous picture does not include: branch support jump support
Control lines and logic
Tclock > max (ALU delay, Memory access, Regfile access)
See book for complete picture
Five Execution Steps
-
7/30/2019 Mi Ps Details
59/124
H.Corporaal EmbProcArch 5kk73 59
Instruction Fetch
Instruction Decode and Register Fetch
Execution, Memory Address Computation, or Branch
Completion
Memory Access or R-type instruction completion
Write-back step
Five Execution Steps
INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!
Step 1: Instruction Fetch
-
7/30/2019 Mi Ps Details
60/124
H.Corporaal EmbProcArch 5kk73 60
Use PC to get instruction and put it in the Instruction
Register Increment the PC by 4 and put the result back in the PC
Can be described succinctly using RTL "Register-TransferLanguage"
IR = Memory[PC];
PC = PC + 4;
Can we figure out the values of the control signals?
What is the advantage of updating the PC now?
Step 1: Instruction Fetch
Step 2: Instruction Decode and
-
7/30/2019 Mi Ps Details
61/124
H.Corporaal EmbProcArch 5kk73 61
Read registers rs and rt in case we need them Compute the branch address in case the instruction is a
branch
Previous two actions are done optimistically!!
RTL:A = Reg[IR[25-21]];B = Reg[IR[20-16]];ALUOut = PC+(sign-extend(IR[15-0])
-
7/30/2019 Mi Ps Details
62/124
H.Corporaal EmbProcArch 5kk73 62
ALU is performing one of four functions, based on instruction type
Memory Reference:
ALUOut = A + sign-extend(IR[15-0]);
R-type:
ALUOut = A op B; Branch:
if (A==B) PC = ALUOut;
Jump:
PC = PC[31-28] || (IR[25-0]
-
7/30/2019 Mi Ps Details
63/124
H.Corporaal EmbProcArch 5kk73 63
Loads and stores access memoryMDR = Memory[ALUOut];
orMemory[ALUOut] = B;
R-type instructions finish
Reg[IR[15-11]] = ALUOut;
The write actually takes place at the end of the cycleon the edge
Step 4 (R-type or Memory-access)
Write back step
-
7/30/2019 Mi Ps Details
64/124
H.Corporaal EmbProcArch 5kk73 64
Memory read completion step
Reg[IR[20-16]]= MDR;
What about all the other instructions?
Write-back step
Summary execution steps
-
7/30/2019 Mi Ps Details
65/124
H.Corporaal EmbProcArch 5kk73 65
Ste name
Action for R-type
instructions
Action for memory-reference
instructions
Action for
branches
Action for
um s
Instruction fetch IR = Memory[PC]
PC = PC + 4
Instruction A = Reg [IR[25-21]]
decode/register fetch B = Reg [IR[20-16]]ALUOut = PC + (sign-extend (IR[15-0])
-
7/30/2019 Mi Ps Details
66/124
H.Corporaal EmbProcArch 5kk73 66
How many cycles will it take to execute this code?
lw $t2, 0($t3)
lw $t3, 4($t3)
beq $t2, $t3, L1 #assume not taken
add $t5, $t2, $t3
sw $t5, 8($t3)L1: ...
What is going on during the 8th cycle of execution?
In what cycle does the actual addition of$t2 and $t3 takes place?
Simple Questions
Implementing the Control
-
7/30/2019 Mi Ps Details
67/124
H.Corporaal EmbProcArch 5kk73 67
Value of control signals is dependent upon:
what instruction is being executed which step is being performed
Use the information we have accumulated to specify a finitestate machine (FSM) specify the finite state machine graphically, or use microprogramming
Implementation can be derived from specification
Implementing the Control
Graphical Specification Instruction fetch Instructiondecode/register fetch0
-
7/30/2019 Mi Ps Details
68/124
How many
state bits willwe need?
p pof FSM
PCWrite
PCSource=10
ALUSrcA=1
ALUSrcB=00ALUOp=01
PCWriteCondPCSource=01
ALUSrcA=1
ALUSrcB=00ALUOp=10
RegDst=1RegWrite
MemtoReg=0
MemWriteIorD=1
MemReadIorD=1
ALUSrcA=1ALUSrcB=10ALUOp=00
RegDst=0RegWrite
MemtoReg=1
ALUSrcA=0
ALUSrcB=11ALUOp=00
MemReadALUSrcA=0
IorD=0
IRWriteALUSrcB=01
ALUOp=00PCWrite
PCSource=00
Jumpcompletion
BranchcompletionExecution
Memoryaddress
computation
Memoryaccess
Memoryaccess R-typecompletion
Write-backstep
(Op='J')
(Op='LW')
4
01
9862
753
Start
Finite State Machine for Control
-
7/30/2019 Mi Ps Details
69/124
H.Corporaal EmbProcArch 5kk73 69
Implementation:
Finite State Machine for ControlPCWrite
PCWriteCond
IorD
MemtoReg
PCSource
ALUOp
ALUSrcB
ALUSrcARegWrite
RegDst
NS3
NS2
NS1
NS0
Op5
Op4
Op3
Op2
Op1
Op0
S3
S2
S1
S0
State register
IRWrite
MemRead
MemWrite
Instruction register
opcode field
Outputs
Control logic
Inputs
PLAOp5
Op4
-
7/30/2019 Mi Ps Details
70/124
H.Corporaal EmbProcArch 5kk73 70
Implemen-tation
If I picked ahorizontal orvertical line could
you explain it ? What type of
FSM is used?Mealy or Moore?
Op3
Op2
Op1
Op0
S3
S2
S1
S0
IorD
IRWrite
MemRead
MemWrite
PCWrite
PCWriteCond
MemtoReg
PCSource1
ALUOp1
ALUSrcB0
ALUSrcA
RegWrite
RegDst
NS3
NS2
NS1
NS0
ALUSrcB1
ALUOp0
PCSource0
(see book)
next
state
currentstate
datapathcontrol
opcode
Pipelined implementation
-
7/30/2019 Mi Ps Details
71/124
H.Corporaal EmbProcArch 5kk73 71
Pipelined implementation
Pipelining Pipelined datapath
Pipelined control
Hazards:
Structural Data
Control
Exceptions
Scheduling For details see the book (chapter 6):
Pipelining
-
7/30/2019 Mi Ps Details
72/124
H.Corporaal EmbProcArch 5kk73 72
PipeliningImprove performance by increasing instruction throughput
Instruction
fetchReg ALU
Data
accessReg
8 nsInstruction
fetchReg ALU
Data
accessReg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Program
execution
order
(in instructions)
Instructionfetch
Reg ALU Dataaccess
Reg
Time
lw $1,100($0)
lw $2,200($0)
lw $3,300($0)
2 nsInstruction
fetchReg ALU
Data
accessReg
2 nsInstruction
fetchReg ALU
Data
accessReg
2 ns 2 ns 2 ns 2 ns 2 ns
Program
execution
order
(in instructions)
Pipelining
-
7/30/2019 Mi Ps Details
73/124
H.Corporaal EmbProcArch 5kk73 73
Pipelining
Ideal speedup = number of stages
Do we achieve this?
Pipelining
-
7/30/2019 Mi Ps Details
74/124
H.Corporaal EmbProcArch 5kk73 74
Pipelining What makes it easy
all instructions are the same length just a few instruction formats
memory operands appear only in loads and stores
What makes it hard? structural hazards: suppose we had only one memory
control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction
Well build a simple pipeline and look at these issues
Well talk about modern processors and what really makes ithard: exception handling
trying to improve performance with out-of-order execution, etc.
Basic idea: start from single cycle impl.
-
7/30/2019 Mi Ps Details
75/124
H.Corporaal EmbProcArch 5kk73 75
g y pWhat do we need to add to actually split the datapath into stages?
Instruction
memory
Address
4
32
0
Add Addresult
Shiftleft2
Instruction
Mux
0
1
Add
PC
0Writedata
M
ux
1Registers
Readdata1
Readdata2
Readregister1
Readregister2
16Sign
extend
Writeregister
Writedata
Readdata
Address
Datamemory1
ALUresult
M
ux
ALUZero
IF:Instructionfetch ID:Instructiondecode/registerfileread
EX:Execute/addresscalculation
MEM:Memoryaccess WB:Writeback
Pipelined Datapath
-
7/30/2019 Mi Ps Details
76/124
H.Corporaal EmbProcArch 5kk73 76
p pCan you find a problem even if there are no dependencies?
What instructions can we execute to manifest the problem?
Instructionmemory
Address
4
32
0
AddAdd
result
Shift
left2
Instruction
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1
Registers
Readdata1
Readdata2
Readregister1
Readregister2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mu
x
ALU
Zero
ID/EX
Data
memory
Address
Corrected Datapath
-
7/30/2019 Mi Ps Details
77/124
H.Corporaal EmbProcArch 5kk73 77
Corrected Datapath
Instruction
memory
Address
4
32
0
AddAdd
result
Shift
left2
Instruction
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0
Address
Writedata
Mux
1
Registers
Readdata1
Readdata2
Readregister1
Readregister2
16Sign
extend
Writeregister
Write
data
Readdata
Data
memory
1
ALUresult
Mux
ALU
Zero
ID/EX
Graphically Representing Pipelines
-
7/30/2019 Mi Ps Details
78/124
H.Corporaal EmbProcArch 5kk73 78
Graphically Representing Pipelines
Can help with answering questions like: how many cycles does it take to execute this code?
what is the ALU doing during cycle 4?
use this representation to help understand datapaths
IM Reg DM Reg
IM Reg DM Reg
CC1 CC2 CC3 CC4 CC5 CC6
Time(inclockcycles)
lw$10,20($1)
Program
executionorder(ininstructions)
sub$11,$2,$3
ALU
ALU
Pipeline Control
-
7/30/2019 Mi Ps Details
79/124
H.Corporaal EmbProcArch 5kk73 79
Pipeline Control
PC
Instructionmemory
Address
Instruction
Instruction[2016]
MemtoReg
ALUOp
Branch
RegDst
ALUSrc
4
16 32
Instruction
[150]
0
0Registers
Writeregister
Writedata
Readdata1
Readdata2
Readregister1
Readregister2
Signextend
Mux1
Write
data
Read
data Mux
1
ALUcontrol
RegWrite
MemRead
Instruction[1511]
6
IF/ID ID/EX EX/MEM MEM/WB
MemWrite
Address
Datamemory
PCSrc
Zero
AddAdd
result
Shift
left2
ALUresult
ALU
Zero
Add
0
1
Mux
0
1
Mux
Pipeline control
-
7/30/2019 Mi Ps Details
80/124
H.Corporaal EmbProcArch 5kk73 80
We have 5 stages. What needs to be controlled in eachstage? Instruction Fetch and PC Increment
Instruction Decode / Register Fetch
Execution
Memory Stage Write Back
How would control be handled in an automobile plant?
a fancy control center telling everyone what to do? should we use a finite state machine?
Pipeline control
Pipeline Control
-
7/30/2019 Mi Ps Details
81/124
H.Corporaal EmbProcArch 5kk73 81
Pass control signals along
just like the data:
Pipeline ControlExecution/Address
Calculation stage control
lines
Memory access stage
control lines
Write-back
stage control
lines
Instruction
Reg
Dst
ALU
Op1
ALU
Op0
ALU
Src
Branc
h
Mem
Read
Mem
Write
Reg
write
Mem
to Reg
R-format 1 1 0 0 0 0 0 1 0
lw 0 0 0 1 0 1 0 1 1
sw X 0 0 1 0 0 1 0 X
beq X 0 1 0 1 0 0 0 X
Control
EX
M
WB
M
WB
WB
IF/ID ID/EX EX/MEM MEM/WB
Instruction
(compare single cycle control!)
Datapath with Control
-
7/30/2019 Mi Ps Details
82/124
H.Corporaal EmbProcArch 5kk73 82
Datapath with Control
PC
Instructionmemory
Instruction
Add
Instruction[2016]
MemtoReg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[150]
0
0
Mux
0
1
AddAdd
result
RegistersWriteregister
Writedata
Readdata1
Readdata2
Readregister1
Readregister2
Signextend
Mux1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft2RegWri
te
MemRead
Control
ALU
Instruction[1511]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
M
ux
0
1
MemWri
te
Address
Datamemory
Address
-
7/30/2019 Mi Ps Details
83/124
H.Corporaal EmbProcArch 5kk73 83
Hazards: problems due to pipelining
-
7/30/2019 Mi Ps Details
84/124
H.Corporaal EmbProcArch 5kk73 84
Hazards: problems due to pipelining
Hazard types: Structural
same resource is needed multiple times in the same cycle
Data
data dependencies limit pipelining Control
next executed instruction may not be the next specifiedinstruction
Structural hazards
-
7/30/2019 Mi Ps Details
85/124
H.Corporaal EmbProcArch 5kk73 85
Structural hazards
Examples: Two accesses to a single ported memory
Two operations need the same function unitat the same time
Two operations need the same function unitin successive cycles, but the unit is not pipelined
Solutions:
stalling add more hardware
-
7/30/2019 Mi Ps Details
86/124
Data hazards
-
7/30/2019 Mi Ps Details
87/124
H.Corporaal EmbProcArch 5kk73 87
Data hazards
Data dependencies: RaW (read-after-write) WaW (write-after-write)
WaR (write-after-read)
Hardware solution: Forwarding / Bypassing
Detection logic
Stalling
Software solution: Scheduling
Data dependences
-
7/30/2019 Mi Ps Details
88/124
H.Corporaal EmbProcArch 5kk73 88
pThree types: RaW, WaR and WaW
add r1, r2, 5 ; r1 := r2+5sub r4, r1, r3 ; RaW of r1
add r1, r2, 5sub r2, r4, 1 ; WaR of r2
add r1, r2, 5sub r1, r1, 1 ; WaW of r1
st r1, 5(r2) ; M[r2+5] := r1ld r5, 0(r4) ; RaW if 5+r2 = 0+r4
WaW and WaR do not occur in simple pipelines, but they limitscheduling freedom!
Problems for your compiler and Pentium!
useregister renamingto solve this!
RaW on MIPS pipeline
-
7/30/2019 Mi Ps Details
89/124
H.Corporaal EmbProcArch 5kk73 89
RaW on MIPS pipeline
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Program
execution
order
(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/
20
20
20
20
20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of
register$2:
DM Reg
Reg
Reg
Reg
DM
Use temporary results dont wait for them to be writtenForwarding
-
7/30/2019 Mi Ps Details
90/124
H.Corporaal EmbProcArch 5kk73 90
Use temporary results, don t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding
What if this
$2 was $13?
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (inclock cycles)
sub$2, $1,$3
Program
executionorder
(in instructions)
and$12,$2,$5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/20 20 20 20 20
or$13,$6, $2
add$14,$2,$2
sw$15,100($2)
Value ofregister$2 :
DM Reg
Reg
Reg
Reg
X X X 20 X X X X XValue ofEX/MEM :
X X X X 20 X X X XValue ofMEM/WB :
DM
Forwarding hardware
-
7/30/2019 Mi Ps Details
91/124
H.Corporaal EmbProcArch 5kk73 91
Forwarding hardware
ALU forwarding circuitry principle:
ALU
from register file
from register file
to register file
Note: there are two options buf - ALU bypass mux - buf
buf - bypass
mux
ALU - buf
Forwarding ID/EX
-
7/30/2019 Mi Ps Details
92/124
H.Corporaal EmbProcArch 5kk73 92
g
PC Instructionmemory
Registers
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
EX/MEM
MEM/WB
Data
memory
Mux
Forwarding
unit
IF/ID
Instruction
Mux
RdEX/MEM.RegisterRd
MEM/WB.RegisterRd
Rt
Rt
Rs
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
ForwardA
ForwardB
Forwarding check
-
7/30/2019 Mi Ps Details
93/124
H.Corporaal EmbProcArch 5kk73 93
g
Check for matching register-ids:
For each source-id of operation in the EX-stage check ifthere is a matching pending dest-id
if (EX/MEM.RegWrite)
(EX/MEM.RegisterRd 0)
(EX/MEM.RegisterRd = ID/EX.RegisterRs)
then ForwardA = 10
Example:
Q. How many comparators do we need?
Can't always forward
-
7/30/2019 Mi Ps Details
94/124
H.Corporaal EmbProcArch 5kk73 94
Load word can still cause a hazard: an instruction tries to read registerr following a load to the same r
Need a hazard detection unit to stall the load instruction
Reg
IM
Reg
Reg
IM
CC 1 CC2 CC3 CC4 CC5 CC6
Time (inclockcycles)
lw$2,20($1)
Program
execution
order
(in instructions)
and$4,$2,$5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC7 CC 8 CC 9
or$8,$2,$6
add$9,$4,$2
slt$1,$6,$7
DM Reg
Reg
Reg
DM
Stalling
-
7/30/2019 Mi Ps Details
95/124
H.Corporaal EmbProcArch 5kk73 95
We can stall the pipeline by keeping an instruction in the same stage
lw$2,20($1)
Program
executionorder(ininstructions)
and$4,$2,$5
or$8,$2,$6
add$9,$4,$2
slt$1,$6,$7
Reg
IM
Reg
Reg
IM DM
CC1 CC2 CC3 CC4 CC5 CC6
Time(inclockcycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC7 CC8 CC9 CC10
DM Reg
RegReg
Reg
bubble
In CC4 the ALU is not used,
Reg, and IM are redone
Hazard Detection Unit
-
7/30/2019 Mi Ps Details
96/124
H.Corporaal EmbProcArch 5kk73 96
PCInstructionmemory
Registers
Mux
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
0
Mux
IF/ID
Instruction
ID/EX.MemRead
IF/IDWri
te
PCW
rite
ID/EX.RegisterRt
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRtIF/ID.RegisterRs
Rt
Rs
Rd
RtEX/MEM.RegisterRd
MEM/WB.RegisterRd
Software only solution?
-
7/30/2019 Mi Ps Details
97/124
H.Corporaal EmbProcArch 5kk73 97
Have compiler guarantee that no hazards occur
Example: where do we insert the NOPs ?
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $13, 100($2)
Problem: this really slows us down!
y
sub $2, $1, $3nop
nop
and $12, $2, $5
or $13, $6, $2add $14, $2, $2
nop
sw $13, 100($2)
Control hazards
-
7/30/2019 Mi Ps Details
98/124
H.Corporaal EmbProcArch 5kk73 98
Control operations may change the sequential flow ofinstructions branch
jump
call (jump and link)
return (exception/interrupt and rti / return from interrupt)
Control hazard: Branch
-
7/30/2019 Mi Ps Details
99/124
H.Corporaal EmbProcArch 5kk73 99
Branch actions:
Compute new address
Determine condition
Perform the actual branch (if taken): PC := new address
Branch example
-
7/30/2019 Mi Ps Details
100/124
H.Corporaal EmbProcArch 5kk73 100
p
Reg
Reg
CC 1
Time (in clockcycles)
40 beq $1,$3,7
Program
execution
order
(in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12,$2,$5
48 or $13,$6,$2
52 add $14,$2,$2
72 lw $4,50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
Branching
-
7/30/2019 Mi Ps Details
101/124
H.Corporaal EmbProcArch 5kk73 101
g
Squash pipeline:
When we decide to branch, other instructions are in thepipeline!
We are predicting branch not taken need to add hardware for flushing instructions if we are wrong
Branch with predict not taken
-
7/30/2019 Mi Ps Details
102/124
H.Corporaal EmbProcArch 5kk73 102
Branch with predict not taken
Branch L
Predictnot taken
L:
Clock cycles
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
Branch speedup
-
7/30/2019 Mi Ps Details
103/124
H.Corporaal EmbProcArch 5kk73 103
p p
Earlier address computation Earlier condition calculation
Put both in the ID pipeline stage adder comparator
Branch L
Predictnot taken
L:
Clock cycles
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
Improved branching / flushing IF/IDIF Fl h
-
7/30/2019 Mi Ps Details
104/124
H.Corporaal EmbProcArch 5kk73 104
PCInstruction
memory
4
Registers
Mux
Mux
Mux
ALU
EX
M
WB
M
WB
WB
ID/EX
0
EX/MEM
MEM/WB
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
IF.Flush
IF/ID
Signextend
Control
Mu
x
=
Shiftleft2
Mux
Exception support
-
7/30/2019 Mi Ps Details
105/124
H.Corporaal EmbProcArch 5kk73 105
Types of exceptions:
Overflow I/O device request
Operating system call
Undefined instruction
Hardware malfunction Page fault
Precise exception:
finish previous instructions (which are still in the pipeline) flush excepting and following instructions, redo them after
handling the exception(s)
Exceptions
-
7/30/2019 Mi Ps Details
106/124
H.Corporaal EmbProcArch 5kk73 106
Changes needed for handling overflow exception of an
operation in EX stage (see book for details) :
Extend PC input mux with extra entry with fixed address
Add EPC register recording the ID/EX stage PC
this is the address of the next instruction ! Cause register recording exception type
E.g., in case of overflow exception insert 3 bubbles;flush the following stages:
IF/ID stage
ID/EX stage
EX/MEM stage
Scheduling, why?
-
7/30/2019 Mi Ps Details
107/124
H.Corporaal EmbProcArch 5kk73 107
g y
Lets look at the execution time:
Texecution= Ncyclesx Tcycle= Ninstructionsx CPIx Tcycle
Scheduling may reduce Texecution Reduce CPI(cycles per instruction)
early scheduling of long latency operations
avoid pipeline stalls due to structural, data and control hazards
allow Nissue > 1 and therefore CPI< 1
Reduce Ninstructions
compact many operations into each instruction (VLIW)
Scheduling data hazards:l 1
-
7/30/2019 Mi Ps Details
108/124
H.Corporaal EmbProcArch 5kk73 108
gexample 1
Try and avoid RaW stalls (in this case load interlocks)!
E.g., reorder these instructions:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t2, 0($t1)sw $t0, 4($t1)
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t0, 4($t1)sw $t2, 0($t1)
Scheduling data hazards Unscheduled code:
-
7/30/2019 Mi Ps Details
109/124
H.Corporaal EmbProcArch 5kk73 109
gexample 2
Avoiding RaW stalls:Reordering instructions forfollowing program
(by you or the compiler)
Code:
a = b + c
d = e - f
Lw R1,b
Lw R2,c
Add R3,R1,R2 interlockSw a,R3
Lw R1,e
Lw R2,f
Sub R4,R1,R2 interlockSw d,R4
Scheduled code:Lw R1,b
Lw R2,c
Lw R5,e extra reg. needed!Add R3,R1,R2
Lw R2,f
Sw a,R3
Sub R4,R5,R2
Sw d,R4
-
7/30/2019 Mi Ps Details
110/124
Scheduling control hazards
-
7/30/2019 Mi Ps Details
111/124
H.Corporaal EmbProcArch 5kk73 111
Scheduling control hazards
What can we do about control hazards and CPIpenalty?
Keep penalty Pbranchlow: Early computation of new PC
Early determination of condition
Visible branch delay slots filled by compiler (MIPS)
Branch prediction
Reduce control dependencies (control heightreduction) [Schlansker and Kathail, Micro95]
Remove branches: if-conversion Conditional instructions: CMOVE, cond skip next
Guarding all instructions: TriMedia
Branch delay slot
-
7/30/2019 Mi Ps Details
112/124
H.Corporaal EmbProcArch 5kk73 112
Add a branch delay slot: the next instruction after a branch is always executed
rely on compiler to fill the slot with something useful
Is this a good idea? let's look how it works
Branch delay slot scheduling
-
7/30/2019 Mi Ps Details
113/124
H.Corporaal EmbProcArch 5kk73 113
op 1
beq r1,r2, L
.............op 2
L: op 3
.............
.............
'fall-through'
branch target
Q. What to put in the delay slot?
Summary
-
7/30/2019 Mi Ps Details
114/124
H.Corporaal EmbProcArch 5kk73 114
Modern processors are (deeply) pipelined, to reduceTcycle and aim at CPI = 1
Hazards increase CPI
Several software and hardware measure to avoid orreduce hazards are taken
Not discussed, but important developments:
Multi-issue further reduces CPI
Branch prediction to avoid high branch penalties Dynamic scheduling
In all cases: a scheduling compiler needed
Recap of MIPS
-
7/30/2019 Mi Ps Details
115/124
H.Corporaal EmbProcArch 5kk73 115
RISC architecture
Register space
Addressing
Instruction format
Pipelining
Why RISC? Keep it simple
-
7/30/2019 Mi Ps Details
116/124
H.Corporaal EmbProcArch 5kk73 116
RISC characteristics:
Reduced number of instructions Limited addressing modes
load-store architecture
enables pipelining
Large register set uniform (no distinction between e.g. address and data registers)
Limited number of instruction sizes (preferably one) know directly where the following instruction starts
Limited number of instruction formats
Memory alignment restrictions ......
Based on quantitative analysis " the famous MIPS one percent rule": don't even think about it
when its not used more than one percent
Register space
-
7/30/2019 Mi Ps Details
117/124
H.Corporaal EmbProcArch 5kk73 117
Name Register number Usage
$zero 0 the constant value 0
$v0-$v1 2-3 values for results and expression evaluation
$a0-$a3 4-7 arguments
$t0-$t7 8-15 temporaries
$s0-$s7 16-23 saved (by callee)
$t8-$t9 24-25 more temporaries
$gp 28 global pointer $sp 29 stack pointer
$fp 30 frame pointer
$ra 31 return address
32 integer (and 32 floating point) registers of 32-bit
Addressing1. Immediate addressing
op rs rt Immediate
-
7/30/2019 Mi Ps Details
118/124
H.Corporaal EmbProcArch 5kk73 118
Byte Halfword Word
Registers
Memory
Memory
Word
Memory
Word
Register
Register
2. Register addressing
3. Base addressing
4. PC-relative addressing
5. Pseudodirect addressing
op rs rt
op rs rt
op
op
rs rt
Address
Address
Address
rd . . . funct
PC
PC
+
+
Instruction format
-
7/30/2019 Mi Ps Details
119/124
H.Corporaal EmbProcArch 5kk73 119
Example instructions
Instruction Meaningadd $s1,$s2,$s3 $s1 = $s2 + $s3
addi $s2,$s3,4 $s2 = $s3 + 4
lw $s1,100($s2) $s1 = Memory[$s2+100]
bne $s4,$s5,L if $s4$s5 goto L
j Label goto Label
op rs rt rd shamt functop rs rt 16 bit address
op 26 bit address
R
I
J
Pipelining
-
7/30/2019 Mi Ps Details
120/124
H.Corporaal EmbProcArch 5kk73 120
time
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
All integer instructions fit into the following pipeline
Other architecture styles
-
7/30/2019 Mi Ps Details
121/124
H.Corporaal EmbProcArch 5kk73 121
Accumulator architecture one operand (in register or memory), accumulator almost always
implicitly used
Stack zero operand: all operands implicit (on TOS)
Register (load store) three operands, all in registers
loads and stores are the only instructions accessing memory (i.e.with a memory (indirect) addressing mode
Register-Memory
two operands, one in memory Memory-Memory
three operands, may be all in memory
(there are more varieties / combinations)
Accumulator architecture
-
7/30/2019 Mi Ps Details
122/124
H.Corporaal EmbProcArch 5kk73 122
Accumulator
ALU Memory
registers
address
latch
latch
Example code: a = b+c;
load b; // accumulator is implicit operand
add c;
store a;
Stack architecture
-
7/30/2019 Mi Ps Details
123/124
H.Corporaal EmbProcArch 5kk73 123
Example code: a = b+c;
push b;push c;
add;
pop a;
b
b
c b+c
push b push c add pop a
stack:
ALU Memory
top of
stack
stack pt
latch
latch
latch
Other architecture styles
-
7/30/2019 Mi Ps Details
124/124
Stack
Architecture
Accumulator
Architecture
Register-Memory
Memory-Memory
Register
(load-store)
Push A Load A Load r1,A Add C,B,A Load r1,A
Push B Add B Add r1,B Load r2,B
Add Store C Store C,r1 Add r3,r1,r2
Pop C Store C,r3
Let's look at the code for C = A + B