cpe555a: real-time embedded systems

Fall 2015, arz

CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 11

CPE555A:Real-Time Embedded Systems

Lecture 2Ali Zaringhalam

Stevens Institute of Technology

Fall 2015, arz


Outline

RISC ISA Single-cycle CPU Multi-cycle CPU Pipelining Pipeline hazards

2

Fall 2015, arz


Von Neumann Machine

MainMemory

ALU/Datapath

Control Unit

Input/Output CPU

Fall 2015, arz


Fetch/Execute Cycle

StartStart Fetch Next InstructionFetch Next Instruction

Execute Instruction

Execute Instruction

HandleInterrupts

(If Any)

HandleInterrupts

(If Any)

Interrupts Disabled

Inte

rru

pts

En

ab

led

The address of the current instruction is the Program Counter (PC) register.

After the instruction is fetched, PC is automatically incremented to point to the next instruction

Fall 2015, arz


Need for Instructions We need a way to tell the processor what steps

to take to execute our program. In the Von Neumann Model this includes

fetching data from memory performing arithmetic & logical operations on the data storing the results of computation in memory performing input/output

In addition the processor must support certain high-level programming constructs. These include

modifying the sequential flow of control for if then else and case

subroutine calls to support structured programming

Fall 2015, arz


Examples of InstructionsH e r e a r e s o m e e x a m p l e i n s t r u c t i o n s :

L o a d i n g d a t a f r o m a m e m o r y l o c a t i o n i n t o a r e g i s t e r

l o a d R e g i s t e r 1 , M e m o r y _ A d d r e s s

S t o r i n g d a t a f r o m a r e g i s t e r t o a m e m o r y l o c a t i o n

s t o r e R e g i s t e r 1 , M e m o r y _ A d d r e s s

A d d i n g t h e c o n t e n t s o f t w o s o u r c e r e g i s t e r s a n d s t o r i n g t h er e s u l t i n a t h i r d d e s t i n a t i o n r e g i s t e r

a d d R e g i s t e r 3 , R e g i s t e r 1 , R e g i s t e r 2 C o m p a r i n g t w o r e g i s t e r s a n d f e t c h i n g t h e n e x t i n s t r u c t i o n

b a s e d o n t h e r e s u l t o f c o m p a r i s o n ( e i t h e r s e q u e n t i a l o r f r o ma b r a n c h )

b n e R e g i s t e r 1 , R e g i s t e r 2 , I n s t r u c t i o n _ A d d r e s s

Fall 2015, arz


RISC Instruction Set Architecture

MIPS is a flavor of the more generic class of Reduced Instruction Set Computer (RISC) Instruction Set Architecture (ISA)

Here are some examples of RISC processors PowerPC SPARC MIPS ARM (heavily used in embedded systems today)

The ISAs implemented in these machines are not quite the same but share a large set of common characteristics (to be discussed shortly)

Fall 2015, arz


Summary: MIPS Instruction Formats

opcode rs rt

6 5 5

rd

5

shamt

5

func

6

R-Type Format

opcode rs rt immediate

6 5 5 16

I-Type Format

opcode immediate

6 26

J-Type Format

This ISA was designed to allow efficient pipelining of instructions in HW

Fall 2015, arz


What’s in an ISA?

Above all an ISA is a set of specifications An ISA gives you a set of requirements on what to build

(i.e., support) in a processor. These include: the set of instructions that the processor must support the number of programmable registers instruction format including size and encoding the interface between the processor and the operating

system for exception handling what features are required and what features are optional

(for example integer arithmetic is required but floating-point arithmetic is typically optional)

in short: whatever is required to ensure binary compatibility between two machines implementing the same ISA

Fall 2015, arz


What Isn’t in an ISA?

An ISA doesn’t tell you how to build a processor. Such as: should it be pipelined? How many instructions should be issued per cycle? etc.

This separation of specification and implementation permits processor vendors to implement the ISA in different ways based

on technology/performance/cost requirements compiler developers to develop compilers to translate to an ISA

independent of the processor’s specific implementation this is not entirely true when it comes to performance optimization

an ISA to live longer than a specific implementation (a particular processor becomes obsolete long before an ISA is abandoned in favor of a new one)

Fall 2015, arz


Characteristics of RISC Processors Large number of General Purpose Registers Strictly load/store Fixed-size instructions Variable-format instructions Limited number of addressing modes Small instruction set (MIPS32 has 168

instructions vs. ~700 in VAX)

opcode rs rt6 5 5

rd5

shamt5

func6

R-Type Formatopcode rs rt immediate

6 5 5 16

I-Type Format

opcode immediate

6 26

J-Type Format

Fall 2015, arz


RISC Alternative: CISC

CISC: Complex Instruction Set Computer

variable-length, variable format instructions complex instructions memory-register instructions complex addressing modes Example: Intel’s IA32

CISC RISC

Fall 2015, arz


What’s a General-Purpose Register?

A general-purpose register (GPR) is a programmable register which can be used f or any purpose the programmer deems necessary. By programmer here we generally mean developers of compilers and low-level code such as drivers. They can use GPRs in any way that suits their purpose to optimize performance. The (General in) GPR should be contrasted with special-purpose registers. These include the accumulator register and registers reserved f or holding the base address and off set index of arrays. Special-purpose registers were common in early processors because registers were expensive and compilers weren’t smart enough to exploit them. Processors such as those f rom I ntel which are based on the legacy 8086 I SA continue to support special-purpose registers.

Fall 2015, arz


Storage-Device HierarchyIn

creasi

ng A

ccess

Tim

e

0.25-0.5 ns

0.5-20 ns

80-250 ns

4 GHZ CPU Cycle T=0.25 ns

Fall 2015, arz


Why a Large Number of GPRs?

Registers are cheaper to make now Registers offer compiler writers flexibility

compiler developers prefer unreserved registers Registers are faster to access than main memory or cache Registers can store variables for as long as necessary.

This reduces the need to access memory for data We can address registers with fewer bits compared to

addressing main memory. This reduces code density in MIPS we need 5 bits to address 32 registers in a 32-bit machine we need 32 bits to address a memory

location

Fall 2015, arz


MIPS Register Organization

32 GPRs (or integer registers), each 32 bits wide reg31 is used to store the return address during

procedure calls. At other times they can be used for any purpose.

Why not consider more than 32 registers? addressing registers in instructions requires address bits:

need n bits to address 2n registers (5 bits to address 32 registers); there is a tradeoff between the number of GPRs and instruction size

more registers means more hardware (e.g., gates, wires); more hardware translates into a longer datapath and lower clock cycle

Fall 2015, arz


Data Transfer Instructions

Even with 32 GPRs, it is impossible to store complex data structures and arrays used in a typical program. To store these we need a much larger storage capacity. This capacity is provided by main memory. This requires a means to transfer data between main memory and GPRs.

The load instruction transfers data f rom memory to a register. I n the f ollowing instruction data is transferred f rom memory location

MEM reg3 32 to the reg1 register.

lw reg1, 32(reg3)

Similarly, the store instruction transfers data f rom a register to main memory.

sw reg1, 32(reg3) opcode rs rt immediate

6 5 5 16

I-Type Format

Fall 2015, arz


Word is 4 bytes. So offset is 8x4=32.

18

Example

How would the f ollowing C assignment statement be compiled?

g h A 8

The compiler would typically assign g and h to some registers, say reg1 and

reg2 respectively. I t would then store the base address of the array in a

third register, say reg3 and adds it to the array off set 8. The compilation

results in the instructions:

lw reg4, 32 reg3

add reg1, reg2,re

# reg4 A 8

# g4 hg A 8

The fi rst instruction transfers data stored in memory to reg4 and the

second instruction adds this to the contents of register reg2, then stores it

in reg1.

Note that you must load data from memory into register reg4 before any arithmetic operation. Hence the name “load-store” which means that you cannot use memory operand in ALU instructions.

reg1

reg2

temp

Fall 2015, arz


Memory Addressing What are the addressable objects in

memory? in most processors today instructions can address and

operate on individual bytes but other multi-byte scalars such as word and half-word

are also available for access

And the issues for multi-byte scalars are... how to organize bytes of a multi-scalar in memory: little

endian vs big endian conventions how to access multi-byte scalars: alignment restrictions In other words, does the given memory address refer to

Least Significant Byte (LSB) or Most Significant Byte (MSB)?

Fall 2015, arz


Little & Big Endian

Big endian: word address is the address of the most significant byte

Little endian: word address is the address of the least significant byte

MSB LSBB+0 B+1 B+2 B+3Big Endian Byte

Little Endian ByteB+3 B+2 B+1 B+0

B is some base addressB is some base address

Fall 2015, arz


Example: 0x12345678

Big Endian Little Endian

78

3412

5612

5678

34

200201202203

memory address

similar to writing English

B=200 is the base address in this exampleB=200 is the base address in this example

Fall 2015, arz


Example Machines

Little endian 80x86, VAX, Alpha

Big endian SPARC, 680x0, IBM370/390, most RISC

Bi-endian processors can be configured to operate in either big- or little-endian modes (e.g., MIPS64)

When to worry about endian-ness? byte/bit manipulation within a multibyte scalar (e.g.,

access 3rd most significant byte of a 4-byte word) data communication between machines of different

endian-ness

Fall 2015, arz


MIPS is Strictly Load/Store?

lw reg5, 8 reg33

add reg1, reg2,reg

# $t0 A 8

# g h5 A 8

lw reg5, 8 reg33

add reg1, reg2,reg

# $t0 A 8

# g h5 A 8

I n a s t r i c t l y l o a d / s t o r e I S A , t h e o n l y i n s t r u c t i o n s t h a t c a n a c c e s s m e m o r ya r e l o a d a n d s t o r e . T h e o p e r a n d s o f o t h e r i n s t r u c t i o n s m u s t fi r s t b e l o a d e di n t o r e g i s t e r s u s i n g t h e l o a d i n s t r u c t i o n . T h e r e s u l t o f t h e o p e r a t i o n c a n b es t o r e d i n m e m o r y o n l y b y m e a n s o f t h e s t o r e i n s t r u c t i o n . I n M I P S ( a n dm o r e g e n e r a l l y i n R I S C ) i n s t r u c t i o n s o f t h e t y p e :

a d d r e g 1 , r e g 2 , 8 ( r e g 3 )

w h e r e o n e o r m o r e o f t h e o p e r a n d s r e s i d e i n m e m o r y a r e n o t a l l o w e d .I n s t r u c t i o n s o f t h i s t y p e a r e c o m m o n i n V A X a n d x 8 6 I S A s .

Fall 2015, arz


Addressing Modes Instructions need to know where/what their

operands are. So the question is how the operands are supplied to the instruction. MIPS ISA supports three methods for this purpose

immediate mode addressing : the operand is encoded directly in the instruction as a constant

the address of the operand is encoded in the instruction register mode addressing : the operand is in a register and

the address of the register is encoded in the instruction displacement mode addressing : the operand is stored in

memory and the address of the memory location is encoded in the instruction

Fall 2015, arz


Addressing Mode Examples Immediate mode

add reg4, 7 # Regs[reg4]=Regs[reg4]+7 16-bit field for the constant

Register mode add reg4, reg3, reg2 # Regs[R4]=Regs[R3]+Regs[R2]

Displacement mode lw reg4, 100(reg1) #

Regs[R4]=Mem[Regs[R1]+100] 16-bits for displacement

Special cases of displacement mode indirect mode: displacement value=0

lw reg4, 0(reg1) # Regs[R4]=Mem[Regs[R1]] absolute addressing : reg0 as base register (always stores 0)

lw reg4, 8700(reg0) # Regs[R4]=Mem[8700]

MIPS ISA supports3 addressing modes explicitly, but effectively we have 5 addressing modes at our disposal.

MIPS ISA supports3 addressing modes explicitly, but effectively we have 5 addressing modes at our disposal.

Fall 2015, arz


I For Immediate Addressing Mode

A large percentage of arithmetic operations have one constant operand (e.g., X=X+4)

Keeping & loading constants from memory is inefficient (consider storing all integer constants in memory!)

ALU instructions with immediate addressing mode are designed to address this need

use I-type instruction format encode constant in the instruction’s 16-bit immediate field

constants in range -215 to (215-1) can be encoded example:

addi R4, R8, 79


6 5 5 16

I-Type Format

Fall 2015, arz


Conditional Branch Instructions

The processor f etches and executes instructions sequentially f rommemory. But programming languages have constructs such as if thenelse or switch statements. I n these conditional branch statementsthe outcome of a test determines whether to execute the nextsequential instruction or f etch the next instruction f rom a branchaddress. As examples, MI PS supports conditional branchinstructions:

1. Branch if equal: beq reg1, reg2, Label. I n this instruction reg1and reg2 are compared and if they are equal, the nextinstruction will be f etched and executed f rom the addressLabel.

2. Branch if not equal: bne reg1, reg2, Label. I n this instructionreg1 and reg2 are compared and if they are not equal, the nextinstruction will be f etched and executed f rom the addressLabel.

Fall 2015, arz


Unconditional Branch Instructions

How would one compile the following f ragment:

if(i==j) f=g+h; else f=g-h;

Here is how:

bne reg4, reg5, Else

add reg1, reg2, reg3

j Exit

Else: sub reg1, reg2, reg3

Exit:

The fi rst thing to note is that we are testing f or inequality even though the C statement is written in term of an equality test. This is because this f orm is generally more effi cient if one assumes that the branch is not taken more of ten than not. The second thing to note is the appearance of the unconditional jump instruction j which simply tells the processor to f etch the next instruction f rom the address Exit.

Assume: f reg1 g reg2 h reg3 i reg4 i reg5

Fall 2015, arz


Encoding Unconditional Branch Instructions

Unconditional PC-region jump encoded as J-Type instruction

opcode J: 2

26-bit PC-region offset with respect to PC+4

opcode=2 Offset added to PC+4

6 26

J-Type Format

opcode=0 rs rt

6 5 5

rd

5

shamt

5

func

6

R-Type Format

Unconditional register jump encoded as R-Type instruction

opcode: 0 funct

JR: 8 rs contains branch

address

Fall 2015, arz


Procedure Call: Invocation

MI PS I SA (and theref ore a MI PS processor) supports procedure calls through the instructions jal and jalr. These instructions are analogous to j and jr instructions f or unconditional branch in that they transf er control to the fi rst instruction in the procedure. However, in a procedure call you must also ensure that at the end of the procedure you return to the caller (specifi cally the instruction in the caller f ollowing the procedure invocation). The jump and link instruction:

jal ProcedureAddress

jumps to the instruction at ProcedureAddress and stores the return address (i.e., PC+4) in register reg31 in an atomic operation.

Fall 2015, arz


Procedure Call: Return

opcode=3 immediate

6 26

J-Type Format

Af ter the procedure fi nishes execution it must return to the instruction at the return address. The address of this instruction is stored in $ra (return address register is $ra=reg31). Thus to return to this address the procedure executes:

jr $ra

so that no new instruction f or the return instruction is required. Similar to the jump instruction j , the jal instruction is encoded as a j -type instruction with opcode=3 as shown below. And similar to j , the jal instruction also uses a 26-bit off set fi eld f or PC-region addressing. The procedure address is constrained within a 256 MB boundary.

Fall 2015, arz


Yet Another Instruction for Procedure Calls: jalr

jalr: jump & link register instruction encoded in R-Type format jumps to address in rt opcode = 0; funct = 9

jalr instruction stores the return address in $ra=reg31

to return at the end of the procedure: jr $ra Used when the procedure’s address is not known

at compile time or is beyond 256 MB of PC+4

opcode=0 rs rt

6 5 5

rd

5

shamt

5

func=9

6

R-Type Format

Fall 2015, arz


Von Neumann Machine

MainMemory

ALU/Datapath

Control Unit

Input/Output CPU

Fall 2015, arz


Single-Cycle CPU

Datapath.

Control unit.

Fall 2015, arz


Single-Cycle CPU

One clock cycle for each instruction No datapath resource can be used more than

once per clock cycle Results in resource duplication for

elements that must be used more than once. Examples: Separate memory units for instruction and

data Two ALUs for conditional branches

One to compute branch condition One to compute branch address

Fall 2015, arz


Shortcomings of Single-Cycle CPU

Duplication of datapath elements Separate instruction & data memory Multiple ALUs

Clock cycle must have the same length for all instructions

Cycle determined by the longest path: load instruction memory (fetch from instruction memory) register (read base address) ALU (compute memory address) data memory (read from data memory) register (write into destination register)

Several instructions require a shorter cycle

lw reg5, 8 reg33

add reg1, reg2,reg

# $t0 A 8

# g h5 A 8

lw reg5, 8 reg33

add reg1, reg2,reg

# $t0 A 8

# g h5 A 8

Fall 2015, arz


Multi-Cycle CPU

IF ID EX MEM WB

Break up datapath into smaller functional segments.

Each instruction uses only the functions it needs

Run faster clock cycle

Fall 2015, arz


Additional Datapath Elements

Internal registers (invisible to programmer) To store intermediate results from one clock

cycle to the next during execution of each instruction

Similar to a scratchpad or a temporary variable Instruction Register (IR) A and B registers to store register operands read

from the register file ALUout to store result of ALU operation Load Memory Data Register (LMDR)

Fall 2015, arz


Internal & External Registers Internal registers store data from one clock

cycle to the next within a single instruction cycle. At the end of a clock cycle, data needed in subsequent clock cycles must be stored in an internal register

Data needed by subsequent instructions by the program are stored in the external registers or memory

At the end of a clock cycle, data is stored in one, the other or both register classes

Fall 2015, arz


Instruction Steps - 1 In the simplest implementation,

instructions take at most 5 clock cycles instruction fetch (IF) instruction decode/register fetch cycle (ID) execution/effective address cycle (EX) memory access/branch completion cycle

(MEM) write-back cycle (WB)

Which instructions require no less than 5 cycles to complete?

Fall 2015, arz


1. Instruction Fetch Cycle (IF)

IR = Mem[PC]

PC <- PC+4

IR = Mem[PC]

PC <- PC+4

•PC register content is applied to the instruction memory address bus

•Instruction is fetched and saved in the IR register to be used in the ID stage

•PC is incremented by 4 to compute the address of the next sequential instruction.

Fall 2015, arz


2. Instruction Decode/Register Fetch Cycle (ID)

A = Regs[rs]

B = Regs[rt]

Imm= (immediate sign extended)

• Functions in the ID stage

• Decode instruction

• Access register file to read the registers and store in A & B (at the next clock edge) for use in the next cycles

Fixed -Field Decoding

•Decoding is done in parallel with reading the register file because these fields occur in fixed locations for all instructions

•Reading registers that will not subsequently be used is harmless


6 5 5 16

I-Type

opcode rs rt func

6 5 5 11

rd

5

R-Type

Fall 2015, arz


3. Execution/Effective Address Cycle (EX)

ALUout = A + (Imm)ALUout = A + (Imm)

Memory Reference Instruction

The ALU adds operands prepared in the last clock cycle in A. The result is the effective address of an operand for load/store.

ALUout = A func BALUout = A func BRegister-Register ALU Instruction

The ALU performs operation on operands prepared in A and B in the last cycle.


6 5 5 16

I-Type

opcode rs rt func

6 5 5 11

rd

5

R-Type

Fall 2015, arz


4. Memory Access or ALU Instruction Completion (MEM)

LMDR = Mem[ALUout] or

Mem[ALUout] = B

LMDR = Mem[ALUout] or

Mem[ALUout] = B

Memory Reference

•Use memory address computed by ALU and stored in ALUout in the previous clock cycle

•Access memory and perform read or write depending on load or store


6 5 5 16

I-Type

Regs[IR11…15 ]= ALUout # I-Type ALU

Regs[IR16…20 ]= ALUout # R-Type ALU

Regs[IR11…15 ]= ALUout # I-Type ALU

Regs[IR16…20 ]= ALUout # R-Type ALU

ALU Instruction (R- or I-Type)

•Store the result from ALUout in the destination register.

opcode rs rt func

6 5 5 11

rd

5

R-Type

Fall 2015, arz


5. Write-Back Cycle (WB)

Regs[IR11..15] = LMDRRegs[IR11..15] = LMDR

Load Instruction

Load data into destination register rd. Data was fetched in an earlier clock cycle and stored in LMDR.

opcode rs rd immediate

6 5 5 16

I-Type

What to do in this stage for conditional branch instructions?

Fall 2015, arz


Example

Assume the following instruction mix and clock cycles:•load: 23% 5•store: 13% 4•branches: 19% 3•jumps: 02% 3•ALU: 43% 4

What is the average CPI?

CPI = .23 x 5 + .13 x 4 + .19 x 3 + .02 x 3 + .43 x 4 = 4.02

Performance improvement over single-cycle CPU: 5.0/4.02=1.24CPI = .23 x 5 + .13 x 4 + .19 x 3 + .02 x 3 + .43 x 4 = 4.02

Performance improvement over single-cycle CPU: 5.0/4.02=1.24

Fall 2015, arz


Multi-Cycle CPU Summary

Multi-cycle CPU improves performance but not by much

Performance is limited by the high frequency of instructions with high CPI (load, store, ALU)

Significant performance gain can be made through pipelining

Pipelining model uses the same stages as the multi-cycle CPU

Fall 2015, arz


What is Pipelining?

Pipelining is an implementation technique where execution of sequential instructions are overlapped in time

It improves instruction execution throughput, but not execution time of individual instructions

Hazard: refers to situations when the next instruction in the pipeline cannot be executed in the following clock cycle

Fall 2015, arz


Laundry Example

Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold

Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes “Execution time”: 90

minutes

A B C D

Fall 2015, arz


Sequential Laundry

Sequential laundry takes 6 hours for 4 loads.

Minimum=30+40+20 minutes=1.5 hours

Sequential laundry takes 6 hours for 4 loads.

Minimum=30+40+20 minutes=1.5 hours

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

Fall 2015, arz


Pipelined Laundry: Start work ASAP

Pipelined laundry takes 3.5 hours for 4 loads

30+4x40+20=210 min= 3.5 hrs

Min time = 1.5 hrs Must waste 30 min for

each load because stages are not balanced

1.5 + 4 x (1/2) = 3.5

hrs

Pipelined laundry takes 3.5 hours for 4 loads

30+4x40+20=210 min= 3.5 hrs

Min time = 1.5 hrs Must waste 30 min for

each load because stages are not balanced

1.5 + 4 x (1/2) = 3.5

hrs

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

Each stage must be 40 minutes long40x3=120 min (execution time)minimum=40+30+20=90 minbalance=120-90=30 minutes

These stages must be extended to 40 minutesThese stages must be extended to 40 minutes

Fall 2015, arz


Pipelining Lessons

Pipelining doesn’t help latency of a single task (in fact it increases it in our example), it helps throughput of entire workload

Pipeline rate limited by slowest pipeline stage

Multiple tasks operating simultaneously using different resources

Potential speedup = Number of pipe stages

Unbalanced lengths of pipe stages reduces speedup

Time to “fill” pipeline and time to “drain” it reduces speedup

Stall for Dependencies

Pipelining doesn’t help latency of a single task (in fact it increases it in our example), it helps throughput of entire workload

Pipeline rate limited by slowest pipeline stage

Multiple tasks operating simultaneously using different resources

Potential speedup = Number of pipe stages

Unbalanced lengths of pipe stages reduces speedup

Time to “fill” pipeline and time to “drain” it reduces speedup

Stall for Dependencies

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20

Dependency example: Cathy’s socks end up in Brian’s dryer load by mistake

Dependency example: Cathy’s socks end up in Brian’s dryer load by mistake

Fall 2015, arz


Summary: MIPS Instruction Formats

opcode rs rt

6 5 5

rd

5

shamt

5

func

6

R-Type Format


6 5 5 16

I-Type Format

opcode immediate

6 26

J-Type Format

This ISA was designed to allow efficient pipelining of instructions in HW

Fall 2015, arz


MIPS 5-Stage Integer Pipeline

MemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

PC

+

Inst.Mem.

4

NPC

IR

Regs

SignExt.

A

BMUX

MUX

MUX

LMDALU DataMem.

ALUOutput

Cond.

16 64

MUX

Imm.

Zero?

Fall 2015, arz


Pipelining Example

Instruction Class

Instruction memory (IF)

Register read (ID)

ALU operations

(EX)

Data memory (MEM)

Register write (WB)

Total

ALU type 2 1 2 0 1 6 ns Load 2 1 2 2 1 8 ns Store 2 1 2 2 0 7 ns Branch 2 1 2 0 0 5 ns

ID EX MEM WBIF

ID EX MEM WBIF

ID EX MEM WBIF

ID EX MEM WBIF

0 2 4 6 8 10 12 14 16Time (ns)

Four successive load instructions

Fall 2015, arz


Pipelining Example - Continued

ID EX MEM WBIF

ID EX MEM WBIF

ID EX MEM WBIF

ID EX MEM WBIF

0 2 4 6 8 10 12 14 16Time (ns)

Performance Improvement

•Pipelined case: a continuous stream of instructions can be fed in every 2 ns. This cycle time corresponds to the stage which takes the longest.

•Non-pipeline case (single-cycle): a continuous stream of instructions can be fed in every 8 ns. This corresponds to the instruction which takes the longest (load)

Performance Improvement

•Pipelined case: a continuous stream of instructions can be fed in every 2 ns. This cycle time corresponds to the stage which takes the longest.

•Non-pipeline case (single-cycle): a continuous stream of instructions can be fed in every 8 ns. This corresponds to the instruction which takes the longest (load)

•Performance degradation = 5*2ns= 10 ns > 8 ns

as compared to single-cycle CPU

•Improvement is in throughput. Instructions

take at least the same amount of time

(in this case instruction latency is longer

because the stages are unbalanced)

•Performance degradation = 5*2ns= 10 ns > 8 ns

as compared to single-cycle CPU

•Improvement is in throughput. Instructions

take at least the same amount of time

(in this case instruction latency is longer

because the stages are unbalanced)

Fall 2015, arz


Execution Time & ThroughputNumber of I nstructions Total Execution Time

(ns)Average Execution Time

(ns)

1 10 10.02 12 6.03 14 4.74 16 4.05 18 3.66 20 3.37 22 3.18 24 3.09 26 2.910 28 2.8….. ….. ……

I NFI NI TY I NFI NI TY 2.0

Number of I nstructions Total Execution Time(ns)

Average Execution Time(ns)

1 10 10.02 12 6.03 14 4.74 16 4.05 18 3.66 20 3.37 22 3.18 24 3.09 26 2.910 28 2.8….. ….. ……

I NFI NI TY I NFI NI TY 2.0

•At the beginning it takes a few instructions to fill the pipeline and reach steady state

•At the end it takes a few instructions to drain the pipeline

•In steady state instructions are executed every 2 ns (typical program)

Fall 2015, arz


MIPS Pipeline

RegIM ALUALUDM Reg

RegIM ALUALUDM Reg

RegIM ALUALUDM Reg

CC1 CC2 CC3 CC4 CC5 CC6 CC7

•Instruction memory (IM) and data memory (DM) are shown as separate units•All operations in a pipeline stage must complete in one clock cycle•Values passed from one stage to another must be stored in internal registers •Registers labeled with the names of stages they connect

•Instruction memory (IM) and data memory (DM) are shown as separate units•All operations in a pipeline stage must complete in one clock cycle•Values passed from one stage to another must be stored in internal registers •Registers labeled with the names of stages they connect

Pipeline registersPipeline registers

IF/ID EX/MEMID/EX MEM/WB

Intermediate registers introduce delay in the datapath

Fall 2015, arz


Data Hazard

RegIF ALUALUMem Reg

RegIF ALUALUMem Reg

RegIF ALUALUMem

RegIF ALUALU

ADD R1, R2, R3

SUB R4, R1, R5

AND R6, R1, R7

OR R8, R1, R9

XOR R10, R1, R11

CC1 CC2 CC3 CC4 CC5

RegIF ALUALUMem Reg

IF/ID ID/EX EX/MEM MEM/WB

Fall 2015, arz


RAW Data Hazard ADD R1, R2, R3SUB R4, R1, R5AND R6, R1, R7OR R8, R1, R9XOR R10, R1, R11

ADD R1, R2, R3SUB R4, R1, R5AND R6, R1, R7OR R8, R1, R9XOR R10, R1, R11

•R1 is not written back to the register file until the WB cycle (CC5) of ADD instruction•R1 is needed in the ID cycle of the succeeding instructions

–CC3 for SUB–CC4 for AND–CC5 for OR–CC6 for XOR

•Unless the hazard is handled, these instructions operate on the wrong operand value

•R1 is not written back to the register file until the WB cycle (CC5) of ADD instruction•R1 is needed in the ID cycle of the succeeding instructions

–CC3 for SUB–CC4 for AND–CC5 for OR–CC6 for XOR

•Unless the hazard is handled, these instructions operate on the wrong operand value

Fall 2015, arz


Split-Phase Register Read/Write

•XOR operates correctly because its ID cycle is in CC6•OR can be made to operate correctly by:

–Writing the register file in the first half of the clock cycle–Reading the register file in the second half of the clock cycle

•XOR operates correctly because its ID cycle is in CC6•OR can be made to operate correctly by:

–Writing the register file in the first half of the clock cycle–Reading the register file in the second half of the clock cycle

RegIM ALUALUMem

ADD R1, R2, R3

OR R8, R1, R9

RegIM ALUALUMem Reg

CC1 CC2 CC3 CC4 CC5 CC6 CC7

1st half 2nd half


Fall 2015, arz


Forwarding (aka Bypassing)

RegIF ALUALUMem Reg

ADD R1, R2, R3

SUB R4, R1, R5

CC1 CC2 CC3 CC4 CC5

RegIF ALUALUMem Reg

•The result is not needed by the SUB instruction in CC4 until the ADD instruction has actually computed the result in the previous cycle CC3

•Forward the result of ALU operations from the previous cycle

•ALU results is written in the ALUout in the EX/MEM pipeline registers

•If Forwarding logic detects that one of the register operands has been “touched” by the previous ALU operation, control logic selects input from the EX/MEM instead of ID/EX

•The result is not needed by the SUB instruction in CC4 until the ADD instruction has actually computed the result in the previous cycle CC3

•Forward the result of ALU operations from the previous cycle

•ALU results is written in the ALUout in the EX/MEM pipeline registers

•If Forwarding logic detects that one of the register operands has been “touched” by the previous ALU operation, control logic selects input from the EX/MEM instead of ID/EX

ALUout

ALUout



Fall 2015, arz


IM

Reg

Reg

IM

LW R1, 0(R2)

AND R6, R1, R7

OR R8, R1, R9

CC1 CC2 CC3 CC4 CC5

RegIM ALUALUDM Reg

CC6 CC7

ALUALUDM SUB R4, R1, R5

ALUALUDM

Stall in the Pipeline• No instruction begins in CC3• No instruction completes in CC6

Stall in the Pipeline• No instruction begins in CC3• No instruction completes in CC6

Forwarding is now through MEM/WB

OR no longer requires forwarding

RegIM ALUALU


IF/ID IF/ID ID/EX EX/MEM MEM/WB

Fall 2015, arz


Tabular View of Pipelining

LW R1,0(R2) I F I D EX MEM WB

SUB R4,R1,R5 I F I D EX MEM WB

AND R6,R1,R7 I F I D EX MEM WB

OR R8,R1,R9 I F I D EX MEM WB

LW R1,0(R2) I F I D EX MEM WB

SUB R4,R1,R5 I F Stall I D EX MEM WB

AND R6,R1,R7 I F I D EX MEM WB

OR R8,R1,R9 I F I D EX MEM

cpe555a: real-time embedded systems

Documents

embedded systems todaythe

current instruction

computer risc instruction

program counter pc register

includesfetching data

memory location

neumann model

example instructions