cpe555a: real-time embedded systems
DESCRIPTION
CPE555A: Real-Time Embedded Systems. Lecture 2 Ali Zaringhalam Stevens Institute of Technology. 1. Outline. RISC ISA Single-cycle CPU Multi-cycle CPU Pipelining Pipeline hazards. 2. Von Neumann Machine. CPU. Input/Output. ALU/Datapath. Main Memory. Control Unit. 3. - PowerPoint PPT PresentationTRANSCRIPT
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 11
CPE555A:Real-Time Embedded Systems
Lecture 2Ali Zaringhalam
Stevens Institute of Technology
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 2
Outline
RISC ISA Single-cycle CPU Multi-cycle CPU Pipelining Pipeline hazards
2
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 33
Von Neumann Machine
MainMemory
ALU/Datapath
Control Unit
Input/Output CPU
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 44
Fetch/Execute Cycle
StartStart Fetch Next InstructionFetch Next Instruction
Execute Instruction
Execute Instruction
HandleInterrupts
(If Any)
HandleInterrupts
(If Any)
Interrupts Disabled
Inte
rru
pts
En
ab
led
The address of the current instruction is the Program Counter (PC) register.
After the instruction is fetched, PC is automatically incremented to point to the next instruction
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 55
Need for Instructions We need a way to tell the processor what steps
to take to execute our program. In the Von Neumann Model this includes
fetching data from memory performing arithmetic & logical operations on the data storing the results of computation in memory performing input/output
In addition the processor must support certain high-level programming constructs. These include
modifying the sequential flow of control for if then else and case
subroutine calls to support structured programming
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 66
Examples of InstructionsH e r e a r e s o m e e x a m p l e i n s t r u c t i o n s :
L o a d i n g d a t a f r o m a m e m o r y l o c a t i o n i n t o a r e g i s t e r
l o a d R e g i s t e r 1 , M e m o r y _ A d d r e s s
S t o r i n g d a t a f r o m a r e g i s t e r t o a m e m o r y l o c a t i o n
s t o r e R e g i s t e r 1 , M e m o r y _ A d d r e s s
A d d i n g t h e c o n t e n t s o f t w o s o u r c e r e g i s t e r s a n d s t o r i n g t h er e s u l t i n a t h i r d d e s t i n a t i o n r e g i s t e r
a d d R e g i s t e r 3 , R e g i s t e r 1 , R e g i s t e r 2 C o m p a r i n g t w o r e g i s t e r s a n d f e t c h i n g t h e n e x t i n s t r u c t i o n
b a s e d o n t h e r e s u l t o f c o m p a r i s o n ( e i t h e r s e q u e n t i a l o r f r o ma b r a n c h )
b n e R e g i s t e r 1 , R e g i s t e r 2 , I n s t r u c t i o n _ A d d r e s s
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 77
RISC Instruction Set Architecture
MIPS is a flavor of the more generic class of Reduced Instruction Set Computer (RISC) Instruction Set Architecture (ISA)
Here are some examples of RISC processors PowerPC SPARC MIPS ARM (heavily used in embedded systems today)
The ISAs implemented in these machines are not quite the same but share a large set of common characteristics (to be discussed shortly)
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 88
Summary: MIPS Instruction Formats
opcode rs rt
6 5 5
rd
5
shamt
5
func
6
R-Type Format
opcode rs rt immediate
6 5 5 16
I-Type Format
opcode immediate
6 26
J-Type Format
This ISA was designed to allow efficient pipelining of instructions in HW
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 99
What’s in an ISA?
Above all an ISA is a set of specifications An ISA gives you a set of requirements on what to build
(i.e., support) in a processor. These include: the set of instructions that the processor must support the number of programmable registers instruction format including size and encoding the interface between the processor and the operating
system for exception handling what features are required and what features are optional
(for example integer arithmetic is required but floating-point arithmetic is typically optional)
in short: whatever is required to ensure binary compatibility between two machines implementing the same ISA
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 1010
What Isn’t in an ISA?
An ISA doesn’t tell you how to build a processor. Such as: should it be pipelined? How many instructions should be issued per cycle? etc.
This separation of specification and implementation permits processor vendors to implement the ISA in different ways based
on technology/performance/cost requirements compiler developers to develop compilers to translate to an ISA
independent of the processor’s specific implementation this is not entirely true when it comes to performance optimization
an ISA to live longer than a specific implementation (a particular processor becomes obsolete long before an ISA is abandoned in favor of a new one)
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 1111
Characteristics of RISC Processors Large number of General Purpose Registers Strictly load/store Fixed-size instructions Variable-format instructions Limited number of addressing modes Small instruction set (MIPS32 has 168
instructions vs. ~700 in VAX)
opcode rs rt6 5 5
rd5
shamt5
func6
R-Type Formatopcode rs rt immediate
6 5 5 16
I-Type Format
opcode immediate
6 26
J-Type Format
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 1212
RISC Alternative: CISC
CISC: Complex Instruction Set Computer
variable-length, variable format instructions complex instructions memory-register instructions complex addressing modes Example: Intel’s IA32
CISC RISC
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 1313
What’s a General-Purpose Register?
A general-purpose register (GPR) is a programmable register which can be used f or any purpose the programmer deems necessary. By programmer here we generally mean developers of compilers and low-level code such as drivers. They can use GPRs in any way that suits their purpose to optimize performance. The (General in) GPR should be contrasted with special-purpose registers. These include the accumulator register and registers reserved f or holding the base address and off set index of arrays. Special-purpose registers were common in early processors because registers were expensive and compilers weren’t smart enough to exploit them. Processors such as those f rom I ntel which are based on the legacy 8086 I SA continue to support special-purpose registers.
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 1414
Storage-Device HierarchyIn
creasi
ng A
ccess
Tim
e
0.25-0.5 ns
0.5-20 ns
80-250 ns
4 GHZ CPU Cycle T=0.25 ns
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 1515
Why a Large Number of GPRs?
Registers are cheaper to make now Registers offer compiler writers flexibility
compiler developers prefer unreserved registers Registers are faster to access than main memory or cache Registers can store variables for as long as necessary.
This reduces the need to access memory for data We can address registers with fewer bits compared to
addressing main memory. This reduces code density in MIPS we need 5 bits to address 32 registers in a 32-bit machine we need 32 bits to address a memory
location
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 1616
MIPS Register Organization
32 GPRs (or integer registers), each 32 bits wide reg31 is used to store the return address during
procedure calls. At other times they can be used for any purpose.
Why not consider more than 32 registers? addressing registers in instructions requires address bits:
need n bits to address 2n registers (5 bits to address 32 registers); there is a tradeoff between the number of GPRs and instruction size
more registers means more hardware (e.g., gates, wires); more hardware translates into a longer datapath and lower clock cycle
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 1717
Data Transfer Instructions
Even with 32 GPRs, it is impossible to store complex data structures and arrays used in a typical program. To store these we need a much larger storage capacity. This capacity is provided by main memory. This requires a means to transfer data between main memory and GPRs.
The load instruction transfers data f rom memory to a register. I n the f ollowing instruction data is transferred f rom memory location
MEM reg3 32 to the reg1 register.
lw reg1, 32(reg3)
Similarly, the store instruction transfers data f rom a register to main memory.
sw reg1, 32(reg3) opcode rs rt immediate
6 5 5 16
I-Type Format
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 18
Word is 4 bytes. So offset is 8x4=32.
18
Example
How would the f ollowing C assignment statement be compiled?
g h A 8
The compiler would typically assign g and h to some registers, say reg1 and
reg2 respectively. I t would then store the base address of the array in a
third register, say reg3 and adds it to the array off set 8. The compilation
results in the instructions:
lw reg4, 32 reg3
add reg1, reg2,re
# reg4 A 8
# g4 hg A 8
The fi rst instruction transfers data stored in memory to reg4 and the
second instruction adds this to the contents of register reg2, then stores it
in reg1.
Note that you must load data from memory into register reg4 before any arithmetic operation. Hence the name “load-store” which means that you cannot use memory operand in ALU instructions.
reg1
reg2
temp
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 1919
Memory Addressing What are the addressable objects in
memory? in most processors today instructions can address and
operate on individual bytes but other multi-byte scalars such as word and half-word
are also available for access
And the issues for multi-byte scalars are... how to organize bytes of a multi-scalar in memory: little
endian vs big endian conventions how to access multi-byte scalars: alignment restrictions In other words, does the given memory address refer to
Least Significant Byte (LSB) or Most Significant Byte (MSB)?
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 2020
Little & Big Endian
Big endian: word address is the address of the most significant byte
Little endian: word address is the address of the least significant byte
MSB LSBB+0 B+1 B+2 B+3Big Endian Byte
Little Endian ByteB+3 B+2 B+1 B+0
B is some base addressB is some base address
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 2121
Example: 0x12345678
Big Endian Little Endian
78
3412
5612
5678
34
200201202203
memory address
similar to writing English
B=200 is the base address in this exampleB=200 is the base address in this example
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 2222
Example Machines
Little endian 80x86, VAX, Alpha
Big endian SPARC, 680x0, IBM370/390, most RISC
Bi-endian processors can be configured to operate in either big- or little-endian modes (e.g., MIPS64)
When to worry about endian-ness? byte/bit manipulation within a multibyte scalar (e.g.,
access 3rd most significant byte of a 4-byte word) data communication between machines of different
endian-ness
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 2323
MIPS is Strictly Load/Store?
lw reg5, 8 reg33
add reg1, reg2,reg
# $t0 A 8
# g h5 A 8
lw reg5, 8 reg33
add reg1, reg2,reg
# $t0 A 8
# g h5 A 8
I n a s t r i c t l y l o a d / s t o r e I S A , t h e o n l y i n s t r u c t i o n s t h a t c a n a c c e s s m e m o r ya r e l o a d a n d s t o r e . T h e o p e r a n d s o f o t h e r i n s t r u c t i o n s m u s t fi r s t b e l o a d e di n t o r e g i s t e r s u s i n g t h e l o a d i n s t r u c t i o n . T h e r e s u l t o f t h e o p e r a t i o n c a n b es t o r e d i n m e m o r y o n l y b y m e a n s o f t h e s t o r e i n s t r u c t i o n . I n M I P S ( a n dm o r e g e n e r a l l y i n R I S C ) i n s t r u c t i o n s o f t h e t y p e :
a d d r e g 1 , r e g 2 , 8 ( r e g 3 )
w h e r e o n e o r m o r e o f t h e o p e r a n d s r e s i d e i n m e m o r y a r e n o t a l l o w e d .I n s t r u c t i o n s o f t h i s t y p e a r e c o m m o n i n V A X a n d x 8 6 I S A s .
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 2424
Addressing Modes Instructions need to know where/what their
operands are. So the question is how the operands are supplied to the instruction. MIPS ISA supports three methods for this purpose
immediate mode addressing : the operand is encoded directly in the instruction as a constant
the address of the operand is encoded in the instruction register mode addressing : the operand is in a register and
the address of the register is encoded in the instruction displacement mode addressing : the operand is stored in
memory and the address of the memory location is encoded in the instruction
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 2525
Addressing Mode Examples Immediate mode
add reg4, 7 # Regs[reg4]=Regs[reg4]+7 16-bit field for the constant
Register mode add reg4, reg3, reg2 # Regs[R4]=Regs[R3]+Regs[R2]
Displacement mode lw reg4, 100(reg1) #
Regs[R4]=Mem[Regs[R1]+100] 16-bits for displacement
Special cases of displacement mode indirect mode: displacement value=0
lw reg4, 0(reg1) # Regs[R4]=Mem[Regs[R1]] absolute addressing : reg0 as base register (always stores 0)
lw reg4, 8700(reg0) # Regs[R4]=Mem[8700]
MIPS ISA supports3 addressing modes explicitly, but effectively we have 5 addressing modes at our disposal.
MIPS ISA supports3 addressing modes explicitly, but effectively we have 5 addressing modes at our disposal.
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 2626
I For Immediate Addressing Mode
A large percentage of arithmetic operations have one constant operand (e.g., X=X+4)
Keeping & loading constants from memory is inefficient (consider storing all integer constants in memory!)
ALU instructions with immediate addressing mode are designed to address this need
use I-type instruction format encode constant in the instruction’s 16-bit immediate field
constants in range -215 to (215-1) can be encoded example:
addi R4, R8, 79
opcode rs rt immediate
6 5 5 16
I-Type Format
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 2727
Conditional Branch Instructions
The processor f etches and executes instructions sequentially f rommemory. But programming languages have constructs such as if thenelse or switch statements. I n these conditional branch statementsthe outcome of a test determines whether to execute the nextsequential instruction or f etch the next instruction f rom a branchaddress. As examples, MI PS supports conditional branchinstructions:
1. Branch if equal: beq reg1, reg2, Label. I n this instruction reg1and reg2 are compared and if they are equal, the nextinstruction will be f etched and executed f rom the addressLabel.
2. Branch if not equal: bne reg1, reg2, Label. I n this instructionreg1 and reg2 are compared and if they are not equal, the nextinstruction will be f etched and executed f rom the addressLabel.
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 2828
Unconditional Branch Instructions
How would one compile the following f ragment:
if(i==j) f=g+h; else f=g-h;
Here is how:
bne reg4, reg5, Else
add reg1, reg2, reg3
j Exit
Else: sub reg1, reg2, reg3
Exit:
The fi rst thing to note is that we are testing f or inequality even though the C statement is written in term of an equality test. This is because this f orm is generally more effi cient if one assumes that the branch is not taken more of ten than not. The second thing to note is the appearance of the unconditional jump instruction j which simply tells the processor to f etch the next instruction f rom the address Exit.
Assume: f reg1 g reg2 h reg3 i reg4 i reg5
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 2929
Encoding Unconditional Branch Instructions
Unconditional PC-region jump encoded as J-Type instruction
opcode J: 2
26-bit PC-region offset with respect to PC+4
opcode=2 Offset added to PC+4
6 26
J-Type Format
opcode=0 rs rt
6 5 5
rd
5
shamt
5
func
6
R-Type Format
Unconditional register jump encoded as R-Type instruction
opcode: 0 funct
JR: 8 rs contains branch
address
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 3030
Procedure Call: Invocation
MI PS I SA (and theref ore a MI PS processor) supports procedure calls through the instructions jal and jalr. These instructions are analogous to j and jr instructions f or unconditional branch in that they transf er control to the fi rst instruction in the procedure. However, in a procedure call you must also ensure that at the end of the procedure you return to the caller (specifi cally the instruction in the caller f ollowing the procedure invocation). The jump and link instruction:
jal ProcedureAddress
jumps to the instruction at ProcedureAddress and stores the return address (i.e., PC+4) in register reg31 in an atomic operation.
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 3131
Procedure Call: Return
opcode=3 immediate
6 26
J-Type Format
Af ter the procedure fi nishes execution it must return to the instruction at the return address. The address of this instruction is stored in $ra (return address register is $ra=reg31). Thus to return to this address the procedure executes:
jr $ra
so that no new instruction f or the return instruction is required. Similar to the jump instruction j , the jal instruction is encoded as a j -type instruction with opcode=3 as shown below. And similar to j , the jal instruction also uses a 26-bit off set fi eld f or PC-region addressing. The procedure address is constrained within a 256 MB boundary.
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 3232
Yet Another Instruction for Procedure Calls: jalr
jalr: jump & link register instruction encoded in R-Type format jumps to address in rt opcode = 0; funct = 9
jalr instruction stores the return address in $ra=reg31
to return at the end of the procedure: jr $ra Used when the procedure’s address is not known
at compile time or is beyond 256 MB of PC+4
opcode=0 rs rt
6 5 5
rd
5
shamt
5
func=9
6
R-Type Format
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 3333
Von Neumann Machine
MainMemory
ALU/Datapath
Control Unit
Input/Output CPU
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 3434
Single-Cycle CPU
Datapath.
Control unit.
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 3535
Single-Cycle CPU
One clock cycle for each instruction No datapath resource can be used more than
once per clock cycle Results in resource duplication for
elements that must be used more than once. Examples: Separate memory units for instruction and
data Two ALUs for conditional branches
One to compute branch condition One to compute branch address
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 3636
Shortcomings of Single-Cycle CPU
Duplication of datapath elements Separate instruction & data memory Multiple ALUs
Clock cycle must have the same length for all instructions
Cycle determined by the longest path: load instruction memory (fetch from instruction memory) register (read base address) ALU (compute memory address) data memory (read from data memory) register (write into destination register)
Several instructions require a shorter cycle
lw reg5, 8 reg33
add reg1, reg2,reg
# $t0 A 8
# g h5 A 8
lw reg5, 8 reg33
add reg1, reg2,reg
# $t0 A 8
# g h5 A 8
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 3737
Multi-Cycle CPU
IF ID EX MEM WB
Break up datapath into smaller functional segments.
Each instruction uses only the functions it needs
Run faster clock cycle
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 3838
Additional Datapath Elements
Internal registers (invisible to programmer) To store intermediate results from one clock
cycle to the next during execution of each instruction
Similar to a scratchpad or a temporary variable Instruction Register (IR) A and B registers to store register operands read
from the register file ALUout to store result of ALU operation Load Memory Data Register (LMDR)
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 3939
Internal & External Registers Internal registers store data from one clock
cycle to the next within a single instruction cycle. At the end of a clock cycle, data needed in subsequent clock cycles must be stored in an internal register
Data needed by subsequent instructions by the program are stored in the external registers or memory
At the end of a clock cycle, data is stored in one, the other or both register classes
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 4040
Instruction Steps - 1 In the simplest implementation,
instructions take at most 5 clock cycles instruction fetch (IF) instruction decode/register fetch cycle (ID) execution/effective address cycle (EX) memory access/branch completion cycle
(MEM) write-back cycle (WB)
Which instructions require no less than 5 cycles to complete?
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 4141
1. Instruction Fetch Cycle (IF)
IR = Mem[PC]
PC <- PC+4
IR = Mem[PC]
PC <- PC+4
•PC register content is applied to the instruction memory address bus
•Instruction is fetched and saved in the IR register to be used in the ID stage
•PC is incremented by 4 to compute the address of the next sequential instruction.
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 4242
2. Instruction Decode/Register Fetch Cycle (ID)
A = Regs[rs]
B = Regs[rt]
Imm= (immediate sign extended)
• Functions in the ID stage
• Decode instruction
• Access register file to read the registers and store in A & B (at the next clock edge) for use in the next cycles
Fixed -Field Decoding
•Decoding is done in parallel with reading the register file because these fields occur in fixed locations for all instructions
•Reading registers that will not subsequently be used is harmless
opcode rs rt immediate
6 5 5 16
I-Type
opcode rs rt func
6 5 5 11
rd
5
R-Type
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 4343
3. Execution/Effective Address Cycle (EX)
ALUout = A + (Imm)ALUout = A + (Imm)
Memory Reference Instruction
The ALU adds operands prepared in the last clock cycle in A. The result is the effective address of an operand for load/store.
ALUout = A func BALUout = A func BRegister-Register ALU Instruction
The ALU performs operation on operands prepared in A and B in the last cycle.
opcode rs rt immediate
6 5 5 16
I-Type
opcode rs rt func
6 5 5 11
rd
5
R-Type
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 4444
4. Memory Access or ALU Instruction Completion (MEM)
LMDR = Mem[ALUout] or
Mem[ALUout] = B
LMDR = Mem[ALUout] or
Mem[ALUout] = B
Memory Reference
•Use memory address computed by ALU and stored in ALUout in the previous clock cycle
•Access memory and perform read or write depending on load or store
opcode rs rt immediate
6 5 5 16
I-Type
Regs[IR11…15 ]= ALUout # I-Type ALU
Regs[IR16…20 ]= ALUout # R-Type ALU
Regs[IR11…15 ]= ALUout # I-Type ALU
Regs[IR16…20 ]= ALUout # R-Type ALU
ALU Instruction (R- or I-Type)
•Store the result from ALUout in the destination register.
opcode rs rt func
6 5 5 11
rd
5
R-Type
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 4545
5. Write-Back Cycle (WB)
Regs[IR11..15] = LMDRRegs[IR11..15] = LMDR
Load Instruction
Load data into destination register rd. Data was fetched in an earlier clock cycle and stored in LMDR.
opcode rs rd immediate
6 5 5 16
I-Type
What to do in this stage for conditional branch instructions?
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 4646
Example
Assume the following instruction mix and clock cycles:•load: 23% 5•store: 13% 4•branches: 19% 3•jumps: 02% 3•ALU: 43% 4
What is the average CPI?
CPI = .23 x 5 + .13 x 4 + .19 x 3 + .02 x 3 + .43 x 4 = 4.02
Performance improvement over single-cycle CPU: 5.0/4.02=1.24CPI = .23 x 5 + .13 x 4 + .19 x 3 + .02 x 3 + .43 x 4 = 4.02
Performance improvement over single-cycle CPU: 5.0/4.02=1.24
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 4747
Multi-Cycle CPU Summary
Multi-cycle CPU improves performance but not by much
Performance is limited by the high frequency of instructions with high CPI (load, store, ALU)
Significant performance gain can be made through pipelining
Pipelining model uses the same stages as the multi-cycle CPU
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 4848
What is Pipelining?
Pipelining is an implementation technique where execution of sequential instructions are overlapped in time
It improves instruction execution throughput, but not execution time of individual instructions
Hazard: refers to situations when the next instruction in the pipeline cannot be executed in the following clock cycle
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 4949
Laundry Example
Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold
Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes “Execution time”: 90
minutes
A B C D
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 5050
Sequential Laundry
Sequential laundry takes 6 hours for 4 loads.
Minimum=30+40+20 minutes=1.5 hours
Sequential laundry takes 6 hours for 4 loads.
Minimum=30+40+20 minutes=1.5 hours
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 5151
Pipelined Laundry: Start work ASAP
Pipelined laundry takes 3.5 hours for 4 loads
30+4x40+20=210 min= 3.5 hrs
Min time = 1.5 hrs Must waste 30 min for
each load because stages are not balanced
1.5 + 4 x (1/2) = 3.5
hrs
Pipelined laundry takes 3.5 hours for 4 loads
30+4x40+20=210 min= 3.5 hrs
Min time = 1.5 hrs Must waste 30 min for
each load because stages are not balanced
1.5 + 4 x (1/2) = 3.5
hrs
A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20
Each stage must be 40 minutes long40x3=120 min (execution time)minimum=40+30+20=90 minbalance=120-90=30 minutes
These stages must be extended to 40 minutesThese stages must be extended to 40 minutes
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 5252
Pipelining Lessons
Pipelining doesn’t help latency of a single task (in fact it increases it in our example), it helps throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously using different resources
Potential speedup = Number of pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to “fill” pipeline and time to “drain” it reduces speedup
Stall for Dependencies
Pipelining doesn’t help latency of a single task (in fact it increases it in our example), it helps throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously using different resources
Potential speedup = Number of pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to “fill” pipeline and time to “drain” it reduces speedup
Stall for Dependencies
A
B
C
D
6 PM 7 8 9
Task
Order
Time
30 40 40 40 40 20
Dependency example: Cathy’s socks end up in Brian’s dryer load by mistake
Dependency example: Cathy’s socks end up in Brian’s dryer load by mistake
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 5353
Summary: MIPS Instruction Formats
opcode rs rt
6 5 5
rd
5
shamt
5
func
6
R-Type Format
opcode rs rt immediate
6 5 5 16
I-Type Format
opcode immediate
6 26
J-Type Format
This ISA was designed to allow efficient pipelining of instructions in HW
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 5454
MIPS 5-Stage Integer Pipeline
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
PC
+
Inst.Mem.
4
NPC
IR
Regs
SignExt.
A
BMUX
MUX
MUX
LMDALU DataMem.
ALUOutput
Cond.
16 64
MUX
Imm.
Zero?
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 5555
Pipelining Example
Instruction Class
Instruction memory (IF)
Register read (ID)
ALU operations
(EX)
Data memory (MEM)
Register write (WB)
Total
ALU type 2 1 2 0 1 6 ns Load 2 1 2 2 1 8 ns Store 2 1 2 2 0 7 ns Branch 2 1 2 0 0 5 ns
ID EX MEM WBIF
ID EX MEM WBIF
ID EX MEM WBIF
ID EX MEM WBIF
0 2 4 6 8 10 12 14 16Time (ns)
Four successive load instructions
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 5656
Pipelining Example - Continued
ID EX MEM WBIF
ID EX MEM WBIF
ID EX MEM WBIF
ID EX MEM WBIF
0 2 4 6 8 10 12 14 16Time (ns)
Performance Improvement
•Pipelined case: a continuous stream of instructions can be fed in every 2 ns. This cycle time corresponds to the stage which takes the longest.
•Non-pipeline case (single-cycle): a continuous stream of instructions can be fed in every 8 ns. This corresponds to the instruction which takes the longest (load)
Performance Improvement
•Pipelined case: a continuous stream of instructions can be fed in every 2 ns. This cycle time corresponds to the stage which takes the longest.
•Non-pipeline case (single-cycle): a continuous stream of instructions can be fed in every 8 ns. This corresponds to the instruction which takes the longest (load)
•Performance degradation = 5*2ns= 10 ns > 8 ns
as compared to single-cycle CPU
•Improvement is in throughput. Instructions
take at least the same amount of time
(in this case instruction latency is longer
because the stages are unbalanced)
•Performance degradation = 5*2ns= 10 ns > 8 ns
as compared to single-cycle CPU
•Improvement is in throughput. Instructions
take at least the same amount of time
(in this case instruction latency is longer
because the stages are unbalanced)
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 5757
Execution Time & ThroughputNumber of I nstructions Total Execution Time
(ns)Average Execution Time
(ns)
1 10 10.02 12 6.03 14 4.74 16 4.05 18 3.66 20 3.37 22 3.18 24 3.09 26 2.910 28 2.8….. ….. ……
I NFI NI TY I NFI NI TY 2.0
Number of I nstructions Total Execution Time(ns)
Average Execution Time(ns)
1 10 10.02 12 6.03 14 4.74 16 4.05 18 3.66 20 3.37 22 3.18 24 3.09 26 2.910 28 2.8….. ….. ……
I NFI NI TY I NFI NI TY 2.0
•At the beginning it takes a few instructions to fill the pipeline and reach steady state
•At the end it takes a few instructions to drain the pipeline
•In steady state instructions are executed every 2 ns (typical program)
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 5858
MIPS Pipeline
RegIM ALUALUDM Reg
RegIM ALUALUDM Reg
RegIM ALUALUDM Reg
CC1 CC2 CC3 CC4 CC5 CC6 CC7
•Instruction memory (IM) and data memory (DM) are shown as separate units•All operations in a pipeline stage must complete in one clock cycle•Values passed from one stage to another must be stored in internal registers •Registers labeled with the names of stages they connect
•Instruction memory (IM) and data memory (DM) are shown as separate units•All operations in a pipeline stage must complete in one clock cycle•Values passed from one stage to another must be stored in internal registers •Registers labeled with the names of stages they connect
Pipeline registersPipeline registers
IF/ID EX/MEMID/EX MEM/WB
Intermediate registers introduce delay in the datapath
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 59
Data Hazard
RegIF ALUALUMem Reg
RegIF ALUALUMem Reg
RegIF ALUALUMem
RegIF ALUALU
ADD R1, R2, R3
SUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
XOR R10, R1, R11
CC1 CC2 CC3 CC4 CC5
RegIF ALUALUMem Reg
IF/ID ID/EX EX/MEM MEM/WB
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 60
RAW Data Hazard ADD R1, R2, R3SUB R4, R1, R5AND R6, R1, R7OR R8, R1, R9XOR R10, R1, R11
ADD R1, R2, R3SUB R4, R1, R5AND R6, R1, R7OR R8, R1, R9XOR R10, R1, R11
•R1 is not written back to the register file until the WB cycle (CC5) of ADD instruction•R1 is needed in the ID cycle of the succeeding instructions
–CC3 for SUB–CC4 for AND–CC5 for OR–CC6 for XOR
•Unless the hazard is handled, these instructions operate on the wrong operand value
•R1 is not written back to the register file until the WB cycle (CC5) of ADD instruction•R1 is needed in the ID cycle of the succeeding instructions
–CC3 for SUB–CC4 for AND–CC5 for OR–CC6 for XOR
•Unless the hazard is handled, these instructions operate on the wrong operand value
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 61
Split-Phase Register Read/Write
•XOR operates correctly because its ID cycle is in CC6•OR can be made to operate correctly by:
–Writing the register file in the first half of the clock cycle–Reading the register file in the second half of the clock cycle
•XOR operates correctly because its ID cycle is in CC6•OR can be made to operate correctly by:
–Writing the register file in the first half of the clock cycle–Reading the register file in the second half of the clock cycle
RegIM ALUALUMem
ADD R1, R2, R3
OR R8, R1, R9
RegIM ALUALUMem Reg
CC1 CC2 CC3 CC4 CC5 CC6 CC7
1st half 2nd half
IF/ID ID/EX EX/MEM MEM/WB
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 62
Forwarding (aka Bypassing)
RegIF ALUALUMem Reg
ADD R1, R2, R3
SUB R4, R1, R5
CC1 CC2 CC3 CC4 CC5
RegIF ALUALUMem Reg
•The result is not needed by the SUB instruction in CC4 until the ADD instruction has actually computed the result in the previous cycle CC3
•Forward the result of ALU operations from the previous cycle
•ALU results is written in the ALUout in the EX/MEM pipeline registers
•If Forwarding logic detects that one of the register operands has been “touched” by the previous ALU operation, control logic selects input from the EX/MEM instead of ID/EX
•The result is not needed by the SUB instruction in CC4 until the ADD instruction has actually computed the result in the previous cycle CC3
•Forward the result of ALU operations from the previous cycle
•ALU results is written in the ALUout in the EX/MEM pipeline registers
•If Forwarding logic detects that one of the register operands has been “touched” by the previous ALU operation, control logic selects input from the EX/MEM instead of ID/EX
ALUout
ALUout
IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 63
IM
Reg
Reg
IM
LW R1, 0(R2)
AND R6, R1, R7
OR R8, R1, R9
CC1 CC2 CC3 CC4 CC5
RegIM ALUALUDM Reg
CC6 CC7
ALUALUDM SUB R4, R1, R5
ALUALUDM
Stall in the Pipeline• No instruction begins in CC3• No instruction completes in CC6
Stall in the Pipeline• No instruction begins in CC3• No instruction completes in CC6
Forwarding is now through MEM/WB
OR no longer requires forwarding
RegIM ALUALU
IF/ID ID/EX EX/MEM MEM/WB
IF/ID IF/ID ID/EX EX/MEM MEM/WB
Fall 2015, arz
CPE555A – Real-Time Embedded SystemsStevens Institute of Technology 64
Tabular View of Pipelining
LW R1,0(R2) I F I D EX MEM WB
SUB R4,R1,R5 I F I D EX MEM WB
AND R6,R1,R7 I F I D EX MEM WB
OR R8,R1,R9 I F I D EX MEM WB
LW R1,0(R2) I F I D EX MEM WB
SUB R4,R1,R5 I F Stall I D EX MEM WB
AND R6,R1,R7 I F I D EX MEM WB
OR R8,R1,R9 I F I D EX MEM