basic pipelining concepts - chalmers · basic pipelining concepts appendix a (recommended reading,...
TRANSCRIPT
Basic Pipelining Concepts
Appendix A (recommended reading, not everything will be covered today)
– Basic pipelining
– Pipeline hazards
• Data hazards
• Control hazards
• Structural hazards
– Multicycle operations
2009
Instruction Execution
For each instruction:
1. Instruction fetch (IF)
2. Instruction decode, operand fetch (ID)
3. Execute computations (EX)
4. Memory access (MEM)
5. Write back results to registers (WB)
Number and types of steps can vary
between different ISA and implementations
2009
MIPS Single Cycle Implementation
Instructionmemory
PC
ADD
4
op rs rt rd shamt funct
op rs rt address/immediate
32 bit
Instructionfetch
2009
MIPS Single Cycle Implementationop rs rt rd shamt funct
op rs rt address/immediate
Instructionmemory
PC
ADD
4
Instructionfetch
Instruction decode/register fetch
Registers
Signext.
data_rt
data_rs
32 bit
rs
rt
rd
address/immediate
16 bit
2009
MIPS Single Cycle Implementationop rs rt rd shamt funct
op rs rt address/immediate
Instructionmemory
PC
ADD
4
Instructionfetch
Instruction decode/register fetch
Registers
rs
rt
Signext.
rd
Shiftleft 2
ADD
Execute/address calc.
ALU
status
result
address/immediate
32 bit16 bit
data_rt
data_rs
2009
MIPS Single Cycle Implementationop rs rt rd shamt funct
op rs rt address/immediate
Instructionmemory
PC
ADD
4
Instructionfetch
Instruction decode/register fetch
Registers
rs
rt
Signext.
rd
Shiftleft 2
ADD
ALU
Execute/address calc.
16 bit 32 bit
status
resultData
memory
Memoryaccess
address/immediate
data_rt
data_rs
2009
MIPS Single Cycle Implementationop rs rt rd shamt funct
op rs rt address/immediate
Instructionmemory
PC
ADD
4
Instructionfetch
Instruction decode/register fetch
Registers
rs
rt
Signext.
rd
Shiftleft 2
ADD
ALU
Execute/address calc.
16 bit 32 bit
status
resultData
memory
Memoryaccess
address/immediate
Writeback
data_rt
data_rs
2009
MIPS Single Cycle Implementationop rs rt rd shamt funct
op rs rt address/immediate
Instructionmemory
PC
ADD
4
Instructionfetch
Instruction decode/register fetch
Registers
rs
rt
Signext.
rd
Shiftleft 2
ADD
ALU
Execute/address calc.
16 bit 32 bit
status
resultData
memory
Memoryaccess
address/immediate
Writeback
data_rt
data_rs
2009
Problems
• All instructions take the time required by the longest instruction
• Alternative solutions:1. Multicycle processors
2. Pipelining
• Both solution require the implementation to change!
We will have closer look at pipelining!
2009
The Assembly Line Concept
• A pipelined processor is based on the assembly line concept
• One station for each stage in the instruction execution
• At any moment there is one instruction at each station
• One new instruction every cycle => CPI=1
Each instruction takes multiple cycles to complete, but the
throughput is high!
2009
Instructionmemory
PC
ADD
4
Instructionfetch
Instruction decode/register fetch
Registers
rs
rt
Signext.
rd
Shiftleft 2
ADD
ALU
Execute/address calc.
16 bit 32 bit
status
resultData
memory
Memoryaccess
address/immediate
Writeback
data_rt
data_rs
Pipeline for MIPS
2009
Pipeline for MIPS
ADD
ALU
Shiftleft 2
Signext.
RegistersInstructionmemory Data
memory
PC
ADD
4
IF/ ID ID/ EX EX/ MEM MEM/ WB
Read addr.
Write addr.
Pipeline registers
2009
Pipelining Example
lw $4, 100($5)
add $5, $2, $3
sw $4, 400($7)
beq $8, $9, 800
2009
lw $4, 100($5)
add $5, $2, $3
sw $4, 400($7)
beq $8, $9, 800
ALU
ADDShiftleft 2
ADD
4
Datamemory
Instructionmemory
PC
Registers
Signext.
IF/ ID ID/ EX EX/ MEM MEM/ WB
Read addr.
Write addr.
lw $4, 100($5)
add $5, $2, $3
sw $4, 400($7)
beq $8, $9, 800
PC
Pipelining Example
2009
Pipelining Example
ALU
ADDShiftleft 2
ADD
4
Datamemory
Instructionmemory
PC
Registers
Signext.
IF/ ID ID/ EX EX/ MEM MEM/ WB
Read addr.
Write addr.
add $5, $2, $3
lw $4, 100($5)
add $5, $2, $3
sw $4, 400($7)
beq $8, $9, 800
PC
2009
Pipelining Example
ALU
ADDShiftleft 2
ADD
4
Datamemory
Instructionmemory
PC
Registers
Signext.
IF/ ID ID/ EX EX/ MEM MEM/ WB
Read addr.
Write addr.
lw $4, 100($5) add $5, $2, $3
$2
$3
lw $4, 100($5)
add $5, $2, $3
sw $4, 400($7)
beq $8, $9, 800
PC
2009
Pipelining Example
ALU
ADDShiftleft 2
ADD
4
Datamemory
Instructionmemory
PC
Registers
Signext.
IF/ ID ID/ EX EX/ MEM MEM/ WB
Read addr.
Write addr.
lw $4, 100($5) add $5, $2, $3sw $4, 400($7)
$5
100
$2
$3
lw $4, 100($5)
add $5, $2, $3
sw $4, 400($7)
beq $8, $9, 800
PC
$2+$3
2009
Pipelining Example
ALU
ADDShiftleft 2
ADD
4
Datamemory
Instructionmemory
PC
Registers
Signext.
IF/ ID ID/ EX EX/ MEM MEM/ WB
Read addr.
Write addr.
lw $4, 100($5) add $5, $2, $3sw $4, 400($7)
$7
400
$5
beq $8, $9, 800
$4
100
$2+$3
beq+4
lw $4, 100($5)
add $5, $2, $3
sw $4, 400($7)
beq $8, $9, 800PC
$5+100
2009
Pipelining Example
ALU
ADDShiftleft 2
ADD
4
Datamemory
Instructionmemory
PC
Registers
Signext.
IF/ ID ID/ EX EX/ MEM MEM/ WB
Read addr.
Write addr.
lw $4, 100($5) add…sw $4, 400($7)
$8
800
$7
beq $8, $9, 800
$9
400
$5+100$4
$2+$3
$5
beq+4
(beq+4)
$7+400
sw $4, 400($7)
lw $4, 100($5)
beq $8, $9, 800
....... PC
2009
Pipelining Example
ALU
ADDShiftleft 2
ADD
4
Datamemory
Instructionmemory
PC
Registers
Signext.
IF/ ID ID/ EX EX/ MEM MEM/ WB
Read addr.
Write addr.
lw $4…sw $4, 400($7)
$8
beq $8, $9, 800
800
$7+400$9
M[$5+100]
$4
beq+4
3200
$4
(beq+4)(beq+8)
$8-$9
beq+4+3200
Z
beq $8, $9, 800
sw $4, 400($7)
.......
....... PC
2009
Pipelining Example
ALU
ADDShiftleft 2
ADD
4
Datamemory
Instructionmemory
PC
Registers
Signext.
IF/ ID ID/ EX EX/ MEM MEM/ WB
Read addr.
Write addr.
sw...beq $8, $9, 800
beq+4+3200
(beq+8)(beq+12) (beq+4)
Controls mux
Z
2009
Pipelining Example
ALU
ADDShiftleft 2
ADD
4
Datamemory
Instructionmemory
PC
Registers
Signext.
IF/ ID ID/ EX EX/ MEM MEM/ WB
Read addr.
Write addr.
beq...(beq+8)(beq+12) (beq+4)(beq+4+3200)
MEM/ WB
.......
.......
.......
branch dest.PC
2009
Pipeline Hazards
• Neighboring instructions are rarely independent
• In a pipeline, this can cause conflicts called hazards
• Three main types of hazards
– Data hazards
– Control hazards
– Structural hazards
2009
Hazard Resolution
Pipeline hazards can be resolved in many different ways
• Stall: Stop parts of the pipeline until the conflicting instructions are sufficiently separated
• Make results available earlier– Move calculations to earlier pipeline stages
– Make results available before they have been stored
– Guess results before they have been computed(!)
• Reorder instructions
2009
Data Hazards
• Three types– Read-After-Write (RAW).
– Write-After-Read (WAR). Do not occur in simple pipelines.
– Write-After-Write (WAW). Do not occur in simple pipelines.
• RAW hazard occurse when– An instruction needs the result of an earlier instruction that is
stored in a register (or memory location),
– and the earlier instruction has not yet written the result
• Usually handled by (some combination of)– data forwarding (bypassing)
– stalling
– instruction reordering
2009
RAW Hazard Example
ALU
ADDShiftleft 2
ADD
4
Datamemory
Instructionmemory
PC
Registers
Signext.
IF/ ID ID/ EX EX/ MEM MEM/ WB
Read addr.
Write addr.
lw $4, 100($5) add $5, $2, $3sw $4, 400($7)
$5
100
$2
$3
2009
Solution: Data forwarding
ALU
ADDShiftleft 2
ADD
4
Datamemory
Instructionmemory
PC
Registers
Signext.
IF/ ID ID/ EX EX/ MEM MEM/ WB
Read addr.
Write addr.
lw $4, 100($5) add $5, $2, $3sw $4, 400($7)
$7
400
$5
beq $8, $9, 800
$4
100
$2+$3
beq+4
2009
Another RAW Hazard Example
ALU
ADDShiftleft 2
ADD
4
Datamemory
Instructionmemory
PC
Registers
Signext.
IF/ ID ID/ EX EX/ MEM MEM/ WB
Read addr.
Write addr.
lw $4, 100($5) add $5, $2, $3sw $4, 400($7)
$7
400
$5
beq $8, $9, 800
$4
100
$2+$3
beq+4
2009
Solution: Stalling
ALU
ADDShiftleft 2
ADD
4
Datamemory
Instructionmemory
PC
Registers
Signext.
IF/ ID ID/ EX EX/ MEM MEM/ WB
Read addr.
Write addr.
lw...sw $4, 400($7)
$7
400
beq $8, $9, 800
$4
nop nop
$4
PC and IF/ID not updateduntil lw reaches WB
”Bubbles” (nop=no operation) loadedinto ID/EX until lw reaches WB
2009
Speedup Equation for Pipelining
CPIpipelined = Ideal CPI + Pipeline stall cycles per instr.
Speedup = * CPIunpipelined Tcunpipelined
Ideal CPI + #stall cycles/instr Tcpipelined
Ideal CPI for a pipeline is normally = 1.
= Number of pipeline stages.
Gives, Speedup = #pipeline stages
1 + #stall cycles/instr
And, if CPIunpipelined * Tcunpipelined
Tcpipelined
2009
Control Hazards
• Occur when the program counter (PC) is changed by
– a branch och jump instruction
– an exception (interrupt, trap, etc.)
• Usually handled by (some combination of)
– branch prediction
– earlier target address calculation
– stalling
– delayed branch
– instruction reordering (static or dynamic scheduling)
2009
Control Hazard Example
ALU
ADDShiftleft 2
ADD
4
Datamemory
Instructionmemory
PC
Registers
Signext.
IF/ ID ID/ EX EX/ MEM MEM/ WB
Read addr.
Write addr.
sw...beq $8, $9, 800
beq+4+800
(beq+8)(beq+12) (beq+4)
If branch to be taken, (beq+4)-(beq+12)should not have been fetched
2009
Solution: Stalling
• Stop fetching instructions after a branch instruction until the address of the next instruction has been determined
• This is very inefficient because branches tend to be very frequent
2009
Solution: Earlier Branch Calculation
ALU
Signext.
RegistersInstructionmemory Data
memory
PC
ADD
4
IF/ ID ID/ EX EX/ MEM MEM/ WB
Read addr.
Write addr.
ADDShiftleft 2
Both address and condition needto be calculated earlier (in ID)
Risk that clock cycle must be increased, leading to total performance loss.
2009
Solution: Branch-Delay Hiding Techniques
Stall until branch condition and target is known: Can cause significant penalties
Predict Branch not taken (a fairly rare case)Execute successor instructions in sequence
“Squash” instructions in pipeline if the branch is actually taken
Works well if state is updated late in the pipeline
30%-38% of conditional branches are not taken on average
Predict Branch taken (a fairly common case)62%-70% of conditional branches are taken on average
Makes sense for more complex pipeline organizations
Delayed branch (schedule useful instr. in delay slot)Define branch to take place after a following instruction
2009
Static Scheduling and Delayed Branch
– Scheduling an instruction from before is always safe
– Scheduling from target or from the not-taken path is not always safe; must be guaranteed that speculative instr. do no harm.
2009
Conditional Delay Slot Execution
Cancelling or nullifying branch instruction: Cancel the instruction in the delay slot if branch does not conform with the prediction
Measurements on SPEC:
80% of the branch-delay slots can be filled with useful instructions
70% will be filled at run-time; 10% of the useful instructions will be cancelled because of mispredictions
2009
Evaluating Branch Hazard Avoidance Techniques
Pipeline speedup =#pipeline stages
1 + Branch frequency x Branch penalty
Schedulingscheme
Branch penaltyfor integ. Pgm
CPI Speedup vs.Unpipelined
Stall pipeline 1 1.17 4.3
Predict taken 1 1.17 4.3
Predict not taken 0.69 1.12 4.5
Delayed branch 0.21 1.04 4.8
2009
Exceptions
• An exception (interrupt, trap, …) always causes a jump in execution
• Exceptions are extra difficult to handle as they cannot be predicted and may occur at different stages for different instructions
• Certain precise exceptions require that instructions in the beginning of the pipeline are thrown away and later restarted when the exception handler is finished
2009
Respecting the Execution order
Exceptions may be generated in another order than the instruction execution order
Example sequence:
lw (e.g., page fault in MEM)
add (e.g., page fault in IF)
The add instruction causes a fault before the load
Pipelinestage
Problem causingexception
IF Page fault on instruction fetch;
misaligned memory access;
memory protection violation
ID Undefined or illegal opcode
EX Arithmetic exception
MEM Page fault on data access;Misaligned memory access;
Memory protection violation
WB none
2009
Structural Hazards
• Occur when two instructions require the same resource at the same time
• Blocked resources can be
– Pipeline stages, or functional units in pipeline stages
– Memory
– Register file
• Usually handled by (a combination of)
– stalling
– instruction reordering (e.g. dynamic scheduling, which will be covered later in the course)
2009
Structural Hazard Example
ALU
ADDShiftleft 2
ADD
4
Datamemory
Instructionmemory
PC
Registers
Signext.
IF/ ID ID/ EX EX/ MEM MEM/ WB
Read addr.
Write addr.
lw $4, 100($5) nopsw $4, 400($7)
$8
200
$7
beq $8, $9, 800
$9
400
$5+100$4
beq+4
(beq+4)
lw has to wait on a cache miss inMEM => stall for previous stages
2009
Multicycle Operations in the Pipeline
– Integer unit: Handles integer instructions, branches, and loads/stores
– Other units: May take several cycles each. Some units are pipelined (mult,add) others are not (div)
2009
Parallel Execution of Instructions
MULTD F2,F4,F6 IFIDM1M2M3M4 M5M6M7MEMWB
ADDD F8,F10,F12 IFIDA1A2A3 A4 MEMWB
SUBI R2,R3,#8 IFIDEXMEMWB
LD F14,0(R2) IFIDEXMEMWB
Structural and RAW hazards:
Structural hazards. Stall in ID stage when
• the functional unit is occupied
• many instructions can reach the WB stage at the same time
RAW hazards: Normal bypassing from MEM and WB stages
Stall in ID stage if any of the source operands is a destination operand of an instruction in any of the FP functional units
2009
WAR and WAW Hazards for Multicycle Operations
– WAR hazards are a non-issue because operands are read in program order
– WAW hazards may occur
Example of a WAW hazard:DIVF F0,F2,F4 FP divide 24 cyclesSUBF F0,F8,F10 FP sub 3 cycles
SUB finishes before DIV ; out-of-order completion
WAW hazards are avoided by: stalling the SUBF until DIVF reaches the MEM stage, or
disabling the write to register F0 for the DIVF instruction
2009
Summary
Pipelining:
– Speeds up throughput, not latency
– Speedup #stages
– Hazards are fundamental limits:
• Structural: need more HW
• Data (RAW,WAR,WAW): need forwarding and compiler scheduling
• Control: delayed branch, branch prediction
Complications:
– Precise exceptions: maintain execution order
– ISA must be designed to match pipelining requirements
– Multi-cycle operations may result in out-of-order completion
– Out-of-order completion introduces WAW hazards and problems with precise interrupts
2009