appendix a pipelining: basic and intermediate concepts

Appendix A

Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 2

Pipelining

• An implementation technique whereby multiple instructions are overlapped in execution.

• Each step in the pipeline (called a pipe stage) completes a part of an instruction.

• Because all stages proceed at the same time, the length of a processor (clock) cycle is determined by the time required for the slowest pipe stage.


Pipelining

• Designer’s goal: Balancing the length of each pipeline stage.

• If the stages are perfectly balanced, the time per instruction on the pipelined processor is,

Time per instruction on unpipelined machine

Number of pipe stages

Speedup from pipelining = number of pipe stages


RISC Instruction Set (MIPS64)

• 64-bit version of the MIPS instruction set.

• 32 registers

• 3 classes of instructions– ALU instructions: DADD, DSUB, …– Load and store instructions: LD, SD, …– Branches and jumps


Implementation of a RISC (Unpipelined, Multicycle)

• Implementation of an integer subset of a RISC architecture that takes at most 5 clock cycles.– Instruction Fetch (IF)– Instruction Decode/Register Fetch (ID)– Execution/Effective Address Calculation (EX)– Memory Access (MEM)– Write-Back (WB)


Instruction Format (32-bit Version)

• All MIPS instructions are 32 bits long.

OP rs rt rd sa funct

OP rs rt immediate

OP jump target

R-format (add, sub, …)

I-format (lw, sw, …)

J-format (j)


Instruction Fetch Cycle (IF)

• Send the program counter (PC) to memory.

• Fetch the current instruction from memory.

• Update the PC to the next sequential PC by adding 4 to the PC.


Instruction Decode/Register Fetch Cycle (ID)

• Decode the instruction and read the registers from the register file.

• Do the equality test on the registers for a possible branch.

• Sign-extend the offset field of the instruction in case it is needed.

• Compute the possible branch target address by adding the sign-extended offset to the incremented PC.


Execution/Effective Address Calculation (EX)

• The ALU operates on the operands prepared in the prior cycle.– Memory reference instructions: The ALU adds the

base register and the offset to form the effective address.

– Register-Register: The ALU performs the operation specified by the ALU opcode on the values from the register file.

– Register-Immediate: The ALU performs the operation specified by the opcode on the first value from the register file and the sign-extended immediate.

CSCE 614 Fall 2009 10

Memory Access (MEM)

• If the instruction is a load, memory does a read using the effective address computed in the previous cycle.

• If it is a store, then the memory writes the data from the second register read from the register file using the effective address.

CSCE 614 Fall 2009 11

Write-Back cycle (WB)

• Register-Register ALU instruction or Load instruction: Write the result into the register file.

CSCE 614 Fall 2009 12

• In this implementation, branch instructions require 2 cycles, store instructions require 4 cycles, and all other instructions require 5 cycles.

• Assuming a branch frequency of 12% and a store frequency of 10%, What is the overall CPI?

CSCE 614 Fall 2009 13

Classic 5 Stage Pipeline for a RISC Processor

CSCE 614 Fall 2009 14

Performance Issues in Pipelining

• Pipelining increases the CPU instruction throughput.

• Throughput: the number of instructions completed per unit of time.

• Pipelining does not decrease the execution time of an individual instruction.– It increases the execution time due to

overhead (clock skew and pipeline register delay) in the control of the pipeline.

CSCE 614 Fall 2009 15

Example (p. A-10)• Consider the unpipelined processor. Assume that

it has a 1ns clock cycle and that it uses 4 cycles for ALU operations and branches and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.2ns of overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from a pipeline?

CSCE 614 Fall 2009 16

Classic 5 Stage Pipeline for a RISC Processor

CSCE 614 Fall 2009 17

Classic 5-Stage Pipeline

• What happens in the pipeline?– One resource cannot be used for two different

operations on the same clock cycle.

=> Separate instruction and data memories.– The register file is used in two stages: ID (two

reads) and WB (one write).

=> Register write in the first half of the clock cycle and register read in the second half.

Pipeline Hazards

CSCE 614 Fall 2009 19

Pipeline Hazards• Situations that prevent the next instructions

in the instruction stream from executing during its designated clock cycle.

• Hazards reduce the performance from the ideal speedup gained by pipelining.– Structural Hazards– Data Hazards– Control Hazards

• Hazards can make it necessary to stall the pipeline.

CSCE 614 Fall 2009 20

Pipeline Hazards

• When an instruction is stalled, all instructions issued later than the stalled instruction are also stalled.

• No new instructions are fetched during the stall.

CSCE 614 Fall 2009 21

Structural Hazards

• Hardware cannot support the combination of instructions that we want to execute in the same clock cycle.– Suppose we have a single memory instead of

two memories.

CSCE 614 Fall 2009 22

Control Hazards

• This arises from the need to make a decision based on the results of one instruction while others are executing.– branch instruction– Pipeline stall (or bubble)

• How can we overcome this problem?

CSCE 614 Fall 2009 23

Branch Hazards

• To minimize the branch penalty, put in enough hardware so that we can test registers, calculate the branch target address, and update the PC during the second stage.

CSCE 614 Fall 2009 24

Example

• Estimate the impact on the CPI of stalling on branches. Assume all other instructions have a CPI of 1.

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)4 ns

Instructionfetch

Reg ALUData

accessReg

2ns

Instructionfetch

Reg ALUData

accessReg

2ns

2 4 6 8 10 12 14 16

Programexecutionorder(in instructions)

CSCE 614 Fall 2009 25

Branch Prediction

• Computers do indeed use prediction to handle branches.– Simplest: Always predict that branches will fail.– If you’re right, the pipeline proceeds at full

speed.– Dynamic hardware predictors make their

guesses depending on the behavior of each branch.

– Popular: Keeping a history for each branch as taken or untaken, and then using the past to predict the future. => about 90% accuracy

CSCE 614 Fall 2009 26

Branch Prediction

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch

Reg ALUData

accessReg

2 ns

Instructionfetch

Reg ALUData

accessReg

2 ns


Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5 ,$6

or $7, $8, $9

Instructionfetch

Reg ALUData

accessReg

2 4 6 8 10 12 14

2 4 6 8 10 12 14

Instructionfetch

Reg ALUData

accessReg

2 ns

4 ns

bubble bubble bubble bubble bubble


When the guessis wrong, the pipeline mustmake sure thatthe instructionfollowing the wrongly guessedbranch have noeffect and mustrestart the pipeline from theproper branch address.

CSCE 614 Fall 2009 27

Delayed Branch

• Delayed decision

• Used in MIPS

• The delayed branch always executes the next sequential instruction, with the branch taking place after that one instruction delay.

CSCE 614 Fall 2009 28

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch

Reg ALUData

accessReg

2 ns

Instructionfetch

Reg ALUData

accessReg

2 ns

2 4 6 8 10 12 14

2 ns

(Delayed branch slot)


CSCE 614 Fall 2009 29

• MIPS software will place an instruction immediately after the delayed branch instruction that is not affected by the branch, and a taken branch changes the address of the instruction that follows this safe instruction.

• Compilers typically fill about 50% of the branch delay slots with useful instructions.

CSCE 614 Fall 2009 30

Data Hazards

• An instruction depends on the results of a previous instruction still in the pipeline.

• e.g. add $s0, $t0, $t1

sub $t2, $s0, $t3

The add instruction doesn’t write the result until the 5th stage. => 3 bubbles

CSCE 614 Fall 2009 31

Solution

• forwarding (or bypassing): getting the missing item early from the internal resources.

• e.g. as soon as the ALU creates the sum for the add, we can supply it as the input for the subtract.

CSCE 614 Fall 2009 32

add $s0, $t0, $t1

sub $t2, $s0, $t3


IF ID WBEX

IF ID MEMEX

Time2 4 6 8 10

MEM

WBMEM

CSCE 614 Fall 2009 33

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3


IF ID WBMEMEX

IF ID WBMEMEX

bubble bubble bubble bubble bubble

Load-Use Data Hazard

CSCE 614 Fall 2009 34

• Even with forwarding, we still have to stall one stage for a load-use data hazard.

• Delayed loads: to follow a load with an instruction independent of that load.

CSCE 614 Fall 2009 35

CSCE 614 Fall 2009 36

Implementation of the MIPS Datapath

CSCE 614 Fall 2009 37

Events on Every Pipe Stage of the MIPS Pipeline

• See Figure A.19 on page A-32.

CSCE 614 Fall 2009 38

Revised Datapath

CSCE 614 Fall 2009 39

Revised Pipeline Structure

• See Figure A.25 on page A-39.

Extending the MIPS to Handle Multicycle Operations

CSCE 614 Fall 2009 41

Floating-Point Operations

• The floating-point pipeline will allow for a longer latency for operations.– the EX cycle may be repeated as many times

as needed to complete the operation.• The number of repetitions can vary for different

operations.

– There may be multiple floating-point functional units.

CSCE 614 Fall 2009 42

Assumptions

• Main integer unit: handles loads and stores, integer ALU operations, and branches.

• FP and integer multiplier.

• FP adder: handles FP add, subtract, and conversion.

• FP and integer divider.

• The EX stages of these functional units are not pipelined.

CSCE 614 Fall 2009 43

MIPS with 3 FP Functional Units

CSCE 614 Fall 2009 44

• Because EX is not pipelined, no other instruction using that functional unit may issue until the previous instruction leaves EX.– Instruction issue (p. A-33): the process of

letting an instruction move from the ID stage into the EX stage of the pipeline.

• If an instruction cannot proceed to the EX stage, the entire pipeline behind that instruction will be stalled.

CSCE 614 Fall 2009 45

• Latency: the number of intervening cycles between an instruction that produces a result and an instruction that uses the result.

• Initiation interval: the number of cycles that must elapse between issuing two operations of a given type.

CSCE 614 Fall 2009 46

Example (Figure A.30)Functional Unit Latency Initiation

Interval

Integer ALU 0 1

Data memory (integer/FP loads) 1 1

FP add 3 1

FP multiply (integer multiply) 6 1

FP divide (integer divide) 24 25

CSCE 614 Fall 2009 47

• Since most operations consume their operands at the beginning of EX stage, the latency is usually the number of stages after EX that an instruction produces a result.– 0 for Integer ALU operations.– 1 for loads.

• Pipeline latency is essentially equal to 1 cycle less than the depth of the execution pipeline, which is the number of stages from the EX stage to the stage that produces the result.

CSCE 614 Fall 2009 48

• To achieve a higher clock rate, fewer logic levels are put in each pipe stage.

=> The number of pipe stages required for more complex operations is larger.

• The penalty for the faster clock rate is longer latency for operations.

CSCE 614 Fall 2009 49

Supporting Multiple FP Operations

unpipelined

appendix a pipelining: basic and intermediate concepts

Documents