appendix a pipelining: basic and intermediate concepts

49
Appendix A Pipelining: Basic and Intermediate Concepts

Upload: barnard-rice

Post on 21-Dec-2015

232 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Appendix A Pipelining: Basic and Intermediate Concepts

Appendix A

Pipelining: Basic and Intermediate Concepts

Page 2: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 2

Pipelining

• An implementation technique whereby multiple instructions are overlapped in execution.

• Each step in the pipeline (called a pipe stage) completes a part of an instruction.

• Because all stages proceed at the same time, the length of a processor (clock) cycle is determined by the time required for the slowest pipe stage.

Page 3: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 3

Pipelining

• Designer’s goal: Balancing the length of each pipeline stage.

• If the stages are perfectly balanced, the time per instruction on the pipelined processor is,

Time per instruction on unpipelined machine

Number of pipe stages

Speedup from pipelining = number of pipe stages

Page 4: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 4

RISC Instruction Set (MIPS64)

• 64-bit version of the MIPS instruction set.

• 32 registers

• 3 classes of instructions– ALU instructions: DADD, DSUB, …– Load and store instructions: LD, SD, …– Branches and jumps

Page 5: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 5

Implementation of a RISC (Unpipelined, Multicycle)

• Implementation of an integer subset of a RISC architecture that takes at most 5 clock cycles.– Instruction Fetch (IF)– Instruction Decode/Register Fetch (ID)– Execution/Effective Address Calculation (EX)– Memory Access (MEM)– Write-Back (WB)

Page 6: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 6

Instruction Format (32-bit Version)

• All MIPS instructions are 32 bits long.

OP rs rt rd sa funct

OP rs rt immediate

OP jump target

R-format (add, sub, …)

I-format (lw, sw, …)

J-format (j)

Page 7: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 7

Instruction Fetch Cycle (IF)

• Send the program counter (PC) to memory.

• Fetch the current instruction from memory.

• Update the PC to the next sequential PC by adding 4 to the PC.

Page 8: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 8

Instruction Decode/Register Fetch Cycle (ID)

• Decode the instruction and read the registers from the register file.

• Do the equality test on the registers for a possible branch.

• Sign-extend the offset field of the instruction in case it is needed.

• Compute the possible branch target address by adding the sign-extended offset to the incremented PC.

Page 9: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 9

Execution/Effective Address Calculation (EX)

• The ALU operates on the operands prepared in the prior cycle.– Memory reference instructions: The ALU adds the

base register and the offset to form the effective address.

– Register-Register: The ALU performs the operation specified by the ALU opcode on the values from the register file.

– Register-Immediate: The ALU performs the operation specified by the opcode on the first value from the register file and the sign-extended immediate.

Page 10: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 10

Memory Access (MEM)

• If the instruction is a load, memory does a read using the effective address computed in the previous cycle.

• If it is a store, then the memory writes the data from the second register read from the register file using the effective address.

Page 11: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 11

Write-Back cycle (WB)

• Register-Register ALU instruction or Load instruction: Write the result into the register file.

Page 12: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 12

• In this implementation, branch instructions require 2 cycles, store instructions require 4 cycles, and all other instructions require 5 cycles.

• Assuming a branch frequency of 12% and a store frequency of 10%, What is the overall CPI?

Page 13: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 13

Classic 5 Stage Pipeline for a RISC Processor

Page 14: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 14

Performance Issues in Pipelining

• Pipelining increases the CPU instruction throughput.

• Throughput: the number of instructions completed per unit of time.

• Pipelining does not decrease the execution time of an individual instruction.– It increases the execution time due to

overhead (clock skew and pipeline register delay) in the control of the pipeline.

Page 15: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 15

Example (p. A-10)• Consider the unpipelined processor. Assume that

it has a 1ns clock cycle and that it uses 4 cycles for ALU operations and branches and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.2ns of overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from a pipeline?

Page 16: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 16

Classic 5 Stage Pipeline for a RISC Processor

Page 17: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 17

Classic 5-Stage Pipeline

• What happens in the pipeline?– One resource cannot be used for two different

operations on the same clock cycle.

=> Separate instruction and data memories.– The register file is used in two stages: ID (two

reads) and WB (one write).

=> Register write in the first half of the clock cycle and register read in the second half.

Page 18: Appendix A Pipelining: Basic and Intermediate Concepts

Pipeline Hazards

Page 19: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 19

Pipeline Hazards• Situations that prevent the next instructions

in the instruction stream from executing during its designated clock cycle.

• Hazards reduce the performance from the ideal speedup gained by pipelining.– Structural Hazards– Data Hazards– Control Hazards

• Hazards can make it necessary to stall the pipeline.

Page 20: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 20

Pipeline Hazards

• When an instruction is stalled, all instructions issued later than the stalled instruction are also stalled.

• No new instructions are fetched during the stall.

Page 21: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 21

Structural Hazards

• Hardware cannot support the combination of instructions that we want to execute in the same clock cycle.– Suppose we have a single memory instead of

two memories.

Page 22: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 22

Control Hazards

• This arises from the need to make a decision based on the results of one instruction while others are executing.– branch instruction– Pipeline stall (or bubble)

• How can we overcome this problem?

Page 23: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 23

Branch Hazards

• To minimize the branch penalty, put in enough hardware so that we can test registers, calculate the branch target address, and update the PC during the second stage.

Page 24: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 24

Example

• Estimate the impact on the CPI of stalling on branches. Assume all other instructions have a CPI of 1.

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)4 ns

Instructionfetch

Reg ALUData

accessReg

2ns

Instructionfetch

Reg ALUData

accessReg

2ns

2 4 6 8 10 12 14 16

Programexecutionorder(in instructions)

Page 25: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 25

Branch Prediction

• Computers do indeed use prediction to handle branches.– Simplest: Always predict that branches will fail.– If you’re right, the pipeline proceeds at full

speed.– Dynamic hardware predictors make their

guesses depending on the behavior of each branch.

– Popular: Keeping a history for each branch as taken or untaken, and then using the past to predict the future. => about 90% accuracy

Page 26: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 26

Branch Prediction

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch

Reg ALUData

accessReg

2 ns

Instructionfetch

Reg ALUData

accessReg

2 ns

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5 ,$6

or $7, $8, $9

Instructionfetch

Reg ALUData

accessReg

2 4 6 8 10 12 14

2 4 6 8 10 12 14

Instructionfetch

Reg ALUData

accessReg

2 ns

4 ns

bubble bubble bubble bubble bubble

Programexecutionorder(in instructions)

When the guessis wrong, the pipeline mustmake sure thatthe instructionfollowing the wrongly guessedbranch have noeffect and mustrestart the pipeline from theproper branch address.

Page 27: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 27

Delayed Branch

• Delayed decision

• Used in MIPS

• The delayed branch always executes the next sequential instruction, with the branch taking place after that one instruction delay.

Page 28: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 28

Instructionfetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch

Reg ALUData

accessReg

2 ns

Instructionfetch

Reg ALUData

accessReg

2 ns

2 4 6 8 10 12 14

2 ns

(Delayed branch slot)

Programexecutionorder(in instructions)

Page 29: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 29

• MIPS software will place an instruction immediately after the delayed branch instruction that is not affected by the branch, and a taken branch changes the address of the instruction that follows this safe instruction.

• Compilers typically fill about 50% of the branch delay slots with useful instructions.

Page 30: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 30

Data Hazards

• An instruction depends on the results of a previous instruction still in the pipeline.

• e.g. add $s0, $t0, $t1

sub $t2, $s0, $t3

The add instruction doesn’t write the result until the 5th stage. => 3 bubbles

Page 31: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 31

Solution

• forwarding (or bypassing): getting the missing item early from the internal resources.

• e.g. as soon as the ALU creates the sum for the add, we can supply it as the input for the subtract.

Page 32: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 32

add $s0, $t0, $t1

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBEX

IF ID MEMEX

Time2 4 6 8 10

MEM

WBMEM

Page 33: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 33

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBMEMEX

IF ID WBMEMEX

bubble bubble bubble bubble bubble

Load-Use Data Hazard

Page 34: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 34

• Even with forwarding, we still have to stall one stage for a load-use data hazard.

• Delayed loads: to follow a load with an instruction independent of that load.

Page 35: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 35

Page 36: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 36

Implementation of the MIPS Datapath

Page 37: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 37

Events on Every Pipe Stage of the MIPS Pipeline

• See Figure A.19 on page A-32.

Page 38: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 38

Revised Datapath

Page 39: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 39

Revised Pipeline Structure

• See Figure A.25 on page A-39.

Page 40: Appendix A Pipelining: Basic and Intermediate Concepts

Extending the MIPS to Handle Multicycle Operations

Page 41: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 41

Floating-Point Operations

• The floating-point pipeline will allow for a longer latency for operations.– the EX cycle may be repeated as many times

as needed to complete the operation.• The number of repetitions can vary for different

operations.

– There may be multiple floating-point functional units.

Page 42: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 42

Assumptions

• Main integer unit: handles loads and stores, integer ALU operations, and branches.

• FP and integer multiplier.

• FP adder: handles FP add, subtract, and conversion.

• FP and integer divider.

• The EX stages of these functional units are not pipelined.

Page 43: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 43

MIPS with 3 FP Functional Units

Page 44: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 44

• Because EX is not pipelined, no other instruction using that functional unit may issue until the previous instruction leaves EX.– Instruction issue (p. A-33): the process of

letting an instruction move from the ID stage into the EX stage of the pipeline.

• If an instruction cannot proceed to the EX stage, the entire pipeline behind that instruction will be stalled.

Page 45: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 45

• Latency: the number of intervening cycles between an instruction that produces a result and an instruction that uses the result.

• Initiation interval: the number of cycles that must elapse between issuing two operations of a given type.

Page 46: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 46

Example (Figure A.30)Functional Unit Latency Initiation

Interval

Integer ALU 0 1

Data memory (integer/FP loads) 1 1

FP add 3 1

FP multiply (integer multiply) 6 1

FP divide (integer divide) 24 25

Page 47: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 47

• Since most operations consume their operands at the beginning of EX stage, the latency is usually the number of stages after EX that an instruction produces a result.– 0 for Integer ALU operations.– 1 for loads.

• Pipeline latency is essentially equal to 1 cycle less than the depth of the execution pipeline, which is the number of stages from the EX stage to the stage that produces the result.

Page 48: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 48

• To achieve a higher clock rate, fewer logic levels are put in each pipe stage.

=> The number of pipe stages required for more complex operations is larger.

• The penalty for the faster clock rate is longer latency for operations.

Page 49: Appendix A Pipelining: Basic and Intermediate Concepts

CSCE 614 Fall 2009 49

Supporting Multiple FP Operations

unpipelined