chapter 14 instruction level parallelism and superscalar processors

Chapter 14Instruction Level Parallelismand Superscalar Processors

What is Superscalar?

• Common instructions (arithmetic, load/store, conditional branch) can be executed independently.

• Equally applicable to RISC & CISC, but more straightforward in RISC machines.

• The order of execution is usually determined by the compiler.

A Superscalar machine executes multiple independent instructions in parallel.

Example Superscalar Organization

Superpipelined Machines

Superpiplined machines overlap pipe stages

- rely on stages being able to begin operations

before the last is complete.

Superscalar v Superpipeline

Limitations of Superscalar

Dependent upon:• Instruction level parallelism• Compiler based optimization• Hardware support

• Limited by—True data dependency—Procedural dependency—Resource conflicts—Output dependency—Antidependency (one instruction can overwrite

a value that an “earlier instruction” has not yet read)

True Data Dependency

ADD r1, r2 (r1 := r1+r2) MOVE r3, r1 (r3 := r1)

• Can fetch and decode second instruction in parallel with first

• Can NOT execute second instruction until first is finished

Compare with the following?

LOAD r1, X (r1 := X)

MOVE r3, r1 (r3 := r1)What additional problem do we have here?

Procedural Dependency

• Can not execute instructions after a branch in parallel with instructions before a branch, because?

• Also, if instruction length is not fixed, instructions have to be decoded to find out how many fetches are needed

Resource Conflict

• Two or more instructions requiring access to the same resource at the same time—e.g. two arithmetic instructions

• Solution - Can possibly duplicate resources—e.g. have two arithmetic units

Antidependancy

ADD R4, R3, 1

ADD R3, R5, 1

• Cannot complete the second instruction before the first has read R3

Effect of Dependencies

Instruction-level Parallelism

• Consider:LOAD R1, R2

ADD R3, 1ADD R4, R2

These can be handles in parallel

• Consider:ADD R3, 1ADD R4, R3STO (R4), R0

These cannot

Instruction Issue Policies

• Order in which instructions are fetched

• Order in which instructions are executed

• Order in which instructions change registers and memory

In-Order Issue In-Order Completion

Issue instructions in the order they occur:

• Not very efficient

• May fetch >1 instruction

• Instructions must stall if necessary

In-Order Issue In-Order Completion (Diagram)

Assume:

• I1 requires 2 cycles to execute

• I3 & I4 conflict for the same functional unit

• I5 depends upon value produced by I4

• I5 & I6 conflict for a functional unit

In-Order Issue Out-of-Order Completion

How does this effect interrupts?

Out-of-Order IssueOut-of-Order Completion

• Decouple decode pipeline from execution pipeline

• Can continue to fetch and decode until this pipeline is full

• When a functional unit becomes available an instruction can be executed

• Since instructions have been decoded, processor can look ahead

Out-of-Order Issue Out-of-Order Completion (Diagram)

Note: I5 depends upon I4, but I6 does not

Register Renaming

Output and antidependencies occur becauseregister contents may not reflect the correctordering from the program

• Could result in a pipeline stall

• One solution: Registers allocated dynamically

Register Renaming example

• R3b:=R3a + R5a (I1)• R4b:=R3b + 1 (I2)• R3c:=R5a + 1 (I3)• R7b:=R3c + R4b (I4)• Without subscript refers to logical register

in instruction• With subscript is hardware register

allocated• Note R3a R3b R3c

Machine Parallelism Support

• Duplication of Resources

• Out of order issue

• Renaming

• Windowing

Speedups of Machine Organizations Without Procedural Dependencies

Study Conclusions

• Not worth duplication functional units without register renaming

• Need instruction window large enough (more than 8, probably not more than 32)

Branch Prediction in Superscalar Machines

• Delayed branch not used much. Why?

• Branch prediction used - Branch history may still be useful

Superscalar Execution

Committing or Retiring Instructions

• Sometimes results must be held in temporary storage until it is certain they can be placed in “permanent” storage.

This temporary storage needs to be regularly cleaned up.

Superscalar Hardware Support

• Simultaneously fetch multiple instructions• Logic to determine true dependencies

involving register values and Mechanisms to communicate these values

• Mechanisms to initiate multiple instructions in parallel

• Resources for parallel execution of multiple instructions

• Mechanisms for committing process state in correct order

chapter 14 instruction level parallelism and superscalar processors

Documents

r3 slide

instruction instructions

order issue

memory slide

necessary slide

order of execution

functional unit slide

fetched order