lecture 3: dynamic ilp

Lecture 3: Dynamic ILP

CS 505: Computer Architecture

Spring 2005

Thu D. Nguyen

Computer Science, Rutgers CS 505: Computer Structures2

Recall from Pipelining Lecture

Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls

Ideal pipeline CPI: measure of the maximum performance attainable by the implementation

Structural hazards: HW cannot support this combination of instructions

Data hazards: Instruction depends on result of prior instruction still in the pipeline

Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)


Data Dependence and Hazards

Dependences are a property of programs

Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is a property of the pipeline

Importance of the data dependencies1) indicates the possibility of a hazard

2) determines order in which results must be calculated

3) sets an upper bound on how much parallelism can possibly be exploited

Today looking at HW schemes to avoid hazard


Dealing with Hazards

In-order issue, in-order executionStall

Forwarding

Avoid WAR and WAW by only write in last stage

In-order issue, out-of-order executionWhy?

Dynamic scheduling with Scoreboarding


Ideas to Reduce Stalls

Technique Reduces Dynamic scheduling Data hazard stalls Dynamic branch prediction

Control stalls

I ssuing multiple instructions per cycle

I deal CPI

Speculation Data and control stalls Dynamic memory disambiguation

Data hazard stalls involving memory

Loop unrolling Control hazard stalls Basic compiler pipeline scheduling

Data hazard stalls

Compiler dependence analysis

I deal CPI and data hazard stalls

Sof tware pipelining and trace scheduling

I deal CPI and data hazard stalls

Compiler speculation I deal CPI , data and control stalls

Chapter 3

Chapter 4


Instruction-Level Parallelism (ILP)

Basic Block (BB) ILP is quite smallBB: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit

average dynamic branch frequency 15% to 25% => 4 to 7 instructions execute between a pair of branches

Plus instructions in BB likely to depend on each other

To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks

Simplest: loop-level parallelism to exploit parallelism among iterations of a loop

Vector is one way

If not vector, then either dynamic via branch prediction or static via loop unrolling by compiler

We will cover branch prediction but not until next time


ILP and Data Hazards

Program order: order instructions would execute in if executed sequentially 1 at a time as determined by original source program

HW/SW goal: exploit parallelism while preserving appearance of program order

modify order in manner than cannot be observed by program

must not affect the outcome of the program

Ex: Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict

Register renaming resolves name dependence for regs

Either by compiler or by HW

add r1, r2, r3

sub r2, r4,r5

and r3, r2, 1


Control Dependencies

Every instruction is control dependent on some set of branches, and, in general, these control dependencies must be preserved to preserve program order

if p1 {

S1;

};

if p2 {

S2;

}

S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1.


Control Dependence Ignored

Control dependence need not always be preservedWilling to execute instructions that should not have been executed, thereby violating the control dependences, if can do so without affecting correctness of the program

Instead, 2 properties critical to program correctness are exception behavior and data flow


Exception Behavior

Preserving exception behavior => any changes in instruction execution order must not change how exceptions are raised in program (=> no new exceptions)

Example:DADDU R2,R3,R4BEQZ R2,L1LW R1,0(R2)

L1:

Problem with moving LW before BEQZ?


Data Flow

Data flow: actual flow of data values among instructions that produce results and those that consume them

branches make flow dynamic, determine which instruction is supplier of data

Example:

DADDUR1,R2,R3BEQZ R4,LDSUBUR1,R5,R6

L: …OR R7,R1,R8

OR depends on DADDU or DSUBU? Must preserve data flow on execution


Advantages of Dynamic Scheduling

Handles cases when dependences unknown at compile time

(e.g., because they may involve a memory reference)

It simplifies the compiler

Allows code that compiled for one pipeline to run efficiently on a different pipeline

Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling


HW Schemes: Instruction Parallelism

Key idea: Allow instructions behind stall to proceedDIVD F0,F2,F4ADDD F10,F0,F8SUBD F12,F8,F14

Enables out-of-order execution and allows out-of-order completion

Will distinguish when an instruction begins execution and when it completes execution; between 2 times, the instruction is in execution

In a dynamically scheduled pipeline, all instructions pass through issue stage in order (in-order issue)


Dynamic Scheduling Step 1

Simple pipeline had 1 stage to check both structural and data hazards: Instruction Decode (ID), also called Instruction Issue

Split the ID pipe stage of simple 5-stage pipeline into 2 stages:

Issue—Decode instructions, check for structural hazards

Read operands—Wait until no data hazards, then read operands


A Dynamic Algorithm: Tomasulo’s Algorithm

For IBM 360/91 (before caches!)

Goal: High Performance without special compilers

Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operations

This led Tomasulo to try to figure out how to get more effective registers — renaming in hardware!

Why Study 1966 Computer?

The descendants of this have flourished!Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, …


Tomasulo Algorithm

Control & buffers distributed with Function Units (FU)

FU buffers called “reservation stations”; have pending operands

Registers in instructions replaced by values or pointers to reservation stations (RS);

form of register renaming ;

avoids WAR, WAW hazards

More reservation stations than registers, so can do optimizations compilers can’t

Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs

Load and Stores treated as FUs with RSs as well


Tomasulo Organization

FP addersFP adders

Add1Add2Add3

FP multipliersFP multipliers

Mult1Mult2

From Mem FP Registers

Reservation Stations

Common Data Bus (CDB)

To Mem

FP OpQueue

Load Buffers

Store Buffers

Load1Load2Load3Load4Load5Load6


Reservation Station Components

Op: Operation to perform in the unit (e.g., + or –)

Vj, Vk: Value of Source operands Store buffers has V field, result to be stored

Qj, Qk: Reservation stations producing source registers (value to be written)

Note: Qj,Qk=0 => ready Store buffers only have Qi for RS producing result

Busy: Indicates reservation station or FU is busy

Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.


Three Stages of Tomasulo Algorithm

1.Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers).

2.Execute—operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result

3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available

Common data bus: data + source (“come from” bus)64 bits of data + 4 bits of Functional Unit source address

Write if matches expected Functional Unit (produces result)

Does the broadcast

Example speed: 3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /


Tomasulo ExampleInstruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F300 FU

Clock cycle counter

FU countdown

Instruction stream

3 Load/Buffers

3 FP Adder R.S.2 FP Mult R.S.


Tomasulo Example Cycle 1

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2



Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F301 FU Load1


Tomasulo Example Cycle 2Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2



Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F302 FU Load2 Load1

Note: Can have multiple loads outstanding



Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2


Add1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Mult1 Load2 Load1



Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued

Load1 completing; what is waiting for Load1?



Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2


Add1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F304 FU Mult1 Load2 M(A1) Add1

• Load2 completing; what is waiting for Load2?



Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2


2 Add1 Yes SUBD M(A1) M(A2)Add2 NoAdd3 No

10 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F305 FU Mult1 M(A2) M(A1) Add1 Mult2

• Timer starts down for Add1, Mult1



Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6


1 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No


Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F306 FU Mult1 M(A2) Add2 Add1 Mult2

• Issue ADDD here despite name dependency on F6?



Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6


0 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No


Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F307 FU Mult1 M(A2) Add2 Add1 Mult2

• Add1 (SUBD) completing; what is waiting for it?



Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6


Add1 No2 Add2 Yes ADDD (M-M) M(A2)

Add3 No7 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F308 FU Mult1 M(A2) Add2 (M-M) Mult2



Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6








Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10






• Add2 (ADDD) completing; what is waiting for it?



Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 No


Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

• Write result of ADDD here?• All quick instructions complete in this cycle!



Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11





• Mult1 (MULTD) completing; what is waiting for it?



Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 NoMult1 No

40 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

• Just waiting for Mult2 (DIVD) to complete


Faster than light computation(skip a couple of cycles)



Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11







Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDD F6 F8 F2 6 10 11





• Mult2 (DIVD) is completing; what is waiting for it?


Tomasulo Example Cycle 57Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+M)(M-M) Result

• Once again: In-order issue, out-of-order execution and out-of-order completion.


Tomasulo Drawbacks

Complexity

Many associative stores (CDB) at high speed

Performance limited by Common Data BusEach CDB must go to multiple functional units high capacitance, high wiring density

Number of functional units that can complete per cycle limited to one!

Multiple CDBs more FU logic for parallel assoc stores

Non-precise interrupts!We will address this later


Tomasulo Loop Example

Loop: LD F0 0 R1

MULTD F4 F0 F2

SD F4 0 R1

SUBI R1 R1 #8

BNEZ R1 Loop

This time assume Multiply takes 4 clocks

Assume 1st load takes 8 clocks (L1 cache miss), 2nd load takes 1 clock (hit)

To be clear, will show clocks for SUBI, BNEZ

Show 2 iterations


Loop ExampleInstruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 Load1 No1 MULTD F4 F0 F2 Load2 No1 SD F4 0 R1 Load3 No2 LD F0 0 R1 Store1 No2 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F300 80 Fu

Added Store Buffers

Value of Register used for address, iteration control

Instruction Loop

Iter-ationCount


Loop Example Cycle 1Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 80

Load2 NoLoad3 NoStore1 NoStore2 NoStore3 No


Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop


Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F301 80 Fu Load1



ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No

Load3 NoStore1 NoStore2 NoStore3 No


Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop


Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F302 80 Fu Load1 Mult1



ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No

Store1 Yes 80 Mult1Store2 NoStore3 No





Implicit renaming sets up data flow graph









Dispatching SUBI Instruction (not in FP queue)









And, BNEZ instruction (not in FP queue)



ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult1

Store2 NoStore3 No





Notice that F0 never sees Load from location 80



ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 No

Store3 No


Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop




Loop Example Cycle 7

Register file completely detached from computation

First and Second iteration completely overlapped



ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No







ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No





Load1 completing: who is waiting? Note: Dispatching SUBI



ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 10 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No


Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

4 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop



Load2 completing: who is waiting? Note: Dispatching BNEZ



ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No



3 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #84 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop



Next load in sequence









Why not issue third multiply?









Why not issue third store?



ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No






Mult1 completing. Who is waiting?



ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No


Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8

0 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop



Mult2 completing. Who is waiting?



ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 No



4 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop





ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 Yes 64 Mult1







ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 Yes 64 Mult1







ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 19 Store3 Yes 64 Mult1







ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 Yes 561 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2 MULTD F4 F0 F2 7 15 16 Store2 No2 SD F4 0 R1 8 19 20 Store3 Yes 64 Mult1





• Once again: In-order issue, out-of-order execution and out-of-order completion.


Why can Tomasulo overlap iterations of loops?

Register renamingMultiple iterations use different physical destinations for registers (dynamic loop unrolling).

Reservation stations Permit instruction issue to advance past integer control flow operations

Also buffer old values of registers - totally avoiding the WAR stall that we saw in the scoreboard.

Other perspective: Tomasulo building data flow dependency graph on the fly.


Tomasulo’s scheme offers 2 major advantages

(1) the distribution of the hazard detection logicdistributed reservation stations and the CDB

If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB

If a centralized register file were used, the units would have to read their results from the registers when register buses are available.

(2) the elimination of stalls for WAW and WAR hazards


What about Precise Exception?

Coming your way soon …


Relationship Between Precise Exceptions and Specultation:

Speculation: guess and check

Important for branch prediction:Need to “take our best shot” at predicting branch direction.

If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly:

This is exactly same as precise exceptions!

Technique for both precise interrupts/exceptions and speculation: in-order completion or commit


HW support for precise interruptsNeed HW buffer for results of uncommitted instructions: reorder buffer

3 fields: instr, destination, value

Use reorder buffer number instead of reservation station when execution completes

Supplies operands between execution complete & commit

(Reorder buffer can be operand source => more registers like RS)

Instructions commit

Once instruction commits, result is put into register

As a result, easy to undo speculated instructions on mispredicted branches or exceptions

ReorderBuffer

FPOp

Queue

FP Adder FP Adder

Res Stations Res Stations

FP Regs


Four Steps of Speculative Tomasulo Algorithm

1. Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send

operands & reorder buffer no. for destination (this stage sometimes called “dispatch”)

2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB

for result; when both in reservation station, execute; checks RAW (sometimes called “issue”)

3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs

& reorder buffer; mark reservation station available.

4. Commit—update register with reorder result When instr. at head of reorder buffer & result present, update

register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”)


Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers


FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1F0F0 LD F0,10(R2)LD F0,10(R2) NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers


2 ADDD R(F4),ROB12 ADDD R(F4),ROB1


ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F10F10

F0F0ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)NN

NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers


3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1


ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F2F2

F10F10

F0F0

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers


3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD ROB5, R(F6)6 ADDD ROB5, R(F6)


ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0F0 ADDD F0,F4,F6ADDD F0,F4,F6 NN

F4F4 LD F4,0(R3)LD F4,0(R3) NN

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0



LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers

5 0+R35 0+R3


3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD ROB5, R(F6)6 ADDD ROB5, R(F6)


ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

----

F0F0ROB5ROB5

ST 0(R3),F4ST 0(R3),F4

ADDD F0,F4,F6ADDD F0,F4,F6NN

NN

F4F4 LD F4,0(R3)LD F4,0(R3) NN

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0



LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

Dest

Reorder Buffer

Registers

1 10+R21 10+R26 0+R36 0+R3


3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)


ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

----

F0F0M[10]M[10]

ST 0(R3),F4ST 0(R3),F4

ADDD F0,F4,F6ADDD F0,F4,F6YY

NN

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0



LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers

2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD M[10],R(F6)6 ADDD M[10],R(F6)




ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

----

F0F0M[10]M[10]

<val2><val2>ST 0(R3),F4ST 0(R3),F4


ExEx

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0



LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers


----

F0F0M[10]M[10]

<val2><val2>ST 0(R3),F4ST 0(R3),F4


ExEx

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY

---- BNE F2,<…>BNE F2,<…> NN



ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F2F2

F10F10

F0F0



LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers

What about memoryhazards???


Memory Disambiguation:Sorting out RAW Hazards in memoryQuestion: Given a load that follows a store in program order, are the two related?

Eg: st 0(R2),R5 ld R6,0(R3)

Can we go ahead and start the load early? Store address could be delayed for a long time by some calculation that leads to R2 (divide?).

We might want to issue/begin execution of both operations in same cycle.

Today: Answer is that we are not allowed to start load until we know that address 0(R2) 0(R3)

Next Week: We might guess at whether or not they are dependent (called “dependence speculation”) and use reorder buffer to fixup if we are wrong.


Hardware Support for Memory Disambiguation

Need buffer to keep track of all outstanding stores to memory, in program order.

Keep track of address (when becomes available) and value (when becomes available)

FIFO ordering: will retire stores from this buffer in program order

When issuing a load, record current head of store queue (know which stores are ahead of you).

When have address for load, check store queue:If any store prior to load is waiting for its address, stall load.

If load address matches earlier store address (associative lookup), then we have a memory-induced RAW hazard:

store value available return value

store value not available return ROB number of source

Otherwise, send out request to memory

Actual stores commit in order, so no worry about WAR/WAW hazards through memory.


F4F4 LD F4, 10(R3)LD F4, 10(R3) NN

Memory Disambiguation:

ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

----

F0F0

----

<val 1><val 1>

ST 10(R3), F5 ST 10(R3), F5

LD F0,32(R2)LD F0,32(R2)

ST 0(R3), F4ST 0(R3), F4

NN

NN

YY

Done?

DestDest

Oldest

Newest

from Memory

2 32+R22 32+R2

4 ROB34 ROB3

Dest

Reorder Buffer

Registers


Relationship between precise interrupts and speculation:

Speculation is a form of guessingBranch prediction, data prediction

If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly

This is exactly same as precise exceptions!

Branch prediction is a very importantNeed to “take our best shot” at predicting branch direction.

If we issue multiple instructions per cycle, lose lots of potential instructions otherwise:

Consider 4 instructions per cycle

If take single cycle to decide on branch, waste from 4 - 7 instruction slots!

Technique for both precise interrupts/exceptions and speculation: in-order completion or commit

This is why reorder buffers in all new processors


Branch Prediction

Next time …


Multiple Issue Superscalar

Until now, trying to reduce CPI to 1 (or ideal CPI)

Can we make CPI < 1?

Issue multiple instructions every clock cycleStatic scheduling vs. dynamic scheduling

VLIW will be covered at end of term if have time

Dynamic schedulingCheck for dependences at issue time

Issue logic can become bottleneckRestrict combinations of instruction types that can be

issued together

Pipeline issue stage further (2 steps of ½ cycle each)

Widen issue logic


Summary

Reservations stations: implicit register renaming to larger set of registers + buffering source operands

Prevents registers as bottleneck

Avoids WAR, WAW hazards of Scoreboard

Allows loop unrolling in HW

Not limited to basic blocks (integer units gets ahead, beyond branches)

Today, helps cache misses as wellDon’t stall for L1 Data cache miss (insufficient ILP for L2 miss?)

Lasting ContributionsDynamic scheduling

Register renaming

Load/store disambiguation

360/91 descendants are Pentium III; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

lecture 3: dynamic ilp

Documents

order instructions

data dependence

data hazardsprogram

actual hazard

data dependenciesindicates

hazarddetermines order

order executionwhy

fetching of instructions