introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 1

1 Copyright © 2012, Elsevier Inc. All rights reserved.

Chapter 3

Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition


Introduction n  Pipelining become universal technique in 1985

n  Overlaps execution of instructions n  Exploits “Instruction Level Parallelism”

n  Beyond this, there are two main approaches: n  Hardware-based dynamic approaches

n  Used in server and desktop processors n  Not used as extensively in PMP processors

n  Compiler-based static approaches n  Not as successful outside of scientific applications

Introduction




Instruction-Level Parallelism n  When exploiting instruction-level parallelism,

goal is to maximize CPI n  Pipeline CPI =

n  Ideal pipeline CPI + n  Structural stalls + n  Data hazard stalls + n  Control stalls

n  Parallelism with basic block is limited n  Typical size of basic block = 3-6 instructions n  Must optimize across branches

Introduction


Data Dependence n  Loop-Level Parallelism

n  Unroll loop statically or dynamically n  Use SIMD (vector processors and GPUs)

n  Challenges: n  Data dependency

n  Instruction j is data dependent on instruction i if n  Instruction i produces a result that may be used by instruction j n  Instruction j is data dependent on instruction k and instruction k

is data dependent on instruction i

n  Dependent instructions cannot be executed simultaneously

Introduction




Data Dependence n  Dependencies are a property of programs n  Pipeline organization determines if dependence

is detected and if it causes a stall

n  Data dependence conveys: n  Possibility of a hazard n  Order in which results must be calculated n  Upper bound on exploitable instruction level

parallelism

n  Dependencies that flow through memory locations are difficult to detect

Introduction


Name Dependence n  Two instructions use the same name but no flow

of information n  Not a true data dependence, but is a problem when

reordering instructions n  Antidependence: instruction j writes a register or

memory location that instruction i reads n  Initial ordering (i before j) must be preserved

n  Output dependence: instruction i and instruction j write the same register or memory location

n  Ordering must be preserved

n  To resolve, use renaming techniques

Introduction




Other Factors n  Data Hazards

n  Read after write (RAW) n  Write after write (WAW) n  Write after read (WAR)

n  Control Dependence n  Ordering of instruction i with respect to a branch

instruction n  Instruction control dependent on a branch cannot be moved

before the branch so that its execution is no longer controller by the branch

n  An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch

Introduction


Examples n  OR instruction dependent

on DADDU and DSUBU

n  Assume R4 isn’t used after skip n  Possible to move DSUBU

before the branch

Introduction •  Example 1: DADDU R1,R2,R3 BEQZ R4,L DSUBU R1,R1,R6

L: … OR R7,R1,R8

•  Example 2:

DADDU R1,R2,R3 BEQZ R12,skip DSUBU R4,R5,R6 DADDU R5,R4,R9

skip: OR R7,R8,R9




Compiler Techniques for Exposing ILP

n  Pipeline scheduling n  Separate dependent instruction from the source

instruction by the pipeline latency of the source instruction

n  Example: for (i=999; i>=0; i=i-1) x[i] = x[i] + s;

Com

piler Techniques


Pipeline Stalls

Loop: L.D F0,0(R1) stall ADD.D F4,F0,F2 stall stall S.D F4,0(R1) DADDUI R1,R1,#-8 stall (assume integer load latency is 1) BNE R1,R2,Loop

Com

piler Techniques




Pipeline Scheduling Scheduled code: Loop: L.D F0,0(R1)

DADDUI R1,R1,#-8 ADD.D F4,F0,F2 stall stall S.D F4,8(R1) BNE R1,R2,Loop

Com

piler Techniques


Loop Unrolling n  Loop unrolling

n  Unroll by a factor of 4 (assume # elements is divisible by 4) n  Eliminate unnecessary instructions

Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) ;drop DADDUI & BNE L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) ;drop DADDUI & BNE L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) ;drop DADDUI & BNE L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop

Com

piler Techniques

n  note: number of live registers vs. original loop




Loop Unrolling/Pipeline Scheduling

n  Pipeline schedule the unrolled loop:

Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4,0(R1) S.D F8,-8(R1) DADDUI R1,R1,#-32 S.D F12,16(R1) S.D F16,8(R1) BNE R1,R2,Loop

Com

piler Techniques


Strip Mining n  Unknown number of loop iterations?

n  Number of iterations = n n  Goal: make k copies of the loop body n  Generate pair of loops:

n  First executes n mod k times n  Second executes n / k times n  “Strip mining”

Com

piler Techniques



15 15

Three Limits to Loop Unrolling 1.  Decrease in amount of overhead

amortized with each extra unrolling •  Amdahl’s Law

2.  Growth in code size •  For larger loops, concern

it increases the instruction cache miss rate

3.  Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling •  If it is not possible to allocate all live values to registers,

we may lose some or all of its advantage

n  Loop unrolling reduces impact of branches on pipeline; another way is branch prediction

Conclude: a little loop unrolling is beneficial; trying to do a lot is hard.

16 16

12%

22%

18%

11% 12%

4%6%

9% 10%

15%

0%

5%

10%

15%

20%

25%

compresseqntott

espresso gc

c li

doduc

ear

hydro2d

mdljdp

su2cor

Mis

pred

ictio

n R

ate

Static Branch Prediction n  To reorder code around branches,

need to predict branch statically when compiled n  Simplest scheme is to predict a branch as taken

n  Average misprediction = untaken branch frequency = 34% SPEC

n  More accurate scheme predicts branches using profile information collected from earlier runs, and modify prediction based on last run:

Integer Floating Point



17 17

Dynamic Branch Prediction n  Why does prediction work?

n  Underlying algorithm has regularities n  Data that is being operated on has regularities n  Instruction sequence has redundancies that are artifacts of way that

humans/compilers think about problems

n  Is dynamic branch prediction better than static branch prediction? n  Seems to be n  There are a small number of important branches in programs that have

dynamic behavior

18 18

Dynamic Branch Prediction n  Performance = ƒ(accuracy, cost of

misprediction) n  Branch History Table:

Lower bits of PC address index table of 1-bit values n  Says whether or not branch taken last time n  No address check

n  Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit): n  End of loop case, when it exits instead of looping as before n  First time through loop on next time through code,

when it predicts exit instead of looping



19 19

n  Solution: 2-bit scheme where change prediction only if get misprediction twice

n  Red: stop, not taken n  Green: go, taken n  Adds hysteresis to decision making process

Dynamic Branch Prediction

T

T NT

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not Taken T

NT T

NT

20 20

18%

5%

12%10% 9%

5%

9% 9%

0% 1%

0%2%4%6%8%10%12%14%16%18%20%

eqntott

espresso gc

c lispice

doducspice

fpppp

matrix300nasa7

Mis

pred

ictio

n R

ate

BHT Accuracy n  Mispredict because either:

n  Wrong guess for that branch n  Got branch history of wrong branch when index the table

n  4096 entry table:

Integer Floating Point



21 21

Correlated Branch Prediction n  Idea: record m most recently executed branches as

taken or not taken, and use that pattern to select the proper n-bit branch history table

n  In general, (m,n) predictor means record last m branches

to select between 2m history tables, each with n-bit counters

n  Thus, old 2-bit BHT is a (0,2) predictor

n  Global Branch History: m-bit shift register keeping T/NT status of last m branches.

n  Each entry in table has m n-bit predictors.

22 22

Correlating Branches

(2,2) predictor – Behavior of recent

branches selects between four predictions of next branch, updating just that prediction

Branch address

2-bits per branch predictor

Prediction

2-bit global branch history

4



23 23

0%

Freq

uenc

y of

Misp

redi

ctio

ns

0% 1%

5% 6% 6%

11%

4%

6% 5%

1% 2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

Accuracy of Different Schemes

4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT

nasa

7

mat

rix30

0

dodu

cd

spic

e

fppp

p

gcc

expr

esso

eqnt

ott li

tom

catv

24 24

Tournament Predictors n  Multilevel branch predictor

n  Use n-bit saturating counter to choose between predictors

n  Usual choice between global and local predictors



25 25

Tournament Predictors Tournament predictor using, say, 4K 2-bit

counters indexed by local branch address. Chooses between:

n  Global predictor n  4K entries index by history of last 12 branches (212 = 4K)

n  Each entry is a standard 2-bit predictor

n  Local predictor n  Local history table: 1024 10-bit entries recording last 10 branches,

index by branch address

n  The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating counters

26 26

Comparing Predictors (Fig. 2.8) Advantage of tournament predictor is ability to

select the right predictor for a particular branch n  Particularly crucial for integer benchmarks. n  A typical tournament predictor will select the global predictor

almost 40% of the time for the SPEC integer benchmarks and less than 15% of the time for the SPEC FP benchmarks



27 27

Pentium 4 Misprediction Rate (per 1000 instructions, not per branch)

11

13

7

12

9

10 0 0

5

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

164.gzip

175.vpr

176.gcc

181.mcf

186.crafty

168.wupwise

171.swim

172.mgrid

173.applu

177.mesa

Bra

nch

mis

pre

dic

tion

s p

er 1

00

0 I

nst

ruct

ion

s

SPECint2000 SPECfp2000

≈6% misprediction rate per branch SPECint (19% of INT instructions are branch)

≈2% misprediction rate per branch SPECfp (5% of FP instructions are branch)

28

•  Branch target calculation is costly and stalls the instruction fetch.

•  BTB stores PCs the same way as caches •  The PC of a branch is sent to the BTB •  When a match is found

the corresponding Predicted PC is returned •  Even if the branch was predicted taken,

instruction fetch continues at the returned Predicted PC, i.e. no stall

Branch Target Buffers (BTB)



29

Branch Target Buffers

Organization?

Proceed normally?

PC of instruction to fetch? What if prediction is wrong?

30 30

Dynamic Branch Prediction Summary n  Prediction becoming important part of execution n  Branch History Table: 2 bits for loop accuracy n  Correlation: Recently executed branches correlated with

next branch n  Either different branches (GA) n  Or different executions of same branches (PA)

n  Tournament predictors take insight to next level by using multiple predictors

n  usually one based on global information and one based on local information, and combining them with a selector

n  In 2006, tournament predictors using ≈ 30K bits are in processors like the Power5 and Pentium 4

n  Branch Target Buffer: include branch address & prediction




Branch Prediction n  Basic 2-bit predictor:

n  For each branch: n  Predict taken or not taken n  If the prediction is wrong two consecutive times, change prediction

n  Correlating predictor: n  Multiple 2-bit predictors for each branch n  One for each possible combination of outcomes of preceding n

branches n  Local predictor:

n  Multiple 2-bit predictors for each branch n  One for each possible combination of outcomes for the last n

occurrences of this branch n  Tournament predictor:

n  Combine correlating predictor with local predictor

Branch P

rediction


Branch Prediction Performance

Branch P

rediction

Branch predictor performance




Dynamic Scheduling n  Rearrange order of instructions to reduce stalls

while maintaining data flow

n  Advantages: n  Compiler doesn’t need to have knowledge of

microarchitecture n  Handles cases where dependencies are unknown at

compile time

n  Disadvantage: n  Substantial increase in hardware complexity n  Complicates exceptions

Branch P

rediction


Dynamic Scheduling n  Dynamic scheduling implies:

n  Out-of-order execution n  Out-of-order completion

n  Creates the possibility for WAR and WAW hazards

n  Tomasulo’s Approach n  Tracks when operands are available n  Introduces register renaming in hardware

n  Minimizes WAW and WAR hazards

Branch P

rediction




Register Renaming n  Example:

DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8

+ name dependence with F6

Branch P

rediction

antidependence

antidependence


Register Renaming n  Example:

DIV.D F0,F2,F4 ADD.D S,F0,F8 S.D S,0(R1) SUB.D T,F10,F14 MUL.D F6,F10,T

n  Now only RAW hazards remain, which can be strictly

ordered

Branch P

rediction




Register Renaming n  Register renaming is provided by reservation stations

(RS) n  Contains:

n  The instruction n  Buffered operand values (when available) n  Reservation station number of instruction providing the

operand values n  RS fetches and buffers an operand as soon as it becomes

available (not necessarily involving register file) n  Pending instructions designate the RS to which they will send

their output n  Result values broadcast on a result bus, called the common data bus (CDB)

n  Only the last output updates the register file n  As instructions are issued, the register specifiers are renamed

with the reservation station n  May be more reservation stations than registers

Branch P

rediction


Tomasulo’s Algorithm n  Load and store buffers

n  Contain data and addresses, act like reservation stations

n  Top-level design:

Branch P

rediction




Tomasulo’s Algorithm n  Three Steps:

n  Issue n  Get next instruction from FIFO queue n  If available RS, issue the instruction to the RS with operand values if

available n  If operand values not available, stall the instruction

n  Execute n  When operand becomes available, store it in any reservation

stations waiting for it n  When all operands are ready, issue the instruction n  Loads and store maintained in program order through effective

address n  No instruction allowed to initiate execution until all branches that

proceed it in program order have completed n  Write result

n  Write result on CDB into reservation stations and store buffers n  (Stores must wait until address and value are received)

Branch P

rediction

40 40

Advantages of Dynamic Scheduling

n  Dynamic scheduling - hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior

n  It handles cases when dependences are unknown at compile time n  it allows the processor to tolerate unpredictable delays such as cache

misses, by executing other code while waiting for the miss to resolve

n  It allows code that compiled for one pipeline to run efficiently on a different pipeline

n  It simplifies the compiler n  Hardware speculation, a technique with

significant performance advantages, builds on dynamic scheduling (next lecture)



41 41

HW Schemes: Instruction Parallelism n  Key idea: Allow instructions behind stall to proceed

DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14

n  Enables out-of-order execution and allows out-of-order completion (e.g., SUBD)

n  In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue)

n  Will distinguish when an instruction begins execution and when it completes execution; between two times, the instruction is in execution

n  Note: Dynamic execution creates WAR and WAW hazards and makes exceptions harder

42 42

Dynamic Scheduling Step 1

n  Simple pipeline had one stage to check both structural and data hazards: Instruction Decode (ID), also called Instruction Issue

n  Split the ID pipe stage of simple 5-stage pipeline into 2 stages:

n  Issue—Decode instructions, check for structural hazards

n  Read operands—Wait until no data hazards, then read operands



43 43

A Dynamic Algorithm: Tomasulo’s

n  For IBM 360/91 (before caches!) n  ⇒ Long memory latency

n  Goal: High Performance without special compilers n  Small number of floating point registers (four in 360)

prevented interesting compiler scheduling of operations n  This led Tomasulo to try to figure out how to get more registers — renaming

in hardware!

n  Why Study 1966 Computer? n  The descendants of this have flourished!

n  Alpha 21264, Pentium 4, AMD Opteron, Power 5, …

44 44

Tomasulo Algorithm n  Control & buffers distributed with Function Units (FU)

n  FU buffers called “reservation stations”; have pending operands

n  Registers in instructions replaced by values or pointers to reservation stations (RS); called register renaming; n  Renaming avoids WAR, WAW hazards n  More reservation stations than registers,

so can do optimizations compilers can’t n  Results to FU from RS, not through registers, over Common

Data Bus that broadcasts results to all FUs n  Avoids RAW hazards by executing an instruction

only when its operands are available

n  Load and Stores treated as FUs with RSs as well n  Integer instructions can go past branches (predict taken),

allowing FP ops beyond basic block in FP queue



45 45

Tomasulo Organization

FP adders

Add1 Add2 Add3

FP multipliers

Mult1 Mult2

From Mem FP Registers

Reservation Stations

Common Data Bus (CDB)

To Mem

FP Op Queue

Load Buffers

Store Buffers

Load1 Load2 Load3 Load4 Load5 Load6

46 46

Reservation Station Components Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands

n  Store buffers has V field, result to be stored

Qj, Qk: Reservation stations producing source registers (value to be written) n  Note: Qj,Qk=0 => ready n  Store buffers only have Qi for RS producing result

Busy: Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.



47 47

Three Stages of Tomasulo Algorithm

1. "Issue—get instruction from FP Op Queue If reservation station free (no structural hazard),

control issues instruction & sends operands (renames registers). 2. "Execute—operate on operands (EX)

When both operands ready then execute; if not ready, watch Common Data Bus for result

3. "Write result—finish execution (WB) Write on Common Data Bus to all awaiting units;

mark reservation station available n  Normal data bus: data + destination (“go to” bus) n  Common data bus: data + source (“come from” bus)

n  64 bits of data + 4 bits of Functional Unit source address n  Write if matches expected Functional Unit (produces result) n  Does the broadcast

n  Example speed: 3 clocks for Fl .pt. +,-; 10 for * ; 40 clocks for /

48 48

Tomasulo Example Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F300 FU

Clock cycle counter

FU count down

Instruction stream

3 Load/Buffers

3 FP Adder R.S. 2 FP Mult R.S.



49 49

Tomasulo Example Cycle 1 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2



Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F301 FU Load1

50 50


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2



Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F302 FU Load2 Load1

Note: Can have multiple loads outstanding



51 51


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2


Add1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Mult1 Load2 Load1

•  Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued •  Load1 completing; what is waiting for Load1?

52 52


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2


Add1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F304 FU Mult1 Load2 M(A1) Add1

•  Load2 completing; what is waiting for Load2?



53 53


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2


2 Add1 Yes SUBD M(A1) M(A2)Add2 NoAdd3 No

10 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F305 FU Mult1 M(A2) M(A1) Add1 Mult2

•  Timer starts down for Add1, Mult1

54 54


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6


1 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No

9 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F306 FU Mult1 M(A2) Add2 Add1 Mult2

•  Issue ADDD here despite name dependency on F6?



55 55


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6


0 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No

8 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F307 FU Mult1 M(A2) Add2 Add1 Mult2

•  Add1 (SUBD) completing; what is waiting for it?

56 56


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6


Add1 No2 Add2 Yes ADDD (M-M) M(A2)

Add3 No7 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F308 FU Mult1 M(A2) Add2 (M-M) Mult2



57 57


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6






58 58


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10






•  Add2 (ADDD) completing; what is waiting for it?



59 59


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 No


Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

•  Write result of ADDD here? •  All quick instructions complete in this cycle!

60 60









61 61







62 62









63 63


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11





•  Mult1 (MULTD) completing; what is waiting for it?

64 64


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 NoMult1 No

40 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

•  Just waiting for Mult2 (DIVD) to complete



65 65

Faster than light computation (skip a couple of cycles)

66 66


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11







67 67


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDD F6 F8 F2 6 10 11





•  Mult2 (DIVD) is completing; what is waiting for it?

68 68


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+M)(M-M) Result

•  Once again: In-order issue, out-of-order execution and out-of-order completion.



69 69

Why can Tomasulo overlap iterations of loops?

n  Register renaming n  Multiple iterations use different physical destinations

for registers (dynamic loop unrolling). n  Reservation stations

n  Permit instruction issue to advance past integer control flow operations

n  Also buffer old values of registers—totally avoiding the WAR stall n  Other perspective: Tomasulo building data flow

dependency graph on the fly

70 70

Tomasulo’s scheme offers 2 major advantages

1.  Distribution of the hazard detection logic n  distributed reservation stations and the CDB n  If multiple instructions waiting on single result, and each

instruction has other operands, then instructions can be released simultaneously by broadcast on CDB

n  If a centralized register file were used, the units would have to read their results from the registers when register buses are available

2.  Elimination of stalls for WAW and WAR hazards



71 71

Tomasulo Drawbacks

n  Complexity n  delays of 360/91, MIPS 10000, Alpha 21264,

IBM PPC 620 in CA:AQA 2/e, but not in silicon! n  Many associative stores (CDB) at high speed n  Performance limited by Common Data Bus

n  Each CDB must go to multiple functional units ⇒high capacitance, high wiring density

n  Number of functional units that can complete per cycle limited to one!

n  Multiple CDBs ⇒ more FU logic for parallel assoc stores n  Non-precise interrupts!

n  We will address this later


Example

Branch P

rediction




Hardware-Based Speculation n  Execute instructions along predicted execution

paths but only commit the results if prediction was correct

n  Instruction commit: allowing an instruction to update the register file when instruction is no longer speculative

n  Need an additional piece of hardware to prevent any irrevocable action until an instruction commits n  I.e. updating state or taking an execution

Branch P

rediction


Reorder Buffer n  Reorder buffer – holds the result of instruction

between completion and commit

n  Four fields: n  Instruction type: branch/store/register n  Destination field: register number n  Value field: output value n  Ready field: completed execution?

n  Modify reservation stations: n  Operand source is now reorder buffer instead of

functional unit

Branch P

rediction




Reorder Buffer n  Register values and memory values are not

written until an instruction commits n  On misprediction:

n  Speculated entries in ROB are cleared

n  Exceptions: n  Not recognized until it is ready to commit

Branch P

rediction


Multiple Issue and Static Scheduling

n  To achieve CPI < 1, need to complete multiple instructions per clock

n  Solutions: n  Statically scheduled superscalar processors n  VLIW (very long instruction word) processors n  dynamically scheduled superscalar processors

Multiple Issue and S

tatic Scheduling




Multiple Issue M

ultiple Issue and Static S

cheduling


VLIW Processors

n  Package multiple operations into one instruction

n  Example VLIW processor: n  One integer instruction (or branch) n  Two independent floating-point operations n  Two independent memory references

n  Must be enough parallelism in code to fill the available slots


tatic Scheduling




VLIW Processors

n  Disadvantages: n  Statically finding parallelism n  Code size n  No hazard detection hardware n  Binary code compatibility


tatic Scheduling


Dynamic Scheduling, Multiple Issue, and Speculation

n  Modern microarchitectures: n  Dynamic scheduling + multiple issue + speculation

n  Two approaches: n  Assign reservation stations and update pipeline

control table in half clock cycles n  Only supports 2 instructions/clock

n  Design logic to handle any possible dependencies between the instructions

n  Hybrid approaches

n  Issue logic can become bottleneck

Dynam

ic Scheduling, M

ultiple Issue, and Speculation




Dynam

ic Scheduling, M


Overview of Design


n  Limit the number of instructions of a given class that can be issued in a “bundle” n  I.e. on FP, one integer, one load, one store

n  Examine all the dependencies amoung the instructions in the bundle

n  If dependencies exist in bundle, encode them in reservation stations

n  Also need multiple completion/commit

Dynam

ic Scheduling, M


Multiple Issue




Loop: LD R2,0(R1) ;R2=array element DADDIU R2,R2,#1 ;increment R2 SD R2,0(R1) ;store result DADDIU R1,R1,#8 ;increment pointer BNE R2,R3,LOOP ;branch if not last element

Dynam

ic Scheduling, M


Example


Dynam

ic Scheduling, M


Example (No Speculation)




Dynam

ic Scheduling, M


Example


n  Need high instruction bandwidth! n  Branch-Target buffers

n  Next PC prediction buffer, indexed by current PC

Adv. Techniques for Instruction D

elivery and Speculation

Branch-Target Buffer




n  Optimization: n  Larger branch-target buffer n  Add target instruction into buffer to deal with longer

decoding time required by larger buffer n  “Branch folding”



Branch Folding


n  Most unconditional branches come from function returns

n  The same procedure can be called from multiple sites n  Causes the buffer to potentially forget about the

return address from previous calls n  Create return address buffer organized as a

stack



Return Address Predictor




n  Design monolithic unit that performs: n  Branch prediction n  Instruction prefetch

n  Fetch ahead

n  Instruction memory access and buffering n  Deal with crossing cache lines



Integrated Instruction Fetch Unit


n  Register renaming vs. reorder buffers n  Instead of virtual registers from reservation stations and

reorder buffer, create a single register pool n  Contains visible registers and virtual registers

n  Use hardware-based map to rename registers during issue n  WAW and WAR hazards are avoided n  Speculation recovery occurs by copying during commit n  Still need a ROB-like queue to update table in order n  Simplifies commit:

n  Record that mapping between architectural register and physical register is no longer speculative

n  Free up physical register used to hold older value n  In other words: SWAP physical registers on commit

n  Physical register de-allocation is more difficult



Register Renaming




n  Combining instruction issue with register renaming: n  Issue logic pre-reserves enough physical registers

for the bundle (fixed number?) n  Issue logic finds dependencies within bundle, maps

registers as necessary n  Issue logic finds dependencies between current

bundle and already in-flight bundles, maps registers as necessary



Integrated Issue and Renaming


n  How much to speculate n  Mis-speculation degrades performance and power

relative to no speculation n  May cause additional misses (cache, TLB)

n  Prevent speculative code from causing higher costing misses (e.g. L2)

n  Speculating through multiple branches n  Complicates speculation recovery n  No processor can resolve multiple branches per

cycle



How Much?




n  Speculation and energy efficiency n  Note: speculation is only energy efficient when it

significantly improves performance

n  Value prediction n  Uses:

n  Loads that load from a constant pool n  Instruction that produces a value from a small set of values

n  Not been incorporated into modern processors n  Similar idea--address aliasing prediction--is used on

some processors



Energy Efficiency

introduction - cse.msu.edu

Documents