introduction - cse.msu.edu

47
The University of Adelaide, School of Computer Science 8 October 2012 Chapter 2 — Instructions: Language of the Computer 1 1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition 2 Copyright © 2012, Elsevier Inc. All rights reserved. Introduction Pipelining become universal technique in 1985 Overlaps execution of instructions Exploits “Instruction Level Parallelism” Beyond this, there are two main approaches: Hardware-based dynamic approaches Used in server and desktop processors Not used as extensively in PMP processors Compiler-based static approaches Not as successful outside of scientific applications Introduction

Upload: others

Post on 17-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 1

1 Copyright © 2012, Elsevier Inc. All rights reserved.

Chapter 3

Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition

2 Copyright © 2012, Elsevier Inc. All rights reserved.

Introduction n  Pipelining become universal technique in 1985

n  Overlaps execution of instructions n  Exploits “Instruction Level Parallelism”

n  Beyond this, there are two main approaches: n  Hardware-based dynamic approaches

n  Used in server and desktop processors n  Not used as extensively in PMP processors

n  Compiler-based static approaches n  Not as successful outside of scientific applications

Introduction

Page 2: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 2

3 Copyright © 2012, Elsevier Inc. All rights reserved.

Instruction-Level Parallelism n  When exploiting instruction-level parallelism,

goal is to maximize CPI n  Pipeline CPI =

n  Ideal pipeline CPI + n  Structural stalls + n  Data hazard stalls + n  Control stalls

n  Parallelism with basic block is limited n  Typical size of basic block = 3-6 instructions n  Must optimize across branches

Introduction

4 Copyright © 2012, Elsevier Inc. All rights reserved.

Data Dependence n  Loop-Level Parallelism

n  Unroll loop statically or dynamically n  Use SIMD (vector processors and GPUs)

n  Challenges: n  Data dependency

n  Instruction j is data dependent on instruction i if n  Instruction i produces a result that may be used by instruction j n  Instruction j is data dependent on instruction k and instruction k

is data dependent on instruction i

n  Dependent instructions cannot be executed simultaneously

Introduction

Page 3: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 3

5 Copyright © 2012, Elsevier Inc. All rights reserved.

Data Dependence n  Dependencies are a property of programs n  Pipeline organization determines if dependence

is detected and if it causes a stall

n  Data dependence conveys: n  Possibility of a hazard n  Order in which results must be calculated n  Upper bound on exploitable instruction level

parallelism

n  Dependencies that flow through memory locations are difficult to detect

Introduction

6 Copyright © 2012, Elsevier Inc. All rights reserved.

Name Dependence n  Two instructions use the same name but no flow

of information n  Not a true data dependence, but is a problem when

reordering instructions n  Antidependence: instruction j writes a register or

memory location that instruction i reads n  Initial ordering (i before j) must be preserved

n  Output dependence: instruction i and instruction j write the same register or memory location

n  Ordering must be preserved

n  To resolve, use renaming techniques

Introduction

Page 4: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 4

7 Copyright © 2012, Elsevier Inc. All rights reserved.

Other Factors n  Data Hazards

n  Read after write (RAW) n  Write after write (WAW) n  Write after read (WAR)

n  Control Dependence n  Ordering of instruction i with respect to a branch

instruction n  Instruction control dependent on a branch cannot be moved

before the branch so that its execution is no longer controller by the branch

n  An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch

Introduction

8 Copyright © 2012, Elsevier Inc. All rights reserved.

Examples n  OR instruction dependent

on DADDU and DSUBU

n  Assume R4 isn’t used after skip n  Possible to move DSUBU

before the branch

Introduction •  Example 1: DADDU R1,R2,R3 BEQZ R4,L DSUBU R1,R1,R6

L: … OR R7,R1,R8

•  Example 2:

DADDU R1,R2,R3 BEQZ R12,skip DSUBU R4,R5,R6 DADDU R5,R4,R9

skip: OR R7,R8,R9

Page 5: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 5

9 Copyright © 2012, Elsevier Inc. All rights reserved.

Compiler Techniques for Exposing ILP

n  Pipeline scheduling n  Separate dependent instruction from the source

instruction by the pipeline latency of the source instruction

n  Example: for (i=999; i>=0; i=i-1) x[i] = x[i] + s;

Com

piler Techniques

10 Copyright © 2012, Elsevier Inc. All rights reserved.

Pipeline Stalls

Loop: L.D F0,0(R1) stall ADD.D F4,F0,F2 stall stall S.D F4,0(R1) DADDUI R1,R1,#-8 stall (assume integer load latency is 1) BNE R1,R2,Loop

Com

piler Techniques

Page 6: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 6

11 Copyright © 2012, Elsevier Inc. All rights reserved.

Pipeline Scheduling Scheduled code: Loop: L.D F0,0(R1)

DADDUI R1,R1,#-8 ADD.D F4,F0,F2 stall stall S.D F4,8(R1) BNE R1,R2,Loop

Com

piler Techniques

12 Copyright © 2012, Elsevier Inc. All rights reserved.

Loop Unrolling n  Loop unrolling

n  Unroll by a factor of 4 (assume # elements is divisible by 4) n  Eliminate unnecessary instructions

Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) ;drop DADDUI & BNE L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) ;drop DADDUI & BNE L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) ;drop DADDUI & BNE L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop

Com

piler Techniques

n  note: number of live registers vs. original loop

Page 7: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 7

13 Copyright © 2012, Elsevier Inc. All rights reserved.

Loop Unrolling/Pipeline Scheduling

n  Pipeline schedule the unrolled loop:

Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4,0(R1) S.D F8,-8(R1) DADDUI R1,R1,#-32 S.D F12,16(R1) S.D F16,8(R1) BNE R1,R2,Loop

Com

piler Techniques

14 Copyright © 2012, Elsevier Inc. All rights reserved.

Strip Mining n  Unknown number of loop iterations?

n  Number of iterations = n n  Goal: make k copies of the loop body n  Generate pair of loops:

n  First executes n mod k times n  Second executes n / k times n  “Strip mining”

Com

piler Techniques

Page 8: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 8

15 15

Three Limits to Loop Unrolling 1.  Decrease in amount of overhead

amortized with each extra unrolling •  Amdahl’s Law

2.  Growth in code size •  For larger loops, concern

it increases the instruction cache miss rate

3.  Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling •  If it is not possible to allocate all live values to registers,

we may lose some or all of its advantage

n  Loop unrolling reduces impact of branches on pipeline; another way is branch prediction

Conclude: a little loop unrolling is beneficial; trying to do a lot is hard.

16 16

12%

22%

18%

11% 12%

4%6%

9% 10%

15%

0%

5%

10%

15%

20%

25%

compresseqntott

espresso gc

c li

doduc

ear

hydro2d

mdljdp

su2cor

Mis

pred

ictio

n R

ate

Static Branch Prediction n  To reorder code around branches,

need to predict branch statically when compiled n  Simplest scheme is to predict a branch as taken

n  Average misprediction = untaken branch frequency = 34% SPEC

n  More accurate scheme predicts branches using profile information collected from earlier runs, and modify prediction based on last run:

Integer Floating Point

Page 9: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 9

17 17

Dynamic Branch Prediction n  Why does prediction work?

n  Underlying algorithm has regularities n  Data that is being operated on has regularities n  Instruction sequence has redundancies that are artifacts of way that

humans/compilers think about problems

n  Is dynamic branch prediction better than static branch prediction? n  Seems to be n  There are a small number of important branches in programs that have

dynamic behavior

18 18

Dynamic Branch Prediction n  Performance = ƒ(accuracy, cost of

misprediction) n  Branch History Table:

Lower bits of PC address index table of 1-bit values n  Says whether or not branch taken last time n  No address check

n  Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit): n  End of loop case, when it exits instead of looping as before n  First time through loop on next time through code,

when it predicts exit instead of looping

Page 10: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 10

19 19

n  Solution: 2-bit scheme where change prediction only if get misprediction twice

n  Red: stop, not taken n  Green: go, taken n  Adds hysteresis to decision making process

Dynamic Branch Prediction

T

T NT

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not Taken T

NT T

NT

20 20

18%

5%

12%10% 9%

5%

9% 9%

0% 1%

0%2%4%6%8%10%12%14%16%18%20%

eqntott

espresso gc

c lispice

doducspice

fpppp

matrix300nasa7

Mis

pred

ictio

n R

ate

BHT Accuracy n  Mispredict because either:

n  Wrong guess for that branch n  Got branch history of wrong branch when index the table

n  4096 entry table:

Integer Floating Point

Page 11: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 11

21 21

Correlated Branch Prediction n  Idea: record m most recently executed branches as

taken or not taken, and use that pattern to select the proper n-bit branch history table

n  In general, (m,n) predictor means record last m branches

to select between 2m history tables, each with n-bit counters

n  Thus, old 2-bit BHT is a (0,2) predictor

n  Global Branch History: m-bit shift register keeping T/NT status of last m branches.

n  Each entry in table has m n-bit predictors.

22 22

Correlating Branches

(2,2) predictor – Behavior of recent

branches selects between four predictions of next branch, updating just that prediction

Branch address

2-bits per branch predictor

Prediction

2-bit global branch history

4

Page 12: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 12

23 23

0%

Freq

uenc

y of

Misp

redi

ctio

ns

0% 1%

5% 6% 6%

11%

4%

6% 5%

1% 2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

Accuracy of Different Schemes

4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT

nasa

7

mat

rix30

0

dodu

cd

spic

e

fppp

p

gcc

expr

esso

eqnt

ott li

tom

catv

24 24

Tournament Predictors n  Multilevel branch predictor

n  Use n-bit saturating counter to choose between predictors

n  Usual choice between global and local predictors

Page 13: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 13

25 25

Tournament Predictors Tournament predictor using, say, 4K 2-bit

counters indexed by local branch address. Chooses between:

n  Global predictor n  4K entries index by history of last 12 branches (212 = 4K)

n  Each entry is a standard 2-bit predictor

n  Local predictor n  Local history table: 1024 10-bit entries recording last 10 branches,

index by branch address

n  The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating counters

26 26

Comparing Predictors (Fig. 2.8) Advantage of tournament predictor is ability to

select the right predictor for a particular branch n  Particularly crucial for integer benchmarks. n  A typical tournament predictor will select the global predictor

almost 40% of the time for the SPEC integer benchmarks and less than 15% of the time for the SPEC FP benchmarks

Page 14: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 14

27 27

Pentium 4 Misprediction Rate (per 1000 instructions, not per branch)

11

13

7

12

9

10 0 0

5

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

164.gzip

175.vpr

176.gcc

181.mcf

186.crafty

168.wupwise

171.swim

172.mgrid

173.applu

177.mesa

Bra

nch

mis

pre

dic

tion

s p

er 1

00

0 I

nst

ruct

ion

s

SPECint2000 SPECfp2000

≈6% misprediction rate per branch SPECint (19% of INT instructions are branch)

≈2% misprediction rate per branch SPECfp (5% of FP instructions are branch)

28

•  Branch target calculation is costly and stalls the instruction fetch.

•  BTB stores PCs the same way as caches •  The PC of a branch is sent to the BTB •  When a match is found

the corresponding Predicted PC is returned •  Even if the branch was predicted taken,

instruction fetch continues at the returned Predicted PC, i.e. no stall

Branch Target Buffers (BTB)

Page 15: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 15

29

Branch Target Buffers

Organization?

Proceed normally?

PC of instruction to fetch? What if prediction is wrong?

30 30

Dynamic Branch Prediction Summary n  Prediction becoming important part of execution n  Branch History Table: 2 bits for loop accuracy n  Correlation: Recently executed branches correlated with

next branch n  Either different branches (GA) n  Or different executions of same branches (PA)

n  Tournament predictors take insight to next level by using multiple predictors

n  usually one based on global information and one based on local information, and combining them with a selector

n  In 2006, tournament predictors using ≈ 30K bits are in processors like the Power5 and Pentium 4

n  Branch Target Buffer: include branch address & prediction

Page 16: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 16

31 Copyright © 2012, Elsevier Inc. All rights reserved.

Branch Prediction n  Basic 2-bit predictor:

n  For each branch: n  Predict taken or not taken n  If the prediction is wrong two consecutive times, change prediction

n  Correlating predictor: n  Multiple 2-bit predictors for each branch n  One for each possible combination of outcomes of preceding n

branches n  Local predictor:

n  Multiple 2-bit predictors for each branch n  One for each possible combination of outcomes for the last n

occurrences of this branch n  Tournament predictor:

n  Combine correlating predictor with local predictor

Branch P

rediction

32 Copyright © 2012, Elsevier Inc. All rights reserved.

Branch Prediction Performance

Branch P

rediction

Branch predictor performance

Page 17: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 17

33 Copyright © 2012, Elsevier Inc. All rights reserved.

Dynamic Scheduling n  Rearrange order of instructions to reduce stalls

while maintaining data flow

n  Advantages: n  Compiler doesn’t need to have knowledge of

microarchitecture n  Handles cases where dependencies are unknown at

compile time

n  Disadvantage: n  Substantial increase in hardware complexity n  Complicates exceptions

Branch P

rediction

34 Copyright © 2012, Elsevier Inc. All rights reserved.

Dynamic Scheduling n  Dynamic scheduling implies:

n  Out-of-order execution n  Out-of-order completion

n  Creates the possibility for WAR and WAW hazards

n  Tomasulo’s Approach n  Tracks when operands are available n  Introduces register renaming in hardware

n  Minimizes WAW and WAR hazards

Branch P

rediction

Page 18: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 18

35 Copyright © 2012, Elsevier Inc. All rights reserved.

Register Renaming n  Example:

DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8

+ name dependence with F6

Branch P

rediction

antidependence

antidependence

36 Copyright © 2012, Elsevier Inc. All rights reserved.

Register Renaming n  Example:

DIV.D F0,F2,F4 ADD.D S,F0,F8 S.D S,0(R1) SUB.D T,F10,F14 MUL.D F6,F10,T

n  Now only RAW hazards remain, which can be strictly

ordered

Branch P

rediction

Page 19: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 19

37 Copyright © 2012, Elsevier Inc. All rights reserved.

Register Renaming n  Register renaming is provided by reservation stations

(RS) n  Contains:

n  The instruction n  Buffered operand values (when available) n  Reservation station number of instruction providing the

operand values n  RS fetches and buffers an operand as soon as it becomes

available (not necessarily involving register file) n  Pending instructions designate the RS to which they will send

their output n  Result values broadcast on a result bus, called the common data bus (CDB)

n  Only the last output updates the register file n  As instructions are issued, the register specifiers are renamed

with the reservation station n  May be more reservation stations than registers

Branch P

rediction

38 Copyright © 2012, Elsevier Inc. All rights reserved.

Tomasulo’s Algorithm n  Load and store buffers

n  Contain data and addresses, act like reservation stations

n  Top-level design:

Branch P

rediction

Page 20: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 20

39 Copyright © 2012, Elsevier Inc. All rights reserved.

Tomasulo’s Algorithm n  Three Steps:

n  Issue n  Get next instruction from FIFO queue n  If available RS, issue the instruction to the RS with operand values if

available n  If operand values not available, stall the instruction

n  Execute n  When operand becomes available, store it in any reservation

stations waiting for it n  When all operands are ready, issue the instruction n  Loads and store maintained in program order through effective

address n  No instruction allowed to initiate execution until all branches that

proceed it in program order have completed n  Write result

n  Write result on CDB into reservation stations and store buffers n  (Stores must wait until address and value are received)

Branch P

rediction

40 40

Advantages of Dynamic Scheduling

n  Dynamic scheduling - hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior

n  It handles cases when dependences are unknown at compile time n  it allows the processor to tolerate unpredictable delays such as cache

misses, by executing other code while waiting for the miss to resolve

n  It allows code that compiled for one pipeline to run efficiently on a different pipeline

n  It simplifies the compiler n  Hardware speculation, a technique with

significant performance advantages, builds on dynamic scheduling (next lecture)

Page 21: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 21

41 41

HW Schemes: Instruction Parallelism n  Key idea: Allow instructions behind stall to proceed

DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14

n  Enables out-of-order execution and allows out-of-order completion (e.g., SUBD)

n  In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue)

n  Will distinguish when an instruction begins execution and when it completes execution; between two times, the instruction is in execution

n  Note: Dynamic execution creates WAR and WAW hazards and makes exceptions harder

42 42

Dynamic Scheduling Step 1

n  Simple pipeline had one stage to check both structural and data hazards: Instruction Decode (ID), also called Instruction Issue

n  Split the ID pipe stage of simple 5-stage pipeline into 2 stages:

n  Issue—Decode instructions, check for structural hazards

n  Read operands—Wait until no data hazards, then read operands

Page 22: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 22

43 43

A Dynamic Algorithm: Tomasulo’s

n  For IBM 360/91 (before caches!) n  ⇒ Long memory latency

n  Goal: High Performance without special compilers n  Small number of floating point registers (four in 360)

prevented interesting compiler scheduling of operations n  This led Tomasulo to try to figure out how to get more registers — renaming

in hardware!

n  Why Study 1966 Computer? n  The descendants of this have flourished!

n  Alpha 21264, Pentium 4, AMD Opteron, Power 5, …

44 44

Tomasulo Algorithm n  Control & buffers distributed with Function Units (FU)

n  FU buffers called “reservation stations”; have pending operands

n  Registers in instructions replaced by values or pointers to reservation stations (RS); called register renaming; n  Renaming avoids WAR, WAW hazards n  More reservation stations than registers,

so can do optimizations compilers can’t n  Results to FU from RS, not through registers, over Common

Data Bus that broadcasts results to all FUs n  Avoids RAW hazards by executing an instruction

only when its operands are available

n  Load and Stores treated as FUs with RSs as well n  Integer instructions can go past branches (predict taken),

allowing FP ops beyond basic block in FP queue

Page 23: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 23

45 45

Tomasulo Organization

FP adders

Add1 Add2 Add3

FP multipliers

Mult1 Mult2

From Mem FP Registers

Reservation Stations

Common Data Bus (CDB)

To Mem

FP Op Queue

Load Buffers

Store Buffers

Load1 Load2 Load3 Load4 Load5 Load6

46 46

Reservation Station Components Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands

n  Store buffers has V field, result to be stored

Qj, Qk: Reservation stations producing source registers (value to be written) n  Note: Qj,Qk=0 => ready n  Store buffers only have Qi for RS producing result

Busy: Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

Page 24: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 24

47 47

Three Stages of Tomasulo Algorithm

1. "Issue—get instruction from FP Op Queue If reservation station free (no structural hazard),

control issues instruction & sends operands (renames registers). 2. "Execute—operate on operands (EX)

When both operands ready then execute; if not ready, watch Common Data Bus for result

3. "Write result—finish execution (WB) Write on Common Data Bus to all awaiting units;

mark reservation station available n  Normal data bus: data + destination (“go to” bus) n  Common data bus: data + source (“come from” bus)

n  64 bits of data + 4 bits of Functional Unit source address n  Write if matches expected Functional Unit (produces result) n  Does the broadcast

n  Example speed: 3 clocks for Fl .pt. +,-; 10 for * ; 40 clocks for /

48 48

Tomasulo Example Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F300 FU

Clock cycle counter

FU count down

Instruction stream

3 Load/Buffers

3 FP Adder R.S. 2 FP Mult R.S.

Page 25: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 25

49 49

Tomasulo Example Cycle 1 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F301 FU Load1

50 50

Tomasulo Example Cycle 2 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F302 FU Load2 Load1

Note: Can have multiple loads outstanding

Page 26: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 26

51 51

Tomasulo Example Cycle 3 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Mult1 Load2 Load1

•  Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued •  Load1 completing; what is waiting for Load1?

52 52

Tomasulo Example Cycle 4 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F304 FU Mult1 Load2 M(A1) Add1

•  Load2 completing; what is waiting for Load2?

Page 27: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 27

53 53

Tomasulo Example Cycle 5 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

2 Add1 Yes SUBD M(A1) M(A2)Add2 NoAdd3 No

10 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F305 FU Mult1 M(A2) M(A1) Add1 Mult2

•  Timer starts down for Add1, Mult1

54 54

Tomasulo Example Cycle 6 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

1 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No

9 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F306 FU Mult1 M(A2) Add2 Add1 Mult2

•  Issue ADDD here despite name dependency on F6?

Page 28: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 28

55 55

Tomasulo Example Cycle 7 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

0 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No

8 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F307 FU Mult1 M(A2) Add2 Add1 Mult2

•  Add1 (SUBD) completing; what is waiting for it?

56 56

Tomasulo Example Cycle 8 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 No2 Add2 Yes ADDD (M-M) M(A2)

Add3 No7 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F308 FU Mult1 M(A2) Add2 (M-M) Mult2

Page 29: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 29

57 57

Tomasulo Example Cycle 9 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 No1 Add2 Yes ADDD (M-M) M(A2)

Add3 No6 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F309 FU Mult1 M(A2) Add2 (M-M) Mult2

58 58

Tomasulo Example Cycle 10 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 No0 Add2 Yes ADDD (M-M) M(A2)

Add3 No5 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3010 FU Mult1 M(A2) Add2 (M-M) Mult2

•  Add2 (ADDD) completing; what is waiting for it?

Page 30: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 30

59 59

Tomasulo Example Cycle 11 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

4 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

•  Write result of ADDD here? •  All quick instructions complete in this cycle!

60 60

Tomasulo Example Cycle 12 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

3 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3012 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

Page 31: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 31

61 61

Tomasulo Example Cycle 13 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

2 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3013 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

62 62

Tomasulo Example Cycle 14 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

1 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3014 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

Page 32: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 32

63 63

Tomasulo Example Cycle 15 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

0 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3015 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

•  Mult1 (MULTD) completing; what is waiting for it?

64 64

Tomasulo Example Cycle 16 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

40 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

•  Just waiting for Mult2 (DIVD) to complete

Page 33: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 33

65 65

Faster than light computation (skip a couple of cycles)

66 66

Tomasulo Example Cycle 55 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

1 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3055 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

Page 34: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 34

67 67

Tomasulo Example Cycle 56 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

0 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

•  Mult2 (DIVD) is completing; what is waiting for it?

68 68

Tomasulo Example Cycle 57 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+M)(M-M) Result

•  Once again: In-order issue, out-of-order execution and out-of-order completion.

Page 35: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 35

69 69

Why can Tomasulo overlap iterations of loops?

n  Register renaming n  Multiple iterations use different physical destinations

for registers (dynamic loop unrolling). n  Reservation stations

n  Permit instruction issue to advance past integer control flow operations

n  Also buffer old values of registers—totally avoiding the WAR stall n  Other perspective: Tomasulo building data flow

dependency graph on the fly

70 70

Tomasulo’s scheme offers 2 major advantages

1.  Distribution of the hazard detection logic n  distributed reservation stations and the CDB n  If multiple instructions waiting on single result, and each

instruction has other operands, then instructions can be released simultaneously by broadcast on CDB

n  If a centralized register file were used, the units would have to read their results from the registers when register buses are available

2.  Elimination of stalls for WAW and WAR hazards

Page 36: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 36

71 71

Tomasulo Drawbacks

n  Complexity n  delays of 360/91, MIPS 10000, Alpha 21264,

IBM PPC 620 in CA:AQA 2/e, but not in silicon! n  Many associative stores (CDB) at high speed n  Performance limited by Common Data Bus

n  Each CDB must go to multiple functional units ⇒high capacitance, high wiring density

n  Number of functional units that can complete per cycle limited to one!

n  Multiple CDBs ⇒ more FU logic for parallel assoc stores n  Non-precise interrupts!

n  We will address this later

72 Copyright © 2012, Elsevier Inc. All rights reserved.

Example

Branch P

rediction

Page 37: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 37

73 Copyright © 2012, Elsevier Inc. All rights reserved.

Hardware-Based Speculation n  Execute instructions along predicted execution

paths but only commit the results if prediction was correct

n  Instruction commit: allowing an instruction to update the register file when instruction is no longer speculative

n  Need an additional piece of hardware to prevent any irrevocable action until an instruction commits n  I.e. updating state or taking an execution

Branch P

rediction

74 Copyright © 2012, Elsevier Inc. All rights reserved.

Reorder Buffer n  Reorder buffer – holds the result of instruction

between completion and commit

n  Four fields: n  Instruction type: branch/store/register n  Destination field: register number n  Value field: output value n  Ready field: completed execution?

n  Modify reservation stations: n  Operand source is now reorder buffer instead of

functional unit

Branch P

rediction

Page 38: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 38

75 Copyright © 2012, Elsevier Inc. All rights reserved.

Reorder Buffer n  Register values and memory values are not

written until an instruction commits n  On misprediction:

n  Speculated entries in ROB are cleared

n  Exceptions: n  Not recognized until it is ready to commit

Branch P

rediction

76 Copyright © 2012, Elsevier Inc. All rights reserved.

Multiple Issue and Static Scheduling

n  To achieve CPI < 1, need to complete multiple instructions per clock

n  Solutions: n  Statically scheduled superscalar processors n  VLIW (very long instruction word) processors n  dynamically scheduled superscalar processors

Multiple Issue and S

tatic Scheduling

Page 39: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 39

77 Copyright © 2012, Elsevier Inc. All rights reserved.

Multiple Issue M

ultiple Issue and Static S

cheduling

78 Copyright © 2012, Elsevier Inc. All rights reserved.

VLIW Processors

n  Package multiple operations into one instruction

n  Example VLIW processor: n  One integer instruction (or branch) n  Two independent floating-point operations n  Two independent memory references

n  Must be enough parallelism in code to fill the available slots

Multiple Issue and S

tatic Scheduling

Page 40: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 40

79 Copyright © 2012, Elsevier Inc. All rights reserved.

VLIW Processors

n  Disadvantages: n  Statically finding parallelism n  Code size n  No hazard detection hardware n  Binary code compatibility

Multiple Issue and S

tatic Scheduling

80 Copyright © 2012, Elsevier Inc. All rights reserved.

Dynamic Scheduling, Multiple Issue, and Speculation

n  Modern microarchitectures: n  Dynamic scheduling + multiple issue + speculation

n  Two approaches: n  Assign reservation stations and update pipeline

control table in half clock cycles n  Only supports 2 instructions/clock

n  Design logic to handle any possible dependencies between the instructions

n  Hybrid approaches

n  Issue logic can become bottleneck

Dynam

ic Scheduling, M

ultiple Issue, and Speculation

Page 41: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 41

81 Copyright © 2012, Elsevier Inc. All rights reserved.

Dynam

ic Scheduling, M

ultiple Issue, and Speculation

Overview of Design

82 Copyright © 2012, Elsevier Inc. All rights reserved.

n  Limit the number of instructions of a given class that can be issued in a “bundle” n  I.e. on FP, one integer, one load, one store

n  Examine all the dependencies amoung the instructions in the bundle

n  If dependencies exist in bundle, encode them in reservation stations

n  Also need multiple completion/commit

Dynam

ic Scheduling, M

ultiple Issue, and Speculation

Multiple Issue

Page 42: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 42

83 Copyright © 2012, Elsevier Inc. All rights reserved.

Loop: LD R2,0(R1) ;R2=array element DADDIU R2,R2,#1 ;increment R2 SD R2,0(R1) ;store result DADDIU R1,R1,#8 ;increment pointer BNE R2,R3,LOOP ;branch if not last element

Dynam

ic Scheduling, M

ultiple Issue, and Speculation

Example

84 Copyright © 2012, Elsevier Inc. All rights reserved.

Dynam

ic Scheduling, M

ultiple Issue, and Speculation

Example (No Speculation)

Page 43: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 43

85 Copyright © 2012, Elsevier Inc. All rights reserved.

Dynam

ic Scheduling, M

ultiple Issue, and Speculation

Example

86 Copyright © 2012, Elsevier Inc. All rights reserved.

n  Need high instruction bandwidth! n  Branch-Target buffers

n  Next PC prediction buffer, indexed by current PC

Adv. Techniques for Instruction D

elivery and Speculation

Branch-Target Buffer

Page 44: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 44

87 Copyright © 2012, Elsevier Inc. All rights reserved.

n  Optimization: n  Larger branch-target buffer n  Add target instruction into buffer to deal with longer

decoding time required by larger buffer n  “Branch folding”

Adv. Techniques for Instruction D

elivery and Speculation

Branch Folding

88 Copyright © 2012, Elsevier Inc. All rights reserved.

n  Most unconditional branches come from function returns

n  The same procedure can be called from multiple sites n  Causes the buffer to potentially forget about the

return address from previous calls n  Create return address buffer organized as a

stack

Adv. Techniques for Instruction D

elivery and Speculation

Return Address Predictor

Page 45: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 45

89 Copyright © 2012, Elsevier Inc. All rights reserved.

n  Design monolithic unit that performs: n  Branch prediction n  Instruction prefetch

n  Fetch ahead

n  Instruction memory access and buffering n  Deal with crossing cache lines

Adv. Techniques for Instruction D

elivery and Speculation

Integrated Instruction Fetch Unit

90 Copyright © 2012, Elsevier Inc. All rights reserved.

n  Register renaming vs. reorder buffers n  Instead of virtual registers from reservation stations and

reorder buffer, create a single register pool n  Contains visible registers and virtual registers

n  Use hardware-based map to rename registers during issue n  WAW and WAR hazards are avoided n  Speculation recovery occurs by copying during commit n  Still need a ROB-like queue to update table in order n  Simplifies commit:

n  Record that mapping between architectural register and physical register is no longer speculative

n  Free up physical register used to hold older value n  In other words: SWAP physical registers on commit

n  Physical register de-allocation is more difficult

Adv. Techniques for Instruction D

elivery and Speculation

Register Renaming

Page 46: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 46

91 Copyright © 2012, Elsevier Inc. All rights reserved.

n  Combining instruction issue with register renaming: n  Issue logic pre-reserves enough physical registers

for the bundle (fixed number?) n  Issue logic finds dependencies within bundle, maps

registers as necessary n  Issue logic finds dependencies between current

bundle and already in-flight bundles, maps registers as necessary

Adv. Techniques for Instruction D

elivery and Speculation

Integrated Issue and Renaming

92 Copyright © 2012, Elsevier Inc. All rights reserved.

n  How much to speculate n  Mis-speculation degrades performance and power

relative to no speculation n  May cause additional misses (cache, TLB)

n  Prevent speculative code from causing higher costing misses (e.g. L2)

n  Speculating through multiple branches n  Complicates speculation recovery n  No processor can resolve multiple branches per

cycle

Adv. Techniques for Instruction D

elivery and Speculation

How Much?

Page 47: Introduction - cse.msu.edu

The University of Adelaide, School of Computer Science 8 October 2012

Chapter 2 — Instructions: Language of the Computer 47

93 Copyright © 2012, Elsevier Inc. All rights reserved.

n  Speculation and energy efficiency n  Note: speculation is only energy efficient when it

significantly improves performance

n  Value prediction n  Uses:

n  Loads that load from a constant pool n  Instruction that produces a value from a small set of values

n  Not been incorporated into modern processors n  Similar idea--address aliasing prediction--is used on

some processors

Adv. Techniques for Instruction D

elivery and Speculation

Energy Efficiency