dynamic branch prediction and speculation

27
ENGS 116 Lecture 9 1 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday: Chapter 3.7-3.9, 4.1 Homework #2: due Friday 14 th , 2.8, A.2, A.13, 3.6a&b, 3.10, 4.5, 4.8, (4.13 optional) Project Proposals due Wednesday!

Upload: terra

Post on 05-Feb-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Dynamic Branch Prediction and Speculation. Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday: Chapter 3.7-3.9, 4.1 Homework #2: due Friday 14 th , 2.8, A.2, A.13, 3.6a&b, 3.10, 4.5, 4.8, (4.13 optional) Project Proposals due Wednesday!. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 1

Dynamic Branch Predictionand Speculation

Vincent H. Berk

October 10, 2005

Reading for today: Chapter 3.2 – 3.6

Reading for Wednesday: Chapter 3.7-3.9, 4.1

Homework #2: due Friday 14th, 2.8, A.2, A.13, 3.6a&b, 3.10, 4.5, 4.8, (4.13 optional)

Project Proposals due Wednesday!

Page 2: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 2

• 2 pages

• Names and Title

• Introduction to problem domain

• Research question

– goal of your project

• Work plan

– e.g.: 2 weeks programming, 1 week experiments, 1 week writing

• References

– books, websites, articles

Project Proposals

Page 3: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 3

• Control dependences limit ILP

• Performance = (accuracy, cost of misprediction)

• Branches arrive much faster when multiple instructions are issued per clock

– Amdahl’s Law

• Want to predict outcome of branch as early as possible

• Methods:

– Branch history table (1 or more bits)

– Correlated branches

– Branch target buffer

Dynamic Branch Prediction

Page 4: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 4

• Branch History Table (BHT) (aka Branch Prediction Buffer)

– Lower bits of PC address index table of 1-bit values

– Entry says whether or not branch taken last time

– No address check

• Problem: In a loop, 1-bit BHT will cause two mispredictions

– First time through loop on next time through code, when it predicts exit instead of looping

– End of loop case, when it exits instead of looping as before

Dynamic Branch Prediction: A Simple Approach

Lower bits of PC

NTT

T

NT

NT

T

.

.

.

Page 5: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 5

Dynamic Branch Prediction: A Better Way

Solution: 2-bit scheme where prediction changes only if we get misprediction twice.

Helps when target is known before result of condition.

Predict taken Predict taken

Predict not takenPredict not taken

Taken

Taken

Taken

Taken

Not taken

Not takenNot taken

Not taken

Page 6: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 6

BHT General Case

• n-bit predictor:

– counter can hold values between 0 and

– predict taken when value is greater than or equal to

half of maximum value:

– The counter is incremented on each taken branch

– and decremented on each not taken branch

Page 7: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 7

BHT Accuracy

• Mispredict because either:

– Wrong guess for that branch

– Got branch history of wrong branch from index table

• 4096-entry table: programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%.

• 4096 entries about as good as infinite number of entries

• 2-bit predictors work nearly as well as more-bit predictors

Page 8: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 8

Correlating Branches

• Hypothesis: recent branches are correlated; that is, behavior of recently-executed branches affects prediction of current branch

if (d == 0)

d = 1;

if (d == 1)

Page 9: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 9

Correlated Branch Prediction

• Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper n-bit branch history table

• In general, (m,n) predictor means record last m branches to select between 2m history tables, each with n-bit counters

– Thus, old 2-bit BHT is a (0,2) predictor

• Global Branch History: m-bit shift register keeping T/NT status of last m branches.

• Each entry in table has m n-bit predictors.

Page 10: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 10

Correlating Branches

(2,2) predictor

– Behavior of recent branches selects between four predictions of next branch, updating just that prediction

Branch address

2-bits per branch predictor

Prediction

2-bit global branch history

4

Page 11: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 11

0%

Fre

quen

cy o

f M

isp

redi

ctio

ns

0%1%

5%6% 6%

11%

4%

6%5%

1%2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

Accuracy of Different Schemes(FROM SECOND EDITION)

4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT

nasa

7

mat

rix3

00

dodu

cd

spic

e

fppp

p

gcc

expr

esso

eqnt

ott li

tom

catv

Page 12: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 12

Tournament Predictors

• Multilevel branch predictor

• Use n-bit saturating counter to choose between predictors

• Usual choice between global and local predictors

Page 13: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 13

Tournament Predictors: DEC Alpha 21264

Tournament predictor using 4K 2-bit counters indexed by local branch address. Chooses between:

• Global predictor

– 4K entries index by history of last 12 branches (212 = 4K)

– Each entry is a standard 2-bit predictor

• Local predictor

– Local history table: 1024 10-bit entries recording last 10 branches, index by branch address

– The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating counters

Page 14: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 14

• Branch target calculation is costly and stalls the instruction fetch.

• BTB stores PCs the same way as caches

• The PC of a branch is sent to the BTB

• When a match is found the corresponding Predicted PC is returned

• If the branch was predicted taken, instruction fetch continues at the returned predicted PC

Branch Target Buffers

Page 15: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 15

Branch Target Buffers

Page 16: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 16

Figure 3.20 The steps involved in handling an instruction with a branch-target buffer

Branch taken?

Is instruction ataken branch?

No Yes

Entry found in branch-target

buffer?

Send PC to memory and branch-target

buffer

No Yes

IF

Send out predicted

PCNo Yes

Mispredicted branch, kill fetched instruction; restart fetch at other target; delete entry from target buffer

Enter branch instruction PC and next PC into branch target buffer

Normal instruction execution

Branch correctly predicted; continue execution with no

stalls

EX

ID

Page 17: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 17

Multiple Issue Machines

• Superscalar: multiple parallel dedicated pipelines:

– Varying number of instructions per cycle, scheduled by compiler and/or by hardware (Tomasulo)

– IBM PowerPC, Sun UltraSparc, DEC Alpha, IA32 Pentium

• VLIW (Very Long Instruction Word): multiple operations encoded in instruction:

– Instructions have wide template (4-16 operations)

– IA-64 Itanium

Page 18: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 18

Getting CPI < 1: Issuing Multiple Instructions/Cycle

• Superscalar DLX: 2 instructions, 1 FP & 1 anything else– Fetch 64-bits/clock cycle; integer on left, FP on right

– Can only issue 2nd instruction if 1st instruction issues

– More ports for FP registers to do FP load & FP op in a pair

Type Pipe StagesInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WB

• 1 cycle load delay expands to 3 instructions in superscalar DLX– Instruction in right half can’t use it, nor instructions in next slot

Page 19: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 19

Multiple Issue Challenges

• While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with:– Exactly 50% FP operations

– No hazards

• If more instructions issued at same time, greater difficulty in decode and issue– Even 2-way scalar examine 2 opcodes, 6 register specifiers, & decide if

1 or 2 instructions can issue

• VLIW: tradeoff instruction space for simple decoding– The long instruction word has room for many operations

– By definition, all the operations the compiler puts in the long instruction word are independent execute in parallel

– E.g., 2 integer operations, 2 FP ops, 2 memory refs, 1 branch 16 to 24 bits per field 7*16 or 112 bits to 7*24 or 168 bits wide

– Need compiling technique that schedules across several branches

Page 20: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 20

Limits to Multi-Issue Machines

• Inherent limitations of instruction-level parallelism

– 1 branch in 5: How to keep a 5-way VLIW busy?

– Latencies of units: many operations must be scheduled

– Easy: More instruction bandwidth

– Easy: Duplicate functional units to get parallel execution

– Hard: Increase ports to register file (bandwidth)• VLIW example needs 7 reads and 3 writes for integer registers

& 5 reads and 3 writes for FP registers

– Harder: Increase ports to memory (bandwidth)

– Decoding superscalar and impact on clock rate, pipeline depth?

Page 21: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 21

Hardware-Based Speculation

• Instead of just instruction fetch and decode, also execute instructions based on prediction of branch.

• Execute instructions out of order as soon as their operands are available.

• Wait with instruction commit until branch is decided.

• Re-order instructions after execution and commit them in order

– reorder buffer or ROB

– register file not updated until commit

• Do not raise exceptions until instruction is committed

• ROB holds and provides operands until commit.

Page 22: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 22

Tomasulo with Speculation

1. Issue – Empty reservation station and an empty ROB slot. Send operands to reservation station from register file or from ROB. This stage is often referred to as: dispatch

2. Execute – Monitor CDB for operands, check RAW hazards. When both operands are available, then execute.

3. Write Result – When available, write result to CDB through to ROB and any waiting reservation stations. Stores write to value field in ROB.

4. Commit – Three cases:

• Normal Commit: write registers, in order commit

• Store: update memory

• Incorrect branch: flush ROB, reservation stations and restart execution at correct PC

Page 23: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 23

Page 24: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 24

Problems with speculation

• Multi Issue Machines:

– Must be able to commit multiple instructions from ROB

– More registers, more renaming

• How much speculation:

– How many branches deep?

– What to do on a cache miss?

– TLB miss?

– Cache interference due to incorrect branch prediction

Page 25: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 25

Figure: 3.41

Number of registers available for renaming.

Page 26: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 26

Figure: 3.45

Window size: the number of instructions the issue unit may look ahead and schedule from.

Page 27: Dynamic Branch Prediction and Speculation

ENGS 116 Lecture 9 27

HW Support for More ILP• Avoid branch prediction by turning branches

into conditionally executed instructions:

If (X) then A = B op C else NOP– If false, then neither store result nor cause

exception

– Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instruction.

– IA-64: 61 1-bit condition fields selected so conditional execution of any instruction

• Drawbacks to conditional instructions– Still takes a clock even if “annulled”

– Stall if condition evaluated late

– Complex conditions reduce effectiveness; condition becomes known late in pipeline

X

A =B op C