eece476: computer architecture lecture 20: branch prediction chapter 6.6 + extra the university of...

EECE476: Computer Architecture

Lecture 20: Branch Prediction

Chapter 6.6 + extra

The University ofBritish Columbia EECE 476 © 2005 Guy Lemieux

2

Control Hazards Summary

• We reduced branch/jump penalty to 1 cycle• Still have 2 remaining problems

• Utilization problem– We may fetch the wrong instruction(s) after branch/jump

• Option 1: stall after every branch/jump• Option 2: nullify-if-branch-taken (small performance improvement)• Option 3: declare as a “delay slot”, always-execute (avoid)• Option 4: new strategies?

• Forwarding problem– We may depend on result of instruction(s) just before branch

• Option 1: stall when dependence detected (HDU)• Option 2: forward when dependence detected (FDU)

3

New Strategy: Nullify-if-Not-Taken

• Previously: Nullify-if-taken– Instruction after branch (PC+4) “sneaks” into pipeline– Nullify if branch is taken (T)

• Observation– Branch has 2 outcomes: taken, not-taken (T or NT)

• What about nullify-if-NT?– This is another valid strategy– We “sneak” instruction from PC+4+OFFSET

• Differences?– Can we predict which outcome is more likely? T or NT?– If so, we can “sneak” the right instruction into the pipeline– Reduces frequency of nullify operations

4

Nullify-if-T vs. Nullify-if-NT

• These are 2 forms of static branch prediction:– Nullify-if-T: always predict NT is likely– Nullify-if-NT: always predict T is likely

• Main Idea– Predict target where branch is going (T or NT)– Put useful (target) instructions in pipeline after branch– Nullify only if we predict wrong

• Performance impact?– More accurate predictions better performance

5

Implementing Static Branch Prediction

• Simplest static branch prediction

– Predict backward branches T, forward branches NT• Requires 0 instruction bits

• More sophisticated static branch prediction

– Define two instruction types: BEQ-likely, BEQ-unlikely

– For each individual branch, compiler decides if branch is likely or unlikely

• Requires 1 instruction bit in ISA to encode “likely-vs-unlikely” into each branch

6

Reducing Branch Pipeline Penalty

Static Methods Summary1. Always stall

– Works well, wastes CPU cycles.

2. Always execute (delayed branch)– Requires useful instruction to be scheduled by compiler

3. Nullify-if-taken (always predicts branch is NT)– Fetch from PC+4, PC+8, etc– Half of branch-forward instructions are NT– Some performance benefit

4. Nullify-if-not-taken (always predicts branch is T)– Fetch from PC+4+OFFSET, PC+8+OFFSET, etc– Almost all branch-backward instructions are T– Big performance benefit

7

Reducing Branch Pipeline Penalty

Dynamic Method5. Nullify-if-mispredicted

• Dynamically predict T or NT• To do this…

– Need branch prediction– Predict direction based upon recent history– Must fetch from predicted direction (target address)

• Note: no correctness problems arise if we mispredict (only performance)• Performance impact?

– Depends on “prediction accuracy”– Want >= 80% to be useful

Somehow, must implement in ISA– ISA may adopt one of more of above policies for branch instructions– ISA may also adopt multiple policies (eg, multiple versions of same branch

instruction)

8

Dynamic Branch Prediction

• Dynamic: predicted branch direction depends upon recent history

– No history? Must guess

– Execute same branch many times History

Need state information to retain history

9

Overview of Dynamic Branch Prediction Schemes

• Many Types of Dynamic Branch Predictors– Basic

• 1-bit predictor• 2-bit predictor (very good)

– Generalization• N-bit saturating counter (not very good)

– Hybrid/advanced (excellent)• Correlating predictors• Multilevel predictors

– Perfect (prescient) predictor• Non-causal, only works in simulation• Used to measure effectiveness of other prediction schemes

10

Dynamic 1-bit Branch Prediction

Basic scheme

• 1-bit predictor– Remembers most recent execution of branch

• Was it taken or not taken?

– Assume same outcome next time

– Where to store 1 bit?• In the instruction encoding?• 1 global bit (DFF) in the CPU?• Visit this again later…

11


1-bit Predictor ExampleA = 0 * initialize registersLoop:

A = A + 1 Loop: ADD $1,$1,$2If A != 10 goto Loop BNE $1,$3, Loop

• PredictionAccuracy?

• Last iteration NT, so next time, first iteration assumes NT• Result: 80% accuracy (20% mispredictions)

Prediction OutcomePrediction Correct?

Middle iterations

T T 8 correct

Last iteration

T NT 1 wrong

First iteration

NT T 1 wrong

12


Two basic schemes

• Simple: 2-bit “saturating counter” predictor– Remember two most recent outcomes?

• History (prev,curr)– (T,T) Predicts T– (NT,NT) Predicts NT– (T,NT) Predicts ?– (NT,T) Predicts ?

– Although a possibility, this scheme is not usually used

• Better: 2-bit “sequence” predictor– Mispredict twice before changing prediction

13

Dynamic 2-bit Sequence Prediction

• Saturating– Repeating T stays in ‘11’

state– Repeating NT stays in ‘00’

state

• Two-in-a-row to change prediction– (T,NT) won’t change

prediction– (NT,T) won’t change

prediction

T11

T10

NT01

NT00

T

NT

T

T

T

NT

NT

NT

14

Dynamic 2-bit Prediction Example

• 2-bit Predictor ExampleA = 0 * initialize registersLoop:

A = A + 1 Loop: ADD $1,$1,$2If A != 10 goto Loop: BNE $1,$3, Loop

• PredictionAccuracy?

• Last iteration is 1st mispredict, so next time, 1st iteration still predicts T• Result: 90% accuracy (10% mispredictions)

Prediction OutcomePrediction Correct?

Middle iterations

T T 8 correct

Last iteration

T NT1 wrong, but

next prediction still T

First iteration

T T 1 correct

15

Dynamic 2-bit Prediction Results

• Effectiveness?• Mispredictions in SPEC89 with 4096-entry branch

prediction table:– Nasa7: 1%– Matrix300: 0%– Tomcatv: 1%– Doduc: 5%– Spice: 9%– FPPPP: 9%– Gcc: 12%– Espresso: 5%– Eqntott: 18%– Li: 10%

• About 90% effective!

16

Dynamic 2-bit Prediction Results

• Mispredictions in SPEC89 with N-entry branch prediction table:

N=4096 N=Infinity– Nasa7: 1% 0%– Matrix300: 0% 0%– Tomcatv: 1% 0%– Doduc: 5% 5%– Spice: 9% 9%– FPPPP: 9% 9%– Gcc: 12% 11%– Espresso: 5% 5%– Eqntott: 18% 18%– Li: 10% 10%

• Still about 90% effective!

17

Dynamic N-bit Prediction Scheme

• We can try to generalize the 2-bit approach

• N-bit “saturating counter” predictor– Increment on taken branch– Decrement on untaken branch– Predictions

• Counter value >= (2^N)/2, predict T• Counter value < (2^N)/2, predict NT

• N-bit “sequence” predictor– X-mispredicts-in-a-row to change– How big is X (relative to N)? Possible?

• Effectiveness?– Not very… 2-bit predictors good enough!

18

N-bit “Saturating Counter” Predictor

T100

NT000

NT001

NT010

NT011

T111

T110

T101

NT

NT

T

T

T

NT

NT

NT

T

NT

NT

NT

T

19

Storing Branch History

• Where? In instruction memory?– Must write 1 or 2 bits into instruction, not good!

• Use special branch prediction table memory– Eg, 4096 entries of 2 bits each

• Not enough for one entry per branch instruction in your program– Or is it?

– Which entry goes with which branch?• Use lower bits of program counter (hash function)• Some branches will use the *same* table entry• Is this incorrect? No!

– Some branches will be predicted with less accuracy…ie, slower program execution

20

Advanced Branch Prediction 1

• Correlating Predictors– Create 8 branch prediction tables

• Each table may contain ~1024 entries, 2-bits of history each entry• Each table is “local history”

– 3 global bits in CPU form “global history”• Simple, small shift register• Stores outcome of 3 most recently executed branches (of all

branches)– Key idea

• “global history” determines which branch prediction table to use• “local history” works like “2-bit predictor”

– Called a (3,2) branch-prediction buffer• Regular 2-bit predictor is a (0,2) predictor

– Works better than (0,2) predictor

21

Advanced Branch Prediction 2

• Multilevel Branch Prediction– Eg, Tournament Predictors:

• Use 2 different branch predictors per entry• Choose the best between them

– How to decide which is best?• Use a third 2-bit predictor

– Like any 2-bit predictor, eg “sequence”– This one says “use predictor 1” or “use predictor 2”

• Change if current predictor is wrong (but other one was right) twice in a row

– Works better than Correlating Predictor

22

Predictors Summary

• Static– Stall– Always execute (delay slots)– Nullify-if-T (Execute-if-NT)– Nullify-if-NT (Execute-if-T)

• Dynamic– Nullify-if-mispredicted

• 2-bit, N-bit “saturating counter” predictor• 2-bit “sequence” predictor (N-bit possible?)• Correlating predictor

– Concept of global / local history• Multilevel predictor

– Eg, Tournament predictor

eece476: computer architecture lecture 20: branch prediction chapter 6.6 + extra the university of...

Documents