lesson 2 branch prediction

7/24/2019 Lesson 2 Branch Prediction

1/76

Branch Prediction Techniques

Prof. Cristina Silvano

Politecnico di Milanoemail: [email protected]

Course on: Advanced Computer Architectures


2/76

Prof. Cristina Silvano Politecnico di Milano - 2 -

Outline

Dealing with Branches in the Processor Pipeline

The Problem of Branch Hazards Branch Prediction Techniques

Static Branch Prediction

Dynamic Branch Prediction

Performance of Branch Schemes


3/76


Conditional Branch Instruction: the branch is taken onlyif the condition is satisfied.

The branch target address is stored in the ProgramCounter (PC) instead of the address of the next instructionin the sequential instruction stream.

Examples of branches for MIPS processor:beq(branch on equal) andbne (branch on not equal)

beq $s1, $s2, L1 # go to L1 if ($s1 == $s2)

bne $s1, $s2, L1 # go to L1 if ($s1 != $s2)

Conditional Branch Instructions


4/76


Conditional branches for MIPS processor:

I (Immediate) Formatop rs rt

6 bit 5 bit 5 bit 16 bit

address

For conditional branches, the instruction fields are:

op (opcode) identifies the instruction type;

rs first register to compare; rt second register to compare;

address (16-bit) indicates the word offset relative to thePC (PC-relative word address)

The offset corresponding to the L1 label (Branch TargetAddress) is relative to the Program Counter (PC): (L1- PC) /4

beq $s1, $s2, L1


5/76


Execution of conditional branches for 5-stage MIPS

pipeline

Instruction fetch and PC increment

Registers read ($x and $y) from Register File. ALU operation to compare registers ($x and $y) to derive Branch

Outcome (branch taken or branch not taken).

Computation of Branch Target Address (PC+4+offset): the value

(PC+4) is added to the least significant 16 bit of the instruction aftersign extension

The result of registers comparison from ALU (Branch Outcome) is usedto decide the value to be stored in the PC: (PC+4) or (PC+4+offset).

beq $x,$y,offset

Register Read$xe $y

Instr. Fetch& PC Increm.

ALU Op. ($x-$y)& (PC+4+offset)

Write ofPC


6/76


Execution of conditional branches for 5-stage MIPS

pipelineID

Instruction DecodeIF

Instruction FetchEX

ExecutionME

Memory AccessWB

Write Back

Branch Outcome and Branch Target Address are ready atthe end of the EX stage (3th stage)

Conditional branches are solved when PC is updated atthe end of the ME stage (4th stage)

beq $x,$y,offset

Register Read$xe $y

Instr. Fetch& PC Increm.

ALU Op. ($x-$y)& (PC+4+offset)

Write ofPC


7/76


Execution of conditional branches for MIPS

Processor resources to execute conditional branches:

PC + 4

(from fetch unit)

WriteData

RegisterRead 2

RegisterRead 1

Registerwrite

16 bit

Zero

32 bit

Istruction

ALURegister File

SignExtension

Contentregistrer 2

Contentregister 1

Branch TargetAddress

To control logic ofconditional branch

2-bit LeftShifter

Adder

[15-0]

[25-21]

[20-16]

WR


8/76


Implementation of the 5-stage MIPS Pipeline

Contentregister 2

Contentregister 1

[15-0]

[25-21]

[20-16]

[15-11]

Read Data

WriteData

RegisterRead 2

RegisterRead 1

RegisterWrite

Zero

32 bit16 bit

MUX

MUX

Result

ALU

RF

WriteData

WriteAddress

ReadAddress

Data

Memory

Signextension

Instruction

Memory

Instruction

ReadAddress

+4Adder

PC

2-bit LeftShifter

MUX

MUX

ID/EXIF/ID MEM/WBEX/MEM

IF Instruction Fetch

ID Instruction Decode EX Execution WB

Write Back

MEM Memory Access

WR

WR RDOP

Adder

Branch Target Address

BranchOutcome

PC Write


9/76


The Problem of Control Hazards

Control hazards: Attempt to make a decision on thenext instruction to fetch before the branch condition is

evaluated. Control hazards arise from the pipelining of conditional

branches and other instructions changing the PC.

Control hazards reduce the performance from the idealspeedup gained by the pipelining since they can make itnecessary to stall the pipeline.


10/76


Branch Hazards

To feed the pipeline we need to fetch a new instructionat each clock cycle, but the branch decision (to change

or not change the PC) is taken during the MEM stage. This delay to determine the correct instruction to fetch

is called Control Hazard or Conditional Branch Hazard

If a branch changes the PC to its target address, it is ataken branch

If a branch falls through, it is not taken or untaken.


11/76


Branch Hazards: Example

The branch instruction may or may not change the PC in MEM stage,but the next 3 instructions are fetched and their execution isstarted.

If the branch is not taken, the pipeline execution is OK If the branch is taken, it is necessary to f lushthe next 3 instructions

in the pipeline and fetched the lw instruction at the branch targetaddress (L1)

IF ID EX ME WBbeq $1, $3, L1

IF ID EX ME WBand $12, $2, $5

IF ID EX ME WBor $13, $6, $2

IF ID EX ME WBadd $14, $2, $2

IF ID EX ME WBL1: lw $4, 50($7)


12/76


Branch Hazards: Solutions

Conservative assumption: To stall the pipeline until thebranch decision is taken (stalling until resolution) and

then fetch the correct instruction flow. Without forwarding : for three clock cycles

With forwarding: for two clock cycles

If the branch is not taken, the three cycles penalty is notjustified throughput reduction.

We can assume the br anch not t aken, and flush the

next 3 instructions in the pipeline only if the branch willbe taken. (We cannot assume the br anch t aken becausewe dont know the branch target address)


13/76


Branch Stalls without Forwarding

stall





stall stall

Conservative assumption: Stalling until resolution at theend of the ME stage.

Each branch costs three stalls to fetch the correctinstruction flow: (PC+4) or Branch Target Address


14/76


Branch Stalls with Forwarding

stall





stall

Conservative assumption: Stalling until resolution at theend of the EX stage.

Each branch costs two stalls to fetch the correctinstruction flow: (PC+4) or Branch Target Address


15/76


Early Evaluation of the PC

To improve performance in case of branch hazards, weneed to add hardware resources to:

1. Compare registers to derive the Branch Outcome2. Compute the Branch Target Address

3. Update the PC register

as soon as possible in the pipeline.

MIPS processor compares registers, computes branchtarget address and updates PC dur i ng ID st age.


16/76


MIPS Processor: Early Evaluation of the PC

Cont.

Reg. 2

Cont.

Reg. 1

Write

Data

Register

Read 2

Register

Read 1

Register

Write

32 bit16 bit

M

U

XRegister File

DataMemory

Sign

Extension

Instruction

Memory

Instruction

Read

Address

+4

Adder

PC

M

U

X

M

U

X

ID/EX

IF/ID

MEM/WBEX/MEM

WR

2-bit Left

Shift

Adder

M

U

X

Result

ALU

OP

=

Branch Target Address

BranchOutcome

PC Write


17/76



stall





Conservative assumption: Stalling until resolution at the

end of the ID stage. Each branch costs one stall to fetch the correct

instruction flow: (PC+4) or Branch Target Address


18/76



Consequence of early evaluation of the branch decisionin ID stage:

In case of addinstruction followed by abranchtesting the result we need to introduce one stallbefore ID stage of branch to enable the forwarding

(EX-ID) of the result from EX stage of previousinstruction. As usual we need one stall after thebranch for branch resolution.

stall

IF ID EX ME WBaddi $1, $1, 4


IF ID EX ME WBstalland $12, $2, $5


19/76



Consequence of early evaluation of the branch decisionin ID stage:

In case of load instruction followed by abranchtesting the result we need to introduce two stallsbefore ID stage of branch to enable the forwarding

(ME-ID) of the result from EX stage of previousinstruction. As usual we need one stall after thebranch for branch resolution.

stall

IF ID EX ME WBlw $1, BASE($2)


IF ID EX ME WBstalland $12, $2, $5

stall


20/76



With the branch decision made during ID stage, there is areduction of the cost associated with each branch

(branch penalty): We need only one-clock-cycle stall after each branch

Or a flush of only one instruction following thebranch

One-cycle-delay for every branch still yields aperformance loss of 10% to 30% depending on the branchfrequency:

Pipeline Stall Cycles per Instruction due to Branches =Branch frequency x Branch Penalty

We will examine some techniques to deal with this

performance loss.


21/76




22/76



In general, the problem of the branch becomes more important fordeeply pipelined processors because the cost of incorrect predictionsincreases (the branches are solved several stages after the ID stage)

Main goal of branch prediction techniques: try to predict ASAP theoutcome of a branch instruction.

The performance of a branch prediction technique depends on:

Accuracy measured in terms of percentage of incorrect predictions givenby the predictor.

Cost of a incorrect prediction measured in terms of time lost to executeuseless instructions (misprediction penalty) given by the processor

architecture We also need to consider branch frequency: the importance of

accurate branch prediction is higher in programs with higher branchfrequency.


23/76



There are many methods to deal with the performanceloss due to branch hazards:

Static Branch Prediction Techniques: The actions fora branch are fixed for each branch during the entireexecution. The actions are fixed at compile time.

Dynamic Branch Prediction Techniques: Thedecision causing the branch prediction candynamically change during the program execution.

In both cases, care must be taken not to change theprocessor state and registers until the branch isdefinitely known.


24/76


Static Branch Prediction Techniques

Static Branch Prediction is used in processors where theexpectation is that the branch behavior is highly

predictable at compile time. Static Branch Prediction can also be used to assist

dynamic predictors.


25/76


Static Branch Prediction Techniques

1) Branch Always Not Taken (Predicted-Not-Taken)

2) Branch Always Taken (Predicted-Taken)

3) Backward Taken Forward Not Taken (BTFNT)

4) Profile-Driven Prediction

5) Delayed Branch


26/76


1) Branch Always Not Taken

We assume the branch will not be taken, thus thesequential instruction flow we have fetched can continue

as if the branch condition was not satisfied. If the condition in stage ID will result not satisfied (the

prediction is correct), we can preserve performance.

IF ID EX ME WBUntaken branch

IF ID EX ME WBInstruction i+1





27/76



If the condition in stage ID will result satisfied (theprediction is incorrect), the branch is taken:

We need to flush the next instruction already fetched (thefetched instruction is turned into a nop) and we restart theexecution by fetching the instruction at the branch targetaddress One-cycle penalty

IF ID EX ME WBTaken branch

IF idle idle idle idleInstruction i+1

IF ID EX ME WBBranch target

IF ID EX ME WBBranch target+1

IF ID EX ME WBBranch target+2


28/76



If on average the 50% of branches are not taken withthis hypothesis we can halve the cost of control hazards

If on average the probability that the branch is not takenis high with this hypothesis we can optimize the cost

of control hazards.


29/76


2) Branch Always Taken

An alternative scheme is to consider every branch as taken: assoon as the branch is decoded and the Branch Target Address

is computed, we assume the branch to be taken and we beginfetching and executing at the target.

The predicted-taken scheme makes sense for pipelines wherethe branch target address is known before the branch

outcome. In MIPS pipeline, we dont know the branch target address

earlier than the branch outcome, so there is no advantage in

this approach for this pipeline. We should anticipate the computation of BTA at the IF stage

or we need a Branch Target Buffer.


30/76


3) Backward Taken Forward Not Taken (BTFNT)

The prediction is based on the branch direction:

Backward-going branches are predicted as taken

Example: the branches at the end of loops go backat the beginning of the next loop iteration we assume the backward-going branches are

always taken. Forward-going branches are predicted as not taken


31/76


4) Profile-Driven Prediction

The branch prediction is based on profiling informationcollected from earlier runs.

The method can use compiler hints.


32/76


5) Delayed Branch Technique

Scheduling techniques: The compiler statically schedulesan independent instruction in the branch delay slot.

The instruction in the branch delay slot is executedwhether or not the branch is taken.

If we assume a branch delay of one-cycle (as for MIPS)

we have only one-delay slot Although it is possible to have for some deeply pipeline

processors a branch delay longer than one-cycle almost all processors with delayed branch have a single delay

slot (since it is usually difficult for the compiler to fill in morethan one delay slot).


33/76



The MIPS compiler always schedules a branchindependent instruction after the branch.

Example: A previous addinstruction with any effects onthe branch is scheduled in the Branch Delay Slot



IF ID EX ME WBlw $3, 300($0)



BRANCH DELAY SLOT


34/76



The behavior of the delayed branch is the same whetheror not the branch is taken.

If the branch is untaken execution continues withthe instruction after the branch

IF ID EX ME WBUntaken branchIF ID EX ME WBBranch delay instr.

IF ID EX ME WBInstr. i+1



BRANCH DELAY SLOT


35/76



If the branch is taken execution continues at thebranch target

IF ID EX ME WBTaken branch

IF ID EX ME WBBranch delay instr.

IF ID EX ME WBBranch target instr.

IF ID EX ME WBBranch target instr.+ 1

IF ID EX ME WBBranch target instr.+ 2

BRANCH DELAY SLOT


36/76



The job of the compiler is to make the instructionplaced in the branch delay slot valid and useful.

There are four ways in which the branch delay slot canbe scheduled:

1. From before

2. From target3. From fall-through

4. From after


37/76


5) Delayed Branch Technique: From Before

The branch delay slot is scheduled with an independentinstruction from before the branch

The instruction in the branch delay slot is alwaysexecuted (whether the branch is taken or untaken).

add $1, $2, $3

if $2 == 0 then

br. delay slot

if $2 == 0 then

add $1, $2, $3


38/76


5) Delayed Branch Technique: From Target

The use of $1 in the branch condition prevents addinstruction(whose destination is $1) from being moved after the branch.

The branch delay slot is scheduled from the target of the branch

(usually the target instruction will need to be copied because it canbe reached by another path).

This strategy is preferred when the branch is taken with highprobability, such as loop branches (backward branches).

sub $4, $5, $6

add $1, $2, $3

if $1 == 0 then

br. delay slot

sub $4, $5, $6

add $1, $2, $3

if $1 == 0 then

sub $4, $5, $6


39/76


5) Delayed Branch Technique: From Fall-Through

The use of $1 in the branch condition prevents add instruction(whose destination is $1) from being moved after the branch.

The branch delay slot is scheduled from the not-taken fall-through

path. This strategy is preferred when the branch is not taken with high

probability, such as forward branches.

add $1, $2, $3

if $1 == 0 then

br. delay slot

or $7, $8, $9

sub $4, $5, $6

add $1, $2, $3

if $1 == 0 then

or $7, $8, $9

sub $4, $5, $6


40/76



To make the optimization legal for the target an fall-through cases, it must be OK to execute the moved

instruction when the branch goes in the unexpecteddirection.

By OK we mean that the instruction in the branch delay

slot is executed but the work is wasted (the program willstill execute correctly).

For example, if the destination register is an unusedtemporary register when the branch goes in theunexpected direction.


41/76



In general, the compilers are able to fill about 50% ofdelayed branch slots with valid and useful instructions,

the remaining slots are filled with nops. In deeply pipeline, the delayed branch is longer that one

cycle: many slots must be filled for every branch, thus it

is more difficult to fill all the slots with usefulinstructions.


42/76



The main limitations on delayed branch scheduling arisefrom:

The restrictions on the instructions that can bescheduled in the delay slot.

The ability of the compiler to statically predict the

outcome of the branch.


43/76



To improve the ability of the compiler to fill the branchdelay slot most processors have introduced a

canceling or nullifying branch: the instruction includesthe direction that the branch was predicted. When the branch behaves as predicted the instruction in the

branch delay slot is executed normally.

When the branch is incorrectly predicted the instruction in thebranch delay slot is turned into a NOP

In this way, the compiler need not be as conservative

when filling the delay slot.


44/76



MIPS architecture has the branch-likely instruction, thatbehaves as cancel-if-not-taken branch:

The instruction in the branch delay slot is executed whether thebranch is taken.

The instruction in the branch delay slot is not executed (it isturned to NOP) whether the branch is untaken.

Useful approach for backward branches (such as loopbranches).

The branch delay slot must be filled from target.


45/76


Dynamic Branch Prediction Techniques

D i B h P di ti


46/76


Dynamic Branch Prediction

Basic Idea: To use the past branch behavior to predictthe future.

We use hardware to dynamically predict the outcome ofa branch: the prediction will depend on the behavior ofthe branch at run time and will change if the branch

changes its behavior during execution. We start with a simple branch prediction scheme and

then examine approaches that increase the branchprediction accuracy.

D i B h P di ti S h


47/76


Dynamic Branch Prediction Schemes

Dynamic branch prediction is based on two interactingmechanisms:

Branch Outcome Predictor: To predict the direction of a branch (i.e. taken or not

taken).

Branch Target Predictor or Branch Target Buffer:

To predict the branch target address in case of takenbranch.

These modules are used by the Instruction Fetch Unit to

predict the next instruction to read in the I-cache. If branch is not taken PC is incremented.

If branch is taken BTP gives the target address

B h Hi t T bl


48/76


Branch History Table

Branch History Table (or Branch Prediction Buffer):

Table containing 1 bit for each entry that says whether thebranch was recently taken or not.

Table indexed by the lower portion of the address of the branchinstruction.

Prediction: hint that it is assumed to be correct, and fetching begins

in the predicted direction. If the hint turns out to be wrong, the prediction bit is inverted

and stored back. The pipeline is flushed and the correctsequence is executed.

The table has no tags (every access is a hit) and the prediction bitcould has been put there by another branch with the same low-orderaddress bits: but it doesnt matter. The prediction is just a hint!



49/76



T/NT

BHT

k-bit Branch Address 2k entries

n-bit Branch Address

Accuracy of the Branch History Table


50/76


Accuracy of the Branch History Table

A misprediction occurs when:

The prediction is incorrect for that branch

or The same index has been referenced by two different

branches, and the previous history refers to the other

branch. To solve this problem it is enough to increase the

number of rows in the BHT or to use a hashing

function (such as in GShare).


51/76


1-bit Branch History Table

Shortcoming of the 1-bit BHT:

In a loop branch, even if a branch is almost always taken and then nottaken once, the 1-bit BHT will mispredict twice (rather than once)

when it is not taken. That scheme causes two wrong predictions:

At the last loop iteration, since the prediction bit will say taken,while we need to exit from the loop.

When we re-enter the loop, at the end of the first loop iterationwe need to take the branch to stay in the loop, while theprediction bit say to exit from the loop, since the prediction bitwas flipped on previous execution of the last iteration of the loop.

For example, if we consider a loop branch whose behavior is takennine times and not taken once, the prediction accuracy is only 80%(due to 2 incorrect predictions and 8 correct ones).

2 bit Branch History Table


52/76


2-bit Branch History Table

The prediction must miss twice before it is changed.

In a loop branch, at the last loop iteration, we do not

need to change the prediction. For each index in the table, the 2 bits are used to

encode the four states of a finite state machine.

FSM for 2-bit Branch History Table


53/76


FSM for 2-bit Branch History Table

n-bit Branch History Table


54/76


n-bit Branch History Table

Generalization: n-bit saturating counter for each entry inthe prediction buffer.

The counter can take on values between 0and 2n

-1 When the counter is greater than or equal to one-half of its

maximum value (2n-1), the branch is predicted as taken.

Otherwise, it is predicted as untaken.

As in the 2-bit scheme, the counter is incremented on ataken branch and decremented on an untaken branch.

Studies on n-bit predictors have shown that 2-bit

predictors behave almost as well.

Accuracy of 2-bit Branch History Table


55/76


Accuracy of 2 bit Branch History Table

For IBM Power architecture executing SPEC89benchmarks , a 4K-entry BHT with 2-bit per entry results

in: Prediction accuracy from 99% to 82% (i.e.

misprediction rate from 1% to 18%)

Almost similar performance with respect to an infinitebuffer with 2-bit per entry.

Correlating Branch Predictors


56/76



The 2-bit BHT uses only the recent behavior of a singlebranch to predict the future behavior of that branch.

Basic Idea: the behavior of recent branches arecorrelated, that is the recent behavior of otherbranchesrather than just the current branch (we are trying topredict) can influence the prediction of the currentbranch.

Example of Correlating Branches


57/76


Example of Correlating Branches

I f ( a==2) a = 0; bb1L1: I f ( b==2) b = 0; bb2L2: I f ( a! =b) {}; bb3

subi r 3, r 1, 2bnez r 3, L1; bb1add r 1, r 0, r 0

L1: subi r 3, r 2, 2bnez r 3, L2; bb2add r 2, r 0, r 0

L2: sub r 3, r 1, r 2beqz r 3, L3; bb3

L3:

Branchbb3 is correlated to previous branchesbb1 andbb2.

If previous branches are both not taken,thenbb3 will be taken (a!=b)



58/76



Branch predictors that use the behavior of otherbranches to make a prediction are called Correlating

Predictors or 2-level Predictors. Example a (1,1) Correlating Predictors means a 1-bit

predictor with 1-bit of correlation: the behavior of lastbranch is used to choose among a pair of 1-bit branchpredictors.

Correlating Branch Predictors: Example


59/76


Correlating Branch Predictors: Example

1

0

....

1

1

....

Last Branch Result

Branch Address

T1: Branch History Tableif last branch taken

T2: Branch History Tableif last branch not taken

Branch Prediction



60/76



Record if the k most recently executed branches havebeen taken o not taken.

The branch is predicted based on the previous executedbranch by selecting the appropriate 1-bit BHT: One prediction is used if the last branch executed was taken

Another prediction is used if the last branch executed was nottaken.

In general, the last branch executed is not the sameinstruction as the branch being predicted (although this

can occur in simple loops with no other branches in theloops).

(m, n) Correlating Branch Predictors


61/76


( , ) g

In general (m, n) correlating predictor recordslast m branches to choose from 2m BHTs, each of

which is a n-bit predictor. The branch prediction buffer can be indexed by

using a concatenation of low-order bits from the

branch address with m-bit global history (i.e.global history of the most recent m branches).

(2, 2) Correlating Branch Predictors


62/76


( , ) g

A (2, 2) correlating predictor has 4 2-bit Branch History Tables.

It uses the 2-bit global history to choose among the 4 BHTs.

1 0

0 0

....

1 0

1 1

....

2-bit global branch history

2-bit Prediction

1 0

1 1

....

1 0

1 1

....

2k entriesBranch Address(k bit)

Example of (2, 2) Correlating Predictor


63/76


p ( ) g

Example: a (2, 2) correlating predictor with 64 totalentries 6-bit index composed of: 2-bit global historyand 4-bit low-order branch address bits

1 0

0 0

....

1 0

1 1

....

2-bit global branch history

2-bit Prediction

1 0

1 1

....

1 0

1 1

....

24

entries

4-bit branch

address

Example of (2, 2) Correlating Predictor


64/76


Each BHT is composed of 16 entries of 2-bit each.

The 4-bit branch address is used to choose four entries

(a row). 2-bit global history is used to choose one of four entries

in a row (one of four BHTs)

Accuracy of Correlating Predictors


65/76


A 2-bit predictor with no global history is simply a (0, 2)predictor.

By comparing the performance of a 2-bit simplepredictor with 4K entries and a (2,2) correlatingpredictor with 1K entries.

The (2,2) predictor not only outperforms the simply 2-bitpredictor with the same number of total bits (4K totalbits), it often outperforms a 2-bit predictor with anunlimited number of entries.

Accuracy of Correlating Predictors


66/76


FrequencyofMispredictions

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

nasa7

matrix300

tomcatv

doducd

spice

fpppp

gcc

espresso

eqntott li

0%

1%

5%6% 6%

11%

4%

6%5%

1%

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

Two-Level Adaptive Branch Predictors


67/76


The first level history is recorded in one (or more) k-bitshift register called Branch History Register (BHR),which records the outcomes of the k most recent

branches The second level history is recorded in one (or more)

tables called Pattern History Table (PHT) of two-bit

saturating counters The BHR is used to index the PHT to select which 2-bit

counter to use.

Once the two-bit counter is selected, the prediction ismade using the same method as in the two-bit counterscheme.

GA Predictor


68/76


BHT: Local predictor

Indexed by the low-

order bits of PC (branchaddress)

GAs: Local and global

predictor 2-level predictor: PHT

Indexed by the contentof BHR (global history)

k-bitPC

T/NT

T NT NT T T/NT

BHT

PHT

BHR

GShare Predictor


69/76


GShare: Local XOR globalinformation

Indexed by theexclusive OR of thelow-order bits of PC(branch address) andthe content of BHR(global history)

T NT NT T

K-bit

PC

T/NT

XOR

BHR

PHT

Branch Target Buffer


70/76


Branch Target Buffer (Branch Target Predictor) is acache storing the predicted branch target address forthe next instruction after a branch

We access the BTB in the IF stage using the instructionaddress of the fetched instruction (a possible branch) toindex the cache.

Typical entry of the BTB:

The predicted target address is expressed as PC-relative

Exact Address of a Branch Predicted target address

Structure of a Branch Target Buffer


71/76


PC of fetched instruction

=

Associative lookup Predicted target address

No, instruction is not predictedTo be a branch, proceed normally

Yes, instruction is a branch,PC should be used as next PC

Use lowest bitsOf the PC

Need also some validity bits

Structure of a Branch Target Buffer


72/76


BTB entry: Tag + Predicted target

address (expressed as PC-relative for conditionalbranches) + Predictionstate bits as in a BranchOutcome Predictor(optional).

T-T-NT

PC

Tag Target Stat

.

=

PresentTargetAddress T/NT

In the BTB we need to store the predicted target addressonly for taken branches.

Speculation


73/76


Without branch prediction, the amount of parallelism is quitelimited, since it is limited to within a basic block a straight-linecode sequence with no branches in except to the entry and no

branches out except at the exit. Branch prediction techniques can help to achieve significant amount

of parallelism.

We can further exploit ILP across multiple basic blocks overcoming

control dependences by speculating on the outcome of branches andexecuting instructions as if our guesses were correct.

With speculation, we fetch, issue and execute instructions as if outbranch predictions were always correct, providing a mechanism tohandle the situation where the speculation is incorrect.

Speculation can be supported by the compiler or by the hardware.

Hardware-Based Speculation


74/76


Hardware-based speculation extends the ideas ofdynamic scheduling and combines three keys ideas:

Dynamic Branch Prediction to choose which instruction to

execute; Speculation to execute instructions before control dependences

are solved (with the ability to undo the effects of an incorrectlyspeculated sequence);

Dynamic Scheduling to deal with the scheduling of differentcombinations of basic blocks.

Hardware-based speculation follows the predicted flowof data values to choose when to execute instructions:

operations execute as soon as their operands areavailable (data flow execution).

Hardware-Based Speculation


75/76


When an instruction is no longer speculative, we allow itto update the register file or memory (instructioncommit).

The key idea behind speculation is to allow instructionsto execute out-of order but to force them to commitin-order and to prevent any irrevocable action (such as

updating state or taking an exception) until aninstructions commits.

Reorder buffer (ROB) to hold the results of instructionsthat have completed execution but have not committed

or to pass results among instructions that may bespeculated.

References


76/76


An introduction to the branch prediction problem can be found inChapter 3 of: J. Hennessy and D. Patterson, ComputerArchitecture, a Quantitative Approach, Morgan Kaufmann, Thirdedition, May 2002.

A survey of basic branch prediction techniques can be found in:D. J. Lalja, Reducing the Branch Penalty in Pipelined Processors,Computer, pages 47-55, July 1988.

A more detailed and advanced survey of the most used branchpredictor architectures can be found in:M. Evers and T.-Y. Yeh, Understanding Branches and DesigningBranch Predictors for High-Performance Microprocessors,Proceedings of the IEEE, Vol. 89, No. 11, pages 1610-1620,November 2001.

lesson 2 branch prediction

Documents