graduate computer architecture i

Lecture 3: Branch Prediction

Young Cho

Graduate Computer Architecture I

2 - CSE/ESE 560M – Graduate Computer Architecture I

“Instruction Frequency”

CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count

“Average Cycles per Instruction”

j

n

jj I CPI Time Cycle timeCPU

1

Countn InstructioI

F whereF CPI CPI1

jj

n

jjj

Cycles Per Instructions


Instruction Memory

Register File ALU

Data Memory

PC Control

IF/ID ID/EX EX/MEM MEM/WB

Typical Load/Store Processor


Pipelining Laundry30 minutes 35 minutes 35 minutes

Three sets of Clean Clothes in 2 hours 40 minutes

35 minutes 25 minutes

With large number of sets, the each load takes average of ~35 min to wash

3X Increase in Productivity!!!


Introducing Problems• Hazards prevent next instruction from

executing during its designated clock cycle– Structural hazards: HW cannot support this

combination of instructions (single person to dry and iron clothes simultaneously)

– Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock – needs both before putting them away)

– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (Er…branch & jump)


• Read After Write (RAW) – Instr2 tries to read operand before Instr1 writes it– Caused by a “Dependence” in compiler term

• Write After Read (WAR) – Instr2 writes operand before Instr1 reads it– Called an “anti-dependence” in compiler term

• Write After Write (WAW) – Instr2 writes operand before Instr1 writes it– “Output dependence” in compiler term

• WAR and WAW in more complex systems

Data Hazards


10: beq r1,r3,36

14: and r2,r3,r5

18: or r6,r1,r7

22: add r8,r1,r9

36: xor r10,r1,r11

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch

3 instructions are in the pipeline before new instruction

can be fetched.

Branch Hazard (Control)


Branch Hazard Alternatives• Stall until branch direction is clear• Predict Branch Not Taken

– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Advantage of late pipeline state update– 47% DLX branches not taken on average– PC+4 already calculated, so use it to get next instr

• Predict Branch Taken– 53% DLX branches taken on average– DLX still incurs 1 cycle branch penalty– Other machines: branch target known before outcome


Branch delay of length n

• Delayed Branch– Define branch to take place AFTER a following

instruction (Fill in Branch Delay Slot)

branch instructionsequential successor1

sequential successor2........sequential successorn

branch target if taken

– 1 slot delay allows proper decision and branch target address in 5 stage pipeline

Branch Hazard Alternatives


Evaluating Branch Alternatives

Scheduling BranchCPIspeedup v.speedup v. scheme penalty unpipelined stall

Stall pipeline 31.42 3.51.0Predict taken 11.14 4.41.26Predict not taken 11.09 4.51.29Delayed branch 0.51.07 4.61.31

Conditional & Unconditional = 14%, 65% change PC

Pipeline speedup = Pipeline depth1 +Branch frequencyBranch penalty


Solution to Hazards• Structural Hazards

– Delaying HW Dependent Instruction– Increase Resources (i.e. dual port memory)

• Data Hazards– Data Forwarding– Software Scheduling

• Control Hazards– Pipeline Stalling– Predict and Flush– Fill Delay Slots with Previous Instructions


Administrative• Literature Survey

– One Q&A per Literature– Q&A should show that you read the paper

• Changes in Schedule– Need to be out of town on Oct 4th (Tuesday)– Quiz 2 moved up 1 lecture

• Tool and VHDL help


Typical Pipeline• Example: MIPS R4000

IF ID MEM WB

integer unit

FP/int Multiply

FP adder

FP/int divider

ex

m1 m2 m3 m4 m5 m6 m7

a1 a2 a3 a4

Div (lat = 25, Init inv=25)


Prediction• Easy to fetch multiple (consecutive)

instructions per cycle– Essentially speculating on sequential flow

• Jump: unconditional change of control flow– Always taken

• Branch: conditional change of control flow– Taken typically ~50% of the time in applications

• Backward: 30% of the Branch 80% taken = ~24%• Forward: 70% of the Branch 40% taken = ~28%


Current Ideas• Reactive

– Adapt Current Action based on the Past– TCP windows– URL completion, ...

• Proactive– Anticipate Future Action based on the Past– Branch prediction– Long Cache block– Tracing


Branch Prediction Schemes• Static Branch Prediction• Dynamic Branch Prediction

– 1-bit Branch-Prediction Buffer– 2-bit Branch-Prediction Buffer– Correlating Branch Prediction Buffer– Tournament Branch Predictor

• Branch Target Buffer• Integrated Instruction Fetch Units• Return Address Predictors


Static Branch Prediction• Execution profiling

– Very accurate if Actually take time to Profile– Incovenient

• Heuristics based on nesting and coding– Simple heuristics are very inaccurate

• Programmer supplied hints...– Inconvenient and potentially inaccurate


Dynamic Branch Prediction• Performance = ƒ(accuracy, cost of mis-prediction)• 1-bit Branch History Table

– Bitmap for Lower bits of PC address– Says whether or not branch taken last time– If Inst is Branch, predict and update the table

• Problem– 1-bit BHT will cause 2 mis-predictions for Loops

• First time through the loop, it predicts exit instead loop• End of loop case, it predicts loops instead of exit

– Avg is 9 iterations before exit• Only 80% accuracy even if loop 90% of the time


N-bit Dynamic Branch Prediction• N-bit scheme where change prediction only

if get misprediction N-times:

T

T

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not TakenT

NT

T

NT

NT

2-bit Scheme: Saturates the prediction up to 2 times


Correlating Branches• (2,2) predictor

– 2-bit global: indicates the behavior of the last two branches

– 2-bit local (2-bit Dynamic Branch Prediction)

• Branch History Table– Global branch history is

used to choose one of four history bitmap table

– Predicts the branch behavior then updates only the selected bitmap table

Branch address (4 bits)

Prediction

2-bit recent global branch history

(01 = not taken then taken)


Accuracy of Different Schemes

4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT

18%

Freq

uenc

y of

Mis

pred

ictio

ns

0%1%

5%6% 6%

11%

4%

6%5%

1%

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li

Freq

uenc

y of

Mis

pred

icti

ons


BHT Accuracy• Mispredict because either:

– Wrong guess for the branch– Wrong Index for the branch

• 4096 entry table – programs vary from 1% misprediction (nasa7,

tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%

• For SPEC92– 4096 about as good as infinite table


Tournament Branch Predictors• Correlating Predictor

– 2-bit predictor failed on important branches– Better results by also using global information

• Tournament Predictors– 1 Predictor based on global information– 1 Predictor based on local information– Use the predictor that guesses better

addr

Predictor BPredictor A


Alpha 21264• 4K 2-bit counters to choose from among a global predictor and a

local predictor• Global predictor also has 4K entries and is indexed by the history of

the last 12 branches; each entry in the global predictor is a standard 2-bit predictor– 12-bit pattern: ith bit 0 => ith prior branch not taken;

ith bit 1 => ith prior branch taken; • Local predictor consists of a 2-level predictor:

– Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted.

– Next level Selected entry from the local history table is used to index a table of 1K entries consisting a 3-bit saturating counters, which provide the local prediction

• Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits!(~180,000 transistors)


Branch Prediction Accuracy

94%

96%

98%

98%

97%

100%

70%

82%

77%

82%

84%

99%

88%

86%

88%

86%

95%

99%

0% 20% 40% 60% 80% 100%

gcc

espresso

li

fpppp

doduc

tomcatv

Profile-based2-bit dynmicTournament


Accuracy versus Size

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Total predictor size (Kbits)

Con

ditio

nal b

ranc

h m

ispr

edic

tion

rate

Local

Correlating

Tournament


Branch Target Buffer• Branch Target Buffer (BTB): Address of branch index to get

prediction AND branch address (if taken)– Note: must check for branch match now, since can’t use wrong

branch addressBranch PC Predicted PC

=?

PC of instruction

FETCH

Extra prediction state

bits

Yes: instruction is branch and use predicted PC as next PCNo: branch not

predicted, proceed normally (Next PC = PC+4)


Predicated Execution• Built in Hardware Support

– Bit for predicated instruction execution– Both paths are in the code– Execution based on the result of the condition

• No Branch Prediction is Required– Instructions not selected are ignored– Sort of inserting Nop


and r3,r1,r5addi r2,r3,#4sub r4,r2,r1jal doitsubi r1,r1,#1

A:

sub r4,r2,r1 doit

addi r2,r3,#4 A+8N

sub r4,r2,r1 L

--- -----

and r3,r1,r5 A+4N

subi r1,r1,#1 A+20N

Internal Cache state:

Zero Cycle Jump• What really has to be done at runtime?

– Once an instruction has been detected as a jump or JAL, we might recode it in the internal cache.

– Very limited form of dynamic compilation?• Use of “Pre-decoded” instruction cache

– Called “branch folding” in the Bell-Labs CRISP processor.– Original CRISP cache had two addresses and could thus fold a

complete branch into the previous instruction– Notice that JAL introduces a structural hazard on write


Dynamic Branch Prediction Summary• Prediction becoming important part of scalar execution• Branch History Table

– 2 bits for loop accuracy• Correlation

– Recently executed branches correlated with next branch.– Either different branches– Or different executions of same branches

• Tournament Predictor– More resources to competitive solutions and pick between them

• Branch Target Buffer– Branch address & prediction

• Predicated Execution– No need for Prediction– Hardware Support needed

graduate computer architecture i

Documents

instructioncs252s05

outcomecs252s05 lec2

branch direction

clearpredict branch

complex systemscs252s05

change pccs252s05 lec2

branch target address

cycles instruction