graduate computer architecture i
DESCRIPTION
Graduate Computer Architecture I. Lecture 3: Branch Prediction Young Cho. Cycles Per Instructions. “Average Cycles per Instruction”. CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count . “Instruction Frequency”. Typical Load/Store Processor. IF/ID. ID/EX. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/1.jpg)
Lecture 3: Branch Prediction
Young Cho
Graduate Computer Architecture I
![Page 2: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/2.jpg)
2 - CSE/ESE 560M – Graduate Computer Architecture I
“Instruction Frequency”
CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count
“Average Cycles per Instruction”
j
n
jj I CPI Time Cycle timeCPU
1
Countn InstructioI
F whereF CPI CPI1
jj
n
jjj
Cycles Per Instructions
![Page 3: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/3.jpg)
3 - CSE/ESE 560M – Graduate Computer Architecture I
Instruction Memory
Register File ALU
Data Memory
PC Control
IF/ID ID/EX EX/MEM MEM/WB
Typical Load/Store Processor
![Page 4: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/4.jpg)
4 - CSE/ESE 560M – Graduate Computer Architecture I
Pipelining Laundry30 minutes 35 minutes 35 minutes
Three sets of Clean Clothes in 2 hours 40 minutes
35 minutes 25 minutes
With large number of sets, the each load takes average of ~35 min to wash
3X Increase in Productivity!!!
![Page 5: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/5.jpg)
5 - CSE/ESE 560M – Graduate Computer Architecture I
Introducing Problems• Hazards prevent next instruction from
executing during its designated clock cycle– Structural hazards: HW cannot support this
combination of instructions (single person to dry and iron clothes simultaneously)
– Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock – needs both before putting them away)
– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (Er…branch & jump)
![Page 6: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/6.jpg)
6 - CSE/ESE 560M – Graduate Computer Architecture I
• Read After Write (RAW) – Instr2 tries to read operand before Instr1 writes it– Caused by a “Dependence” in compiler term
• Write After Read (WAR) – Instr2 writes operand before Instr1 reads it– Called an “anti-dependence” in compiler term
• Write After Write (WAW) – Instr2 writes operand before Instr1 writes it– “Output dependence” in compiler term
• WAR and WAW in more complex systems
Data Hazards
![Page 7: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/7.jpg)
7 - CSE/ESE 560M – Graduate Computer Architecture I
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch
3 instructions are in the pipeline before new instruction
can be fetched.
Branch Hazard (Control)
![Page 8: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/8.jpg)
8 - CSE/ESE 560M – Graduate Computer Architecture I
Branch Hazard Alternatives• Stall until branch direction is clear• Predict Branch Not Taken
– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Advantage of late pipeline state update– 47% DLX branches not taken on average– PC+4 already calculated, so use it to get next instr
• Predict Branch Taken– 53% DLX branches taken on average– DLX still incurs 1 cycle branch penalty– Other machines: branch target known before outcome
![Page 9: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/9.jpg)
9 - CSE/ESE 560M – Graduate Computer Architecture I
Branch delay of length n
• Delayed Branch– Define branch to take place AFTER a following
instruction (Fill in Branch Delay Slot)
branch instructionsequential successor1
sequential successor2........sequential successorn
branch target if taken
– 1 slot delay allows proper decision and branch target address in 5 stage pipeline
Branch Hazard Alternatives
![Page 10: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/10.jpg)
10 - CSE/ESE 560M – Graduate Computer Architecture I
Evaluating Branch Alternatives
Scheduling BranchCPIspeedup v.speedup v. scheme penalty unpipelined stall
Stall pipeline 31.42 3.51.0Predict taken 11.14 4.41.26Predict not taken 11.09 4.51.29Delayed branch 0.51.07 4.61.31
Conditional & Unconditional = 14%, 65% change PC
Pipeline speedup = Pipeline depth1 +Branch frequencyBranch penalty
![Page 11: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/11.jpg)
11 - CSE/ESE 560M – Graduate Computer Architecture I
Solution to Hazards• Structural Hazards
– Delaying HW Dependent Instruction– Increase Resources (i.e. dual port memory)
• Data Hazards– Data Forwarding– Software Scheduling
• Control Hazards– Pipeline Stalling– Predict and Flush– Fill Delay Slots with Previous Instructions
![Page 12: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/12.jpg)
12 - CSE/ESE 560M – Graduate Computer Architecture I
Administrative• Literature Survey
– One Q&A per Literature– Q&A should show that you read the paper
• Changes in Schedule– Need to be out of town on Oct 4th (Tuesday)– Quiz 2 moved up 1 lecture
• Tool and VHDL help
![Page 13: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/13.jpg)
13 - CSE/ESE 560M – Graduate Computer Architecture I
Typical Pipeline• Example: MIPS R4000
IF ID MEM WB
integer unit
FP/int Multiply
FP adder
FP/int divider
ex
m1 m2 m3 m4 m5 m6 m7
a1 a2 a3 a4
Div (lat = 25, Init inv=25)
![Page 14: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/14.jpg)
14 - CSE/ESE 560M – Graduate Computer Architecture I
Prediction• Easy to fetch multiple (consecutive)
instructions per cycle– Essentially speculating on sequential flow
• Jump: unconditional change of control flow– Always taken
• Branch: conditional change of control flow– Taken typically ~50% of the time in applications
• Backward: 30% of the Branch 80% taken = ~24%• Forward: 70% of the Branch 40% taken = ~28%
![Page 15: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/15.jpg)
15 - CSE/ESE 560M – Graduate Computer Architecture I
Current Ideas• Reactive
– Adapt Current Action based on the Past– TCP windows– URL completion, ...
• Proactive– Anticipate Future Action based on the Past– Branch prediction– Long Cache block– Tracing
![Page 16: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/16.jpg)
16 - CSE/ESE 560M – Graduate Computer Architecture I
Branch Prediction Schemes• Static Branch Prediction• Dynamic Branch Prediction
– 1-bit Branch-Prediction Buffer– 2-bit Branch-Prediction Buffer– Correlating Branch Prediction Buffer– Tournament Branch Predictor
• Branch Target Buffer• Integrated Instruction Fetch Units• Return Address Predictors
![Page 17: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/17.jpg)
17 - CSE/ESE 560M – Graduate Computer Architecture I
Static Branch Prediction• Execution profiling
– Very accurate if Actually take time to Profile– Incovenient
• Heuristics based on nesting and coding– Simple heuristics are very inaccurate
• Programmer supplied hints...– Inconvenient and potentially inaccurate
![Page 18: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/18.jpg)
18 - CSE/ESE 560M – Graduate Computer Architecture I
Dynamic Branch Prediction• Performance = ƒ(accuracy, cost of mis-prediction)• 1-bit Branch History Table
– Bitmap for Lower bits of PC address– Says whether or not branch taken last time– If Inst is Branch, predict and update the table
• Problem– 1-bit BHT will cause 2 mis-predictions for Loops
• First time through the loop, it predicts exit instead loop• End of loop case, it predicts loops instead of exit
– Avg is 9 iterations before exit• Only 80% accuracy even if loop 90% of the time
![Page 19: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/19.jpg)
19 - CSE/ESE 560M – Graduate Computer Architecture I
N-bit Dynamic Branch Prediction• N-bit scheme where change prediction only
if get misprediction N-times:
T
T
NT
Predict Taken
Predict Not Taken
Predict Taken
Predict Not TakenT
NT
T
NT
NT
2-bit Scheme: Saturates the prediction up to 2 times
![Page 20: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/20.jpg)
20 - CSE/ESE 560M – Graduate Computer Architecture I
Correlating Branches• (2,2) predictor
– 2-bit global: indicates the behavior of the last two branches
– 2-bit local (2-bit Dynamic Branch Prediction)
• Branch History Table– Global branch history is
used to choose one of four history bitmap table
– Predicts the branch behavior then updates only the selected bitmap table
Branch address (4 bits)
Prediction
2-bit recent global branch history
(01 = not taken then taken)
![Page 21: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/21.jpg)
21 - CSE/ESE 560M – Graduate Computer Architecture I
Accuracy of Different Schemes
4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT
18%
Freq
uenc
y of
Mis
pred
ictio
ns
0%1%
5%6% 6%
11%
4%
6%5%
1%
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li
Freq
uenc
y of
Mis
pred
icti
ons
![Page 22: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/22.jpg)
22 - CSE/ESE 560M – Graduate Computer Architecture I
BHT Accuracy• Mispredict because either:
– Wrong guess for the branch– Wrong Index for the branch
• 4096 entry table – programs vary from 1% misprediction (nasa7,
tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%
• For SPEC92– 4096 about as good as infinite table
![Page 23: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/23.jpg)
23 - CSE/ESE 560M – Graduate Computer Architecture I
Tournament Branch Predictors• Correlating Predictor
– 2-bit predictor failed on important branches– Better results by also using global information
• Tournament Predictors– 1 Predictor based on global information– 1 Predictor based on local information– Use the predictor that guesses better
addr
Predictor BPredictor A
![Page 24: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/24.jpg)
24 - CSE/ESE 560M – Graduate Computer Architecture I
Alpha 21264• 4K 2-bit counters to choose from among a global predictor and a
local predictor• Global predictor also has 4K entries and is indexed by the history of
the last 12 branches; each entry in the global predictor is a standard 2-bit predictor– 12-bit pattern: ith bit 0 => ith prior branch not taken;
ith bit 1 => ith prior branch taken; • Local predictor consists of a 2-level predictor:
– Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted.
– Next level Selected entry from the local history table is used to index a table of 1K entries consisting a 3-bit saturating counters, which provide the local prediction
• Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits!(~180,000 transistors)
![Page 25: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/25.jpg)
25 - CSE/ESE 560M – Graduate Computer Architecture I
Branch Prediction Accuracy
94%
96%
98%
98%
97%
100%
70%
82%
77%
82%
84%
99%
88%
86%
88%
86%
95%
99%
0% 20% 40% 60% 80% 100%
gcc
espresso
li
fpppp
doduc
tomcatv
Profile-based2-bit dynmicTournament
![Page 26: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/26.jpg)
26 - CSE/ESE 560M – Graduate Computer Architecture I
Accuracy versus Size
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Total predictor size (Kbits)
Con
ditio
nal b
ranc
h m
ispr
edic
tion
rate
Local
Correlating
Tournament
![Page 27: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/27.jpg)
27 - CSE/ESE 560M – Graduate Computer Architecture I
Branch Target Buffer• Branch Target Buffer (BTB): Address of branch index to get
prediction AND branch address (if taken)– Note: must check for branch match now, since can’t use wrong
branch addressBranch PC Predicted PC
=?
PC of instruction
FETCH
Extra prediction state
bits
Yes: instruction is branch and use predicted PC as next PCNo: branch not
predicted, proceed normally (Next PC = PC+4)
![Page 28: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/28.jpg)
28 - CSE/ESE 560M – Graduate Computer Architecture I
Predicated Execution• Built in Hardware Support
– Bit for predicated instruction execution– Both paths are in the code– Execution based on the result of the condition
• No Branch Prediction is Required– Instructions not selected are ignored– Sort of inserting Nop
![Page 29: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/29.jpg)
29 - CSE/ESE 560M – Graduate Computer Architecture I
and r3,r1,r5addi r2,r3,#4sub r4,r2,r1jal doitsubi r1,r1,#1
A:
sub r4,r2,r1 doit
addi r2,r3,#4 A+8N
sub r4,r2,r1 L
--- -----
and r3,r1,r5 A+4N
subi r1,r1,#1 A+20N
Internal Cache state:
Zero Cycle Jump• What really has to be done at runtime?
– Once an instruction has been detected as a jump or JAL, we might recode it in the internal cache.
– Very limited form of dynamic compilation?• Use of “Pre-decoded” instruction cache
– Called “branch folding” in the Bell-Labs CRISP processor.– Original CRISP cache had two addresses and could thus fold a
complete branch into the previous instruction– Notice that JAL introduces a structural hazard on write
![Page 30: Graduate Computer Architecture I](https://reader036.vdocument.in/reader036/viewer/2022062501/568167b3550346895ddd0248/html5/thumbnails/30.jpg)
30 - CSE/ESE 560M – Graduate Computer Architecture I
Dynamic Branch Prediction Summary• Prediction becoming important part of scalar execution• Branch History Table
– 2 bits for loop accuracy• Correlation
– Recently executed branches correlated with next branch.– Either different branches– Or different executions of same branches
• Tournament Predictor– More resources to competitive solutions and pick between them
• Branch Target Buffer– Branch address & prediction
• Predicated Execution– No need for Prediction– Hardware Support needed