1 a new case for the tage predictor andré seznec inria/irisa
TRANSCRIPT
2
Branch prediction
• Just the simplest way to improve processor core performance:
Replacing the branch predictor by a more accurate one does not necessitate to change the rest of the design
3
The TAGE branch predictor
• Introduced in 2006
State-of-the-art global history predictor
CBP-2 (2006), CBP-3 (2011)
4
TAGE: multiple tables, global history
predictor
L(1)1iαL(i)
0 L(0)
The set of history lengths forms a geometric series
most of the storage for short history !!
{0, 2, 4, 8, 16, 32, 64, 128}
Capture correlation on very long histories
5TAGE:
Tagged and prediction by the longest history matching entry
pc h[0:L1]
ctr u tag
=?
ctr u tag
=?
ctr u tag
=?
prediction
pc pc h[0:L2] pc h[0:L3]
11 1 1 1 1 1
1
1
Tagless base predictor
6
Why TAGE
• State-of-the-art global history predictor
• But also: Large cost-effective design space:
32Kbits-512 Kbits 4 to 12 tables 100-1000 history bits
Confidence estimator for free [HPCA2011]
7
And more in this presentation
• Cost-effective hardware complexity and energy consumption Limited nb of accesses to predictor tables Use of single ported predictor tables
• Improving TAGE accuracy with a small side predictor Tracking statistical correlation Using local history
8The implicit 3 accesses scenarioin academic studies
• A prediction on the right path: Read at prediction time Update at retire time
Re-read Update
3 accesses on the same predictor entry !!
Might lead to the usage of 3-ported components
Might lead to the usage of 3-ported components
9
Why not only 2 accesses through propagating values read at prediction
time
• A loop, a bimodal predictor
C=1
C=1, mispredicti
on
C=1, mispredicti
on
C=0, misprediction
C=0, misprediction
Execute
Execute
C=0
FetchFetch Retire
Retire
C=2
C=2, correct prediction
C=2, correct prediction
10Is it that important for global history predictors ?
• Only 2 accesses:
33 % increase misp. rate on gshare
17 % increase misp. rate on GEHL
4 % increase misp. rate on TAGE
Using 3rd Championship Branch Prediction framework
11 Reducing the number of predictor writes
• At retire time: Lots of silent updates (rewrite
saturated counters) [Banisiadi and Moshovos 2003] ~ 2.1 writes for 1 mispredictions for
TAGE less than 10 writes for 100 branches
12Eliminating most of the reads at retire time
• On correct predictions, do not re-read,
but use the values read at prediction time: gshare: +4.5 % mispredictions GEHL: +8.8 % mispredictions TAGE: +1.3 % mispredictions
TAGE: 1.13 access {prediction+ (read at retire) + update}
per prediction on the correct path
TAGE: 1.13 access {prediction+ (read at retire) + update}
per prediction on the correct path
13Opens opportunity to use single-ported memory components
• Cycle stealing: Wait for free cycles to update
Mispredictions Fetch gating Front-end stalls
Complex management .. and impact on accuracy ?Complex management .. and impact on accuracy ?
14
A simple and general schemeusing single-ported components
A simple and general schemeusing single-ported components
15
4-way interleaved single ported
• 4 banks per predictor table
• Guarantee that 3 consecutive predictions are done by 3 different banks:
Predictions for Z after X and Yb(Z) = Z & 3while ((b(Z)==b(X))|| (b(Z)==b(Y)) b(Z) += (1 & 3)
16
• Read at prediction has priority
• Read at retire is delayed by at most one cycle
• Write update is delayed by at most two cycles
17
B0B0 B1B1 B3B3B2B2
PaRtUn
T=0
Prediction has priorityno prediction for at least 2 cycles no prediction for at least 2 cycles
Worst case for an update
no extra read at retire time and no update for 2 cycles no extra read at retire time and no update for 2 cycles
204-way interleaved vs 3-ported TAGE predictor
• 0.5 % increase of misprediction rate
• 3.3x decrease of silicon area of the predictor tables
• 2x decrease of energy per table access
Works also for the other global history predictors
22Two classes of branchesnot that well predicted by TAGE
« Statistically » correlated branches:•Not really correlated with the global history, but exhibit a bias•Sometimes better predicted by a single wide PC indexed counter than by TAGE
« Statistically » correlated branches:•Not really correlated with the global history, but exhibit a bias•Sometimes better predicted by a single wide PC indexed counter than by TAGE
Branches correlated with local history: •No problem if very regular global history•TAGE can not learn the pattern if irregular•Not just the loops with constant iteration numbers
Branches correlated with local history: •No problem if very regular global history•TAGE can not learn the pattern if irregular•Not just the loops with constant iteration numbers
23
The Statistical Corrector predictor(from 3rd Championship Branch Prediction)
• Poor correlation with global history, but some bias Track cases such that:
« In this case (PC, history, prediction),
TAGE is likely (>50 %) to mispredict »
AND REVERSE THE PREDICTION !!Tree of adders captures the « average behavior »Tree of adders captures the « average behavior »
24
Statistical Correlator Predictor
TAGE
H
A
Stat.
Corr.
Prediction + ctr value
+
+
HAPred Gehl-like
2.5 % misprediction rate decrease 2.5 % misprediction rate decrease
25
Use the same principle for local history biased
branches !
Use the same principle for local history biased
branches !
26Local Statistical Correlator Predictor
TAGE H
A
LocalStat.Corr.
Prediction + ctr value
+
+
LHAPred LGehl-like
Local hist.Local hist.
478 Kbits478 Kbits
30 Kbits30 Kbits
27
Local Statistical Corrector Predictor
• 8-9 % misprediction rate decrease over TAGElocal history correlation AND statistically biased
branches
• No need for loop predictor
• Small local history tables (32-64 entries)State-of-art prediction accuracy: without the irrealistic tricks used at 3rd CBPState-of-art prediction accuracy: without the irrealistic tricks used at 3rd CBP
28Managing speculative local history:not that easy
S(peculative) H(istory)
P(rogram) C(ounter)
Inflight branches
SH
PC
SH
PC
SH
PC
SH
PC
SH
PC
SH
PC
SH
PC
SH
PC
Direct MappedLocal History
Table
Direct MappedLocal History
Table
Sta
t. C
orr.
Sta
t. C
orr.
Local History
Prediction
SH = (SH <<1) + pred
SH
PC
PC
TAGE prediction
29Major local history management cost
• The associative search on the inflight branches
Can be leveraged to another goal !!
30
The « late update » mispredictions
• Issue: Some mispredictions are due to late
updates at retirement, (later than resolution time)
• Immediate Update Mimicker: Try to catch these cases
31
PTA
Same table, same entry
ETA
ETA
ETA
PTA
PTA
ETA
PTA
ETA
PTA
PTA
PTA
PTA
PTA
Misprediction
P(rediction) or (E)xecutedT(able)A(ddress in the table)
PTA
PTA
PTA
PTA
PTA
PTA
PTA
PTA
PTA
PTA
PTA
PTA
Fetch
1 % misp. rate decrease
1 % misp. rate decrease
The Immediate Update Mimicker for TAGE
32
The Immediate Update Mimicker
• Marginal accuracy gain
• But can be combined with
speculative local history management
34
Against alternative predictors
• Outperforms the (not so realistic) podium of 3rd Championship Branch Prediction ISL-TAGE FTL++
GEHL+LGEHL based OH-SNAP
Piecewise linear + varying weights
Particularly, on the most predictable benchmarks Particularly, on the most predictable benchmarks
35
Putting all together
• Complexity and energy 4-way interleaved tables Reduced accesses at retire time
• Accuracy Local Statistical Corrector Predictor Immediate Update Mimicker
≈ State-of-the-art predictor
Cost effective: silicon, energy
36
Conclusion
• Made a new case for TAGE: Already known:
State-of-the-art global history predictor Confidence estimation for free
Established: Area and energy effective implementation
with single-ported components Accuracy improved with Local Statistical
Predictor