1 a new case for the tage predictor andré seznec inria/irisa

38
1 A New Case for the TAGE Predictor André Seznec INRIA/IRISA

Upload: pierce-bennett

Post on 03-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

1

A New Case for the TAGE Predictor

André Seznec

INRIA/IRISA

2

Branch prediction

• Just the simplest way to improve processor core performance:

Replacing the branch predictor by a more accurate one does not necessitate to change the rest of the design

3

The TAGE branch predictor

• Introduced in 2006

State-of-the-art global history predictor

CBP-2 (2006), CBP-3 (2011)

4

TAGE: multiple tables, global history

predictor

L(1)1iαL(i)

0 L(0)

The set of history lengths forms a geometric series

most of the storage for short history !!

{0, 2, 4, 8, 16, 32, 64, 128}

Capture correlation on very long histories

5TAGE:

Tagged and prediction by the longest history matching entry

pc h[0:L1]

ctr u tag

=?

ctr u tag

=?

ctr u tag

=?

prediction

pc pc h[0:L2] pc h[0:L3]

11 1 1 1 1 1

1

1

Tagless base predictor

6

Why TAGE

• State-of-the-art global history predictor

• But also: Large cost-effective design space:

32Kbits-512 Kbits 4 to 12 tables 100-1000 history bits

Confidence estimator for free [HPCA2011]

7

And more in this presentation

• Cost-effective hardware complexity and energy consumption Limited nb of accesses to predictor tables Use of single ported predictor tables

• Improving TAGE accuracy with a small side predictor Tracking statistical correlation Using local history

8The implicit 3 accesses scenarioin academic studies

• A prediction on the right path: Read at prediction time Update at retire time

Re-read Update

3 accesses on the same predictor entry !!

Might lead to the usage of 3-ported components

Might lead to the usage of 3-ported components

9

Why not only 2 accesses through propagating values read at prediction

time

• A loop, a bimodal predictor

C=1

C=1, mispredicti

on

C=1, mispredicti

on

C=0, misprediction

C=0, misprediction

Execute

Execute

C=0

FetchFetch Retire

Retire

C=2

C=2, correct prediction

C=2, correct prediction

10Is it that important for global history predictors ?

• Only 2 accesses:

33 % increase misp. rate on gshare

17 % increase misp. rate on GEHL

4 % increase misp. rate on TAGE

Using 3rd Championship Branch Prediction framework

11 Reducing the number of predictor writes

• At retire time: Lots of silent updates (rewrite

saturated counters) [Banisiadi and Moshovos 2003] ~ 2.1 writes for 1 mispredictions for

TAGE less than 10 writes for 100 branches

12Eliminating most of the reads at retire time

• On correct predictions, do not re-read,

but use the values read at prediction time: gshare: +4.5 % mispredictions GEHL: +8.8 % mispredictions TAGE: +1.3 % mispredictions

TAGE: 1.13 access {prediction+ (read at retire) + update}

per prediction on the correct path

TAGE: 1.13 access {prediction+ (read at retire) + update}

per prediction on the correct path

13Opens opportunity to use single-ported memory components

• Cycle stealing: Wait for free cycles to update

Mispredictions Fetch gating Front-end stalls

Complex management .. and impact on accuracy ?Complex management .. and impact on accuracy ?

14

A simple and general schemeusing single-ported components

A simple and general schemeusing single-ported components

15

4-way interleaved single ported

• 4 banks per predictor table

• Guarantee that 3 consecutive predictions are done by 3 different banks:

Predictions for Z after X and Yb(Z) = Z & 3while ((b(Z)==b(X))|| (b(Z)==b(Y)) b(Z) += (1 & 3)

16

• Read at prediction has priority

• Read at retire is delayed by at most one cycle

• Write update is delayed by at most two cycles

17

B0B0 B1B1 B3B3B2B2

PaRtUn

T=0

Prediction has priorityno prediction for at least 2 cycles no prediction for at least 2 cycles

Worst case for an update

no extra read at retire time and no update for 2 cycles no extra read at retire time and no update for 2 cycles

18

B0B0 B1B1 B3B3B2B2

RtUn

T=1

No prediction by construction Read at retire time

19

B0B0 B1B1 B3B3B2B2

Un

T=2

No prediction and no read at retire time by construction

204-way interleaved vs 3-ported TAGE predictor

• 0.5 % increase of misprediction rate

• 3.3x decrease of silicon area of the predictor tables

• 2x decrease of energy per table access

Works also for the other global history predictors

21

Improving TAGE accuracy

with a small side predictor

22Two classes of branchesnot that well predicted by TAGE

« Statistically » correlated branches:•Not really correlated with the global history, but exhibit a bias•Sometimes better predicted by a single wide PC indexed counter than by TAGE

« Statistically » correlated branches:•Not really correlated with the global history, but exhibit a bias•Sometimes better predicted by a single wide PC indexed counter than by TAGE

Branches correlated with local history: •No problem if very regular global history•TAGE can not learn the pattern if irregular•Not just the loops with constant iteration numbers

Branches correlated with local history: •No problem if very regular global history•TAGE can not learn the pattern if irregular•Not just the loops with constant iteration numbers

23

The Statistical Corrector predictor(from 3rd Championship Branch Prediction)

• Poor correlation with global history, but some bias Track cases such that:

« In this case (PC, history, prediction),

TAGE is likely (>50 %) to mispredict »

AND REVERSE THE PREDICTION !!Tree of adders captures the « average behavior »Tree of adders captures the « average behavior »

24

Statistical Correlator Predictor

TAGE

H

A

Stat.

Corr.

Prediction + ctr value

+

+

HAPred Gehl-like

2.5 % misprediction rate decrease 2.5 % misprediction rate decrease

25

Use the same principle for local history biased

branches !

Use the same principle for local history biased

branches !

26Local Statistical Correlator Predictor

TAGE H

A

LocalStat.Corr.

Prediction + ctr value

+

+

LHAPred LGehl-like

Local hist.Local hist.

478 Kbits478 Kbits

30 Kbits30 Kbits

27

Local Statistical Corrector Predictor

• 8-9 % misprediction rate decrease over TAGElocal history correlation AND statistically biased

branches

• No need for loop predictor

• Small local history tables (32-64 entries)State-of-art prediction accuracy: without the irrealistic tricks used at 3rd CBPState-of-art prediction accuracy: without the irrealistic tricks used at 3rd CBP

28Managing speculative local history:not that easy

S(peculative) H(istory)

P(rogram) C(ounter)

Inflight branches

SH

PC

SH

PC

SH

PC

SH

PC

SH

PC

SH

PC

SH

PC

SH

PC

Direct MappedLocal History

Table

Direct MappedLocal History

Table

Sta

t. C

orr.

Sta

t. C

orr.

Local History

Prediction

SH = (SH <<1) + pred

SH

PC

PC

TAGE prediction

29Major local history management cost

• The associative search on the inflight branches

Can be leveraged to another goal !!

30

The « late update » mispredictions

• Issue: Some mispredictions are due to late

updates at retirement, (later than resolution time)

• Immediate Update Mimicker: Try to catch these cases

31

PTA

Same table, same entry

ETA

ETA

ETA

PTA

PTA

ETA

PTA

ETA

PTA

PTA

PTA

PTA

PTA

Misprediction

P(rediction) or (E)xecutedT(able)A(ddress in the table)

PTA

PTA

PTA

PTA

PTA

PTA

PTA

PTA

PTA

PTA

PTA

PTA

Fetch

1 % misp. rate decrease

1 % misp. rate decrease

The Immediate Update Mimicker for TAGE

32

The Immediate Update Mimicker

• Marginal accuracy gain

• But can be combined with

 speculative local history management 

33

MP

PK

I

Storage budget

34

Against alternative predictors

• Outperforms the (not so realistic) podium of 3rd Championship Branch Prediction ISL-TAGE FTL++

GEHL+LGEHL based OH-SNAP

Piecewise linear + varying weights

Particularly, on the most predictable benchmarks Particularly, on the most predictable benchmarks

35

Putting all together

• Complexity and energy 4-way interleaved tables Reduced accesses at retire time

• Accuracy Local Statistical Corrector Predictor Immediate Update Mimicker

≈ State-of-the-art predictor

Cost effective: silicon, energy

36

Conclusion

• Made a new case for TAGE: Already known:

State-of-the-art global history predictor Confidence estimation for free

Established: Area and energy effective implementation

with single-ported components Accuracy improved with Local Statistical

Predictor

37

38Some « hope » on less predictable benchmarks