[ieee 14th euromicro international conference on parallel, distributed, and network-based processing...

Extending the PPM Branch Predictor

Zenaide Carvalho da Silva João Angelo Martini Ronaldo Augusto Lara Gonçalves Univ. Federal do Pará Univ. Estadual de Maringá Univ. Estadual de Maringá Depto de Informática Depto de Informática Depto de Informática Belém - PA - Brazil Maringá - PR - Brasil Maringá - PR - Brasil [email protected] [email protected] [email protected]

Abstract

In superscalar architectures, branch prediction techniques are necessary to handle control dependences, boosting the instruction fetch and increasing the number of available useful instructions for parallel execution. Nowadays, most of branch predictors use a kind of table containing branch histories and target addresses. These histories generate different patterns that appear many times with probabilities that depend on the program execution flow. The PPM (Prediction Partial Matching) predictor, which works with branch pattern probabilities, was analyzed and used as base for the development of a more aggressive model, denominated TDPP (Transition Dependent Probability Predictor). This new model was analyzed and evaluated on the SimpleScalar Tool Set Platform. The results obtained in the SPEC 2000 benchmarks reached average hit rates about 98% for 16-bits history sizes. The TDPP model was more efficient than PPM and appropriate for real implementation in the near future.

1. Introduction

Nowadays, the superscalar architectures are used in the most of commercial processors, allowing the exploration of instruction level parallelism and making possible to execute independent instructions in parallel [1]. However, the performance of these processors is limited by branch instructions, which can change the program execution flow, and thus, hinder the detection of parallel instructions once that this ability depends of knowing which path will be followed after the fetch of each branch instruction. This situation is known as control dependence.

To minimize this problem and increase the issue of instructions per cycle, the superscalar processors use branch prediction mechanisms [2] to forecast the path that will be followed and so to fetch instructions with few stalls, to fill out the pipeline with useful instructions and to execute parallel instructions in advance.

Considering that branch instructions generate different patterns and these patterns have different occurrence probabilities, PPM predictor [3] can be used to take advantage of this situation [4]. In this present work, the PPM predictor was previously simulated and analyzed under different conditions and a more aggressive and efficient predictor, called TDPP, was proposed and evaluated. The TDPP predictor executes less steps than PPM and it acts on the different types of branch instructions, providing better performance.

This paper is organized as follows. Section 2 presents the PPM predictor and experimental simulation results. Section 3 presents the proposed TDPP predictor. Section 4 analyzes TDPP performance and points the benefits on PPM predictor. Section 5 presents conclusions and future works. References appear in the last section.

2. PPM predictor

Modern techniques of data compression can represent a large number of binary symbols, frequently in a string, through a small number of bits and so reduce the total volume of data. These same techniques have been used for conditional branch prediction [5, 6] because the heuristic used by them usually defines a probabilistic model that predicts the next string symbol with a high precision.

In this context, the PPM predictor (Prediction Partial Matching) [4] appears. It looks the branch

Proceedings of the 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP’06) 1066-6192/06 $20.00 © 2006 IEEE

outcomes on a previous sequence that was already observed (1 for taken and 0 for not-taken) to predict the next one. The PPM predictor of N order is formed by N+1 Markov predictors, of N, N-1,…0 orders.

A Markov predictor of 0 order predicts the next branch without consider any previous history, in other words, it’s a static predictor. A Markov predictor of K order predict the next branch based on k more recent branches (last k solved branches until the moment), which represents the current branch history. The next branch to be solved will enter on the right side of this sequence and will do the most left bit be moved outside the current history. After each step, the history is updated in its most right bit and, successively, different histories will appear with the execution of the Markov predictor.

During the program execution, the same history appears many times and thus it generates a pattern that has an occurrence frequency. Therefore, each branch pattern has an occurrence probability. For the prediction, the current history can suggest 2 possible candidates to be the new current history, depending on the predicted branch result is 1 or 0. Thus, the transition of the current history to the next current history will happen with certain probability.

The Markov predictor builds the transition frequency registering in a counter table the number of times that 1 or 0 appears after each current history during the program execution. This table is indexed by history. To predict a branch the Markov predictor indexes the table with the current history and uses the most frequent transition from the table.

The PPM predictor of N order tries firstly to predict using a Markov predictor of N order. If it is not possible, in other words, the current history of N bits still didn’t happen any time (it contains 0 in the counters of the fields 0 and 1), the PPM predictor tries to predict using a Markov predictor of N-1 order on a history of n-1 bits also. If it is not possible again the PPM predictor reduces more one bit of the history and so forth, until the corresponding prediction can be made, in the worst case by the Markov predictor of 0 order. We have called this feature of “dynamic order reduction”.

3. TDPP predictor

During the simulations with the PPM predictor we disabled the use of the dynamic order reduction, in order to test its behavior, and this fact didn’t implicate in performance loss. The results showed the PPM predictor of N order needs just one Markov predictor of N order to reach the best performance. Also, we

observed the different types of branches implicate in different prediction hit rates. With base in this analysis the TDPP predictor was developed [9].

This model (see Figure 1) uses a register to keep the current global history (N bits to keep the outcomes of the last N branches, independently of the branch address or its type) and several probability tables, one for each branch type. Each table has two columns, in order to keep branch frequencies separated by transition, explained as follows.

In order to make the next prediction, TDPP supposes the next current global history as follows. The global history (just a copy) is shifted left and one 0 or 1 is put in the rightmost bit, generating 2 candidates to be the next current global history. However, each one of the 2 candidate histories could also be generated from two other histories. For example, the candidate history 1010 can be generated from histories 0101 or 1101.

Thus, the probability tables provide the occurrence frequencies of both transitions that can generate the two candidate histories, that is, column 0 and column 1. The shifted left bit, called transition bit, is used to address the column of the probability table associated to the respective branch instruction type, identifying what transition must be considered. Thus, we said the prediction is based on branch type, global branch history and also on the transitions among histories.

global history

candidatehistories

0

1

selects transition

lastoutcome

supposesnext outcome

addressestable entries

br type

00000 1

0 100

0 010

0 110

0001001000110100

:11101111

selectstable tables per

br type

nextdiscard

winner

1 02 10 12 21 1: :0 11 2

Figure 1. TDPP’s model organization

After that, the candidate history that has the highest occurrence probability is selected. If the selected candidate history was obtained with the insertion of 0, the predictor then predicts 0 (not taken branch), otherwise it predicts 1 (taken branch). Remember that each branch type has its own probability table, with two columns, containing 2n

entries each. After each branch committing, the frequency for the correct tuple (table, line, transition) must be updated.


4. Simulation and performance evaluation

The PPM predictor was tested with the sim-bpred simulator of the SimpleScalar Tool Set [7] platform and evaluated using 13 benchmarks of the SPEC 2000 [8]. The history sizes were changed from 2 to 16. After that, the TDPP predictor was simulated in the same conditions that the PPM predictor. The graphs of the Figures 2 present the performance of the TDPP predictor. The behavior of the TDPP predictor for all benchmarks is analogous to behavior obtained with the PPM predictor, but the performance is better.

0

20

40

60

80

100

120

2 4 8 16

History Size

% A

ccur

acy

AVG

Figure 2. Global performance of the TDPP

In the average of the benchmarks, the TDPP predictor showed more efficiency than the PPM predictor, for all history sizes. Figure 3 shows the speedup of the TDPP predictor over the PPM predictor, that is, the percentage that the difference of accuracy represents. A speedup above 4% can be observed for history sizes of 4 bits. However, for history sizes of 16 bits the speedup reaches 1% only. It’s important to point out that the TDPP changed the computational effort before utilized by PPM for maintain the dynamic order reduction by the additional expense of tables by branch type.

Figure 3. Speedup TDPP over PPM

However, if the table entry size can be reduced, then the additional expense could be not so important. This effect would be the same that substitute a wide table used by PPM by several narrow tables used by TDPP, with better performance. The next section investigates these questions.

4.1 Exploring the limits of hardware

In the simulations previously described, the table entry sizes of the TDPP model, which register the frequencies of the transitions, were not stipulated. In fact, to work with no limits is unreal. To evaluate the impact of this limitation in the performance of the TDPP predictor, new simulations were made for entry sizes that vary from 2 to 16 bits, because after 16 bits the performance results almost are not altered. Thus, the values of the frequency can reach maximums of 3 to 65535, called of Max (maximum limit of the frequency). Only the 16-bits history was considered by the fact of providing better performance.

Additionally, a new feature was inserted in the TDPP predictor: the frequency resetting. This feature works as follows. After the branch solution, the table entry associated to the updated history must be increased. However, this increment just will be made if the entry value not yet reached Max (all the bits in 1). After this increase, the value of this entry and the value of the entry associate to the loser candidate history are compared with Max value. If they are simultaneously equals, the values of these entries receive zero. This version of the predictor is called TDPPR predictor. The results of the simulations are presented too. The graph of the Figure 4 shows the performance of the TDPPR predictor for all benchmarks. Notice that the best medium performance, 96.69%, is obtained when Max is 15, that is, with counters of 4 bits only. The hit rates decrease in conformity with the increase of the Max.

Besides, we have noted that for the floating-point benchmarks, when Max’s increases from 7 to 1023, the hit rate maintains almost invariable. After that, the hit rate decreases, endangering the performance. The size of Max that allowed greatest hit rate was 31, reaching 98.45%. For this case, there is no reason for using more than 5 bits in each table entry. Also, for integer benchmarks, we have noted that the hit rates decreases when Max is greater than 15, that is, more than 4 bits in each table entry. In this point the predictor reaches the major hit rate: 95.18%.

The speedup obtained of the TDPPR predictor with Max 15 over the same predictor with unlimited size reaches 0.8%. That means even if is possible dispose of large tables, there is no advantage because the uncontrolled increase of frequency values damage the performance. This fact happens because the frequency counter preserves non-updated values, reached in previous parts of the dynamic code, which are inadequate to be applied to the current execution. In this sense, the frequency resetting after certain time

1%

3%

2%

2 4 8 16

4%


improves the performance by ending with the "addiction" of the frequency counters. The most important fact is that imposing limits to the entry sizes the probability table promotes a significantly reduction of hardware volume.

95,40

95,60

95,80

96,00

96,20

96,40

96,60

96,80

3 15 63 255

1023

4095

1638

365

535

MAX Size

% A

ccur

acy

AVG

Figure 4. Global performance of the TDPPR

We also use the idea about Max in the 16-bits PPM Predictor and the average results among all benchmarks are shown in the Figure 5. In this simulation, the best performance (96.02%) was reached with Max 7, that is, just 3 bits in each second table entry, against 95% when using unlimited frequency counter size. That is around 1.0% of speedup, almost the same speedup of the TDPP over PPM. This result shows there is not advantage to predict branches by the instruction type when using 16-bits history sizes. Just using Max is enough.

94,6094,8095,0095,2095,4095,6095,8096,0096,20

3 15 63 255102

340

95163

8365

535

Max Size

% A

ccur

acy

AVG

Figure 5. Global performance of the PPMR

5. Conclusions

Modern processors suffer penalties due to branch instructions that exist in the application code. To modify the existent codes to remove these instructions or to make new programs with no branch instructions are absolutely impossible actions. The applications depend on data to make decisions and the data are calculated and modified in runtime in order to solve the purpose of the applications. It is not real to think

that is possible to make unconditional applications. An acceptable alternative is to develop hardware mechanisms that allow to the processors handle the branches and provide high performance intelligently, taking advantage of all applications implicitly.

The TDPP predictor was proposed and evaluated on benchmarks of SPEC 2000, presenting average hit rate higher than 95% and better than the predecessor PPM, reaching maximums of 98%. We observed the increase of the history size lead to a proportional increase in the performance but the utilization of tables with large entries expends resources with no performance increase. In our experiments the simulations showed that 4-bits frequency counters is enough to provide the best hit rates. In addition, the prediction based on branch types provides reasonable speedup depending on the history size.

6. Acknowledgments

The authors would like to thank Renata C. L. Silva for helping us to translate this paper and the CAPES, CNPq and Araucaria Foundation for financial support.

7. References

[1] Smith, J.E.; Sohi, G.S. The Microarchitecture of Superscalar Processors. Proceedings of the IEEE, v.83, n.12, Dec. 1995. [2] Yeh, T.Y.; Patt, Y. A comparison of Dynamic Branch Predictors that use Two-Levels of Branch History. In Proceedings of the International Symposium on Computer Architecture, May1993, pages 257-266. [3] Moffat, A. Implementing the PPM data Compression Scheme. IEEE Transactions on Communications, November 1990, 38(11):1917-1921. [4] Chen, I.C.K; Coffey, J.T.; Mudge, T.N. Analysis of Branch Prediction via Data Compression. In: Proceedings of ASPLOS, October 1995, pages 128-137. [5] Federovsky, E.; Feder, M.; Weiss, S. Branch Prediction based on Universal Data Compression Algorithms. In Proceedings of the International Symposium on Architecture, June 1998, pages 62-72. [6] Kalamatianos, J.; Kaeli, D.R.. Indirect Branch Prediction using Data Compression. Journal of Instruction Level Parallelism , Vol. 1, 1999. [7] Burger, D.; Austin, T. M. The SimpleScalar Tool Set, Version 2.0. Technical History #1342, Computer Sciences Department, University of Wisconsin-Madison, Jun. 1997. [8] Henning, J.L. SPEC CPU2000: Measuring CPU Performance in the New Millennium. IEEE, 2000. p. 28-35. [9] Silva, Z.C. et al, Branch Prediction Based on Both Branch Type and History Transition Probability (In Portuguese), WSCAD (Workshop sobre Sistemas Computacionais de Alto Desempenho), Brazil, Oct, 2004.


[ieee 14th euromicro international conference on parallel, distributed, and network-based processing...

Documents