a bayesian network approach for compiler auto-tuning for...

8
A Bayesian Network Approach for Compiler Auto-tuning for Embedded Processors Amir Hossein Ashouri, Giovanni Mariani, Gianluca Palermo and Cristina Silvano Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Italy {amirhossein.ashouri, giovannisiro.mariani, gianluca.palermo, cristina.silvano}@polimi.it Abstract—The complexity and diversity of today’s architec- tures require an additional effort from the programmers in port- ing and tuning the application code across different platforms. The problem is even more complex when considering that also the compiler requires some tuning, since standard optimization options have been customized for specific architectures or de- signed for the average case. This paper proposes a machine- learning approach for reducing the cost of the compiler auto- tuning phase and to speedup the application performance in embedded architectures. The proposed framework is based on an application characterization done dynamically with micro- architecture independent features and based on the usage of Bayesian Networks. The main characteristic of the Bayesian Network approach consists of not describing the solution as a strict set of compiler transformations to be applied, but as a complex probability distribution function to be sampled. Experimental results, carried out on an ARM platform and GCC transformation space, proved the effectiveness of the proposed methodology for the selected benchmarks. The selected set of solutions (less than 10% of the search space) demonstrated to be very close to the optimal sequence of transformations, showing also an applications performance speedup up to 2.8 (1.5 on average) with respect to -O2 and -O3 for the cBench suite. Additionally, the proposed method demonstrated a 3x speedup in terms of search time with respect to an iterative compilation approach, given the same quality of the solutions 1 . I. I NTRODUCTION Nowadays, most SW applications are first developed in a source high-level programming language (e.g. C, C++) and then compiled to generate their executable binary implemen- tations. Compiler optimizations (e.g. loop unrolling, register allocation, etc.) might lead to substantial benefits in reference to several performance metrics (e.g. execution time, power consumption, memory footprint). Application developers are relying on compilers intelligence to optimize the application performance, while being often unaware of how the compiler optimization process can modify the application itself. Today compilers’ interfaces expose different options that allow the programmer to specify the optimizations to be applied. It is usually possible to specify optimization levels to automatically include a set of predefined optimizations, known to be benefi- cial for application performance in most cases [12]. There are also several specific compiler optimizations that are beyond the ones included in predefined optimization levels. The effects of enabling these compiler optimizations are quite complex. These effects mostly depend on the features of the target application and on the decision whether or not enabling some other compiler optimizations. It is rather hard to decide the 1 This work was supported in part by the EC under the grants FP7-HARPA- 612069 and FP7-CONTREX-611146 compiler optimizations to be enabled to maximize application performance. When considering application-specific embedded systems, where the application is compiled once and then deployed on the market, exploiting the available compiler optimizations to achieve the best performance prior to the deployment repre- sents an even more crucial task. So far researchers proposed two main solutions to identify the best compiler optimization strategy: a) iterative compilation [5] and b) machine learning predictive techniques [22]. The former approach consists of recompiling several times the application using different com- piler optimizations and then choosing the executable binary providing the best performance. However, Due to the high number of compilation runs to be carried out, iterative com- pilation is a time consuming procedure. The latter approach consists of machine learning techniques, where some compiler optimizations shall be applied in reference to some software features. Once the machine learning model has been trained, given a target application, the model predicts a sequence of compiler optimizations to maximize performance. Once trained, the use of machine learning does not require to carry out several compilations. However, the performance of the final executable binary is typically worse than the one found with iterative compilation. In this work, we propose an innovative approach to tackle the problem of the identification of the compiler optimiza- tions that maximize the performance of a target application. We apply a statistical methodology to infer the probability distribution of the compiler optimizations to be enabled to achieve the best performance. Then we start to drive an iterative compilation process by sampling from this probability distribution. Similarly to machine learning approaches, we use a set of training applications to learn the statistical relation- ships between application features and compiler optimizations. Bayesian networks are used to learn the statistical model. Given a new application not included in the training set, its software features are fed to the Bayesian network as evidence on the distribution. This evidence generates a bias on the distribution, since compiler optimizations are correlated to software features. The obtained probability distribution is application-specific and effectively focuses on the iterative compilation process limited to the most promising compiler optimizations. Ex- perimental results carried out on an embedded ARM-based computing platform demonstrate a 3× speedup in reference to state of the art iterative compilation techniques [5]. To summarize, the contributions of this work are: 978-1-4799-6307-2/14/$31.00 ©2014 IEEE 90

Upload: nguyendung

Post on 18-Feb-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

A Bayesian Network Approach for Compiler

Auto-tuning for Embedded Processors

Amir Hossein Ashouri, Giovanni Mariani, Gianluca Palermo and Cristina Silvano

Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Italy

{amirhossein.ashouri, giovannisiro.mariani, gianluca.palermo, cristina.silvano}@polimi.it

Abstract—The complexity and diversity of today’s architec-tures require an additional effort from the programmers in port-ing and tuning the application code across different platforms.The problem is even more complex when considering that alsothe compiler requires some tuning, since standard optimizationoptions have been customized for specific architectures or de-signed for the average case. This paper proposes a machine-learning approach for reducing the cost of the compiler auto-tuning phase and to speedup the application performance inembedded architectures. The proposed framework is based onan application characterization done dynamically with micro-architecture independent features and based on the usage ofBayesian Networks. The main characteristic of the BayesianNetwork approach consists of not describing the solution asa strict set of compiler transformations to be applied, butas a complex probability distribution function to be sampled.Experimental results, carried out on an ARM platform and GCCtransformation space, proved the effectiveness of the proposedmethodology for the selected benchmarks. The selected set ofsolutions (less than 10% of the search space) demonstrated to bevery close to the optimal sequence of transformations, showingalso an applications performance speedup up to 2.8 (1.5 onaverage) with respect to -O2 and -O3 for the cBench suite.Additionally, the proposed method demonstrated a 3x speedupin terms of search time with respect to an iterative compilationapproach, given the same quality of the solutions1.

I. INTRODUCTION

Nowadays, most SW applications are first developed in asource high-level programming language (e.g. C, C++) andthen compiled to generate their executable binary implemen-tations. Compiler optimizations (e.g. loop unrolling, registerallocation, etc.) might lead to substantial benefits in referenceto several performance metrics (e.g. execution time, powerconsumption, memory footprint). Application developers arerelying on compilers intelligence to optimize the applicationperformance, while being often unaware of how the compileroptimization process can modify the application itself. Todaycompilers’ interfaces expose different options that allow theprogrammer to specify the optimizations to be applied. It isusually possible to specify optimization levels to automaticallyinclude a set of predefined optimizations, known to be benefi-cial for application performance in most cases [12]. There arealso several specific compiler optimizations that are beyond theones included in predefined optimization levels. The effectsof enabling these compiler optimizations are quite complex.These effects mostly depend on the features of the targetapplication and on the decision whether or not enabling someother compiler optimizations. It is rather hard to decide the

1This work was supported in part by the EC under the grants FP7-HARPA-612069 and FP7-CONTREX-611146

compiler optimizations to be enabled to maximize applicationperformance.

When considering application-specific embedded systems,where the application is compiled once and then deployed onthe market, exploiting the available compiler optimizations toachieve the best performance prior to the deployment repre-sents an even more crucial task. So far researchers proposedtwo main solutions to identify the best compiler optimizationstrategy: a) iterative compilation [5] and b) machine learningpredictive techniques [22]. The former approach consists ofrecompiling several times the application using different com-piler optimizations and then choosing the executable binaryproviding the best performance. However, Due to the highnumber of compilation runs to be carried out, iterative com-pilation is a time consuming procedure. The latter approachconsists of machine learning techniques, where some compileroptimizations shall be applied in reference to some softwarefeatures. Once the machine learning model has been trained,given a target application, the model predicts a sequenceof compiler optimizations to maximize performance. Oncetrained, the use of machine learning does not require to carryout several compilations. However, the performance of the finalexecutable binary is typically worse than the one found withiterative compilation.

In this work, we propose an innovative approach to tacklethe problem of the identification of the compiler optimiza-tions that maximize the performance of a target application.We apply a statistical methodology to infer the probabilitydistribution of the compiler optimizations to be enabled toachieve the best performance. Then we start to drive aniterative compilation process by sampling from this probabilitydistribution. Similarly to machine learning approaches, we usea set of training applications to learn the statistical relation-ships between application features and compiler optimizations.Bayesian networks are used to learn the statistical model.Given a new application not included in the training set, itssoftware features are fed to the Bayesian network as evidenceon the distribution. This evidence generates a bias on thedistribution, since compiler optimizations are correlated tosoftware features.

The obtained probability distribution is application-specificand effectively focuses on the iterative compilation processlimited to the most promising compiler optimizations. Ex-perimental results carried out on an embedded ARM-basedcomputing platform demonstrate a 3× speedup in referenceto state of the art iterative compilation techniques [5]. Tosummarize, the contributions of this work are:

978-1-4799-6307-2/14/$31.00 ©2014 IEEE 90

• A Bayesian model capable to capture the relationshipsamong different compiler optimizations and programfeatures. This approach is based on statistical analysisand represents the leaned relationships as a directedacyclic graph that can be easily inspected and under-stood by humans.

• The integration of the Bayesian network model withina compiler optimization framework. The probabilitydistribution of the compiler optimizations to be usedfor a new, previously unobserved, program is inferredby means of the Bayesian network to focus the opti-mization itself.

• The application and the assessment of the proposedmethodology to focus on for the optimization of anembedded ARM-based platform.

The rest of the paper is organized as follows. Section IIprovides a brief discussion on the related work. In SectionIII, we introduce the proposed statistical approach to iterativeoptimization. Section IV presents experimental evaluation ofthe proposed methodology on an ARM platform. Finally,Section V summarizes the outcomes of the work and envisionsome future research paths.

II. RELATED WORK

Optimizations carried out at compilation time are usedbroadly, especially in embedded computing applications. Thismakes such techniques especially interesting and researchersare investigating more efficient techniques for identifying thebest compiler optimizations to be applied given the targetarchitecture. The related work in this field can be categorizedas a) iterative compilation and, b) machine learning basedapproaches.

Iterative compilation was introduced as a technique capableof outperforming static handcrafted optimization sequences,those usually exposed by compiler interfaces as optimizationlevels. Since the beginning of its introduction [3], [14], the goalof iterative compilation was to identify the most appropriatecompiler passes for a target application.

More recent literature discusses about down-sampling tech-niques to reduce the search space [21] with the goal of iden-tifying compiler optimization sequences per each application.

Other authors exploit iterative compilation jointly witharchitectural design space exploration for VLIW architectures[2]. Their intuition is that the performance of a computerarchitecture is dependent on the executable binary that, inturn, depends on the optimizations applied at compilation time.Thus, by studying the two problems jointly, the final archi-tecture is optimal in reference to the compilation techniquein use and the effects of different compiler optimizations areidentified at the early design phases.

Given that compilation is a time consuming task, severalresearches proposed techniques to predict the best compileroptimization sequences rather than applying a trial-and-errorprocess such as in iterative compilation. These predictionmethodologies are generally based on machine learning tech-niques [1], [8], [18]. Milepost [18], is a machine-learning-based compiler that automatically adjusts its optimization

heuristics to improve the execution time, code size, or compi-lation time of specific programs on different architectures [8].Authors in [16], introduced a methodology for learning staticcode features to be used within compiler heuristics that drivethe optimization process.

There are previous works focusing on the use of empiricalmodeling to build application-specific performance models forcompiler optimizations. Authors in [24], tried to capture theinteractions between compiler flags and the micro-architecturalparameters without any prior knowledge. The outcome couldbenefit the compiler writers to exploit the best effective partsof the design space to further improve the performance.

In [23], authors suggest to include iterative compilationwithin the compiler itself to enable the heuristics implementingthe iterative compilation mechanism. In this way, iterativecompilation is carried out at the level of each code segment butit is significantly sped up by focusing the research as definedby the heuristics embedded within the compiler. Authors in[5] evaluate the effects of dataset variations on the iterativecompilation process. It turns out that iterative compilation isa robust approach capable of achieving more that 83% of thepotential speedup for all datasets for the selected benchmarks.

Our approach is significantly different from the previousones given that it applies a statistical methodology to learnthe relationships between application features and compileroptimizations as well as between different compiler opti-mizations. Our work is, to some extent, related to the oneproposed in [1] where machine learning was used to capturethe probability distribution of different compiler transforma-tions. However in this work we propose the use of Bayesiannetworks as a framework enabling statistical inference onthe probability distribution given the evidence of applicationfeatures. Given a target application, its features are fed to theBayesian network to induce an application-specific bias on theprobability distribution of compiler optimizations. Addition-ally, in our approach, program features are obtained throughmicro-architectural independent dynamic profiling [11] ratherthan static code analysis. The adoption of dynamic profilingprovides insight data about the actual program execution withthe purpose to give more weight to the code segments executedmore often (i.e. code segments whose optimization would leadto higher benefits).

III. THE PROPOSED METHODOLOGY

Main goal of the proposed approach is to identify thebest compiler optimizations to be applied to a target appli-cation. Each application is passed through a characterizationphase, that generate a parametric representation of its dynamicfeatures. A statistical model based on Bayesian networkscorrelates these features to the compiler optimizations suchas to maximize the application performance.

The optimization flow is represented in Figure 1 and itconsists of two main phases. During the initial training phase,the Bayesian network is leaned on the base of a set of trainingapplications (see Figure 1a). During the final exploitationphase, new applications are optimized by exploiting the knowl-edge stored in the Bayesian network.

During both phases, an optimization process is necessaryto identify the best compiler optimizations to be enabled

91

(a) Training the Bayesian network

(b) Optimization process for a new target application

Fig. 1: Overview of the proposed methodology.

to achieve the best performance. This is done for learningpurposes during the training phase and for optimization pur-poses during the exploitation phase. To implement the opti-mization process, a Design Space Exploration (DSE) enginehas been used. This DSE engine compiles, executes andmeasures application performance by enabling and disablingdifferent compiler optimizations. The compiler optimizationsto be enabled are decided by a Design of Experiments (DoE)plan. In our approach the DoE plan is obtained by samplingfrom a given probability distribution that is either an uniformdistribution (during the training phase as in Figure 1a) or anapplication-specific distribution inferred through the Bayesiannetwork (during the exploitation phase as in Figure 1b).

The uniform distribution adopted during the training phaseallows us to explore uniformly the compiler optimization spaceO thus to discover and learn what are the most promisingregions of this space. The application-specific distributionused during the exploitation phase allows us to speedup theoptimization by focusing on the most promising region of thecompiler optimization space O.

A. Application characterization

In this work, we use PIN [17] based dynamic profilingframework to analyze the behavior of the different applicationsat execution time. In particular, the selected profiling frame-work provides a high level Micro-architectural IndependentCharacterization of Applications (MICA) [11] that is suitablefor characterizing applications in a cross-platform manner.Furthermore, there is no static syntactic analysis but the wholeframework is based solely on MICA profiling

In our experimental setup, the application is compiled andprofiled on an x86 host processor, while the target architecturewhere the application executes (i.e. the architecture for which

the application shall be optimized) is an embedded ARM-based platform. Thanks to the high level abstraction of theapplication characterization carried out with MICA, we caneasily change the target architecture without the need ofchanging the profiling infrastructure.

The MICA framework reports data about instruction types,memory and register access pattern, potential instruction levelparallelism and a dynamic control flow analysis in terms ofbranch predictability. Overall, the MICA framework charac-terizes an application in reference to 99 different metrics (orfeatures). However, many of these 99 features are stronglycorrelated (e.g. the number of memory reads with stridesmaller than 1K is bounded by the number of reads withstride smaller than 2K). Furthermore, generating a Bayesiannetwork is a process whose time complexity grows up super-linearly with the number of parameters in use. It is practicallyexpensive to include all the 99 features in the Bayesian networkmodel. Thus, with the goal of speeding up the construction ofthe Bayesian network, we apply Principal Component Analysis(PCA) to reduce the number of parameters to characterizethe application. PCA is a technique to transform a set ofcorrelated parameters (application features) into a set of or-thogonal (i.e. uncorrelated) principal components [13]. ThePCA transformation aims at sorting the principal componentsby descending order based on their variance. For instance, thefirst components include the most of the input data variability,i.e. they represent the most of the information contained inthe input data. To reduce the number of input features, whilekeeping most of the information contained in the input data,it is simply needed to use the first k principal components assuggested in [11]. In particular, we set k = 10 to trade off theinformation stored in the application characterization and thetime required to train the Bayesian network.

B. The Bayesian framework

Background on Bayesian networks. Bayesian networksis a powerful tool to represent the probability distribution ofdifferent variables that characterize a certain phenomenon. Thephenomenon to be investigated in this work is the optimalityof compiler optimization sequences.

Let us define a Boolean vector o whose elements oi arethe different compiler optimizations. Each optimization oi canbe either enabled oi = 1 or disabled oi = 0. In this workthe phase ordering problem [15] is not taken into account butrather we consider that different optimizations oi are organizedin a predefined order embedded in the compiler. A compileroptimization sequence represented by the vector o belongsto the n dimensional Boolean space O = {0, 1}n, where nrepresents the number of compiler optimizations under study.

An application is parametrically represented by the vectorα whose elements αi are the first k principal components ofits dynamic profiling features. Note that elements αi in thevector α generally belong to the continuous domain.

The optimal compiler optimization sequence o ∈ O thatmaximizes the performance of an application is generallyunknown. However it is known that the effects of a compileroptimization oi may depend on whether or not another opti-mization oj is applied. Additionally it is known that the com-

92

piler optimization sequence that maximizes the performanceof a given application depends on the application itself.

The reason why the optimal compiler optimization se-quence o is unknown a priory is because it is not possibleto capture in a deterministic way the dependencies amongstthe variables in the vectors o and α. There is no way toidentify an analytic model to exactly fit the vector functiono(α). In fact, the best optimization sequence o depends alsoon other factors that are somewhat outside our comprehension,the unknown. It is exactly to deal with the unknown thatwe propose not to predict the best optimization sequence o

but rather to infer its probability distribution. The uncertaintystored in the probability distribution models the effects of theunknown.

Given the peculiarities of the problem, we selected asunderlying probabilistic model the Bayesian networks. Inparticular, Bayesian networks have the following features ofinterest for the target problem:

• Their expressiveness allows to include in the sameframework heterogeneous variables such as Booleanvariables (in the optimization vector o) and continuousvariables (in the application characterization α).

• They are suitable for modeling cause-effect dependen-cies: for the target problem we expect that the benefitsof some compiler optimizations (effects) are due to thepresence of some application features (causes).

• It is possible to graphically investigate the modelto visualize the dependencies among different com-piler optimizations. If needed, it is even possible tomanually edit the graph for including some a prioryknowledge.

• It is possible to bias the probability distribution ofsome variables (the optimization vector o) given theevidence on other variables (the application charac-terization α). This enables to infer an application-specific distribution for the vector o from the vectorα observed by analyzing the target application.

• It is a general optimization model making it feasibleto infer a probabilistic model both with and withoutthe kernel characterization i.e. static analysis or micro-architecture independent characterization.

A Bayesian network is a direct acyclic graph whose nodesrepresent variables and whose edges are the dependencies be-tween these variables. Figure 2 reports a simple example withone variable α1 representing the application features and twovariables o1, o2 representing different compiler optimizations.Let us consider that α1 represents the number of profiled HW-loop instructions while o1, o2 are two optimizations operatingon loops (e.g. loop unrolling o1 and loop tiling o2). Ofcourse, the presence of loops can be observed by profilingα1 and it is the primary cause of loop optimizations’ effects.Additionally the effectiveness of whether or not to apply looptiling might depend on the decision taken about applying or notloop unrolling. All these dependencies can be automaticallydiscovered by training the Bayesian model and are representedas edges on the network topology graph (Figure 2). Notethat we use dashed lines for nodes representing observable

variables whose value can be input as evidence to the network.In this example, the variable α1 can be observed and, byintroducing its evidence, it is possible to bias the probabilitydistributions of other variables.

Training the Bayesian model. Off-the-shelf tools existto construct Bayesian networks automatically by fitting thedistribution of some training data [19]. To do so, first thegraph topology shall be identified and then the probabilitydistribution of the variables including their dependencies shallbe estimated.

The identification of the graph topology is particularlycomplex and time consuming. To avoid unreasonable trainingtimes, we decided to reduce the number of nodes in the graphby limiting the number k of elements in the vector α. This isthe reason for using PCA during application characterization.Nonetheless, for efficiency reasons, the algorithm used forselecting the graph topology is a greedy algorithm, named K2,initialized with the Maximum Weight Spanning Tree (MWST)ordering method as suggested in [10]. Note that the initialordering of the nodes for the MWST algorithm is given to letthe elements α to appear first and then the elements of o. Evenif the final topological sorting of the nodes changes accordingto the algorithm described in [10], by using this initializationcriterion it always happens that the dependencies are directedfrom elements of α to elements of o and not vice versa.

The coefficients of the functions describing the probabilitydistributions of each variable as well as their dependenciesare tuned automatically to fit the distribution in the trainingdata [19]. Within the proposed Bayesian model, a relation-ship between two different variables (e.g. between a programfeature αj and a compiler optimization oi) is modeled asa conditional probability function P (oi = b | αj = x).To estimate these functions, training data are gathered byanalyzing a set A of training applications (Figure 1a). First,application features are computed for each application a ∈A to enable the principal component analysis. Thus, eachapplication is characterized by its own principal componentvector α. Then, an experimental compilation campaign iscarried out for each application by sampling several compileroptimization sequences from the compiler optimization spaceO with an uniform distribution. For each application we selectthe 15% best performing compilation sequences among thesampled ones. The distribution of these sequences are learnedby the Bayesian network framework in relation to the vectorα characterizing the application.

Inferring an application-specific distribution. Once theBayesian network is trained, the principal component vector αobtained for a new application can be fed as evidence to theframework to bias the distribution of the compiler optimizationvector o. To sample a compiler optimization sequence from

Fig. 2: A Bayesian network example.

93

this biased distribution, we proceed as follows. The nodesin the direct acyclic graph describing the Bayesian networkare sorted in topological order, i.e. if a node at position ihas some predecessors, those appear at positions j, j < i. Atthis point, all nodes representing the variables α appear atthe first positions2. The value of each compiler optimizationoi is sampled in sequence by following the topological ordersuch that all its parent nodes have been decided. Thus, themarginal probability P (oi = 0 | P) and P (oi = 1 | P) can becomputed on the base of the parent node vector value P (beingeach parent either an evidence αj or a previously sampledcompiler optimization oj). Similarly, by using the maximumlikelihood method, it is possible to compute the most prob-able vector from this biased probability distribution. Whensampling from the application-specific probability distributioninferred through the Bayesian network, we always consider toreturn the most probable optimization sequence as first sample.

IV. EXPERIMENTAL EVALUATION

A. Experimental Setup

The experimental results have been generated by using anARMv7 Cortex-A9 architecture on a TI OMAP 4430 processor[20] with ArchLinux and GCC-ARM 4.6.3 running on top ofit. We considered the cBench benchmark suite [7] including24 applications and 5 datasets. Table I reports the compileroptimizations analyzed in this paper. This set of compilertransformations has been derived from [5]. In particular, weselected the compiler optimizations whose speedup factor isreported to be greater than 1.10. These compiler optimizationscan be enabled/disabled by means of their respective compileroptimization flags. They are applied to improve applicationperformance beyond the predefined optimization level -O3.The application execution time is estimated by using the Linux-Perf tool. Execution time required to process a given datasetby a compiled application binary is estimated by averagingfive different executions.

B. Bayesian Model Analysis

In our experimental setup, we have used the Matlab envi-ronment [19] to train the Bayesian network by correlating thefirst 10 Principal Components (PCs) of application features tothe 7 compiler optimizations listed in Table I. This step takesabout 1.5 minutes on an Intel Core i7 processor at 2.4 GHz.An analysis on the feasibility of applying the approach over alarger number of compiler optimization flags has been carriedout and demonstrated that, training a Bayesian model including24 compiler optimizations takes about 1 hour per application.

In this work, the validation has been carried out throughcross-validation. We train different Bayesian networks, eachone by excluding from the training application set one ofthe applications. Validation data are then gathered on theapplication excluded from the training set.

With the purpose of investigating graphically the dependen-cies between the variables involved in the compiler optimiza-tion problem, we trained a final Bayesian network includingall the applications in the training set. The resulting network

2This is by construction due to the initialization of the MWST and the K2algorithms used to discover the network topology.

PC.1 PC.2 PC.3 PC.4 PC.5 PC.6 PC.7 PC.8 PC.9 PC.10

math−opt

fn−gss−br fn−ivopt

fn−tree−opt fn−inline funroll−lo O2

Fig. 3: Topology of the Bayesian network.

topology is shown in Figure 3. By removing one applicationfrom the training set, the graph topology changes slightly,mainly in terms of the different edges connecting PC nodes tocompiler optimization nodes. This effect is due to the removalof the application in the principal component analysis andthus to the fact that principal components are computed ina different way. For conciseness we do not report the graphtopologies for each one of the trained Bayesian network.

Nodes of the topology graph reported in Figure 3 areorganized in layers. The first layer reports the PCs that are theobservable variables (reported as dashed lines). In the secondlayer, there are represented the compiler optimizations whoseparents are the PC nodes. Thus the effects of these compileroptimizations depend only on the application characterizationin terms of its features. In the third layer, there are compileroptimization nodes whose parents include optimization nodesfrom the second layer. Once a new application is characterizedfor a target application-dataset, the evidence related to the PCsof its features are fed to the network in the first layer. Then,probability distributions of other nodes can be inferred in turnon the second and third layers.

There are two nodes in the third layer of Figure 3. The firstone is the fn-gss-br node that depends on funroll-lo becauseunrolling loops impacts on the predictability of the branchesimplementing these loops. Moreover, funroll-lo impacts on theeffectiveness of the heuristic branch probability estimation,thus on fn-gss-br. The second node in the third layer is thefn-ivopt node, that depends on fn-tree-opt as parent node inthe second layer. Both these optimizations work on trees andtherefore their effects are interdependent.

While sampling compiler optimizations from the Bayesiannetwork, the decisions of whether or not to apply fn-gss-brand fn-ivopt are taken after deciding whether or not to applyfunroll-lo and fn-tree-opt.

C. Comparison Results

The proposed methodology samples different compiler op-timization sequences from the Bayesian network. The perfor-mance achieved by the best application binary depends on thenumber of sequences sampled from the model. Performance

94

TABLE I: Compiler optimizations under analysis, to be introduced beyond -O3

Compiler Transformation Abbreviation Short Description

-funsafe-math-optimizations math-opt Allow optimizations for floating-point arithmetic that (a) assume valid arguments and results and

(b) may violate IEEE or ANSI standards.

-fno-guess-branch-probability fn-gss-br Do not guess branch probabilities using heuristics.

-fno-ivopts fn-ivopt Disable induction variable optimizations on trees.

-fno-tree-loop-optimize fn-tree-br Disable loop optimizations on trees

-fno-inline-functions fn-inline Disable optimization that inline all simple functions.

-funroll-all-loops funroll-lo Unroll all loops, even if their number of iterations is uncertain

-O2 O2 Overwrite the -O3 optimization level by disabling some optimizations involving a space-speed trade-

off

Fig. 4: Performance improvement obtained by sampling 8 optimization sequences from the probability distribution inferred bythe proposed Bayesian model (in reference to -O2 and -O3).

speedup is measured in reference to -O2 and -O3, which arethe optimization levels available for GCC.

Figure 4 reports these speedups by considering a sampleof 8 different compiler optimization sequences (results areaveraged on the different datasets). All applications have beensped up in reference to -O2 and only consumer-jpeg-d hasbeen slowed down in reference to -O3. For two datasets outof the five tested for this application, -O3 returns the bestperforming executable binary and it is not possible to exceedits performance. On average, the speedups are of 1.56 and 1.48in reference to -O2 and -O3 respectively and the max speed-up of 2.85× and 2.7× has been observed for network-patriciaw.r.t -O2 and -O3 respectively.

It is known that Iterative Compilation can generally im-prove application performance in reference to static hand-crafted compiler optimization sequences [1]. Additionally,given the complexity of the iterative compilation problem, ithas been proved that drawing compiler optimization sequencesat random is as good as applying other optimization algorithmssuch as genetic algorithms or simulated annealing [1], [4],[5]. For these reasons, to evaluate the proposed approachwe compare to a random methodology that samples compileroptimization sequences from the uniform distribution.

Let us define the normalized speed-up τ as the ratio of theachieved performance improvement over the whole potential

performance improvement:

τ =Eref − E

Eref − Ebest

(1)

Where E is the execution time achieved by the methodologyunder consideration, Eref is the execution time achieved witha reference compilation methodology and Ebest is the bestexecution time computed through an exhaustive exploration ofall possible compiler optimization sequences (in our case 128different sequences). Note that, as the execution time E of theiterative compilation methodology under analysis gets closerto the reference execution time Eref , the τ gets closer to 0reporting that no improvement is returned. In the same way,as E gets closer to the best execution time Ebest, while τgets close to 1 reporting that the whole potential performanceimprovement has been achieved.

Figure 5, reports the τ achieved by the proposed opti-mization technique and by the random iterative compilationtechnique in reference to execution time obtained by -O3 (Eref ). Similarly, Figure 6 reports the normalized speed-up τof the proposed approach in reference to the random iterativecompilation technique (Eref ). Moreover, to highlight thebenefits of including the application characterization based onMICA in the Bayesian model, Figure 6 reports the evaluationof the proposed approach when including (continuous line) ornot including (dotted line) them in the model. The comparisonsreported in Figures 5 and 6 have carried out by considering

95

5 10 15 20 25 30 35 40 45 50

−0.2

0

0.2

0.4

0.6

0.8

1

number of extractions

No

rm

ali

ze

d s

pe

ed

−u

p w

.r.t

. −

O3

proposed opt

iterative compilation

Fig. 5: Normalized speed-up of the proposed method w.r.t -O3

5 10 15 20 25 30 35 40 45 50−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

number of extractions

No

rm

ali

ze

d s

pe

ed

−u

p w

.r.t

. it

era

tiv

e c

om

pil

ati

on

Proposed method with MICA

Proposed method without MICA

Fig. 6: Normalized speed-up of the proposed method (with andwithout MICA) w.r.t random iterative compilation.

the same number of compiler optimization sequences sampledfor both the random iterative compilation and the proposedapproach.

We can observe in Figure 5 that, by sampling a numberof compiler optimization sequences greater than 4, both ap-proaches under analysis outperform the reference optimizationlevel of -O3 reporting positive values of τ . The normalizedspeed-up with respect to -O3 grows quickly, thus resultingin the 80% of the potential performance when extracting20 and 24 compiler optimizations sequences respectively forthe the proposed approach and the random iterative com-pilation approach. We can also observe that the proposedapproach outperforms the iterative methodology. By increasingthe number of extractions, the achieved normalized potentialimprovement goes from about 15% (from 10 to 20 extractions)to about 10% (above 20 extractions). This is reported moreclearly in Figure 6. When targeting the extraction of veryfew compiler optimization sequences (x axis, from 1 to 9extractions), the proposed methodology achieves about 25%of the potential performance improvement w.r.t. the randomiterative compilation methodology. Furthermore, the proposedmethodology has been evaluated with both including/excludingthe platform independent characterizations and performanceimprovement utilizing the MICA is clearly observable.

Evaluation of a practical usage. When using iterativecompilation in practice, a typical approach is to decide theefforts to be put on the optimization itself. These efforts canbe measured in terms of optimization time, that is directlyproportional to the number of compilations to be executed.Thus, we evaluated the proposed optimization approach in

0.9

1

1.1

1.2

8 16 24 32 40 48

Evaluations (Number of extractions)

Ap

plicati

on

sp

eed

−u

p w

.r.t

.It

erati

ve c

om

pilati

on

128

Fig. 7: Box-plot of application speedup varying the explorationefforts (extractions) of the reference random iterative compi-lation while keeping constant to 8 extractions the explorationefforts of the proposed methodology.

terms of application performance reached after a fixed numberof compilations. In particular we fix this number to 8 thatrepresents 6.25% of the overall optimization space.

This evaluation has been carried out in reference to the ran-dom iterative compilation methodology as it has been demon-strated to be able to outperform static/dynamic compilationtechniques on the long-run [1], [6], [9]. The goal is to estimatethe speedup achieved in terms of both application performanceand compilation efforts. Figure 7 reports the box-plot analysisof the application speedup, while keeping the compilationefforts of the proposed methodology to 8 compilations (orextractions) and varying the compilation efforts of the randomiterative compilation. each box represents the data excludingthe outliers having set the two quantiles respectively 90% and10%. The central box notch reports the median.

When using the same compilation efforts of 8 extractionsfor the random approach, we obtain an application speedup of1.09 on average. The box-plot distribution demonstrates thatit happens very rarely to obtain a slowdown in reference tothe random iterative compilation when considering the samecompilation efforts.

Of course, by increasing the compilation efforts of therandom iterative compilation above 8 extractions while keepingconstant the exploration efforts of the proposed approach, theapplication speedup decreases. The random iterative compila-tion needs an efforts of 24 extractions to achieve on average theapplication performance obtained with 8 extractions from theproposed methodology. This means that the proposed method-ology provides a speedup of 3× in terms of optimizationefforts.

V. CONCLUSIONS

This paper presents a methodology to infer within aBayesian framework the best compiler optimizations to beapplied for optimizing the performance of a target appli-cation. The methodology has been validated for an ARM-based platform considering the GCC compiler. In average,it delivers 1.56 and 1.48 performance speedup respectivelyin reference to the optimization levels -O2 and -O3 whenconsidering an optimization effort of 8 compilations (about 6%of the considered compiler optimization space). A factor of 3×speedup in terms of optimization efforts has been empirically

96

measured in reference to state of the art iterative compilationapproaches.

REFERENCES

[1] F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. O’Boyle,J. Thomson, M. Toussaint, and C. K. Williams. Using machine learningto focus iterative optimization. In Proceedings of the International

Symposium on Code Generation and Optimization, pages 295–305.IEEE Computer Society, 2006.

[2] A. H. Ashouri, V. Zaccaria, S. Xydis, G. Palermo, and C. Silvano.A framework for compiler level statistical analysis over customizedvliw architecture. In Very Large Scale Integration (VLSI-SoC), 2013

IFIP/IEEE 21st International Conference on, pages 124–129. IEEE,2013.

[3] F. Bodin, T. Kisuki, P. Knijnenburg, M. O’Boyle, E. Rohou, et al.Iterative compilation in a non-linear optimisation space. In Workshop

on Profile and Feedback-Directed Compilation, 1998.

[4] J. Cavazos, G. Fursin, F. Agakov, E. Bonilla, M. F. P. O’Boyle, andO. Temam. Rapidly selecting good compiler optimizations usingperformance counters. In Proceedings of the International Symposium

on Code Generation and Optimization, CGO ’07, pages 185–197,Washington, DC, USA, 2007. IEEE Computer Society.

[5] Y. Chen, S. Fang, Y. Huang, L. Eeckhout, G. Fursin, O. Temam, andC. Wu. Deconstructing iterative optimization. ACM Transactions on

Architecture and Code Optimization (TACO), 9(3):21, 2012.

[6] Y. Chen, Y. Huang, L. Eeckhout, G. Fursin, L. Peng, O. Temam, andC. Wu. Evaluating iterative optimization across 1000 datasets. InProceedings of the 2010 ACM SIGPLAN Conference on Programming

Language Design and Implementation, PLDI ’10, pages 448–459, NewYork, NY, USA, 2010. ACM.

[7] G. Fursin. Collective benchmark (cbench), a collection of open-source programs with multiple datasets assembled by the community toenable realistic benchmarking and research on program and architectureoptimization, 2010.

[8] G. Fursin, C. Miranda, O. Temam, M. Namolaru, E. Yom-Tov, A. Zaks,B. Mendelson, E. Bonilla, J. Thomson, H. Leather, et al. Milepost gcc:machine learning based research compiler. In GCC Summit, 2008.

[9] G. Fursin, M. F. OBoyle, and P. M. Knijnenburg. Evaluating iterativecompilation. In Languages and Compilers for Parallel Computing,pages 362–376. Springer, 2005.

[10] D. Heckerman and D. M. Chickering. Learning bayesian networks: Thecombination of knowledge and statistical data. In Machine Learning,pages 20–197, 1995.

[11] K. Hoste and L. Eeckhout. Microarchitecture-independent workloadcharacterization. IEEE Micro, 27(3):63–72, 2007.

[12] K. Hoste and L. Eeckhout. Cole: compiler optimization level explo-ration. In Proceedings of the 6th annual IEEE/ACM international

symposium on Code generation and optimization, pages 165–174.ACM, 2008.

[13] R. A. Johnson and D. W. Wichern. Applied multivariate statistical

analysis, volume 5. Prentice hall Upper Saddle River, NJ, 2002.

[14] T. Kisuki, P. M. Knijnenburg, M. F. O’Boyle, F. Bodin, and H. A.Wijshoff. A feasibility study in iterative compilation. In High

Performance Computing, pages 121–132. Springer, 1999.

[15] P. A. Kulkarni, D. B. Whalley, G. S. Tyson, and J. W. Davidson. Prac-tical exhaustive optimization phase order exploration and evaluation.ACM Trans. Archit. Code Optim., 6(1):1:1–1:36, Apr. 2009.

[16] H. Leather, E. Bonilla, and M. O’Boyle. Automatic feature generationfor machine learning based optimizing compilation. In Code Generation

and Optimization, 2009. CGO 2009. International Symposium on, pages81–91. IEEE, 2009.

[17] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney,S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customizedprogram analysis tools with dynamic instrumentation. In Proceedings of

the 2005 ACM SIGPLAN conference on Programming language design

and implementation, PLDI ’05, pages 190–200, New York, NY, USA,2005. ACM.

[18] The Milepost website. http://www.milepost.eu/.

[19] K. P. Murphy. The bayes net toolbox for matlab. Computing Science

and Statistics, 33:2001, 2001.

[20] The PandaBoard archlinux. http://www.pandaboard.org.

[21] S. Purini and L. Jain. Finding good optimization sequences coveringprogram space. ACM Transactions on Architecture and Code Optimiza-

tion (TACO), 9(4):56, 2013.

[22] M. Tartara and S. Crespi Reghizzi. Continuous learning of compilerheuristics. ACM Trans. Archit. Code Optim., 9(4):46:1–46:25, Jan. 2013.

[23] S. Triantafyllis, M. Vachharajani, N. Vachharajani, and D. I. August.Compiler optimization-space exploration. In Code Generation and

Optimization, 2003. CGO 2003. International Symposium on, pages204–215. IEEE, 2003.

[24] K. Vaswani, M. J. Thazhuthaveetil, Y. N. Srikant, and P. J. Joseph.Microarchitecture sensitive empirical models for compiler optimiza-tions. In Code Generation and Optimization, 2007. CGO ’07. Inter-

national Symposium on, pages 131–143, March 2007.

97