a parallel bandit-based approach for autotuning …zhiruz/pdfs/datuner-fpga2017.pdf · a parallel...

A Parallel Bandit-Based Approachfor Autotuning FPGA Compilation

Chang Xu1,∗, Gai Liu2, Ritchie Zhao2, Stephen Yang3, Guojie Luo1, Zhiru Zhang2,†

1 Center for Energy-Efficient Computing and Applications, Peking University, Beijing, China2 School of Electrical and Computer Engineering, Cornell University, Ithaca, USA

3 Xilinx, Inc., San Jose, USA∗[email protected], †[email protected]

AbstractMainstream FPGA CAD tools provide an extensive collec-tion of optimization options that have a significant impacton the quality of the final design. These options togethercreate an enormous and complex design space that cannoteffectively be explored by human effort alone. Instead, wepropose to search this parameter space using autotuning,which is a popular approach in the compiler optimizationdomain. Specifically, we study the effectiveness of apply-ing the multi-armed bandit (MAB) technique to automat-ically tune the options for a complete FPGA compilationflow from RTL to bitstream, including RTL/logic synthe-sis, technology mapping, placement, and routing. To miti-gate the high runtime cost incurred by the complex FPGAimplementation process, we devise an efficient paralleliza-tion scheme that enables multiple MAB-based autotuners toexplore the design space simultaneously. In particular, wepropose a dynamic solution space partitioning and resourceallocation technique that intelligently allocates computingresources to promising search regions based on the runtimeinformation of search quality from previous iterations. Ex-periments on academic and commercial FPGA CAD toolsdemonstrate promising improvements in quality and conver-gence rate across a variety of real-life designs.

1. IntroductionOver the last three decades, FPGAs have evolved from asmall chip with a few thousand logic blocks to billion-transistor system-on-chips containing hardened DSP blocks,embedded memories, multicore processors, alongside mil-lions of programmable logic elements. Concurrently, FPGAdevelopment tools have also grown into sophisticated designenvironments. Compiling an RTL design into bitstream typ-ically involves heuristically solving a sequence of complex

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected]’17, February 22-24, 2017, Monterey, CA, USAc©2017 ACM. ISBN 978-1-4503-4354-1/17/02 . . . $15.00

DOI: http://dx.doi.org/10.1145/3020078.3021747

combinatorial optimization problems such as logic synthe-sis, technology mapping, placement, and routing [7].

To meet the stringent yet diverse design requirementsfrom different domains and use cases, modern FPGA CADtools commonly provide users with a large collection of op-timization options (or parameters) that have a significantimpact on the quality of the final design. For instance, theplacement step alone in the Xilinx Vivado Design Suite of-fers up to 20 different parameters, translating to a searchspace of more than 106 design points [3]. In addition, mul-tiple options may interact in subtle ways resulting in unpre-dictable effects on solution quality. Traditionally, navigatingthrough such an enormous design space requires designersto rely on either prior design experience or vendor-suppliedguidelines. Such ad hoc design practices incur costly manualeffort to achieve the desired quality of results (QoR). Worse,each new design may require a drastically different set ofoptions to achieve the best QoR [24].

One solution to improve design productivity is employingmeta-heuristic search techniques to explore the parameterspace automatically. Figure 1 shows the improvement of theworst negative slack (WNS) of three designs generated byVivado, each tuned using three different search techniques:active learning, Bayes classification, and greedy mutation.From our experiments, it is evident that the most effectivesearch technique (in terms of the number of Vivado runsneeded to close timing) varies across different designs. In-tuitively, distinct designs often present vastly different struc-tures of the search space. Besides, different phases of thedesign space exploration benefit from different search tech-niques. For example, stochastic methods such as genetic al-gorithm may be more useful during the initial phase of thesearch, while first-order optimizations like gradient descentare very efficient in finding local minima when the promis-ing search space is narrowed.

The above observations clearly motivate the use of an en-semble of search heuristics rather than one particular tech-nique to effectively explore the design space of FPGA com-pilation. Similar insights were also gained in the OpenTunerproject, which aimed to provide an extensible open-sourceframework for software program autotuning [4]. OpenTunercurrently incorporates a collection of search techniques to

-0.40

-0.35

-0.30

-0.25

-0.20

-0.15

-0.10

-0.05

0.00

0.05W

ors

t negative s

lack (

ns)

Number of Vivado runs

bgcd

-0.40

-0.35

-0.30

-0.25

-0.20

-0.15

-0.10

-0.05

0.00

0.05


sort

0 50 100-0.40

-0.35

-0.30

-0.25

-0.20

-0.15

-0.10

-0.05

0.00

0.05


cfd

Active learning

Bayes classification

Greedy mutation

0 50 100

(a)

0 50 100

(b) (c)

Figure 1: The search traces of three designs using three different search algorithms with the goal of improving worstnegative slack — We use meta-heuristic algorithms to analyze results from Vivado, and guide the selection of Vivadoconfiguration parameters. The x-axis denotes the number of Vivado runs. (a) greedy mutation, a simple genetic algorithm,is to first to close timing for binary GCD; (b) active leaning, a semi-supervised machine learning technique, is the first toclose timing for computational fluid dynamics; (c) Bayes classification, the naıve Bayes classifiers, is the first to closetiming for the bubble sort design.

provide robustness against different search spaces and usesthe multi-armed bandit (MAB) algorithm [11] to determinethe allocation of trials between the available techniques dy-namically. In addition to applications in program autotun-ing [4], MAB has already been applied to many importantoptimization problems in various fields, such as artificial in-telligence [19, 22] and operations research [9, 15].

Since FPGA CAD tools usually require long executiontimes (minutes to hours for real-world designs), it is crucialto significantly speed up the MAB-guided search withoutsacrificing the final QoR. An intuitive approach is launchingmultiple machines simultaneously, each conducting a MAB-guided search within the solution space independently. Al-ternatively, one can use a more efficient scheme that dynam-ically partitions the solution space into multiple partitions,and allocates additional computing resources to regions thatare more likely to generate high-quality solutions.

In this paper, we propose DATuner — a parallelbandit-based framework for autotuning FPGA compilation.DATuner is built on OpenTuner but instead focuses on im-proving the productivity and quality of FPGA-targeted hard-ware designs. We also propose scalable and effective paral-lelization techniques based on dynamical solution space par-titioning to speed up the convergence of DATuner. Our maincontributions are as follows:

1. We adapt OpenTuner to tune the CAD tool parameters forFPGA compilation and demonstrate the effectiveness ofthe bandit-based approach in improving the design QoR.

2. We propose a scalable parallelization scheme which ac-complishes the following: (1) efficiently partitions theglobal solution space into promising subspaces; (2) allo-cates compute resource among subspaces to balance the

exploration of unknown subspaces and the exploitationof subspaces with known high-quality solutions.

3. Experiments with DATuner on academic and commer-cial FPGA CAD tools demonstrate very encouraging im-provements in design quality across a variety of real-lifebenchmarks. We believe that our framework is also ap-plicable to many other EDA problems.

The rest of the paper is organized as follows: Section 2introduces the preliminaries that serve as the basics of thiswork; Section 3 discusses our proposed techniques; Sec-tion 4 presents the experimental results; Section 5 summa-rizes the related work, followed by conclusions in Section 6.

2. PreliminariesIn this section, we provide an overview of the MAB problemformulation, its usage in OpenTuner, as well as the basics ofthe FPGA compilation process.

2.1 Multi-Armed Bandit ApproachThe MAB problem and its solutions are extensively studiedin statistics and machine learning [11]. The classic problemis formulated as a game, where the player has a fixed numberof actions to choose from (i.e., arms), each of which havinga reward given by an unknown probability distribution withan unknown expected value. The game progresses in rounds,and in each round, the player chooses one action (i.e., pullan arm) and obtain a reward sampled from the correspond-ing distribution. The reward loosely captures the effective-ness of an arm, and crucially, its probability distribution islearned during the process of the game. The objective is tomaximize the total payoff after all the rounds. An effectiveMAB algorithm must find a right balance between exploita-tion (choosing the known best arm to obtain the highest ex-

pected reward), and exploration (selecting an infrequentlyused arm to gain more information about its reward distri-bution). Choosing an infrequently used arm sacrifices short-term gain for the possibility of discovering greater payoff inthe long run. Existing methods are usually randomized al-gorithms, which pick an action in each round based on thehistory of chosen actions and observed rewards so far [5, 6].The quality metric of an MAB algorithm is regret, whichis the ratio between optimal payoff (obtained by pulling theoptimal arm every round) and that generated from the MABalgorithm. Several known MAB algorithms can achieve aregret of O(log N) for an N-round MAB, which has beenshown to be asymptotically optimal [8].

Recently, an open-source autotuning framework calledOpenTuner has adopted MAB to improve the runtime ofsoftware benchmarks [4]. Specifically, OpenTuner incorpo-rates an ensemble of meta-heuristics for effectively search-ing the space of software compiler options. Examples ofthese heuristic methods include differential evolution, ge-netic algorithm, particle swarm optimization, and simulatedannealing. The MAB algorithm used in OpenTuner treatseach search method as an arm, and measures its reward us-ing the area under curve mechanism — If an arm has yieldeda new global best, an upward line is drawn, otherwise, aflat line is drawn. The area under this curve (scaled to amaximum value of 1) is the total payoff attributed to thecorresponding arm. To balance exploitation and exploration,OpenTuner ranks each arm with a weighted sum of the areaunder curve metric and the frequency of its previous uses.Notably, OpenTuner reported up to 2.8x speedup at no pro-gramming cost by automatically tuning GCC flags.

2.2 FPGA Compilation FlowThe mainstream FPGA compilation flow takes an RTL de-sign as input and generates a device-specific bitstream.This process involves several distinct and modular steps:logic synthesis, technology mapping, placement, and rout-ing. Synthesis lowers the RTL design into a technology-independent logic or gate-level network. Technology map-ping then maps this network into a netlist of look-up tables(LUTs). Placement determines the physical location of eachLUT in the netlist, and routing connects all signal paths us-ing the available programmable interconnects.

Many of these steps involve NP-hard problems. To tacklethe difficulty of solving these problems, experts proposesome approximated solutions with heuristic-based methods.FPGA CAD tools often provide designers with a set of con-figuration parameters that select between heuristics or influ-ence the behavior of a heuristic. Examples of parametersinclude enabling remapping or retiming in logic synthesis,deciding how much to spread logic for congestion or howmuch to weight wire delay. Table 1 shows the tunable param-eters available in the open-source Verilog-to-Routing (VTR)toolflow [16], covering logic synthesis, packing, placementand routing. With commercial FPGA CAD tools from Alter-

a/Intel and Xilinx, a much larger collection of switches areavailable (roughly 60 to 80 options are exposed to design-ers).

Table 1: List of tunable VTR configuration parameters.

Parameter Value Stage

resyn {on,off} logic synthesisresyn2 {on,off} logic synthesisresyn3 {on,off} logic synthesisalpha clustering [0,1] packingbeta clustering [0,1] packingallow unrelated cluster {on,off} packingconnection {on,off} packingalpha t [0.5-0.9] placementseed {1,2,3,4,5} placementinner num {1,10,100} placementtiming tradeoff [0.3-0.7] placementinner loop recompute {0,1,5} placementtd place exp first {0,1,3} placementtd place exp last {5,8,10} placementmax router iterations {20,50,80} routinginitial pres fac [0.3-100] routingpres fac mult [1.2-2] routingacc fac [1-2] routingbb factor {1,3,5} routingastar fac [1-2] routingmax criticality [0.8-1] routingcriticality exp [0.8-1] routingbase cost type {’demand only’, routing

’delay normalized’}

2.3 Autotuning FPGA Compilation ParametersObviously, these tool options together create an enormousdesign space which cannot be effectively explored by humaneffort alone. Besides, there is no single set of compilation pa-rameters that works for all designs [24]. Thus, we propose toautomatically configure the FPGA tool options to achieve afaster design closure and better QoRs. The QoR metrics canbe timing slack, resource usage, or power consumption. Dueto the slow runtime of FPGA compilation, effective paral-lelization is paramount to ensure the viability of autotuning.

In this paper, we propose a parallel search methodologynamed DATuner. DATuner adopts OpenTuner as its coresearch engine to leverage the advantage of an ensemble overa single technique.

3. DATuner TechniquesIn this section, we first propose a dynamic solution spacepartitioning method for parallelization. We also provide aMAB-based method for uneven computing resource alloca-tion. Then we illustrate our paralellization framework.

3.1 MotivationWe formulate the EDA autotuning problem as a search prob-lem in an N -dimensional solution space called S0, where

each dimension can be either continuous or discrete. Obvi-ously, S0 grows exponentially with the dimensionality of thesolution space. When N is large, it is impractical to exhaus-tively traverse the search space to find the optimal solution.A simple approach is to partition S0 into subspaces, launchmultiple parallel searches, and assign each search instanceto explore within one subspace. However, since some sub-spaces are more promising than others in terms of yieldinggood solutions, it is vital to properly partition the solutionspace in a way that the majority of search instances are as-signed to the most promising subspaces. Of course, just asthere is no single set of tool parameters that works well forall designs, no static partitioning of the solution space is op-timal across designs.

Finding a suitable partitioning of the search space is keyto improving the quality of the search. But doing so man-ually is usually hard, and requires adequate domain knowl-edge. In addition, just as there is no single set of parametersthat works well for all designs, no static partitioning of theparameter space is optimal across designs. Thus, we proposea novel parallelization method, where we gradually identifypromising subspaces via dynamic partitioning based on QoRsamples obtained at runtime. We further propose an MAB-based compute resource allocation method for uneven sam-pling — more compute resources are assigned to promisingsubspaces to increase sampling quality.

3.2 Dynamic Solution Space PartitionWe propose to partition the solution space dynamically,where the partitioning does not rely on any prior knowledge;instead it is decided by posterior knowledge learned duringruntime. More specifically, our partitioning method itera-tively constructs a space partitioning tree (SP tree), wherethe root node of the tree represents the initial solution space,and each intermediate node represents a subspace. The leafnodes of the SP tree collectively form the active partitioningthat is currently used by the parallel search instances. In eachiteration, we select a subspace and divide it into multiplesmaller partitions. As a result, a leaf in the SP tree will be-come an intermediate node that branches out to multiple newchildren, with which each “newborn” representing a newlycreated subspace.

Given an N dimensional solution space, we proposeto dynamically partition the search space into subspacesand allocate more computing resources for searching withinpromising subspaces. This dynamic partitioning process isillustrated in Figure 2. S0 represents the initial solution spaceand S1, S2, ..., Sn are the subspaces iteratively created dur-ing space partitioning process. Figure 2(a) shows the knownsamples that are explored, where each grey dot indicates asample with a good QoR and red crosses are those with poorQoRs. At each step of the partitioning process, a key deci-sion is to select the most profitable dimension out of the Ndimensions to partition, such that we can gain as much in-formation about the solution space as possible. Here we pro-

pose to examine the entropy and information gain [20] whenpartitioning a specific dimension. Specifically, for each di-mension i in the N -dimensional search space, we computeconditional entropy assuming we partition along dimensioni and derive the corresponding information gain. We selectthe dimension with the highest information gain to partition.

Formally, for a subspace Si, we define the set of knownsamples within Si as Di. We further label each sample asgood or bad based on its associated QoR, and use Dgi andDbi to denote subset of good and bad samples within Di,respectively (note that Dgi ∪ Dbi = Di). Then we definethe entropy of Di as

H(Di) = −(|Dgi||Di|

log(|Dgi||Di|

) +|Dbi||Di|

log(|Dbi||Di|

))

where |Di|, |Dgi|, |Dbi| are the cardinalities of sets Di,Dgi, and Dbi, respectively. We next define the conditionalentropy of Di conditioned on a specific dimension of Si.Suppose d represents the parameter chosen for the dth di-mension of Si, and we assume d has k possible discrete val-ues. If we further define H(Di|d = j) as the entropy of Di

conditioned on d taking the value of j, we will have

H(Di|d) =k∑

j=1

|Di,d=j ||Di|

H(Di|d = j)

Here |Di,d=j | is the cardinality of the set of samples inSi with d set to j. With the above notations, we can for-mally define the information gain along dimension d asG(Di, d) = H(Di) − H(Di|d). Our dynamic space par-titioning algorithm creates a new subspace by partitioningalong the dimension with the highest information gain. Ifthe chosen dimension has k possible parameter values (con-tinuous value will be discretized), it will get partitioned andbecome an intermediate node in SP tree with (k − 1) newchildren.

For the example in Figure 2(a), we have the follow-ing calculations when deciding whether to partition alongx or y dimension. Here we assume have sampled 14 de-sign points and eight of them have good QoRs. Therefore,the entropy of the initial solution space S0 is H(D0) =−( 8

14 log(814 ) +

614 log(

614 )) = 0.986. We then compute the

conditional entropy of D0 conditioned on dimension x andy as H(D0|x) = 8

14 ∗ 0.81 + 614 ∗ 0.91 = 0.853, and

H(D0|y) = 614 ∗ 1.00 + 8

14 ∗ 0.96 = 0.977. Finally, wecompute the information gain for partitioning along x or ydimension as G(D0, x) = 0.133 and G(D0, y) = 0.009.Since the former information gain is higher, we decide topartition S0 along the x dimension, which results in two newsubspaces S1 and S2, as shown in Figure 2(b,c). In the nextiteration of this process, we follow the same method andchoose to further partition along the y dimension in S2 asshown in Figure 2(d).

Samples with good QoR

Samples with

bad QoR𝑠0

x

y

0.5 10

1

(a)

𝑺𝟎

𝑺𝟏 𝑺𝟐

𝑥 < 0.5 𝑥 ≥ 0.5

1 core 3 cores

(b)

𝑠1 𝑠2

x

y

0.5 10

1

0.5

(c)

𝑺𝟎

𝑺𝟏𝑺𝟐

𝑥 < 0.5 𝑥 ≥ 0.5

𝑺𝟑 𝑺𝟒

𝑦 > 0.5 𝑦 ≤ 0.51 core

1 core 2 cores

(d)

Figure 2: Conceptual illustration of the solution spacepartitioning method — The solution space is 2D with eachdimension constrained to [0,1]. grey circles represent sam-ples with good QoR and red crosses are poor samples. (a)and (c) show the partitions and the known samples at eachiteration. (b) and (d) show the corresponding SP trees. Givenfour cores, we assign a different number of cores to each sub-space based on the quality of solutions in that subspace. Ateach step, we choose the dimension with the highest infor-mation gain to partition.

3.3 MAB-directed Computing Resource AllocationOnce we obtain an SP tree, we need to properly allocatethe available compute resources to the subspaces to maxi-mize the overall sampling rewards. The key here is to bal-ance exploitation (i.e., allocating more search instances tothe most promising subspace) with exploration (i.e., sam-pling less promising subspaces to obtain a better estimate ofenclosed solutions). Just as we can leverage MAB to choosethe search technique when exploring within a subspace, wealso propose to use it to solve the resource allocation prob-lem. More concretely, we treat the subspaces as arms. Tocalculate the reward, we employ the UCBI algorithm [5],which defines the reward of a subspace Si at round t aspayoff(Si, t) = xi +

√2 ln tni

. In this formulation, xi is theaverage QoR of samples in Si and ni is the number of times(i.e., frequency) Si has been chosen so far by the MAB. Es-sentially, the QoR and frequency terms are used to balanceexploitation and exploration in the compute resource alloca-tion.

3.4 Parallelization FrameworkFigure 3 shows the overall flow of the proposed paralleliza-tion scheme, which follows a master-slave model. The mas-ter is responsible for distributing the parallel slave processes(search instances) to different subspaces. It starts with theoriginal global space, and after collecting a sufficient num-ber of samples, it performs dynamic space partitioning andallocates search instances to appropriate subspaces. A slaveprocess invokes its search techniques to explore a specificsubspace, and reports its result backs to the master. The mas-ter then further partition the search space and reallocates theslaves. This process iterates until attaining the target QoR orreaching timeout.

Original Space

Partition?

Dynamic Partition

Resource Allocation

Send Tasks

Recv Results

Stop?

End

StartMaster

OpenTuner

Y

N

Y

N

Task

QoR

Slaves

Figure 3: DATuner parallelization framework — We ap-ply Master/Slave parallelization model and use MPI forcommunication. Master sends Tasks to slaves. Task spec-ifies which subspace that slave instance should explore andmay contain a historical best found QoR sample (if exists) inthis subspace for slave instance to use as a search seed.

4. Experimental ResultsIn this section we evaluate the QoR improvement and con-vergence time of DATuner on two widely-used FPGA CADtools: (1) the academic Verilog-to-Routing (VTR) frame-work [16] (version 7.0) and (2) Xilinx Vivado DesignSuite [3] (version 2016.1). We run DATuner with eight ma-chines (each machine with one search instance). Each ma-chine has a quad-core Xeon processor running at 2.8GHz.

We make use of a set of real-life benchmarks from sev-eral different domains including image processing, general-purpose computing, communication, security, and computervision. For the experiments targeting VTR, we choose 10benchmarks covering different domains from the VTR-7.0

benchmark suite [16]. For the experiments using Vivado, weuse five large industry designs. The characteristics of theVTR benchmarks and industry designs are summarized inTable 2 and Table 3, respectively.

Table 2: Profiles of the benchmarks used for VTR tuning.

Circuit LUT BRAM Description

MkPktMerge 239 7344 Packet processingDiffeq1 362 0 Scientific computingCh intrinsics 425 256 Memory systemRaygentop 1884 5376 Ray tracingMkSMAdapter4B 1960 4456 Packet processingSha 2001 0 CryptographyBoundtop 3053 32768 Ray tracingOr1200 3075 2048 RISC-V ProcessorBlob merge 8067 0 Image processingStereovision0 9567 0 Computer vision

Table 3: Profiles of the industry benchmarks used forVivado tuning.

Circuit FF LUT Constraint (ns) Device

Design1 14545 14122 2.60 Virtex 7K160TDesign2 17847 29012 6.55 Virtex 7K70TDesign3 18204 28361 2.00 Virtex 7K160TDesign4 26098 17242 2.65 Virtex 7VX330TDesign5 27873 38261 4.30 Virtex 7VX330T

We select 23 tunable parameters from the VTR-7.0 man-ual [16] and 9 tunable parameters from Vivado flow coveringlogic synthesis, packing, placement and routing. The list ofparameters for VTR and Vivado are shown in Table 1 andTable 4, respectively.

4.1 Tuning VTRWe first compare DATuner with a parallel baseline that per-forms a static partitioning scheme, denoted as Static-Part inSection 4.1.1. Then we compare DATuner with a serial base-line running OpenTuner on one machine, denoted as Ser-MAB in Section 4.1.2.

4.1.1 Comparison with Static PartitioningFor the case of static partitioning, we choose three pa-rameters from the VTR parameter list as the pivots forpartitioning and partition the solution space into eightsubspaces. We empirically choose the partitioning piv-ots to be alpha t, allow unrelated clustering, andbase cost type, which usually have large impact on thetiming quality.

In Figure 4, we compare DATuner with Static-Part, whereboth methods use eight machines for parallel searching. Weconduct 100 iterations of searching for 10 VTR benchmarksand show the best-found frequency at different iterations.DATuner achieves better QoR as well as faster convergence

Table 4: List of tunable Vivado configuration parame-ters.

Parameter Value Stage

OptDirective {Explore, logic synthesisExploreSequentialArea,AddRemap, ExploreArea,Default}

PlaceDirective {Explore, placementExtraNetDelay high,ExtraNetDelay medium,ExtraNetDelay low,ExtraPostPlacementOpt,WLDrivenBlockPlacement,LateBlockPlacement,SSI SpreadLogic high,SSI SpreadLogic low,AltSpreadLogic low,AltSpreadLogic medium,AltSpreadLogic high,ExtraTimingOpt, Default}

Fanout opt {on,off} post-placementPlacement opt {on,off} post-placementCritical cell opt {on,off} post-placementCritical pin opt {on,off} post-placementRetime {on,off} post-placementRewire {on,off} post-placement

RouteDirective {Explore, routingHigherDelayCost, Default}

rate than Static-Part for eight out of the ten designs. Be-sides, DATuner shows a significant improvement on aver-age QoR than Static-Part, further demonstrating the benefitsof dynamic partitioning. From our experiments, the SP treeslearnt by DATuner at runtime through dynamic partitioningare indeed very different across different designs.

Table 5 provides a more detailed comparison between thebest configurations found by DATuner and static partition-ing in terms of the clock frequency, runtime, and area in-crease over the results from the default setting. It is not sur-prising to see frequency improvements from both dynamicand static partitioning, which indicates effectiveness of us-ing MAB-guided search ensembles. In addition, DATunerwith dynamic partitioning outperforms the static scheme inthe majority of designs in terms of both frequency and LUTcounts. The runtime overheads are also similar.

4.1.2 Comparison with Serial SearchWe further compare DATuner with Ser-MAB in Figure5. For fair comparison, we constrain DATuner and Ser-MAB with the same amount of compute efforts, where Ser-MAB runs on one machine for 800 iterations and DATunerruns eight search instances of 100 iterations each. Whileeach MAB instance makes use of an ensemble of heuris-tic searches to avoid getting easily stuck in local optima, our

0 50Iterations

110105100959085807570

Best

found

frequency

(MH

z) Blob_merge

Par-MAB(best)

Par-MAB(avg)

Par-MAB + Dyn-Part (best)

Par-MAB + Dyn-Part (avg)

100 0 50110

190180170160150140130120

Boundtop

100 0 5040

4847464544434241

Diffeq1

100 0 50200

270

260

250

240

230

220

210

Ch_intrinsics

100 0 50 100190

250

240

230

220

210

200

MkPktMerge

0 50

180170160150140130120110100

190MkSMAdapter4B

100 0 5055

85

80

75

70

65

60

Sha

100 0 50130

200

190

180

170

160

150

140

Raygentop

100 0 5045

8580757065605550

Or1200

100 0 50 100100

120

140

260240220200180160

Stereovision0

Figure 4: Comparison of dynamic and static partitioning schemes for VTR tuning — Par-MAB uses the Static-Partparallelization scheme and Par-MAB+Dyn-Part is the DATuner method. best is the best frequency found up to the currentiteration and avg is the average frequency of all design points found up to the current iteration.

Table 5: Profiles of the best-found VTR configurations by DATuner and static partitioning — Freq., RT, and LUT are theratios of the best configuration over default on clock frequency, runtime, and LUT count, respectively.

CircuitDATuner Static-Part

Freq. ratio RT ratio LUT ratio Freq. ratio RT ratio LUT ratioBlob merge 1.11 5.22 1.01 1.09 2.54 1.01Boundtop 1.17 3.85 1.08 1.14 3.21 1.06Diffeq1 1.16 1.16 1.06 1.13 1.23 1.36Ch intrinsics 1.06 3.51 0.97 1.06 4.59 1.73MkPktMerge 1.09 0.89 1.20 1.14 0.68 1.00MkSMAdapter4B 1.04 1.86 0.99 1.01 0.69 1.00Sha 1.12 2.28 1.00 1.06 1.69 1.00Raygentop 1.12 3.62 0.99 1.07 2.09 1.42Or1200 1.08 0.95 1.00 1.06 2.18 1.12Stereovision0 1.18 1.74 1.29 1.21 2.04 1.24Avg. 1.11 2.51 1.06 1.09 2.10 1.19

parallelization scheme further enables multiple MAB searchinstances to explore additional promising regions that can-not be quickly reached by a single search. In Figure 6, weattempt to visualize the search trajectory of one VTR de-sign through dimensionality reduction. Here the black dashline captures the search trajectory of Ser-MAB, which ismostly trapped in one region. Other colored lines representsearch trajectories of different search instances instantiatedby DATuner, where they are simultaneously exploring dif-ferent promising regions of the design space. We believe thatthis partly explains the results in Figure 5 where Ser-MABresults in a worse QoR than that from DATuner in six out ofthe ten VTR designs.

In Figure 7, we further compare DATuner with Ser-MABin terms of the runtime to achieve a specific QoR target.

We set the QoR target to be 2% improvement in designfrequency over the default design, and measure the speedupin runtime of DATuner (with eight search instances) overSer-MAB. DATuner reaches the target QoR 11X faster thanSer-MAB.

4.2 Tuning VivadoWe have also applied DATuner to resolve the timing closureproblem for five large industry designs, which are listed inTable 3. These designs are specified with very tight timingconstraint, and are known as challenging benchmarks interms of meeting timing.

In addition to showing the timing improvement over thedefault settings of Vivado, we also experiment with the Vi-vado exploration mode, a tool option that explores various

0100

Iterations

110

105

100

95

90

85

80

75

Bestfoundfrequency(MHz)

Blob_merge

DATuner

Ser-MAB

800 0100

100

120

140

160

180

Boundtop

800 010041

48

47

46

45

44

43

42

Diffeq1

800 0100200

270

260

250

240

230

220

210

Ch_intrinsics

800 0100 800215

250

245

240

235

230

225

220

MkPktMerge

0100

180

170

160

150

140

130

120

110

100

190MkSMAdapter4B

800 010055

85

80

75

70

65

60

Sha

800 0100140

200

190

180

170

160

150

Raygentop

800 010055

85

80

75

70

65

60

Or1200

800 0100 800120

260

240

220

200

180

160

140

Stereovision0

Figure 5: Comparison of DATuner and sequential search (Ser-MAB) for VTR tuning — We use the same total number ofsearch iterations for DATuner and Ser-MAB.

Promising Region

Promising Region

Promising Region

Promising Region

Figure 6: Search trajectory of VTR tuning for Diffeq1— We use t-SNE package [23] to reduce dimensionalityof the search space for VTR design Diffeq1 from 23 to2 for the sake of visualization. Different colors representsamples of different search engines. Arrows indicate searchtime sequence. The black dash line represents trajectory ofSer-MAB. Other colored lines capture traces of DATuner.Ser-MAB is stuck in one promising region while DATunerleads to multiple promising regions.

optimizations in the placement and routing stages for im-proving timing. Figure 8 shows that this mode improves theWNS for four designs, but still fail to meet the timing con-straint. In contrast, DATuner helps close timing for all de-signs.

We also study the profiles of the best-found configura-tions by DATuner. Figure 9 shows that average similarity

20.1

8.4

2.0

10.9

23.0

3.1 2.3

21.7

7.5

12.311.1

0.0

5.0

10.0

15.0

20.0

25.0

Speedup

Figure 7: Runtime improvement using DATuner for VTRtuning — Compared with Ser-Search, DATuner reaches thesame frequency target 11.1X faster on average using eightmachines.

among the five designs is only 46%, indicating that the one-size-fits-all solution indeed does not exist. Table 6 furthershows the ratios of worst negative slack (WNS), runtime,and area between the best-found configurations and default.On average, we reduce WNS by 11% and increase the run-time by a factor of 3.7X. The resource utilization is almostthe same as the default configuration.

5. Related WorkIntelligent design space exploration and autotuning-basedtechniques have been proposed in the domain of high-

Figure 8: Vivado tuning by DATuner for large-scale industry designs — These designs fail to meet timing with bothVivado-default and Vivado-explore modes, whereas DATuner closes timing for all of them.

0.6

0.5 0.4

0.4 0.5 0.4

0.5 0.2 0.3 0.4

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1Design1

Design2

Design3

Design4

Design5

Design1 Design2 Design3 Design4 Design5

Figure 9: Similarity matrix of best-found configurationsfor Vivado designs — the average similarity is 46%.

Table 6: Profiles of the best-found Vivado configurationsby DATuner — WNS, RT, LUT, and FF are the ratios ofthe best configuration over the default one in terms of worstnegative slack, runtime, LUT count, and flip-flop count, re-spectively.

Circuit WNS ratio RT ratio LUT ratio FF ratio

Design1 0.77 2.38 0.83 1.00Design2 0.93 6.80 1.00 1.00Design3 0.94 3.26 1.00 1.00Design4 0.88 4.30 1.03 1.00Design5 0.90 1.62 1.00 1.00Avg. 0.89 3.67 0.97 1.00

performance computing (e.g., stencil computation [10] andmatrix computation [21]) and compiler optimization (e.g.,a domain-specific compiler for image processing applica-tions [18] and the compiler for Java Virtual Machine [12]).

Similar research efforts have recently emerged in theEDA field to configure CAD tool options to improve designquality automatically. Xilinx SmartXplorer [1] and AlteraDesign Space Explorer [2] use predefined or user-specifiedconfiguration bundles for design space exploration. Theysupport parallel CAD runs with fixed sets of parameters andautomatically report the best solution found. Unlike the fixed

heuristics used in these tools, DATuner dynamically deter-mines the best search method for the current design.

InTime [13, 14] is a commercial autotuning tool forFPGA timing closure based on the naıve Bayesian classi-fier, and has recently been extended to include other machinelearning techniques as well [24]. InTime builds a database ofconfigurations from a series of preliminary runs and learns topredict the next set of CAD tool options to improve timingresults, achieving 30% improvement in timing result com-pared to vendor-supplied design space exploration tools. Theauthors in [17] also propose machine learning techniquessuch as linear regression and random forest to autotune theperformance and power consumption of FPGA designs. Wenote that the learning-based sampling and classification tech-niques used in InTime [14] and [17] are complementary toour proposal. It is possible to integrate these methods intoDATuner as an additional arm in the MAB algorithm. Be-sides timing closure, our framework can also be applied toother EDA tools.

OpenTuner [4] leverages the MAB method to dynami-cally select the best searching technique from an ensem-ble of searching strategies to tune global compiler switches,which improves the performance of software compilers suchas GCC by to 2.8x over the baseline using gcc -o3. Ourwork builds on OpenTuner, and targets hardware synthesisrather than software compiler optimizations, and investigatesparallelization techniques to speed up the tuning process.

6. ConclusionsIn this paper we propose DATuner, a parallel autotuningframework for FPGA compilation using the multi-arm ban-dit technique. To mitigate the high runtime cost incurred bythe complex CAD optimization process, we devise an effi-cient parallelization scheme that enables many MAB-basedautotuners to explore the design space simultaneously. Con-cretely, DATuner dynamically partitions solution space intopromising subspaces based on information gains, and al-locates compute resource among subspaces to balance theexploration of unknown subspaces and the exploitation ofsubspaces with known high-quality solutions. Applications

of DATuner on VTR and Xilinx Vivado tools have demon-strated promising improvements in quality and convergencerate across a variety of academic and industry designs.

7. AcknowledgementsThis research was supported in part by DARPA YoungFaculty Award D15AP00096, Intel Corporation under theISRA Program, a research gift from Xilinx, Inc., Na-tional Natural Science Foundation of China (NSFC) Grant61520106004, Beijing Natural Science Foundation (BJNSF)Grant 4142022 and Chinese Scholarship Council.

References[1] SmartXplorer for ISE Project Navigator Users Tutorial.

Xilinx Inc., 2010.

[2] Quartus II Handbook Volume 1: Design and Synthesis.Altera Corporation, 2014.

[3] Vivado Design Suite User Guide. Xilinx Inc., 2015.

[4] J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O’Reilly, and S. Ama-rasinghe. OpenTuner: An Extensible Framework forProgram Autotuning. Int’l Conf. on Parallel Architec-tures and Compilation Techniques (PACT), 2014.

[5] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-timeAnalysis of the Multiarmed Bandit Problem. MachineLearning, 2002.

[6] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E.Schapire. The Nonstochastic Multiarmed BanditProblem. SIAM Journal on Computing, 2002.

[7] D. Chen, J. Cong, and P. Pan. FPGA Design Automa-tion: A Survey. Foundations and Trends in ElectronicDesign Automation, 1(3):139–169, 2006.

[8] W. Chen, Y. Wang, and Y. Yuan. Combinatorial Multi-Armed Bandit: General Framework and Applications.Intl Conf. on Machine Learning(ICML), 2013.

[9] V. F. Farias and R. Madan. The Irrevocable MultiarmedBandit Problem. Operations Research, 59:383, 2011.

[10] J. D. Garvey and T. S. Abdelrahman. AutomaticPerformance Tuning of Stencil Computations on GPUs.Int’l Conf. on Parallel Processing (ICPP), pages 300–309, Sept 2015.

[11] J. Gittins, K. Glazebrook, and R. Weber. Multi-ArmedBandit Allocation Indices. 2011.

[12] S. Jayasena, M. Fernando, T. Rusira, C. Perera, andC. Philips. Auto-Tuning the Java Virtual Machine.Parallel and Distributed Processing Symposium Work-shop (IPDPSW), pages 1261–1270, 2015.

[13] N. Kapre, B. Chandrashekaran, H. Ng, and K. Teo.Driving Timing Convergence of FPGA Designs

through Machine Learning and Cloud Computing.IEEE Symp. on Field Programmable Custom Com-puting Machines (FCCM), 2015.

[14] N. Kapre, H. Ng, K. Teo, and J. Naude. InTime: AMachine Learning Approach for Efficient Selection ofFPGA CAD Tool Parameters. Int’l Symp. on Field-Programmable Gate Arrays (FPGA), 2015.

[15] L. Lai, H. Jiang, and H. V. Poor. Medium Accessin Cognitive Radio Networks: A Competitive Multi-Armed Bandit Framework. Asilomar Conf. on Signals,Systems and Computers, 2008.

[16] J. Luu, J. Goeders, M. Wainberg, A. Somerville, T. Yu,K. Nasartschuk, M. Nasr, S. Wang, T. Liu, N. Ahmed,K. B. Kent, J. Anderson, J. Rose, and V. Betz. VTR7.0: Next Generation Architecture and CAD Systemfor FPGAs. ACM Transactions on ReconfigurableTechnology and Systems (TRETS), June 2014.

[17] A. Mametjanov, P. Balaprakash, C. Choudary, P. D.Hovland, S. M. Wild, and G. Sabin. Autotuning FPGADesign Parameters for Performance and Power. IEEESymp. on Field Programmable Custom ComputingMachines (FCCM), pages 84–91, 2015.

[18] R. T. Mullapudi, V. Vasista, and U. Bondhugula. Poly-Mage: Automatic Optimization for Image ProcessingPipelines. Int’l Conf. on Architectural Support forProgramming Languages and Operating Systems (AS-PLOS), pages 429–443, 2015.

[19] S. Ontanon. The Combinatorial Multi-Armed BanditProblem and its Application to Real-Time StrategyGames. Artificial Intelligence and Interactive DigitalEntertainment Conference, 2013.

[20] C. E. Shannon. A Mathematical Theory of Commu-nication. ACM SIGMOBILE Mobile Computing andCommunications Review, 2001.

[21] O. Spillinger, D. Eliahu, A. Fox, and J. Demmel. MatrixMultiplication Algorithm Selection with SupportVector Machines. 2015.

[22] L. Tran-Thanh, S. Stein, A. Rogers, and N. R. Jennings.Efficient Crowdsourcing of Unknown Experts usingBounded Multi-Armed Bandits. Artificial Intelligence,214:89–111, 2014.

[23] L. Van der Maaten and G. Hinton. Visualizing DataUsing t-SNE. Journal of Machine Learning Research,2008.

[24] Q. Yanghua, N. Kapre, H. Ng, and K. Teo. Improv-ing Classification Accuracy of a Machine LearningApproach for FPGA Timing Closure. IEEE Symp.on Field Programmable Custom Computing Machines(FCCM), 2016.

a parallel bandit-based approach for autotuning …zhiruz/pdfs/datuner-fpga2017.pdf · a parallel...

Documents