hyperparameter optimization for effort estimation · 2020. 4. 1. · in summary, unlike the test...

Noname manuscript No.(will be inserted by the editor)

Hyperparameter Optimization for Effort Estimation

Tianpei Xia · Rahul Krishna · Jianfeng Chen ·George Mathew · Xipeng Shen · Tim Menzies

the date of receipt and acceptance should be inserted later

Abstract Software analytics has been widely used in software engineering for manytasks such as generating effort estimates for software projects. One of the “blackarts” of software analytics is tuning the parameters controlling a data mining algo-rithm. Such hyperparameter optimization has been widely studied in other softwareanalytics domains (e.g. defect prediction and text mining) but, so far, has not beenextensively explored for effort estimation. Accordingly, this paper seeks simple, au-tomatic, effective and fast methods for finding good tunings for automatic softwareeffort estimation.

We introduce a hyperparameter optimization architecture called OIL (OptimizedInductive Learning). We test OIL on a wide range of hyperparameter optimizers usingdata from 945 software projects. After tuning, large improvements in effort estimationaccuracy were observed (measured in terms of standardized accuracy).

Tianpei XiaDepartment of Computer Science, North Carolina State University, Raleigh, NC, USAE-mail: [email protected]

Rahul KrishnaDepartment of Computer Science, North Carolina State University, Raleigh, NC, USAE-mail: [email protected]

Jianfeng ChenDepartment of Computer Science, North Carolina State University, Raleigh, NC, USAE-mail: [email protected]

George MathewDepartment of Computer Science, North Carolina State University, Raleigh, NC, USAE-mail: [email protected]

Xipeng ShenDepartment of Computer Science, North Carolina State University, Raleigh, NC, USAE-mail: [email protected]

Tim MenziesDepartment of Computer Science, North Carolina State University, Raleigh, NC, USAE-mail: [email protected]

arX

iv:1

805.

0033

6v3

[cs

.SE

] 9

Oct

201

8

2 Tianpei Xia et al.

From those results, we recommend using regression trees (CART) tuned by dif-ferent evolution combine with default analogy-based estimator. This particular com-bination of learner and optimizers often achieves in a few hours what other optimizersneed days to weeks of CPU time to accomplish.

An important part of this analysis is its reproducibility and refutability. All ourscripts and data are on-line. It is hoped that this paper will prompt and enable muchmore research on better methods to tune software effort estimators.

Keywords Effort Estimation · Optimization · Evolutionary Algorithms

1 Introduction

This paper explores methods to improve algorithms for software effort estimation.This is needed since software effort estimates can be wildly inaccurate [32]. Effortestimations need to be accurate (if for no other reason) since many government or-ganizations demand that the budgets allocated to large publicly funded projects bedouble-checked by some estimation model [51]. Non-algorithm techniques that relyon human judgment [28] are much harder to audit or dispute (e.g., when the estimateis generated by a senior colleague but disputed by others).

Sarro et al. [63] assert that effort estimation is a critical activity for planningand monitoring software project development in order to deliver the product on timeand within budget [10, 36, 76]. The competitiveness (and occasionally the survival)of software organizations depends on their ability to accurately predict the effort re-quired for developing software systems; both over- or under- estimates can negativelyaffect the outcome of software projects [76, 46, 47, 70].

Hyperparameter optimizers tuning the control parameters of a data mining algo-rithm. It is well established that classification tasks like software defect prediction ortext classification are improved by such tuning [21, 75, 2, 1]. This paper investigateshyperparameter optimization using data from 945 projects. To the best of our knowl-edge, this study is the most extensive exploration of hyperparameter optimization andeffort estimation yet undertaken.

We assess our results with respect to recent findings by Acuri & Fraser [6]. Theycaution that to transition hyperparameter optimizers to industry, they need to be fast:

A practitioner, that wants to use such tools, should not be required to runlarge tuning phases before being able to apply those tools [6].

Also, according to Acuri & Fraser, optimizers must be useful:

At least in the context of test data generation, (tuning) does not seem easyto find good settings that significantly outperform “default” values. ... Us-ing “default” values is a reasonable and justified choice, whereas parametertuning is a long and expensive process that might or might not pay off [6].

Hence, to assess such optimization for effort estimation, we ask four questions.RQ1: To address one concern raised by Acuri & Fraser, we must first ask is it

best to just use “off-the-shelf” defaults? We will find that tuned learners providebetter estimates than untuned learners. Hence, for effort estimation:

Hyperparameter Optimization for Effort Estimation 3

Lesson1: “off-the-shelf” defaults should be deprecated.

RQ2: Can tuning effort be avoided by replacing old defaults with new defaults?This checks if we can run tuning once (and once only) then use those new defaultsever after. We will observe that effort estimation tunings differ extensively from dataset to data set. Hence, for effort estimation:

Lesson2: Overall, there are no “best” default settings.

RQ3: The first two research questions tell us that we must retune our effort es-timators whenever new data arrives. Accordingly, we must now address the otherconcern raised by Acuri & Fraser about CPU cost. Hence, in this question we ask canwe avoid slow hyperparameter optimization? The answer to RQ3 will be “yes” sinceour results show that for effort estimation:

Lesson3: Overall, our slowest optimizers perform no better than certain fasterones.

RQ4: The final question to answer is what hyperparameter optimizers to use foreffort estimation? Here, we report that a certain combination of learners and optimiz-ers usually produce best results. Further, this particular combination often achievesin a few hours what other optimizers need days to weeks of CPU to achieve. Hencewe will recommend the following combination for effort estimation:

Lesson4: For new data sets, try a combination of CART with the optimizersdifferential evolution and default analogy-based estimator.

(Note: The italicized words are explained below.)In summary, unlike the test case generation domains explored by Acuri & Fraser,

hyperparamter optimization for effort estimation is both useful and fast.Overall the contributions of this paper are:

– A demonstration that defaults settings are not the best way to perform effort es-timation. Hence, when new data is encountered, some tuning process is requiredto learn the best settings for generating estimates from that data.

– A recognition of the inherent difficulty associated with effort estimation. Sincethere is not one universally best effort estimation method. commisioning a neweffort estimator requires extensive testing. As shown below, this can take hours todays to weeks of CPU time.

– A new criteria for assessing effort estimators. Given that inherent cost of commis-sioning an effort estimator, it is now important to assess effort estimation methodsnot only on their predictive accuracy, but also on the time required to generatethose estimates.

– The identification of a combination of learner and optimizer that works as well asanything else, and which takes just an hour or two to learn an effort estimator.

– An extensible open-source architecture called OIL that enables the commission-ing of effort estimation methods. OIL makes our results repeatable and refutable.


The rest of this paper is structured as follows. The next section discusses differentmethods for effort estimation and how to optimize the parameters of effort estimationmethods. This is followed by a description of our data, our experimental methods,and our results. After that, a discussion section explores open issues with this work.

From all of the above, we can conclude that (a) while Acuri & Fraser’s pessimismabout hyperparameter optimization applies to their test data generation domain, thereexists (b) other domains (e.g. effort estimation) where hyperparamter optimizationis both useful and fast. Hence, we hope that OIL, and the results of this paper, willprompt and enable more research on methods to tune software effort estimators.

Note that OIL and all the data used in this study is freely available for downloadfrom https://github.com/ai-se/magic101

2 About Effort Estimation

Software effort estimation is the process of predicting the most realistic amount ofhuman effort (usually expressed in terms of hours, days or months of human work)required to plan, design and develop a software project based on the informationcollected in previous related software projects. It is important to allocate resourcesproperly in software projects to avoid waste. In some cases, inadequate or overfullfunding can cause a considerable waste of resource and time. For example, NASAcanceled its Check-out Launch Control System project after the initial $200M esti-mate was exceeded by another $200M [15]. As shown below, effort estimation canbe categorized into (a) human-based and (b) algorithm-based methods [37, 66].

2.1 Human-based Methods

There are many human-based estimations methods [28] including methods for agileprojects.

In agile development, effort estimation is often done by project teams applyingtechniques that made in terms of story points [13, 77]. Story points are a unit ofmeasure for expressing an estimate of the overall effort that will be required to fullyimplement a piece of work. To estimate these story points, some techniques like plan-ning poker is applied [45]. In planning poker, first choose a user story to discuss.Then the developers individually choose the number of story points. Once all teammembers have chosen their estimates, the choices are disclosed. The developers, whochose the lowest and the highest story points, must justify their choices. This pro-cess repeats until a consensus is achieved and the agreed number of story points isassigned to the user story.

It should be noted that planning poker is used to incrementally assess effort forparticular tasks in the scrum backlog. This is a different and simpler task than theinitial estimation of software projects. This is an important issue since the initialbudget allocation may require a significant amount of intra-organizational lobbyingbetween groups with competing concerns (this is particularly true for larger projects).Since it can be challenging to change such initial budget allocations, it is importantto get the initial estimate as accurate as possible.


For several reasons, this paper does not explore human-based estimation methods.Firstly, it is known that humans rarely update their human-based estimation knowl-edge based on feedback from new projects [30]. Secondly, algorithm-based methodsare preferred when estimate have to be audited or debated (since the method is explicitand available for inspection). Thirdly, algorithm-based methods can be run manytimes (each time applying small mutations to the input data) to understand the rangeof possible estimates. Even very strong advocates of human-based methods [29] ac-knowledge that algorithm-based methods are useful for learning the uncertainty aboutparticular estimates.

2.2 Algorithm-based Methods

There are many algorithmic estimation methods. Some, such as COCOMO [9], makeassumptions about the attributes in the model. For example, COCOMO requires thatdata includes 22 specific attributes such as analyst capability (acap) and softwarecomplexity (cplx). This attribute assumptions restricts how much data is available forstudies like this paper. For example, here we explore 945 projects expressed using awide range of attributes. If we used COCOMO, we could only have accessed an orderof magnitude fewer projects.

Due to its attribute assumptions, this paper does not study COCOMO data. Allthe following learners can accept projects described using any attributes, just as longas one of those is some measure of project development effort.

Whigham et al.’s ATLM method [79] is a multiple linear regression model whichcalculate the effort as effort = β0 +

∑i βi × ai + εi, where ai are explanatory

attributes and εi are errors to the actual value. The prediction weights βi are de-termined using least square error estimation [57]. Additionally, transformations areapplied on the attributes to further minimize the error in the model. In case of categor-ical attributes the standard approach of “dummy variables” [26] is applied. While, forcontinuous attributes, transformations such as logarithmic, square root, or no transfor-mation is employed such that the skewness of the attribute is minimum. It should benoted that, ATLM does not consider relatively complex techniques like using modelresiduals, box transformations or step-wise regression (which are standard) when de-veloping a linear regression model. The authors make this decision since they intendATLM to be a simple baseline model rather than the “best” model.

Sarro et al. proposed a method named Linear Programming for Effort Estimation(LP4EE) [62], which aims to achieve the best outcome from a mathematical modelwith a linear objective function subject to linear equality and inequality constraints.The feasible region is given by the intersection of the constraints and the Simplex(linear programming algorithm) is able to find a point in the polyhedron where thefunction has the smallest error in polynomial time. In effort estimation problem, thismodel minimises the Sum of Absolute Residual (SAR), when a new project is pre-sented to the model, LP4EE predicts the effort as effort = a1 ∗ x1 + a2 ∗ x2 + ...+an ∗ xn, where x is the value of given project feature and a is the corresponding co-efficient evaluated by linear programming. LP4EE is suggested to be used as anotherbaseline model for effort estimation since it provides similar or more accurate esti-


Table 1 CART’s parameters.

Parameter Default Tuning Range Notesmax feature None [0.01, 1] The number of feature to consider when looking for

the best split.max depth None [1, 12] The maximum depth of the tree.

min sample split 2 [0, 20] The minimum number of samples required to splitan internal node.

min samples leaf 1 [1, 12] The minimum number of samples required to be ata leaf node.

mates than ATLM and is much less sensitive than ATLM to multiple data splits anddifferent cross-validation methods.

Another kind of algorithm-based estimation method are regression trees such asCART [42]. CART is a tree learner that divides a data set, then recurses on eachsplit. If data contains more than min sample split, then a split is attempted. On theother hand, if a split contains no more than min samples leaf, then the recursionstops. CART finds the attributes whose ranges contain rows with least variance inthe number of defects. If an attribute ranges ri is found in ni rows each with an ef-fort variance of vi, then CART seeks the attribute with a split that most minimizes∑

i

(√vi × ni/(

∑i ni)

). For more details on the CART parameters, see Table 1.

Yet another algorithm-based estimator are the analogy-based Estimation (ABE)methods advocted by Shepperd and Schofield [68]. ABE is widely-used [59, 38, 27,35, 51], in many forms. We say that “ABE0” is the standard form seen in the literatureand “ABEN” are the 6,000+ variants of ABE defined below. The general form of ABE(which applies to ABE0 or ABEN) is to first form a table of rows of past projects. Thecolumns of this table are composed of independent variables (the features that defineprojects) and one dependent feature (project effort). From this table, we learn whatsimilar projects (analogies) to use from the training set when examining a new testinstance. For each test instance, ABE then selects k analogies out of the training set.Analogies are selected via a similarity measure. Before calculating similarity, ABEnormalizes numerics min..max to 0..1 (so all numerics get equal chance to influencethe dependent). Then, ABE uses feature weighting to reduce the influence of lessinformative features. Finally, some adaption strategy is applied return a combinationof the dependent effort values seen in the k nearest analogies. For details on ABE0and ABEN, see Figure 1 & Table 2.

2.3 Effort Estimation and Hyperparameter Optimization

Note that we do not claim that the above represents all methods for effort estimation.Rather, we say that (a) all the above are either prominent in the literature or widelyused; and (b) anyone with knowledge of the current effort estimation literature wouldbe tempted to try some of the above.

While the above list is incomplete, it is certainly very long. Consider, for example,just the ABEN variants documented in Table 2. There are 2×8×3×6×4×6 = 6, 912such variants. Some can be ignored; e.g. at k = 1, adaptation mechanisms return the


Table 2 Variations on analogy. Visualized in Figure 1.

– To measure similarity between x, y, ABE uses√∑n

i=1 wi(xi − yi)2 where wi correspondsto feature weights applied to independent features. ABE0 uses a uniform weighting wherewi = 1. ABE0’s adaptation strategy is to return the effort of the nearest k = 1 item.

– Two ways to find training subsets: (a) Remove nothing: Usually, effort estimators use all trainingprojects [12]. Our ABE0 is using this variant; (b) Outlier methods: prune training projects with(say) suspiciously large values [34]. Typically, this removes a small percentage of the trainingdata.

– Eight ways to make feature weighting: Li et al. [44] and Hall and Holmes [25] review 8 differentfeature weighting schemes.

– Three ways to discretize (summarize numeric ranges into a few bins): Some feature weightingschemes require an initial discretization of continuous columns. There are many discretizationpolicies in the literature, including: (1) equal frequency, (2) equal width, (3) do nothing.

– Six ways to choose similarity measurements: Mendes et al. [48] discuss three similarity mea-sures, including the weighted Euclidean measure described above, an unweighted variant(where wi = 1), and a “maximum distance” measure that focuses on the single feature thatmaximizes interproject distance. Frank et al. [20] use a triangular distribution that sets to theweight to zero after the distance is more than “k” neighbors away from the test instance. Afifth and sixth similarity measure are the Minkowski distance measure used in [4] and the meanvalue of the ranking of each project feature used in [78].

– Four ways for adaption mechanisms: (1) median effort value, (2) mean dependent value,(3) summarize the adaptations via a second learner (e.g., linear regression) [44, 49, 7, 60],(4) weighted mean [48].

– Six ways to select analogies: Analogy selectors are fixed or dynamic [37]. Fixed methods usek ∈ {1, 2, 3, 4, 5} nearest neighbors while dynamic methods use the training set to find which1 ≤ k ≤ N − 1 is best for N examples.

same result, so they are not necessary. Also, not all feature weighting techniques usediscretization. But even after those discards, there are still thousands of possibilities.

Given the space to exploration is so large, some researchers have offered au-tomatic support for that exploration. Some of that prior work suffered from beingapplied to limited data [44] or optimizing with algorithms that are not representativeof the state-of-the-art in multi-objective optimization [44].

Another issue with prior work is that researchers use methods deprecated in theliterature. For example, some use grid search [17, 71] which is a set of nested for-loops that iterate over a range of options. Grid search is slow and, due to the size ofthe increment in each loop, can miss important settings [8].

Some research also recommends tabu search (TS) for hypeparameter optimisationin software effort estimation [14]. Tabu search, as proposed by Glover, aims to over-come some limitations of local search [24]. It is a meta-heuristic relying on adaptivememory and responsive exploration of the search space. However, to use TS, goodunderstanding of the problem structure is required, and domain specific knowledge isneeded for selection for tabus and aspiration criteria. We had planned to include TSin this analysis, then realized that of the earlier advocates of TS for effort estimationwere still using it (e.g. Sarro and Corazza et al. [14] explored TS in their 2013 paperbut elected not to use it in any of their subsequent experiments [63, 62].

Other researchers assume that the effort model is a specific parametric form (e.g.the COCOMO equation) and propose mutation methods to adjust the parameters of


Fig. 1 OIL’s feature model of the space of machine learning options for ABEN. In this model,SubsetSelection , Similarity , AdaptionMechanism and AnalogySelection are the mandatory fea-tures, while the FeatureWeighting and DiscretizationMethod features are optimal. To avoid makingthe graph too complex, some cross-tree constrains are not presented.

that equation [3, 53, 69, 11, 61]. As mentioned above, this approach is hard to testsince there are very few data sets using the pre-specified COCOMO attributes.

Further, all that prior work needs to be revisited given the existence of recent andvery prominent methods; i.e. ATLM from TOSEM’15 [79] or LP4EE from TOSEM’18 [62].

Accordingly, this paper conducts a more thorough investigation of hyperparame-ter optimization for effort estimation.

– We use methods with no data feature assumptions (i.e. no COCOMO data);– That vary many parameters (6,000+ combinations);– That also tests results on 9 different sources with data on 945 software projects;– Which uses optimizers representative of the state-of-the-art (NSGA-II [16], MOEA/

D [80], DE [72]);– And which benchmark results against prominent methods such as ATLM and

LP4EE.

2.4 OIL

OIL is our architecture for exploring hyperparameter optimization and effort estima-tion, initially, our plan was to use standard hyperparameter tuning for this task. Thenwe learned that standard data mining toolkits like Scikit-learn [58] did not includemany of the effort estimation techniques; and (b) standard hyperparameter tuners canbe slow (Scikit-learn recommends a default runtime of 24 hours . Hence, we buildOIL:

– At the base library layer, we use Scikit-learn [58].– Above that, OIL has a utilities layer containing all the algorithms missing in

Scikit-Learn (e.g., ABEN required numerous additions at the utilities layer).– Higher up, OIL’s modelling layer uses an XML-based domain-specific language

to specify a feature map of data mining options. These feature models are single-parent and-or graphs with (optional) cross-tree constraints showing what options


require or exclude other options. A graphical representation of the feature modelused in this paper is shown in Figure 1.

– Finally, at top-most optimizer layer, there is some optimizer that makes decisionsacross the feature map. An automatic mapper facility then links those decisionsdown to the lower layers to run the selected algorithms.

2.5 Optimizers

Once OIL’s layers were built, it was simple to “pop the top” and replace the toplayer with another optimizer. Nair et al. [55] advise that for search-based SE studies,optimizers should be selecting via the a “dumb+two+next” rule. Here:

– “Dumb” is some baseline method;– “Two” are some well-established optimizers;– “Next” is a more recent method which may not have been applied before to this

domain.

For our “dumb” optimizer, we used Random Choice (hereafter, RD). To find N validconfigurations, RD selects leaves at random from Figure 1. All these N variants areexecuted and the best one is selected for application to the test set. To maintain paritywith DE2 and DE8 systems described below, OIL uses N ∈ {40, 160} (denotedRD40 and RD160).

Moving on, our “two” well-established optimizer are differential evolution (here-after, DE [72]) and NSGA-II [16]. These have been used frequently in the SE liter-ature [21, 2, 1, 65, 64]. NSGA-II is a standard genetic algorithm (for N generations,mutate, crossover, select best candidates for the next generation) with a fast selectoperator. All candidates that dominated i other items are grouped into together in“band” i. When selecting C candidates for the next generation, the top bands 1..iwith M < C candidates are all selected. Next, using a near-linear time pruning oper-ator, the i+ 1 band is pruned down to C −M items, all of which are selected.

The premise of DE is that the best way to mutate the existing tunings is to ex-trapolate between current solutions. Three solutions a, b, c are selected at random.For each tuning parameter k, at some probability cr, we replace the old tuning xkwith yk. For booleans yk = ¬xk and for numerics, yk = ak + f × (bk − ck) wheref is a parameter controlling differential weight. The main loop of DE runs over thepopulation of size np, replacing old items with new candidates (if new candidate isbetter). This means that, as the loop progresses, the population is full of increasinglymore valuable solutions (which, in turn, helps extrapolation). As to the control pa-rameters of DE, using advice from Storn [72], we set {np, g , cr} = {20, 0.75, 0.3}.The number of generations gen ∈ {2, 8} was set as follows. A small number (2) wasused to test the effects of a very CPU-light effort estimator. A larger number (8) wasused to check if anything was lost by restricting the inference to just two generations.These two versions were denoted DE2 and DE8.

There are many other variants of DE besides above DE2 and DE8, with differentmutation and crossover strategies, number of generations, with or without early stop-ping rule, the performance of DE can be very different. Our third DE method uses


same parameters as DE2 and DE8 ({20, 0.75, 0.3}), but for the mutation strategy, anew equation is used to find replacement yk:

yk = bestk + f × (ak − bk) + f × (ck − dk)

instead of using the original random choice yk = ak + f × (bk − ck). In this newequation, four mutually exclusive solutions a, b, c, d are randomly selected, and bestkis the one has the best performance in current solutions. For the crossover strategy,we use binomial, which is performed on each of the d variables whenever a randomlygenerated number between 0 and 1 is less than or equal to crossover rate (cr).

Table 3 Data in this study. For details on thefeatures, see Table 4.

Projects Featureskemerer 15 6albrecht 24 7isbsg10 37 11finnish 38 7

miyazaki 48 7maxwell 62 25

desharnais 77 6kitchenham 145 6

china 499 16total 945

We call this DE method “DE NEW”.As to our “next” optimizer, we used

MOEA/D [80]. This is a decompositionapproach that runs simultaneous problemsat once, as follows. Prior to inference, allcandidates are assigned random weights toall goals. Candidates with similar weightsare said to be in the same “neighbor-hood”. When any candidate finds a usefulmutation, then this candidate’s values arecopied to all neighbors that are further awaythan the candidate from the best “Utopianpoint”. Note that these neighborhoods canbe pre-computed and cached prior to evo-lution. Hence, MOEA/D runs very quickly.MOEA/D has not been previously applied to effort estimation.

For these optimizers we used, DE and RD have the fixed evaluation budget de-scribed above. The other evolutionary treatments (NSGA-II, MOEA/D) were ran tillthe results from new generations were no better than before.

3 Empirical Study

3.1 Data

To assess OIL, we applied it to the 945 projects seen in nine datasets from theSEACRAFT repository (http://tiny.cc/seacraft); see Table 3 and Table 4. This datawas selected since it has been widely used in previous estimation research. Also, it isquite diverse since it differs for:

– Observation number (from 15 to 499 projects).– Number and type of features (from 6 to 25 features, including a variety of fea-

tures describing the software projects, such as number of developers involvedin the project and their experience, technologies used, size in terms of Func-tion Points, etc.). Note that some features of the original datasets are not usedin our experiment because they are naturally irrelevant to their effort values.For example, feature “ID” in “china” dataset is just the order number of these

http://tiny.cc/seacraft


Table 4 Descriptive Statistics of the Datasets

feature min max mean std

kem

erer

Duration 5 31 14.3 7.5KSLOC 39 450 186.6 136.8AdjFP 100 2307 999.1 589.6RAWFP 97 2284 993.9 597.4Effort 23 1107 219.2 263.1

albr

echt

Input 7 193 40.2 36.9Output 12 150 47.2 35.2Inquiry 0 75 16.9 19.3File 3 60 17.4 15.5FPAdj 1 1 1.0 0.1RawFPs 190 1902 638.5 452.7AdjFP 199 1902 647.6 488.0Effort 0 105 21.9 28.4

isbs

g10

UFP 1 2 1.2 0.4IS 1 10 3.2 3.0DP 1 5 2.6 1.1LT 1 3 1.6 0.8PPL 1 14 5.1 4.1CA 1 2 1.1 0.3FS 44 1371 343.8 304.2RS 1 4 1.7 0.9FPS 1 5 3.5 0.7Effort 87 14453 2959 3518

finni

sh

hw 1 3 1.3 0.6at 1 5 2.2 1.5FP 65 1814 763.6 510.8co 2 10 6.3 2.7prod 1 29 10.1 7.1lnsize 4 8 6.4 0.8lneff 6 10 8.4 1.2Effort 460 26670 7678 7135

chin

a

AFP 9 17518 486.9 1059Input 0 9404 167.1 486.3Output 0 2455 113.6 221.3Enquiry 0 952 61.6 105.4File 0 2955 91.2 210.3Interface 0 1572 24.2 85.0Added 0 13580 360.4 829.8Changed 0 5193 85.1 290.9Deleted 0 2657 12.4 124.2PDR A 0 84 11.8 12.1PDR U 0 97 12.1 12.8NPDR A 0 101 13.3 14.0NPDU U 0 108 13.6 14.8Resource 1 4 1.5 0.8Dev.Type 0 0 0.0 0.0Duration 1 84 8.7 7.3Effort 26 54620 3921 6481

feature min max mean std

miy

azak

i

KLOC 7 390 63.4 71.9SCRN 0 150 28.4 30.4FORM 0 76 20.9 18.1FILE 2 100 27.7 20.4ESCRN 0 2113 473.0 514.3EFORM 0 1566 447.1 389.6EFILE 57 3800 936.6 709.4Effort 6 340 55.6 60.1

max

wel

l

App 1 5 2.4 1.0Har 1 5 2.6 1.0Dba 0 4 1.0 0.4Ifc 1 2 1.9 0.2Source 1 2 1.9 0.3Telon. 0 1 0.2 0.4Nlan 1 4 2.5 1.0T01 1 5 3.0 1.0T02 1 5 3.0 0.7T03 2 5 3.0 0.9T04 2 5 3.2 0.7T05 1 5 3.0 0.7T06 1 4 2.9 0.7T07 1 5 3.2 0.9T08 2 5 3.8 1.0T09 2 5 4.1 0.7T10 2 5 3.6 0.9T11 2 5 3.4 1.0T12 2 5 3.8 0.7T13 1 5 3.1 1.0T14 1 5 3.3 1.0T15 1 5 3.3 0.7Dura. 4 54 17.2 10.7Size 48 3643 673.3 784.1Time 1 9 5.6 2.1Effort 583 63694 8223 10500

desh

arna

is

TeamExp 0 4 2.3 1.3MngExp 0 7 2.6 1.5Length 1 36 11.3 6.8Trans.s 9 886 177.5 146.1Entities 7 387 120.5 86.1AdjPts 73 1127 298.0 182.3Effort 546 23940 4834 4188

kitc

henh

am

code 1 6 2.1 0.9type 0 6 2.4 0.9duration 37 946 206.4 134.1fun pts 15 18137 527.7 1522estimate 121 79870 2856 6789esti mtd 1 5 2.5 0.9Effort 219 113930 3113 9598

projects, we drop these kinds of feature in experiments. Other cases are “Syear”in “maxwell” (which denotes “start year”), “ID” in “miyazaki/kemerer/china” and“defects/months” in “nasa93/coc81”.

– Technical characteristics (software projects developed in different programminglanguages and for different application domains, ranging from telecommunica-tions to commercial information systems).

– Geographical locations (software projects coming from Canada, China, Finland).

3.2 Cross-Validation

Each data sets was treated in a variety of ways. Each treatment is an M*N-way cross-validation test of some learner or some learner and optimizer. That is, M times, shuf-fle the data randomly (using a different random number seed) then divide the data intoN bins. For i ∈ N , bin i is used to test a model build from the other bins. Followingthe advice of Nair et al. [55], for the smaller data sets (with 40 rows or less), we useN = 3 bins while for the others, we use N = 10 bins.


Table 5 Performance scores: Standardized Accuracy

SA is defined in terms of

MAE =1

N

n∑i=1

|RE i − EE i|

where N is the number of projects used for evaluating the performance, and RE i and EE i are theactual and estimated effort, respectively, for the project i.SA uses MAE as follows:

SA = (1−MAEPj

MAErguess

)× 100

where MAEPjis the MAE of the approach Pj being evaluated and MAErguess is the MAE of

a large number (e.g., 1000 runs) of random guesses. Over many runs, MAErguess will convergeon simply using the sample mean [67]. That is, SA represents how much better Pj is than randomguessing. Values near zero means that the prediction model Pj is practically useless, performing littlebetter than random guesses [67].

As a procedural detail, first we divided the data and then we applied the treat-ments. That is, all treatments saw the same training and test data.

3.3 Scoring Metrics

The results from each an effort estimator can be scored many ways including the Stan-dardized Accuracy (SA) measure defined in Table 5. We represent results in terms ofSA since this measure has been adopted in recent high-profile publications [41, 67].Note that for SA evaluation measure, larger values are better.

From the cross-vals, we report the median (termed med) which is the 50th per-centile of the test scores seen in the M*N results. Also reported are the inter-quartilerange (termed IQR) which is the (75-25)th percentile. The IQR is a non-parametricdescription of the variability about the median value.

For each data sets, the results from a M*N-way are sorted by their median value,then ranked using the Scott-Knott test recommended for ranking effort estimationexperiments by Mittas et al. in TSE’13 [52]. For full details on Scott-Knott test, seeTable 6. In summary, Scott-Knott is a top-down bi-clustering method that recursivelydivides sorted treatments. Division stops when there is only one treatment left orwhen a division of numerous treatments generates splits that are statistically indis-tinguishable. To judge when two sets of treatments are indistinguishable, we use aconjunction of both a 95% bootstrap significance test [18] and a A12 test for a non-small effect size difference in the distributions [51]. These tests were used since theirnon-parametric nature avoids issues with non-Gaussian distributions.

Table 7 shows an example of the report generated by our Scott-Knott procedure.Note that when multiple treatments tie for Rank=1, then we use the treatment’s run-times to break the tie. Specifically, for all treatments in Rank=1, we mark the fasterones as Rank=1* .


Table 6 Explanation of Scott-Knott test.

This study ranks methods using the Scott-Knott procedure recommended by Mittas & Angelis intheir 2013 IEEE TSE paper [52]. This method sorts a list of l treatments with ls measurements bytheir median score. It then splits l into sub-lists m,n in order to maximize the expected value ofdifferences in the observed performances before and after divisions. For example, we could sort ls =4 methods based on their median score, then divide them into three sub-lists of of size ms,ns ∈{(1, 3), (2, 2), (3, 1)}. Scott-Knott would declare one of these divisions to be “best” as follows. Forlists l,m, n of size ls,ms,ns where l = m ∪ n, the “best” division maximizes E(∆); i.e. thedifference in the expected mean value before and after the spit:

E(∆) =ms

lsabs(m.µ− l.µ)2 +

ns

lsabs(n.µ− l.µ)2

Scott-Knott then checks if that “best” division is actually useful. To implement that check, Scott-Knott would apply some statistical hypothesis testH to check ifm,n are significantly different. If so,Scott-Knott then recurses on each half of the “best” division.For a more specific example, consider the results from l = 5 treatments:

rx1 = [0.34, 0.49, 0.51, 0.6]rx2 = [0.6, 0.7, 0.8, 0.9]rx3 = [0.15, 0.25, 0.4, 0.35]rx4= [0.6, 0.7, 0.8, 0.9]rx5= [0.1, 0.2, 0.3, 0.4]

After sorting and division, Scott-Knott declares:

– Ranked #1 is rx5 with median= 0.25– Ranked #1 is rx3 with median= 0.3– Ranked #2 is rx1 with median= 0.5– Ranked #3 is rx2 with median= 0.75– Ranked #3 is rx4 with median= 0.75

Note that Scott-Knott found little difference between rx5 and rx3. Hence, they have the same rank,even though their medians differ.Scott-Knott is prefered to, say, an all-pairs hypothesis test of all methods; e.g. six treatments can becompared (62 − 6)/2 = 15 ways. A 95% confidence test run for each comparison has a very low totalconfidence: 0.9515 = 46%. To avoid an all-pairs comparison, Scott-Knott only calls on hypothesistests after it has found splits that maximize the performance differences.For this study, our hypothesis test H was a conjunction of the A12 effect size test of and non-parametric bootstrap sampling; i.e. our Scott-Knott divided the data if both bootstrapping and an effectsize test agreed that the division was statistically significant (95% confidence) and not a “small” effect(A12 ≥ 0.6).For a justification of the use of non-parametric bootstrapping, see Efron & Tibshirani [18, p220-223].For a justification of the use of effect size tests see Shepperd & MacDonell [67]; Kampenes [31]; andKocaguneli et al. [33]. These researchers warn that even if an hypothesis test declares two populationsto be “significantly” different, then that result is misleading if the “effect size” is very small. Hence,to assess the performance differences we first must rule out small effects. Vargha and Delaney’s non-parametric A12 effect size test explores two lists M and N of size m and n:

A12 =

∑x∈M,y∈N

{1 if x > y

0.5 if x == y

/(mn)

This expression computes the probability that numbers in one sample are bigger than in another. Thistest was recently endorsed by Arcuri and Briand at ICSE’11 [5].


Table 7 Example of Scott-Knott results. SA scores seen in the China data set. sorted by their medianvalue. Here, larger values are better. Med is the 50th percentile and IQR is the inter-quartile range; i.e.,75th-25th percentile. Lines with a dot in the middle shows median values with the IQR. For the Ranks,smaller values are better. Ranks are computed via the Scott-Knot procedure from TSE13 [52]. Rows withthe same ranks are statistically indistinguishable. 1* denotes rows of fastest best-ranked treatments.

Standardized AccuracyRank China Med IQR1* CART DE8 93 2 s1* CART DE2 93 2 s2 CART RD160 90 4 s2 ABEN NSGA2 90 5 s2 CART DE NEW 87 9 s2 CART RD40 86 12 s2 CART0 85 7 s2 ABEN DE8 85 13 s2 ABEN RD160 83 10 s2 ABEN DE2 82 15 s2 ABEN RD40 80 12 s3 ABE0 61 11 s4 CART MOEAD 54 19 s4 LP4EE 51 13 s4 ATLM 48 7 s4 CART NSGA2 44 14 s

3.4 Terminology for Optimizers

Some treatments are named “X Y” which denote learner “X” tuned by optimizer “Y”.In the following:

X ∈ {CART ,ABE}Y ∈ {DE2 ,DE8 ,DE NEW ,MOEA/D ,NSGA2 ,RD40 ,RD160}

Note that we do not tune ATLM and LP4EE since they were designed to be used “off-the-shelf”. Whigham et al. [79] declare that one of ATLM’s most important featuresis that if does not need tuning.

4 Results

4.1 Observations

Table 8 shows the runtimes (in minutes) for one of our 30 N*M experiments foreach dataset. From the last column of that table, we see that the median to maximumruntimes per dataset range are:

– 139 to 680 minutes, for one-way;– Hence 70 to 340 hours, for the 30 repeats of our N*M experiments.

Performance scores for all data sets are shown in Table 9. For space reasons, all theslower treatments (as defined in Table 8) that were not ranked first have been deleted.


Table 8 Average runtime (in minutes), for one-way out of an N*M cross-validation experiment. cross-validation (minutes). Executing on a 2GHz processor, with 8GB RAM, running Windows 10. Note thatLP4EE and ATLM have no tuning results since the authors of these methods stress that it is advantageousto use their baseline methods, without any tuning. Last column reports totals for each dataset.

faster slower

AB

E0C

ART

0AT

LMLP

4EE

CA

RTR

D40

CA

RTR

D16

0C

ART

DE2

CA

RTD

E8C

ART

DE

NEW

CA

RTM

OEA

D

CA

RTN

SGA

2A

BEN

RD

40A

BEN

RD

160

AB

END

E2A

BEN

DE8

AB

ENN

SGA

2

totalkemerer <1 <1 <1 <1 <1 <1 <1 <1 <1 <1 5 2 5 2 4 17 45albrecht <1 <1 <1 <1 <1 <1 <1 <1 <1 <1 5 2 8 3 6 24 58finnish <1 <1 <1 <1 <1 <1 <1 <1 <1 <1 5 3 10 4 8 30 70

miyazaki <1 <1 <1 <1 <1 <1 <1 <1 <1 <1 5 4 14 5 10 37 85desharnais <1 <1 <1 <1 <1 <1 <1 <1 <1 <1 4 8 19 12 22 64 139

isbsg10 <1 <1 <1 <1 <1 <1 <1 <1 <1 <1 5 4 11 4 95 33 162maxwell <1 <1 <1 <1 <1 <1 <1 <1 <1 <1 6 13 38 16 29 111 223

kitchenham <1 <1 <1 <1 <1 <1 <1 <1 <1 <1 6 15 32 26 48 126 263china <1 <1 <1 <1 <1 <1 <1 <1 <1 <1 6 36 86 82 131 329 680total 3 4 4 4 4 5 5 5 5 6 47 87 223 154 267 771

Rationale: such sub-optimal and slower treatments need not be discussed further.Please see https://ibb.co/ftCOD9 for a display of all the results.

In Table 9, we observe that ATLM and LP4EE performed as expected. Whighamet al. [79] and Sarro et al. [62] designed these methods to serve as baselines againstwhich other treatments can be compared. Hence, it might be expected that these meth-ods will perform comparatively below other methods. This was certainly the casehere– as seen in Table 9, these baseline methods are top-ranked in only 3/9 datasets.

Another thing to observe in Table 9 is that random search (RD) performed asexpected; i.e. it was never top-ranked. This is a gratifying result since if randomotherwise, then that tend to negate the value of hyperparamter optimization.

Another interesting result is that traditional estimation-by-analogy might also betermed a baseline method. Note that ABE0 scores well in 3/9 data sets; i.e. just asoften as LP4EE and ATLM.

4.2 Answers to Research Questions

Finally, we turn to the research questions listed in the introduction.RQ1: Is it best just to use the “off-the-shelf” defaults?

As mentioned in the introduction, Acuri & Fraser note that for test case gener-ation, using the default settings can work just as well as anything else. We can seesome evidence of this effect in Table 9. Observe, for example, the isbsg10 resultswhere the untuned ABE0 treatment achieves Rank=1*.

However, overall, Table 9 is negative on the use of default settings. If we justused any For example, in datasets “china/finnish/miyazaki”, not even one treatments


Table 9 % SA results from our cross-validation studies. Same format as Table 7. Note that:Rank Using Med. IQR

china1* CART DE8 93 2 s1* CART DE2 93 2 s2 CART RD160 90 4 s2 CART DE NEW 87 9 s2 CART RD40 86 12 s2 CART0 85 7 s3 ABE0 61 11 s4 CART MOEAD 54 19 s4 LP4EE 51 13 s4 ATLM 48 7 s

albrecht1* CART MOEAD 65 21 s1 ABEN DE2 63 11 s1 ABEN DE8 62 18 s1 ABEN RD160 60 18 s1* ABE0 60 16 s2 CART DE NEW 54 32 s2 CART DE8 53 19 s2 CART RD160 52 43 s2 CART DE2 50 52 s2 CART RD40 49 47 s2 CART0 45 40 s3 ATLM 18 30 s3 LP4EE 17 34 s

desharnais1* CART DE NEW 44 24 s1 ABEN DE8 44 24 s1 ABEN NSGA2 43 27 s1* ATLM 42 13 s1 ABEN RD40 42 27 s1* CART DE8 42 27 s1 ABEN DE2 41 28 s1* CART DE2 40 26 s1* LP4EE 40 24 s1 ABEN RD160 40 31 s2 CART MOEAD 35 31 s2 ABE0 30 42 s2 CART RD160 25 16 s2 CART RD40 23 19 s2 CART0 23 18 s

finnish1* CART DE NEW 86 13 s2 CART DE8 81 14 s2 CART DE2 74 10 s2 CART RD160 74 16 s2 CART RD40 71 12 s2 CART MOEAD 71 11 s2 CART0 69 19 s3 ABE0 49 22 s4 ATLM 40 49 s4 LP4EE 36 41 s

kemerer1* CART MOEAD 47 47 s1 ABEN DE8 41 37 s1 ABEN DE2 39 30 s1* ABE0 37 41 s1 ABEN RD160 37 46 s1 ABEN RD40 36 18 s2 CART DE NEW 31 72 s2 CART RD160 30 85 s2 CART DE2 30 82 s2 CART DE8 26 102 s2 CART RD40 24 77 s3 LP4EE 14 28 s3 CART0 12 48 s4 ATLM -48 507 out-of-range

maxwell1* CART MOEAD 60 33 s1* CART DE NEW 58 31 s1* LP4EE 57 23 s1* CART DE2 56 28 s1* CART DE8 56 31 s2 CART RD160 43 36 s2 CART RD40 41 33 s2 ABE0 39 37 s2 ATLM 33 21 s3 CART0 16 21 s

miyazaki1* CART MOEAD 51 31 s1* CART DE NEW 49 31 s1 ABEN NSGA2 49 34 s2 CART DE8 45 36 s3 CART RD160 42 36 s3 CART RD40 42 33 s3 LP4EE 42 34 s3 CART DE2 41 36 s3 ABE0 40 35 s4 ATLM 23 56 s5 CART0 10 21 s

isbsg101* CART MOEAD 38 11 s1 CART NSGA2 33 21 s1 ABEN RD40 33 30 s1 ABEN DE2 31 28 s1 ABEN RD160 30 49 s1 ABEN DE8 28 57 s1* ATLM 28 19 s1* ABE0 28 29 s1 ABEN NSGA2 26 33 s2 CART DE NEW 22 26 s2 CART DE2 19 40 s2 LP4EE 17 37 s2 CART DE8 16 20 s2 CART RD160 16 21 s2 CART RD40 15 22 s2 CART0 15 24 skitchenham

1* LP4EE 81 10 s1* ATLM 79 8 s1* CART DE NEW 79 14 s1* CART DE2 77 12 s1* CART DE8 76 12 s2 CART RD160 73 11 s2 CART RD40 71 13 s2 ABE0 64 14 s3 CART MOEAD 58 22 s3 CART0 56 15 s

– Larger SA values are better.– The gray rows show the

Rank=1* resultsrecommended for each data set.

– “ out-of-range ” denotes resultsthat are so bad that they falloutside of the 0%..100% rangeshown here.

– Not shown here are anyRank> 1 methods that are listedas slower in Table 8. Rationale:(a) space reasons and (b) suchsub-optimal and slowertreatments need not bediscussed further.

– For all results, seehttps://ibb.co/ftCOD9.


that use the default found in Rank=1*. Overall, if we always used just one of themethods using defaults (LP4EE, ATLM, ABE0) then that would achieve best ranksin 3/9 datasets (and see bf RQ4 for a discussion on what happens if we use all of thedefault methods).

Another aspect to note in the Table 9 results are the large differences in per-formance scores between the best and worst treatments (exceptions: desharnais andisbsg10’s SA scores do not vary much). That is, there is much to be gained by usingthe Rank=1* treatments and deprecating the rest.

In summary, using the defaults is recommended only in a part of datasets. Also,in terms of better test scores, there is much to be gained from tuning. Hence:

Lesson1: “Off-the-shelf” defaults should be deprecated.

RQ2: Can we replace the old defaults with new defaults?If the hyperparameter tunings found by this paper were nearly always the same,

then this study could conclude by recommending better values for default settings.This would be a most convenient result since, in future when new data arrives, thecomplexities of this study would not be needed.

Unfortunately, this turns out not to be the case. Table 10 shows the percent fre-quencies with which some tuning decision appears in our M*N-way cross validations(this table uses results from DE8 tuning CART since, as shown below, this usuallyleads to best results). Note that in those results it it not true that across most datasetsthere is a setting that is usually selected (thought min samples leaf less than 3 is of-ten a popular setting). Accordingly, we say that Table 10 shows that there is muchvariations of the best tunings. Hence, for effort estimation:

Table 10 Tunings discovered by hyperparameter selections (CART+DE8). Table rows sorted by numberof rows in data sets (smallest on top). Cells in this table show the percent of times a particular choice wasmade. White text on black denotes choices made in more than 50% of tunings.

%max features max depth min sample split min samples leaf(selected at random; (of trees) (continuation (termination

100% means “use all”) criteria) criteria)25% 50% 75% 100% ≤03 ≤06 ≤09 ≤12 ≤5 ≤10 ≤15 ≤20 ≤03 ≤06 ≤09 ≤12

kemerer 18 32 23 27 57 37 05 00 95 02 03 00 92 02 05 02

albrecht 13 23 20 43 63 28 08 00 68 32 00 00 83 15 02 00

isbsg10 12 35 28 25 57 33 08 00 47 23 15 15 60 27 10 03

finnish 07 03 27 63 32 56 12 00 73 18 05 03 78 17 05 00

miyazaki 10 22 27 40 31 46 20 03 42 24 18 16 78 13 07 02

maxwell 04 16 40 40 18 60 20 02 44 27 17 12 50 33 14 04

desharnais 25 23 27 25 40 46 11 02 36 26 13 25 32 26 24 19

kitchenham 01 12 32 56 03 42 45 10 43 30 17 10 48 35 12 04

china 00 04 25 71 00 00 25 75 56 30 10 02 68 28 04 00

KEY: 10 20 30 40 50 60 70 80 90 100 %


Lesson2: Overall, there are no “best” default settings.

Before going on, one curious aspect of the Table 10 results are the %max featuresresults; it was rarely most useful to use all features. Except for finnish and china), bestresults were often obtained after discarding (at random) a quarter to three-quarters ofthe features. This is a clear indication that, in future work, it might be advantageousto explore more feature selection for CART models.RQ3: Can we avoid slow hyperparameter optimization?

“Kilo-optimizers” such as NSGA-II examine 103 candidates or more since theyexplore population sizes of 102 for many generations. Hence, as shown in Table 8,they can be very slow.

Is it possible to avoid such slow runtimes? There are many heuristic methods forspeeding up kilo-optimization:

– Active learners select explore a few most informative candidates [39];– Decomposition learners like MOEA/D convert large objective problems into mul-

tiple smaller problems;– Other optimizers explore fewer candidates (DE & RD).

Kilo-optimization is necessary when their exploration of more candidates leads tobetter solutions that heuristic exploration. Such better solutions from kilo-optimizationare rarely found in Table 9 (only in 3/9 cases). Further, the size of the improvementsseen with kilo-optimizers over the best Rank=2 treatments is very small. Those im-provements come at significant runtime cost (in Table 8), the kilo-optimizers are oneto two orders of magnitude slower than other methods). Hence we say that for effortestimation:

Lesson3: Overall, our slowest optimizers perform no better than certain fasterones.

RQ4: What hyperparatmeter optimizers to use for effort estimation?When we discuss this work with our industrial colleagues, they want to know

“the bottom line”; i.e. what they should use or, at the very least, what they shouldnot use. This section offers that advice. We stress that this section is based on theabove results so, clearly these recommendations are something that would need to berevised whenever new results come to hand.

Based on the above we can assert that using all the estimators mentioned aboveis contraindicated:

– For one thing, many of them never appear in our top-ranked results.– For another thing, testing all of them on new data sets would be needleesly expen-

sive. Recall our rig: 30 repeats over the data where each of those repeats includethe very slow estimators shown in Table 8. As seen in that figure, the median tomaximum runtimes for such an analysis for a single dataset would take 70 to 340hours (i.e. days to weeks).

Similarly, using just one of the methods is also contraindicated. The following liststhe best that can be expected if an engineer chooses just one of our estimators, and ap-plied it to all our data sets. The fractions shown at left come from counting optimizerfrequencies in the top-ranks of Table 9:


3/9 : A = {LP4EE ,ATLM ,ABE0}4/9 : B = {CART DE2 ,CART DE8}5/9 : C = {CART DE NEW ,CART MOEAD}

Note that:

– None of these best solo estimators are the untuned baseline methods (ATLM ,LP4EE ). Hence, we cannot endorse their use for generating estimates to beshown to bsuiness managers. That said, we do still endorse their use as a base-line methods, for methodological reasons in effort estimation research (they areuseful for generating a quick result against which we can compare other, better,methods).

– These best solo methods barely cover half our data sets. Hence, below, we willrecommend combinations of a minimal number of estimators.

Since we cannot endorse the use of all of our estimators, and we cannot endorse theuse of just one, we now turn to discussing what minimal combination of estimatorsoffer the most value. Suppose we are allow ourselves just two estimators. From Ta-ble 9, we can count what pairs of estimators appear in Rank=1* for most data sets.These are:

8/9 : D = {CART DE2 + CART MOEAD}8/9 : E = {CART DE8 + CART MOEAD}8/9 : F = {CART MOEAD + CART DE NEW }8/9 : G = {ABE0 + CART DE NEW }

Also, if we are allow ourselves just three estimators, then at most, these would appearin Rank=1* results the following number of times:

9/9 : H = {ABE0 + CART DE NEW + CART DE2}9/9 : I = {ABE0 + CART DE NEW + CART DE8}9/9 : J = {CART MOEAD + CART DE NEW + CART DE2}9/9 : K = {CART MOEAD + CART DE NEW + CART DE8}

Note that, to simple the implementation of our estimators, we might want to avoidMOEA/D. Rationale: the same level of results (9/9) can be acquired without adding athird style of optimizer (where ABE0 is one “style” and DE is another and MOEA/Dis a third). So the best triple combinations with simplest implementations are thesets {H, I}:

9/9 : H = {ABE0 + CART DE NEW + CART DE2}9/9 : I = {ABE0 + CART DE NEW + CART DE8}

(To reinforce a point made above, we note that the baseline methods (LP4EE andATLM) are never the “best” if “best” means generates top ranked results in mostdatasets.)


Applying the same “simplify the implementation” approach to our two estimatorresults, we might select the set G = ABE0 + CART DE NEW as our preferredsimplest pairs. Note that, on our machines,

– This fastest preferred pair of G = ABE0 + CART DE NEW takes 1.5 hour tocomplete a 10x3-way study;

– The faster preferred triple K = ABE0 + CART DE NEW + CART DE2takes 2.5 hours.

We see here that the fastest preferred triple K includes the fastest preferred doubleG. Hence, we recommend the K triple: Hence:

Lesson4: For new data sets, try a combination of CART with the optimizersdifferential evolution and default analogy-based estimator (ABE0).

5 Threats to Validity

Internal Bias: Many of our methods contain stochastic random operators. To reducethe bias from random operators, we repeated our experiment in 20 times and appliedstatistical tests to remove spurious distinctions.

Parameter Bias: For other studies, this is a significant question since (as shownabove) the settings to the control parameters of the learners can have a dramatic effecton the efficacy of the estimation. That said, recall that much of the technology ofthis paper concerned methods to explore the space of possible parameters. Hence weassert that this study suffers much less paramter bias than other studies.

Sampling Bias: While we tested OIL on the nine datasets, it would be inappro-priate to conclude that OIL tuning always perform better than others methods for alldata sets. As researchers, what we can do to mitigate this problem is to carefully doc-ument out method, release out code, and encourage the community to try this methodon more datasets, as the occasion arises.

6 Related Work

In software engineering, hyperparameter optimization techniques have been appliedto some sub-domains, but yet to be adopted in many others. One way to characterizethis paper is an attempt to adapt recent work in hyperparameter optimization in soft-ware defect prediction to effort estimation. Note that, like in defect prediction, thisarticle has also concluded that Differential Evolution is an useful method.

Several SE defect prediction techniques rely on static code attributes [40, 56, 73].Much of that work has focused of finding and employing complex and “off-the-shelf”machine learning models [50, 54, 19], without any hyperparameter optimization. Ac-cording to a literature review done by Fu et al. [22], as shown in Figure 2, nearly 80%of highly cited papers in defect prediction do not mention parameters tuning (so theyrely on the default parameters setting of the data miners).


Fig. 2 Literature review of hyperparameters tuning on 52 top defect prediction papers [22]

Never mention tuning

78%

Just mention tuning

14%

Manually tuning

2%

Grid Search

4%

DE

2%

Gao et al. [23] acknowledged the impacts of the parameter tuning for softwarequality prediction. For example, in their study, “distanceWeighting” parameter wasset to “Weight by 1/distance”, the KNN parameter “k” was set to “30”, and the “cross-Validate” parameter was set to “true”. However, they did not provide any furtherexplanation about their tuning strategies.

As to methods of tuning, Bergstra and Bengio [8] comment that grid search1 isvery popular since (a) such a simple search to gives researchers some degree of in-sight; (b) grid search has very little technical overhead for its implementation; (c) it issimple to automate and parallize; (d) on a computing cluster, it can find better tuningsthan sequential optimization (in the same amount of time). That said, Bergstra andBengio deprecate grid search since that style of search is not more effective than morerandomized searchers if the underlying search space is inherently low dimensional.

Lessmann et al. [43] used grid search to tune parameters as part of their exten-sive analysis of different algorithms for defect prediction. However, they only tuneda small set of their learners while they used the default settings for the rest. Our con-jecture is that the overall cost of their tuning was too expensive so they chose only totune the most critical part.

Two recent studies about investigating the effects of parameter tuning on defectprediction were conducted by Tantithamthavorn et al. [74, 75] and Fu et al. [21]. Tan-tithamthavorn et al. also used grid search while Fu et al. used differential evolution.Both of the papers concluded that tuning rarely makes performance worse across arange of performance measures (precision, recall, etc.). Fu et al. [21] also report thatdifferent data sets require different hyperparameters to maximize performance.

One major difference between the studies of Fu et al. [21] and Tantithamthavorn etal. [74] was the computational costs of their experiments. Since Fu et al.’s differentialevolution based method had a strict stopping criterion, it was significantly faster.

1 For N tunable option, run N nested for-loops to explore their ranges.


Note that there are several other methods for hyperparameter optimization andwe aim to explore several other method as a part of future work. But as shown here,it requires much work to create and extract conclusions from a hyperparameter opti-mizer. One goal of this work, which we think we have achieved, to identify a simplebaseline method against which subsequent work can be benchmarked.

7 Conclusions and Future Work

Hyperparameter optimization is known to dramatically improve the performance ofmany software analytics tasks such as software defect prediction or text classifica-tion [1, 2, 21, 75]. But as discussed in §2.3, the benefits of hyperparameter opti-mization for effort estimation have not been extensively studied. Prior work in thisarea only explored very small data sets [44] or used optimization algorithms that arenot representative of the state-of-the-art in multi-objective optimization [17, 44, 71].Other researchers assume that the effort model is a specific parametric form (e.g.the COCOMO equation), which greatly limits the amount of data that can be stud-ied. Further, all that prior work needs to be revisited given the existence of recentand very prominent methods; i.e. ATLM from TOSEM’15 [79] and LP4EE fromTOSEM’18 [62].

Accordingly, this paper conducts a more thorough investigation of hyperparam-eter optimization for effort estimation using methods (a) with no data feature as-sumptions (i.e. no COCOMO data); (b) that vary many parameters (6,000+ com-binations); that tests its results on 9 different sources with data on 945 softwareprojects; (c) which uses optimizers representative of the state-of-the-art (NSGA-II [16], MOEA/D [80], DE [72]); and which (d) benchmark results against prominentmethods such as ATLM and LP4EE.

These results were assessed with respect to the Acuri and Fraser’s concerns men-tioned in the introduction; i.e. sometimes hyperparamter optimization can be both tooslow and not effective. Such pessimism may indeed apply to the test data generationdomain. However, the results of this paper show that there exists other domains likeeffort estimation where hyperparameter optimization is both useful and fast. Afterapplying hyperparamter optimization, large improvements in effort estimation accu-racy were observed (measured in terms of the standardized accuracy). From thoseresults, we can recommend using a combination of regression trees (CART) tuned bydifferent evolution and default analogy-based estimator. This particular combinationof learner and optimizers can achieve in a few hours what other optimizers need daysto weeks of CPU to accomplish.

To the best of our knowledge, this study is the most extensive exploration ofhyperparameter optimization and effort estimation yet undertaken. That said, thereare very many options not explored here. Our current plans for future work includethe following.

– Try other learners: e.g. neural nets, bayesian learners or random forest;– Try other data pre-processors. We mentioned above how it was curious that max

features was often less than 100%. This is a clear indication that, we might be


able to further improve our estimations results by adding more intelligent featureselection to, say, CART.

– Other optimizers. For example, combining DE and MOEA/D might be a fruitfulway to proceed.

– Yet another possible future direction could be hyper-hyperparamter optimization.In the above, we used optimizers like differential evolution to tune learners. Butthese optimizers have their own control parameters. Perhaps there are better set-tings for the optimizers? Which could be found via hyper-hyperparameter opti-mization?

Hyper-hyperparameter optimization could be a very slow process. Hence, results likethis paper could be most useful since here we have identified optimizers that arevery fast and very slow (and the latter would not be suitable for hyper-hyperparamteroptimization).

In any case, we hope that OIL and the results of this paper will prompt and enablemuch more research on better methods to tune software effort estimators. To that end,we have placed all our scripts and data on-line at https://github.com/ai-se/magic101

Acknowledgement

This work was partially funded by a National Science Foundation Award 1703487.

References

1. Agrawal A, Menzies T (2018) ” better data” is better than” better data min-ers”(benefits of tuning smote for defect prediction). In: ICSE’18

2. Agrawal A, Fu W, Menzies T (2018) What is wrong with topic modeling? andhow to fix it using search-based software engineering. IST Journal

3. Aljahdali S, Sheta AF (2010) Software effort estimation by tuning coocmo modelparameters using differential evolution. In: Computer Systems and Applications(AICCSA), 2010 IEEE/ACS International Conference on, IEEE, pp 1–6

4. Angelis L, Stamelos I (2000) A simulation tool for efficient analogy based costestimation. EMSE 5(1):35–68

5. Arcuri A, Briand L (2011) A practical guide for using statistical tests to as-sess randomized algorithms in software engineering. In: Software Engineering(ICSE), 2011 33rd International Conference on, IEEE, pp 1–10

6. Arcuri A, Fraser G (2013) Parameter tuning or default values? an empirical in-vestigation in search-based software engineering. ESE 18(3):594–623

7. Baker DR (2007) A hybrid approach to expert and model based effort estimation.West Virginia University

8. Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. JMach Learn Res 13(1):281–305

9. Boehm BW (1981) Software engineering economics. Prentice-Hall10. Briand LC, Wieczorek I (2002) Resource estimation in software engineering.

Encyclopedia of software engineering


11. Chalotra S, Sehra SK, Brar YS, Kaur N (2015) Tuning of cocomo model parame-ters by using bee colony optimization. Indian Journal of Science and Technology8(14)

12. Chang CL (1974) Finding prototypes for nearest neighbor classifiers. TC 100(11)13. Cohn M (2005) Agile estimating and planning. Pearson Education14. Corazza A, Di Martino S, Ferrucci F, Gravino C, Sarro F, Mendes E (2013) Using

tabu search to configure support vector regression for effort estimation. EmpiricalSoftware Engineering 18(3):506–546

15. Cowing K (2002) Nasa to shut down checkout & launch control system.http://www.spaceref.com/news/viewnews.html?id=475

16. Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjec-tive genetic algorithm: Nsga-ii. IEEE Transactions on Evolutionary Computation6(2):182–197

17. Dejaeger K, Verbeke W, Martens D, Baesens B (2012) Data mining techniquesfor software effort estimation: A comparative study. IEEE Trans SE 38(2):375–397

18. Efron B, Tibshirani J (1993) Introduction to bootstrap. Chapman & Hall19. Elish KO, Elish MO (2008) Predicting defect-prone software modules using sup-

port vector machines. J Syst Softw 81(5):649–660, DOI 10.1016/j.jss.2007.07.040

20. Frank E, Hall M, Pfahringer B (2002) Locally weighted naive bayes. In: 19thconference on Uncertainty in Artificial Intelligence, pp 249–256

21. Fu W, Menzies T, Shen X (2016) Tuning for software analytics: Is it really nec-essary? IST Journal 76:135–146

22. Fu W, Nair V, Menzies T (2016) Why is differential evolution better than gridsearch for tuning defect predictors? arXiv preprint arXiv:160902613

23. Gao K, Khoshgoftaar TM, Wang H, Seliya N (2011) Choosing software metricsfor defect prediction: an investigation on feature selection techniques. Software:Practice and Experience 41(5):579–606

24. Glover F, Laguna M (1997) Tabu Search. Kluwer Academic Publishers, Norwell,MA, USA

25. Hall MA, Holmes G (2003) Benchmarking attribute selection techniques. TKDE15(6):1437–1447

26. Hardy MA (1993) Regression with dummy variables, vol 93. Sage27. Hihn J, Menzies T (2015) Data mining methods and cost estimation models:

Why is it so hard to infuse new ideas? In: ASEW, pp 5–9, DOI 10.1109/ASEW.2015.27

28. Jørgensen M (2004) A review of studies on expert estimation of software devel-opment effort. JSS 70(1-2):37–60

29. Jørgensen M (2015) The world is skewed: Ignorance, use, misuse, misunder-standings, and how to improve uncertainty analyses in software developmentprojects

30. Jørgensen M, Gruschke TM (2009) The impact of lessons-learned sessions oneffort estimation and uncertainty assessments. TSE 35(3):368–383

31. Kampenes VB, Dyba T, Hannay JE, Sjøberg DI (2007) A systematic review ofeffect size in software engineering experiments. Information and Software Tech-


nology 49(11-12):1073–108632. Kemerer CF (1987) An empirical validation of software cost estimation models.

CACM 30(5):416–42933. Keung J, Kocaguneli E, Menzies T (2013) Finding conclusion stability for select-

ing the best effort predictor in software effort estimation. ASE 20(4):543–567,DOI 10.1007/s10515-012-0108-5

34. Keung JW, Kitchenham BA, Jeffery DR (2008) Analogy-x: Providing statisticalinference to analogy-based software cost estimation. TSE 34(4):471–484

35. Kocaguneli E, Menzies T (2011) How to find relevant data for effort estimation?In: ESEM, pp 255–264, DOI 10.1109/ESEM.2011.34

36. Kocaguneli E, Misirli AT, Caglayan B, Bener A (2011) Experiences on developerparticipation and effort estimation. In: SEAA’11, IEEE, pp 419–422

37. Kocaguneli E, Menzies T, Bener A, Keung JW (2012) Exploiting the essentialassumptions of analogy-based effort estimation. TSE 38(2):425–438

38. Kocaguneli E, Menzies T, Mendes E (2015) Transfer learning in effort estima-tion. ESE 20(3):813–843, DOI 10.1007/s10664-014-9300-5

39. Krall J, Menzies T, Davies M (2015) Gale: Geometric active learning forsearch-based software engineering. IEEE Transactions on Software Engineering41(10):1001–1018, DOI 10.1109/TSE.2015.2432024

40. Krishna R, Menzies T, Fu W (2016) Too much automation? the bellwether effectand its implications for transfer learning. In: IEEE/ACM ICSE, ASE 2016, DOI10.1145/2970276.2970339

41. Langdon WB, Dolado J, Sarro F, Harman M (2016) Exact mean absolute errorof baseline predictor, marp0. IST 73:16–18

42. LBreiman ROCS J Friedman (1984) Classification and Regression Trees.Wadsworth

43. Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classificationmodels for software defect prediction: A proposed framework and novel findings.IEEE Transactions on Software Engineering 34(4):485–496, DOI 10.1109/TSE.2008.35

44. Li Y, Xie M, Goh TN (2009) A study of project selection and feature weightingfor analogy based software cost estimation. JSS 82(2):241–252

45. Mahnic V, Hovelja T (2012) On using planning poker for estimating user stories.Journal of Systems and Software 85(9):2086–2095

46. McConnell S (2006) Software estimation: demystifying the black art. Microsoftpress

47. Mendes E, Mosley N (2002) Further investigation into the use of cbr and stepwiseregression to predict development effort for web hypermedia applications. In:ESEM’02, IEEE, pp 79–90

48. Mendes E, Watson I, Triggs C, Mosley S N Counsell (2003) A comparative studyof cost estimation models for web hypermedia applications. ESE 8(2):163–196

49. Menzies T, Chen Z, Hihn J, Lum K (2006) Selecting best practices for effortestimation. TSE 32(11):883–895

50. Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes tolearn defect predictors. IEEE Transactions on Software Engineering 33(1):2–13,DOI 10.1109/TSE.2007.256941


51. Menzies T, Yang Y, Mathew G, Boehm B, Hihn J (2017) Negative results for soft-ware effort estimation. ESE 22(5):2658–2683, DOI 10.1007/s10664-016-9472-2

52. Mittas N, Angelis L (2013) Ranking and clustering software cost estimationmodels through a multiple comparisons algorithm. IEEE Trans SE 39(4):537–551, DOI 10.1109/TSE.2012.45

53. Moeyersoms J, Junque de Fortuny E, Dejaeger K, Baesens B, Martens D (2015)Comprehensible software fault and effort prediction. J Syst Softw 100(C):80–90,DOI 10.1016/j.jss.2014.10.032

54. Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiencyof change metrics and static code attributes for defect prediction. In: 30th ICSE,DOI 10.1145/1368088.1368114

55. Nair V, Agrawal A, Chen J, Fu W, Mathew G, Menzies T, Minku LL, Wagner M,Yu Z (2018) Data-driven search-based software engineering. In: MSR

56. Nam J, Kim S (2015) Heterogeneous defect prediction. In: 10th FSE, ESEC/FSE2015, DOI 10.1145/2786805.2786814

57. Neter J, Kutner MH, Nachtsheim CJ, Wasserman W (1996) Applied linear sta-tistical models, vol 4. Irwin Chicago

58. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blon-del M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J (2011) Scikit-learn:Machine learning in python. JMLR 12(Oct):2825–2830

59. Peters T, Menzies T, Layman L (2015) Lace2: Better privacy-preserving datasharing for cross project defect prediction. In: ICSE, vol 1, pp 801–811, DOI10.1109/ICSE.2015.92

60. Quinlan JR (1992) Learning with continuous classes. In: 5th Australian jointconference on artificial intelligence, Singapore, vol 92, pp 343–348

61. Rao GS, Krishna CVP, Rao KR (2014) Multi objective particle swarm optimiza-tion for software cost estimation. In: ICT and Critical Infrastructure: Proceedingsof the 48th Annual Convention of Computer Society of India-Vol I, Springer, pp125–132

62. Sarro F, Petrozziello A (2018) Linear programming as a baseline for softwareeffort estimation. ACM Transactions on Software Engineering and Methodology(TOSEM) p to appear

63. Sarro F, Petrozziello A, Harman M (2016) Multi-objective software effort esti-mation. In: ICSE, ACM, pp 619–630

64. Sayyad AS, Ammar H (2013) Pareto-optimal search-based software engineering(posbse): A literature survey. In: RAISE’13, IEEE, pp 21–27

65. Sayyad AS, Menzies T, Ammar H (2013) On the value of user preferences insearch-based software engineering: a case study in software product lines. In:ICSE’13, IEEE Press, pp 492–501

66. Shepperd M (2007) Software project economics: a roadmap. In: 2007 Future ofSoftware Engineering, IEEE Computer Society, pp 304–315

67. Shepperd M, MacDonell S (2012) Evaluating prediction systems in softwareproject estimation. IST 54(8):820–827

68. Shepperd M, Schofield C (1997) Estimating software project effort using analo-gies. TSE 23(11):736–743


69. Singh BK, Misra A (2012) Software effort estimation by genetic algorithm tunedparameters of modified constructive cost model for nasa software projects. Inter-national Journal of Computer Applications 59(9)

70. Sommerville I (2010) Software engineering. Addison-Wesley71. Song L, Minku LL, Yao X (2013) The impact of parameter tuning on software

effort estimation using learning machines. In: PROMISE’13, ACM, New York,NY, USA, pp 9:1–9:10, DOI 10.1145/2499393.2499394

72. Storn R, Price K (1997) Differential evolution–a simple and efficient heuristicfor global optimization over cont. spaces. JoGO 11(4):341–359

73. Tan M, Tan L, Dara S (2015) Online defect prediction for imbalanced data. In:ICSE

74. Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016) Automatedparameter optimization of classification techniques for defect prediction models.In: 38th ICSE, DOI 10.1145/2884781.2884857

75. Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2018) The impactof automated parameter optimization on defect prediction models. IEEE Trans-actions on Software Engineering pp 1–1, DOI 10.1109/TSE.2018.2794977

76. Trendowicz A, Jeffery R (2014) Software project effort estimation. Foundationsand Best Practice Guidelines for Success, Constructive Cost Model–COCOMOpags pp 277–293

77. VersionOne T (2017) 11th annual state of agile survey. Tech. rep., Technicalreport, Version One

78. Walkerden F, Jeffery R (1999) An empirical study of analogy-based softwareeffort estimation. ESE 4(2):135–158

79. Whigham PA, Owen CA, Macdonell SG (2015) A baseline model for softwareeffort estimation. TOSEM 24(3):20:1–20:11, DOI 10.1145/2738037

80. Zhang Q, Li H (2007) Moea/d: A multiobjective evolutionary algorithm basedon decomposition. IEEE Transactions on Evolutionary Computation 11(6):712–731, DOI 10.1109/TEVC.2007.892759

hyperparameter optimization for effort estimation · 2020. 4. 1. · in summary, unlike the test...

Documents