predictive accuracy comparison between neural networks and statistical regression for development...

16
Please cite this article in press as: C. López-Martín, Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects, Appl. Soft Comput. J. (2014), http://dx.doi.org/10.1016/j.asoc.2014.10.033 ARTICLE IN PRESS G Model ASOC-2592; No. of Pages 16 Applied Soft Computing xxx (2014) xxx–xxx Contents lists available at ScienceDirect Applied Soft Computing j ourna l h o mepage: www.elsevier.com/locate/asoc Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects Cuauhtémoc López-Martín Information Systems Department, Universidad de Guadalajara, P.O. Box 45100, Jalisco, Mexico a r t i c l e i n f o Article history: Received 5 July 2013 Received in revised form 6 August 2014 Accepted 27 October 2014 Available online xxx Keywords: Software development effort prediction Radial Basis Function Neural Network Feedforward multilayer perceptron General regression neural network Statistical regression ISBSG data set a b s t r a c t To get a better prediction of costs, schedule, and the risks of a software project, it is necessary to have a more accurate prediction of its development effort. Among the main prediction techniques are those based on mathematical models, such as statistical regressions or machine learning (ML). The ML models applied to predicting the development effort have mainly based their conclusions on the following weak- nesses: (1) using an accuracy criterion which leads to asymmetry, (2) applying a validation method that causes a conclusion instability by randomly selecting the samples for training and testing the models, (3) omitting the explanation of how the parameters for the neural networks were determined, (4) generating conclusions from models that were not trained and tested from mutually exclusive data sets, (5) omitting an analysis of the dependence, variance and normality of data for selecting the suitable statistical test for comparing the accuracies among models, and (6) reporting results without showing a statistically sig- nificant difference. In this study, these six issues are addressed when comparing the prediction accuracy of a radial Basis Function Neural Network (RBFNN) with that of a regression statistical (the model most frequently compared with ML models), to feedforward multilayer perceptron (MLP, the most commonly used in the effort prediction of software projects), and to general regression neural network (GRNN, a RBFNN variant). The hypothesis tested is the following: the accuracy of effort prediction for RBFNN is statistically better than the accuracy obtained from a simple linear regression (SLR), MLP and GRNN when adjusted function points data, obtained from software projects, is used as the independent variable. Samples obtained from the International Software Benchmarking Standards Group (ISBSG) Release 11 related to new and enhanced projects were used. The models were trained and tested from a leave-one- out cross-validation method. The criteria for evaluating the models were based on Absolute Residuals and by a Friedman statistical test. The results showed that there was a statistically significant difference in the accuracy among the four models for new projects, but not for enhanced projects. Regarding new projects, the accuracy for RBFNN was better than for a SLR at the 99% confidence level, whereas the MLP and GRNN were better than for a SLR at the 90% confidence level. © 2014 Elsevier B.V. All rights reserved. 1. Introduction Since two of the three most important causes of Information Technology projects failure have been related to a poor resource prediction [1], the creation of accurate models to Software Devel- opment Effort Prediction (SDEP) represents a continuous activity of researchers and software managers [2]. In average, software deve- lopers expend from 30% to 40% more effort than is predicted [1]. Underpredicting software project effort causes schedule delays and cost over-runs, which may address to project failure and affect the Tel.: +52 3337703352. E-mail address: [email protected] reputation and competitiveness of a company; on the other hand, overpredicting software project effort may address to an ineffective use of software development resources, which can result in missed opportunities to fund in other projects and therefore loss of project tenders [3,4]. These scenarios have motivated researchers in direc- ting their efforts to determine which technique is more accurate for effort prediction, or to propose new or combined techniques that could provide better predictions [5]. These techniques can be classified into the following two general categories: 1) Expert judgment that implies a lack of analytical argumentation and aims to derive estimates based on experience of experts on similar projects. This technique is based on a tacit (intuition- based) quantification step [6]. http://dx.doi.org/10.1016/j.asoc.2014.10.033 1568-4946/© 2014 Elsevier B.V. All rights reserved.

Upload: cuauhtemoc

Post on 17-Mar-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects

A

Ps

CI

a

ARRAA

KSRFGSI

1

TporlUc

h1

ARTICLE IN PRESSG ModelSOC-2592; No. of Pages 16

Applied Soft Computing xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Applied Soft Computing

j ourna l h o mepage: www.elsev ier .com/ locate /asoc

redictive accuracy comparison between neural networks andtatistical regression for development effort of software projects

uauhtémoc López-Martín ∗

nformation Systems Department, Universidad de Guadalajara, P.O. Box 45100, Jalisco, Mexico

r t i c l e i n f o

rticle history:eceived 5 July 2013eceived in revised form 6 August 2014ccepted 27 October 2014vailable online xxx

eywords:oftware development effort predictionadial Basis Function Neural Networkeedforward multilayer perceptroneneral regression neural networktatistical regressionSBSG data set

a b s t r a c t

To get a better prediction of costs, schedule, and the risks of a software project, it is necessary to havea more accurate prediction of its development effort. Among the main prediction techniques are thosebased on mathematical models, such as statistical regressions or machine learning (ML). The ML modelsapplied to predicting the development effort have mainly based their conclusions on the following weak-nesses: (1) using an accuracy criterion which leads to asymmetry, (2) applying a validation method thatcauses a conclusion instability by randomly selecting the samples for training and testing the models, (3)omitting the explanation of how the parameters for the neural networks were determined, (4) generatingconclusions from models that were not trained and tested from mutually exclusive data sets, (5) omittingan analysis of the dependence, variance and normality of data for selecting the suitable statistical test forcomparing the accuracies among models, and (6) reporting results without showing a statistically sig-nificant difference. In this study, these six issues are addressed when comparing the prediction accuracyof a radial Basis Function Neural Network (RBFNN) with that of a regression statistical (the model mostfrequently compared with ML models), to feedforward multilayer perceptron (MLP, the most commonlyused in the effort prediction of software projects), and to general regression neural network (GRNN,a RBFNN variant). The hypothesis tested is the following: the accuracy of effort prediction for RBFNNis statistically better than the accuracy obtained from a simple linear regression (SLR), MLP and GRNNwhen adjusted function points data, obtained from software projects, is used as the independent variable.Samples obtained from the International Software Benchmarking Standards Group (ISBSG) Release 11related to new and enhanced projects were used. The models were trained and tested from a leave-one-

out cross-validation method. The criteria for evaluating the models were based on Absolute Residualsand by a Friedman statistical test. The results showed that there was a statistically significant differencein the accuracy among the four models for new projects, but not for enhanced projects. Regarding newprojects, the accuracy for RBFNN was better than for a SLR at the 99% confidence level, whereas the MLPand GRNN were better than for a SLR at the 90% confidence level.

© 2014 Elsevier B.V. All rights reserved.

. Introduction

Since two of the three most important causes of Informationechnology projects failure have been related to a poor resourcerediction [1], the creation of accurate models to Software Devel-pment Effort Prediction (SDEP) represents a continuous activity ofesearchers and software managers [2]. In average, software deve-

Please cite this article in press as: C. López-Martín, Predictive accuracyfor development effort of software projects, Appl. Soft Comput. J. (201

opers expend from 30% to 40% more effort than is predicted [1].nderpredicting software project effort causes schedule delays andost over-runs, which may address to project failure and affect the

∗ Tel.: +52 3337703352.E-mail address: [email protected]

ttp://dx.doi.org/10.1016/j.asoc.2014.10.033568-4946/© 2014 Elsevier B.V. All rights reserved.

reputation and competitiveness of a company; on the other hand,overpredicting software project effort may address to an ineffectiveuse of software development resources, which can result in missedopportunities to fund in other projects and therefore loss of projecttenders [3,4]. These scenarios have motivated researchers in direc-ting their efforts to determine which technique is more accuratefor effort prediction, or to propose new or combined techniquesthat could provide better predictions [5]. These techniques can beclassified into the following two general categories:

comparison between neural networks and statistical regression4), http://dx.doi.org/10.1016/j.asoc.2014.10.033

1) Expert judgment that implies a lack of analytical argumentationand aims to derive estimates based on experience of experts onsimilar projects. This technique is based on a tacit (intuition-based) quantification step [6].

Page 2: Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects

ING ModelA

2 ft Com

2

o[e

rfr

1

2

3

4

5

6

isdatRs

aptt[aNcri

ARTICLESOC-2592; No. of Pages 16

C. López-Martín / Applied So

) Model-based techniques which are based on a deliberate(mechanical) quantification step, and they can be divided intomodels based on Statistics (whose general form is a statisticalregression model [7]) and models based on machine learning(ML) such as case-based reasoning, artificial neural networks,decision trees, Bayesian networks, support vector regression,genetic algorithms, genetic programming, and association rules[5].

These techniques have been applied to SDEP of projects devel-ped at individual level in academic environments [8,9] or by teams1,5]. This study involves projects developed by teams of practition-rs in enterprises environments.

A systematic literature review of 84 primary studies specificallyegarding the application of ML techniques to SDEP was structuredrom journals (70%), conferences (29%) and a book chapter [5]. Thiseview revealed the following issues:

) ML techniques have mainly been used in two forms betweenthe years 1991 and 2010: either alone or as combination of twoor more ML techniques, or as a combination with non-ML tech-niques.

) Neural networks represent the second most used technique with26%, after the case based reasoning technique with 37%.

) Genetic algorithms have only been used in combination withother techniques such as case-based reasoning, artificial neu-ral networks, and support vector regression. In the mentionedcases, genetic algorithms have been used for either featureweighting or feature selection.

) As for weaknesses of techniques, decision trees is prone to over-fitting on small training data set; case-based reasoning andneural networks cannot deal with categorical features in theirstandard forms; case-based reasoning, neural networks anddecision trees cannot deal with missing values in their standardforms; and finally, neural networks are not easy to understandby practitioners.

) As to strengths of techniques, case-based reasoning and deci-sion trees are intuitive and are easy to understand; whereasneural networks have the ability to learn complex (non-linear)functions.

) Regarding strengths and weaknesses of Bayesian networks, sup-port vector regression, genetic programming, and associationrules were not listed since none of them was mentioned bymore than one study. That is, their application in the softwaredevelopment prediction is still scarce.

The SDEP accuracy results of ML techniques, so far, have beennconclusive and inconsistent. The SDEP accuracy varies under theame model when it is constructed with different historical projectata sets or different experimental designs. This assertion wasttributed to the limited number of primary studies [5]. Therefore,his research is also motivated by the suggestion generated fromef. [5], in the sense of conducting more empirical studies to obtaintronger in evidence related to the performance of techniques.

With regard to neural networks, several kinds of them have beenpplied to SDEP. Until the year 2008, the feedforward multilayererceptron (MLP) with back propagation learning algorithm washe neural network most commonly used in the SDEP [10], andhis kind of neural networks have kept its use until the year 20092], 2012 [11], and 2013 [12]. In this research, the application of

Please cite this article in press as: C. López-Martín, Predictive accuracyfor development effort of software projects, Appl. Soft Comput. J. (201

n artificial neural network named Radial Basis Function Neuraletwork (RBFNN) is proposed. The RBFNN prediction accuracy isompared with that of a MLP and a GRNN (a variant of the RBFNNecently used for predicting the effort). The proposal of the RBFNNs based on the following arguments:

PRESSputing xxx (2014) xxx–xxx

1) A survey of 96 studies comparing the performance between neu-ral networks and statistical regression models in several fields,showed that neural networks outperformed the regression mod-els in about 58% of the cases, whereas in 24% of the cases, theperformance of the statistical models were equivalent to theneural networks [13].

2) The non-linear relationships are common among developmenteffort and independent variables in software projects [14].

3) Neural networks are the most accurate models among ML tech-niques when they have been applied to SDEP [5].

4) In spite of the fact that neural networks have been shown tohave a better prediction accuracy than regression models whenthey are applied to SDEP, the application of ML techniques in theindustry is still limited [5].

5) The procedure for training a RBFNN is faster than that one usedfor training a MLP, due to the internal representation formedby the RBFNN hidden neurons which lead to a two-stage train-ing procedure [15], whereas a RBFNN requires less substantialcomputation than a GRNN to evaluate new points [38].

There have been identified ten studies where a RBFNN wasapplied to SDEP. Six of them compared a RBFNN with a statisti-cal regression [22–27], whereas the other four did not compare theaccuracy of the RBFNN with any other technique [28–31]. In thisresearch, the SDEP accuracy of the RBFNN is compared with theaccuracy of a statistical regression. There are the following reasonsfor this comparison:

1) The use of non-ML (such as statistical regression) and MLtechniques in parallel at the early stage of development is rec-ommended [5].

2) A regression analysis allows selecting the statistically significantindependent variables [16] that explain the dependent variable(development effort).

3) Statistical regressions are the models most frequently comparedwith ML models [5,13].

Neural networks and statistical models have been applied infields such as accounting, finance, health, medicine, engineering,manufacturing or marketing [13] and specifically to SDEP [5]. Thefollowing six weaknesses have been identified in studies where theaccuracy in these two models was compared with each other:

1. In spite of the Mean Magnitude of Relative Error (MMRE) is abiased predictor of central tendency of the residuals because of itis an asymmetric measure that leads to asymmetry [17], the 89%of the studies has used to the MMRE as criterion for evaluatingthe prediction accuracy of ML models [5].

2. Among the three dominant validation methods used for eval-uating the precision accuracy of ML models are Holdout (38%),Leave-one-out cross-validation or LOOCV (37%), and k-fold cross-validation (19%) [5]. The Holdout and k-fold cross-validationmethods use mainly a random selection for training and testingthe models, which introduces the problem of conclusion insta-bility of different studies. Therefore, methods that use a randomselection have been deprecated for assessing SDEP models [18].

3. The determination of parameters for the neural networks like thenumber of hidden layers or number of neurons in each hiddenlayer is usually omitted [13].

4. Results obtained from the trained models were not tested on anew data set that was not used for training the models. The rea-son for testing on a separate data set is that otherwise, it would

comparison between neural networks and statistical regression4), http://dx.doi.org/10.1016/j.asoc.2014.10.033

not have provided an unbiased prediction of the generalizationerror [15].

5. Most of the studies did not use the statistical techniques foranalyzing the dependence, normality and variance of data. They

Page 3: Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects

ING ModelA

ft Com

6

oitat

1

2

3

4

5

6

[

tstRdc2

i

Hati

ARTICLESOC-2592; No. of Pages 16

C. López-Martín / Applied So

were not considered or at least it was not explicitly mentioned[13]. This analysis is useful for selecting the suitable statisticaltest for comparing the accuracies of models [19].

. It was not clear whether statistically significant differencesexisted in the performance of different models that were com-pared (the conclusion of only 32% of the studies were based onsome statistical tests performed to observe whether significantdifferences existed among the models [13]). This kind of analysisis important because it is an inherent part of empirical studiesthat employ significance testing; that is essential for the planningof studies, for the interpretation of results, and for the validityof study conclusions (it is usually reported the 90%, 95% and 99%levels of confidence [20]).

The contribution of this study is to investigate the applicationf a RBFNN to SDEP with the adjusted function points (AFP) as thendependent variable. The accuracy of the RBFNN is compared tohose of a Simple Linear Regression (SLR, a non-ML technique), MLPnd a GRNN. This contribution is achieved when taking into accounthe previous six weaknesses identified as follows:

. The use of the Absolute Residuals (AR) as criterion for compar-ing the prediction accuracy among the four models. The AR isunbiased, since it is not based on ratios [17].

. The application of the LOOCV as validation method. In terms ofreproducibility, LOOCV removes one cause of conclusion insta-bility: the random selection of train and test sets. That is, theLOOCV avoids nondeterministic selection of train and test sets[18]. In addition, a recent study regarding effort predictionshowed that in the majority of the 90 algorithms (one of themwas a MLP and other four were statistical regressions) used on 20different data sets, the LOOCV had statistically indistinguishablebias and variance values under different validation methods suchas 3-fold and 10-fold cross-validation, moreover, it was found thatLOOCV and k-fold validation methods had similar run times [18].

. The manner of how the number of hidden layers and the numberof neurons in each hidden layer for the MLP, as well as for theoptimal spread parameter for the RBFNN and GRNN is described.

. Mutually exclusive data sets are used for training and testing themodels when applied the LOOCV method.

. The dependence of data is determined. In addition, the Chi-squared, Shapiro–Wilk, skewness, and kurtosis statistical testsare used for testing the normality of data. Finally, the Levenestatistical test is used for test the variances among samples [21].

. A suitable statistical test is selected for comparing the predic-tion accuracy among the four models based on results of datadependence, as well as normality and variance tests [21].

None of the ten studies identified applying a RBFNN to SDEP22–31] met with the previous six issues.

The RBFNN, MLP, GRNN and SLR of this research are trained andested by using an industrial data set. Enterprise projects were mea-ured using adjusted function points which were obtained fromhe International Software Benchmarking Standards Group (ISBSG)elease 11 [32]. The most common research topic when the ISBSGata set has been used is prediction methods with 70.5%. This per-entage was obtained from 129 studies published between years000 and 2012 in 19 journals and 40 conferences [33].

The hypothesis to be investigated in this research is the follow-ng:

Please cite this article in press as: C. López-Martín, Predictive accuracyfor development effort of software projects, Appl. Soft Comput. J. (201

1. The SDEP accuracy for RBFNN is statistically better than theccuracy obtained from a SLR, MLP and GRNN when adjusted func-ion points data, obtained from software projects, is used as thendependent variable.

PRESSputing xxx (2014) xxx–xxx 3

The rest of this study is organized as follows: Section 2 analyzesstudies that were identified where a RBFNN was applied for pre-dicting the software development effort. In Section 3 the RBFNNused in this study is detailed, whereas the MLP and GRNN arebriefly explained. Section 4 describes the dependent and indepen-dent variables used in the four models. Section 5 describes thecriteria for selecting the data set used for the models. Section 6analyses the data set functional form and describes the accuracycriteria used for evaluating the performance of the models. Sec-tion 7 involves the criteria for selecting the validation method forthe models, to apply that selected method, as well as to describethe theory for analyzing the dependence, normality and variance ofdata. In Section 8, a suitable statistical test is selected and appliedby comparing the accuracies of the four models. Finally, Section 9presents the discussion, conclusions, limitations of this study andfuture research.

2. Related work

Table 1 shows the ten studies where a RBFNN was applied toSDEP. It provides the following attributes of each study: size ofsamples for training and testing the RBFNN, the prediction accu-racy criterion, the best accuracy achieved in the testing phase, thesize measure for projects, as well as the method used for validatingthe models (three studies used the full data set only for the train-ing phase [28–30], whereas other one [24] used in its testing phaseto five projects selected from the same data set which was usedfor training). Eight of the ten studies used the MMRE as aggrega-tion evaluation criterion; in one study the authors only comparedfive cases without using any aggregation criterion [24], whereasthe tenth one used the root mean square error (RMSE) [25] whichis calculated as follows:

RMSE =

√√√√1n

n∑i=1

(actual efforti − predicted efforti)2

Regarding source of the data sets, five studies [24,27–30] usedthe COCOMO 81 (comprising 63 projects reported in the year of1981 [34]), two studies [29,30] used the Tukutuku data set (53projects), two studies [23,31] based their conclusions on the NASAdata set (18 projects), one study [22] used three data sets obtainedfrom the IBM DP Services Organization (24 projects), Kemerer (15projects), and Hallmark (28 projects), one study [26] trained itsmodels from the IBMDPS (24 projects) and from a Canadian finan-cial organization (37 projects), whereas one of them [25] used asample of 1538 projects from the ISBSG Release 10 data set.

As for independent variables, three of the ten studies involvedFP, one of them used unadjusted FP [22], a second study usedadjusted FP [26], and a third one used the five ILF, EIF, EI, EO andEQ values (described in Section 4 of this study) [25].

Table 2 includes an analysis of each study with respect to thefollowing questions related to the six weaknesses mentioned inSection 1 of this research:

Q1: Does it use any prediction accuracy criterion which leads tosymmetry such as AR?

Q2: Does it use any validation method that not randomly selectingthe samples for training and testing the models such as LOOCVmethod?

comparison between neural networks and statistical regression4), http://dx.doi.org/10.1016/j.asoc.2014.10.033

Q3: Does it include an explanation about how the number of neu-rons, hidden layers or spread parameters was obtained?

Q4: Are the models tested with a data set that was not used fortraining the models?

Page 4: Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects

ARTICLE IN PRESSG ModelASOC-2592; No. of Pages 16

4 C. López-Martín / Applied Soft Computing xxx (2014) xxx–xxx

Table 1Analysis of data sets of studies related to RBFNN (En: nth-experiment; LOC: lines of code; FP: function points; AFP: adjusted function points; ILF, EIF, EI, EO and EQ; WM: webmetrics; DM: development methodology).

Study Size of samples Sample size fortraining

Sample size fortesting

Criterionaccuracy

Best accuracyobtained in testing

Projectmeasures

Validationmethod

[22] Set 1: 15Set 2: 24Set 3: 28

E1: 32E2: 60

E1: 7E2: 7

MMRE 0.32 LOCFP

Holdout

[23] 18 E1: 18E2: 18

E1: 1E2: 1

MMRE 0.13 E1: LOCE2: LOC and DM

Leave-one-outcross-validation

[24] 63 63 5 – – LOC –[25] 1538 1380 158 RMSE 0.051 ILF, EIF, EI, EO

and EQ10-Fold crossvalidation

[26] Set 1: 24Set 2: 37

E1: 19E2: 30

E1: 5E2: 7

MMRE 0.29 AFP 3-Fold cross-validation andHoldout(80–20%)

[27] 63 53 10 MMRE 0.17 LOC Holdout[28] Set 1: 252

Set 2: 53E1: 252E2: 53

E1: NoneE2: None

MMRE – LOCWM

[29] 252 252 None MMRE – LOC –[30] Set 1: 252

Set 2: 53E1: 252E2: 53

E1: NoneE2: None

MMRE – LOCWM

MRE

Q

Q

to(ftuentbid

aia

ts

TA

[31] 18 E1: 18E2: 18

E1: 1E2: 1

M

5: Is there any analysis of dependence, normality and variancefor data to select a suitable statistical test for comparing theprediction accuracy between models?

6: Are accuracies of models compared based upon statisticallysignificant difference?

Among the ten studies described in Table 2, none used a predic-ion accuracy criterion leading to symmetry (such as AR). Just twof them used a validation method that causes conclusion stabilitysuch as LOOCV) [23,31]. Two of them reported that spread was usedor training their RBFNN [24,27]; however, they did not report howhat value was obtained. There are only three studies that did notse mutually exclusive samples for training and testing the mod-ls [28–30]. None of ten studies reported analysis of dependence,ormality and variance for the data such that a suitable statisticalest has been selected for comparing the accuracy between modelsased on the result of that analysis. Finally, just one of the ten stud-

es contained reported their results from a statistically significantifference [25].

In accordance with the MLP with back propagation learninglgorithm, it has been the neural network most commonly usedn the effort prediction of software projects to the year of 2008 [10]

Please cite this article in press as: C. López-Martín, Predictive accuracyfor development effort of software projects, Appl. Soft Comput. J. (201

nd until 2013 [2,11,12].Regarding GRNN, it was already used to SDEP [37]; however,

he research neither used an accuracy criterion which leaded toymmetry nor a validation method that causes conclusion stability.

able 2nalysis of studies where a RBFNN has been applied.

Study Answers to questions

Q1 Q2 Q3

[22] No No Yes[23] No Yes Yes[24] No No No

[25] No No No

[26] No No Yes[27] No No No

[28] No No Yes[29] No No Yes[30] No No Yes[31] No Yes Yes

0.18 E1: LOCE2: LOC and DM

Leave-one-outcross-validation

The prediction accuracy of the GRNN was compared to that of aSLR. Results showed that there was not a statistically significantdifference in the prediction accuracy for the two models.

3. Neural networks

An artificial neural network, or simply a neural network (NN),is a technique of computing and signal processing that is inspiredin processing done by a network of biological neurons [35]. A basisfor construction of a neural network is an artificial neuron. Thefeedforward neural networks architectures can be classified in thefollowing two categories: (1) single layer with adjustable weightsand bias, or (2) multilayer networks that contain input layer, oneor more hidden layers of neurons, and an output layer of neurons[35]. The neural networks applied to SDEP have been MLP, RBFNNand GRNN, which corresponds to multilayered feedforward neuralnetworks.

The MLP are based on neurons which compute a nonlinear func-tion of the scalar product of the input vector and a weight vector ofthat neuron, whereas in a RBFNN, the activation of a hidden neu-ron is determined by the (Euclidean) distance between the inputvector and the weight vector representing center of that neuron in

comparison between neural networks and statistical regression4), http://dx.doi.org/10.1016/j.asoc.2014.10.033

the input space [35]. The GRNN (two hidden layers) is a variant of aRBFNN (one hidden layer). In the following subsections, the RBFNNused in this study is detailed, and the MLP and the GRNN are brieflydescribed.

Q4 Q5 Q6

Yes No No Yes No No

Yes No NoYes No Yes

Yes No NoYes No No

No No No No No No No No No Yes No No

Page 5: Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects

ARTICLE IN PRESSG ModelASOC-2592; No. of Pages 16

C. López-Martín / Applied Soft Computing xxx (2014) xxx–xxx 5

NN ar

3

lFaie→

t{ti

fR

ep

e→

dsh

s

orpa

t→w

v

oi

f

bfi

Fig. 1. RBF

.1. Radial Basis Function Neural Network (RBFNN)

A RBFNN is a network whose architecture consists of one inputayer, one middle-layer (hidden layer) and an output layer [15,35].ig. 1 shows the architecture of a RBFNN including dependentnd independent variables of this study. Each software projectn the training set T of Fig. 1 has a known development effort∈ � and a known set of n attributes, represented by a vector

p ∈ �n. The effort e is assumed to be an unknown target func-

ion t of the other n attributes: e = t(→p ). The training set T =

(→pj, ej)| j = 1, 2, 3, . . ., J

}, containing J projects, is such that

here is no pair of parallel→pj vectors. The goal is to use the train-

ng set and train the RBFNN to approximate the unknown target

unction t. When a training input vector→pj is given as input to the

BFNN, the output of the RBFNN has to produce the corresponding

ffort of project ej. When attributes→x ∈ �n of some new project are

resented as an input, the output is predicted effort y∈ �.The input layer of the RBFNN consists of n neurons such that

ach neuron receives one of the n components of the input vector

x . An input neuron does not have any parameters and does noto any processing of the received component. Each input neuronimply passes the received component to each of the neurons in theidden layer, such that each neuron in the hidden layer receives the

ame and complete input vector→x .

The hidden layer of the RBFNN consists of J radial basis neurons;ne neuron for each project in the training set. Each hidden neu-on has parameters: a weight vector

→w ∈ �n and a bias b∈ �. The

rocessing in a hidden neuron consists of inner value v calculationnd an output value f calculation.

The inner value of the hidden neuron is calculated as the dis-ance (the Euclidean norm of the difference) of the vectors

→x and

, multiplied by the bias parameter:

=∥∥∥→

x − →w

∥∥∥b

The output value f of the hidden neuron is calculated by usingne of radial basis functions. In this case, the radial basis functions the following Gaussian function:

= f (v) = exp(−v2) = e(−v2)

Please cite this article in press as: C. López-Martín, Predictive accuracyfor development effort of software projects, Appl. Soft Comput. J. (201

The domain of this function is a complete set of real num-ers � = (− ∞ , ∞) and its range is (0,1], such that f(0) = 1 and the

decreases non-linearly as v assumes values further away from 0n either negative or positive direction.

chitecture.

Each of the hidden neurons passes its output f to each of theneurons in the output layer. In this case, since the effort is a scalar,there is only one neuron in the output layer.

The output neuron receives the output value f from the J hidden

neurons as a vector→f ∈ �J and produces the predicted effort y as

the output from the RBFNN. The output neuron contains a weight

vector→W ∈ �J and a bias B∈ � as parameters. The output neuron is a

linear function, therefore the output value is calculated as follows:

y =→f ·

→W + B

The parameters of the RBNN→w ∈ �n and b∈ � for each hidden

neuron, and→W ∈ �J and B∈ � for the output neuron, are calculated

from the training set T and a chosen spread constant.

3.1.1. The calculation of the hidden nodes weight vectors→w

Each hidden neuron corresponds to one distinct project soft-ware in the training set. The weight vector of the j-th neuron issimply the vector of its corresponding attributes of the j-th project→wj = →

pj . This achieves that, when some training vector→pj is pre-

sented as an input, in the corresponding j-th hidden neuron theinner value is the following:

vj =∥∥∥→

pj − →wj

∥∥∥b =∥∥∥→

pj − →pj

∥∥∥b = 0 · b = 0

The output value of the neuron is fj = f (vj) = f (0) = exp(−02) =1 and it is the maximum output value that a hidden neuron canproduce.

3.1.2. The calculation of the hidden neuron’s bias scalar bThe output value of the hidden neuron has range 0 < f ≤ 1. The

output value is f = 1 when the inner value v = 0 and f decreasestoward 0 as the value of v is further away from 0. To control thesensitivity of the function f = f (v), that is, how fast it decreases, orover which part of its domain it has a value greater than 0.5, theinner value v − half(vH) that produces f = 0.5 is of interest.

Since f = f (v) = e(−v2) with the domain of f restricted to v ∈[0, ∞), the inverse function is v = v(f ) = +

√− ln(f ). Therefore, for

f = 0.5 we have: vH = v(0.5) =√

− ln(0.5) ≈ 0.83255. The value of

Euclidean distance∥∥∥→

x − →w

∥∥∥, corresponding to the vH for some fixed

value of b, is called the spread. That is, the spread of the j-th hid-

comparison between neural networks and statistical regression4), http://dx.doi.org/10.1016/j.asoc.2014.10.033

den neuron is the Euclidean distance∥∥∥→

x − →wj

∥∥∥ =∥∥∥→

x − →pj

∥∥∥ in the

input space of the RBFNN, for which the output of that neuron isfj = 0.5.

Page 6: Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects

ING ModelA

6 ft Com

∥∥∥pc

b

d

a

3B

p

pi

Tts

v

t

o

t

(ttc

e

t

mtt

t

at

t

Ri⎡⎢⎢⎢⎢⎢⎢⎣

3

oacr

ARTICLESOC-2592; No. of Pages 16

C. López-Martín / Applied So

Based upon vj =∥∥∥→

x − →wj

∥∥∥b, when the input→x is such that

→x − →

wj

∥∥∥ = spread, then vj = vH , and vH = spread · b. Thus, the

arameters b in all hidden neurons may have the same value, cal-ulated as follows:

= vH

spread≈ 0.83255

spread

With such value of b, the spread is equivalent to the Euclidean

istance∥∥∥→

x − →w

∥∥∥ in the input space of the RBFNN, for which v = vH

nd f = 0.5.

.1.3. The calculation of the vector→W of output neuron and scalar

A RBFNN network performs a hard-point interpolation. The hard-oint means that when a training input vector, containing a project

→j attributes is given as input to the RBFNN, the output of the RBFNNs exactly the corresponding development effort ej of that project.

he output neuron parameters→W ∈ �J and B∈ � are calculated so

hat the hard-point requirement is fulfilled. It is done by using aimulation and a linear system solving as follows. The first input

ector→p1 from the training set is given as input to the RBFNN and

he output vector of the hidden layer→f ∈ �J , that goes into the

utput neuron, is memorized as→f1. The same is done for each of

he J input vectors→pj from the training set. Thus, a list of vectors

→f1,

→f2, . . .,

→fj , . . .,

→fJ ) is created and memorized. Each vector

→fj is

ransposed into a row vector and an extra element “1” is appendedo each row vector. Thus, a matrix F with J rows by J + 1 columns isreated, where each row corresponds to one project at the input.

Also, an→E column vector of the size J is created from the known

fforts ej of the projects of the training set.The parameter B of the output neuron is appended as the bot-

om element of the column vector→W = (W1, W2, . . ., Wj, . . ., WJ),

aking it a column vector→V of the size J + 1. The vector

→V now con-

ains all the parameters of the output neuron as variables Wj and Bo be calculated.

To find the values of the unknown variables a system of equa-

ions F · →V = →

E is solved numerically. Since no two vectors→pj are

multiple of each other, the vectors→fj are linearly independent;

herefore the system has a unique solution.

Thus, the parameters→W and B are calculated, which guaran-

ee that the hard-point interpolation requirement is fulfilled by the

BFNN. The internal structure of the system of equations F · →V = →

Es the following:

→f11→f21. . .. . .. . .→fJ 1

⎤⎥⎥⎥⎥⎥⎥⎦

·

⎡⎢⎢⎢⎢⎢⎢⎣

W1W2. . .. . .. . .WJ

B

⎤⎥⎥⎥⎥⎥⎥⎦

=

⎡⎢⎢⎢⎢⎣

e1e2. . .. . .. . .eJ

⎤⎥⎥⎥⎥⎦

.2. Feedforward Multilayer Perceptron Neural Network (MLP)

A MLP consists of layers of neurons. There is an input layer, an

Please cite this article in press as: C. López-Martín, Predictive accuracyfor development effort of software projects, Appl. Soft Comput. J. (201

utput layer and optionally one or more hidden layers [15,35]. After network receives its input vector, layer by layer of neurons pro-esses the signal, until the output layer emits an output vector asesponse. The neurons in the same layer are processing the signal

PRESSputing xxx (2014) xxx–xxx

in parallel. In the MLP, the signals between neurons always flowfrom the input layer toward the output layer.

The MLP learns by adjusting its parameters. The parameters arethe values of bias and weights in its neurons. The learning in an neu-ron is produced by adjusting values of the bias b and weights wj. Asingle neuron can be applied only to simple tasks. For more com-plex tasks, multiple neurons are connected into a network. Duringthe training period a network is processing prepared inputs andadjusting its parameters. It is guided by some learning algorithm, inorder to improve its performance. Once the performance is accept-ably accurate, the training period is completed. The parameters ofthe network are then fixed to the learned values, and the networkstarts its period of application for the intended task.

In a MLP, the input vector into an artificial neuron in the first hid-

den layer is a vector of numeric values→x = {x1, x2, . . ., xj, . . ., xm}.

The neuron receives the vector and perceives each value, or compo-nent of the vector, with a particular independent sensitivity calledweight

→w = {w1, w2, . . ., wj, . . ., wm}. Upon receiving the input vec-

tor, the neuron first calculates its internal state v, and then its outputvalue y. The internal state v of the neuron is the inner productof the input vector and the weight vector with added numeri-

cal value called bias b : v = →x · →

w + b =m∑

j=1

xjwj + b. This function

is also known as transfer function. The output of the neuron is afunction of its internal state y = ˚(v). This function is also knownas activation function. The principal task of the activation functionis to scale all possible values of the internal state into a desiredinterval of output values. The most used activation functions are ofthree types: threshold function, piecewise-linear, and non-linear.

In this study a MLP is applied for function approximation. Theactivation functions for neurons of the hidden layers and for outputneuron were non-linear and piecewise-linear respectively, whichare defined as follows:

1. Non-linear. ˚(v) = 11+e−av for (0,1), ˚(v) = tanh(v) for (−1,1).

2. Piecewise-linear ˚(v) ={−1 v ≤ v1

a1v + a0 v1 < v ≤ v21 v2 < v

The development effort is considered as a function of one vari-able related to project size: adjusted function points. The optimizedLevenberg–Marquardt algorithm [36] was used to train the net-work.

3.3. General Regression Neural Network (GRNN)

A GRNN is a variant of a RBFNN. A GRNN consists of one inputlayer, two hidden layers, and an output layer. A GRNN has as prin-cipal advantages a fast learning and a convergence to the optimalregression surface as the number of samples becomes very large[38]. The architecture of a GRNN is composed of input neurons thatprovide all the scaled measurement independent variables X to allneurons on the second layer, the pattern neurons. Each pattern neu-ron can represent one sample input or, when the number of inputsis large, a cluster center representing a subset of related inputsfrom the sample. When a new vector X is fed into the network,it is subtracted from the stored vector in the pattern neurons. Theabsolute values of the differences are summed and passed on to anexponential activation function. The pattern neuron output are fedinto the summation units, which perform a dot product between

comparison between neural networks and statistical regression4), http://dx.doi.org/10.1016/j.asoc.2014.10.033

a weight vector and a vector representing the outputs from thepattern neurons. The summation unit that generates an estimateof f(X)K (where K is a constant determined by the Parzen win-dow used, but that does need computing) sums the outputs of the

Page 7: Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects

ING ModelA

ft Com

psYtmo

aGvotbattsts

4

cainopsw(a(

Rat

1

2

3

bv

1

ARTICLESOC-2592; No. of Pages 16

C. López-Martín / Applied So

attern neurons weighted by the number of observations repre-ented by each cluster center. The summation unit that estimates′·f(X)K multiplies each value from a pattern neuron by the sum ofhe samples Yj associated with cluster center Xi. The output neuron

erely divides Y′·f(X)K by f(X)K to produce the desired predictionf Y.

In the GRNN, a parameter named spread has to be adjusted until suitable value was obtained. If the value of spread is small, theRNN function is very steep, so that the neuron with the weightector closest to the input will have a much larger output thanther neurons. The GRNN tends to respond with the target vec-or associated with the nearest input vector. As the spread valueecomes larger, the function slope of the GRNN becomes smoothernd several neurons can respond to an input vector. The networkhen acts as if it is taking a weighted average between target vec-ors whose input vectors are closest to the new input vector. As thepread value becomes larger, more and more neurons contributeo the average, with the result that the network function becomesmoother [37].

. Measurement of the size for software projects

Function points (FP) and lines of code (LOC) represent twoommon software measures for predicting software size. FP haven advantage over LOC in that they are programming languagendependent, they are applicable throughout the life cycle, moreormative (they employ a standard process for calculation thatbtains estimates through a formalized set of procedures), moreredictive (they are superior to LOC for predicting software size,chedule and development effort because they provide managersith accurate a priori estimates of software size), more prescriptive

they can be used to identify scope creep by repeating predictionst major milestones in the development process), and more timelythey can be computed earlier in the development life cycle) [39].

The data set of projects of this study was obtained from the ISBSGelease 11 [32]. The 96% of the ISBSG projects reported FP valuesnd only 10% reported LOC values. The FP counting process involveshe following concepts [40]:

. Record element type (RET). A subgroup of data elements withinan either internal or external file.

. File type referenced (FTR). A file type referenced by a transaction(i.e. add, delete or modify data).

. Data element type (DET). A unique user recognizable, non-recursive (non-repetitive) field.

The process for counting the adjusted function points (AFP) isased on the following five steps involving nineteen independentariables: ILF, EIF, EI, EO, EQ and fourteen characteristics [40]:

) Identify data functions and their complexity. There are two typesof data: internal and external.a) Internal logical file (ILF). A user identifiable group of logi-

cally related data that resides entirely within the applicationboundary and is maintained through External Inputs. Eventhough it is not a rule, an ILF should have at least one externalinput.

b) External interface file (EIF). A user identifiable group of log-ically related data that is used for reference purposes only.The data resides entirely outside the application boundaryand is maintained by external inputs of another application.

Please cite this article in press as: C. López-Martín, Predictive accuracyfor development effort of software projects, Appl. Soft Comput. J. (201

Each EIF must have at least one external output or externalinterface file. At least one transaction, external input, exter-nal output or external inquiry should include the EIF as anFTR.

PRESSputing xxx (2014) xxx–xxx 7

The ILFs and EIFs are rated taking into account their numberof RETs and DETs, and according to Table A1 of the Appendix Aof this study.

2) Identify transactional function and their complexity. There arethree transactional function types:a) External input (EI). An elementary process of the applica-

tion that processes data or control information that entersfrom outside the boundary of the application, such as adding,changing, and deleting transactions. The levels for externalinputs are rated taking into account their FTRs and DETs, andaccording to Table A2 of the Appendix A of this study.

b) External output (EO). An elementary process in which deriveddata passes across the boundary from inside to outside.Derived data occurs when one or more data elements arecombined with a formula to generate or derive an additionaldata element.

c) External inquiry (EQ). An elementary process with both inputand output components that result in data retrieval from oneor more ILF and EIF. The input process does not update ormaintain any FTR (ILF or EIF) and the output side does notcontain derived data.

The levels for external outputs and for external inquiries arerated taking into account their FTRs and DETs, and according toTable A3 of the Appendix A of this study.

3) Determine the Unadjusted Function Points (UFP) count. Theobtained levels for ILF, EIF, EI, EO and EQ are converted to aquantitative value according to the three possible levels (low,average of high) of Table A4 of the Appendix A of this study. Thefinal UFP are obtained by adding all these values up.

4) Determine the value adjustment factor (VAF) from 14 systemscharacteristics. The VAF is obtained from system characteristicsthat rate the general functionality of the application. The origi-nal counting for function points published in 1979 involved tensystem characteristics that were selected because they showedstrong evidence about their influence on software developmentproductivity [41]. In 1983, those characteristics were expandedfrom ten to the following fourteen [42]: data communications,distributed data processing, performance, heavily used configu-ration, transaction rate, on-line data entry, end-user efficiency,on-line update, complex processing, reusability, installationease, operational ease, multiple sites, and facilitate change. Eachcharacteristic has associated a degree of influence whose rangeis from 0 to 5: not present, or no influence if present = 0, insignif-icant influence = 1, moderate influence = 2, average influence = 3,significant influence = 4, strong influence = 5. Each of this degreeof influence has a specific meaning by system characteristic asshowed in Appendix B of this study. The VAF can have an influ-ence maximum over the UFP of ±35%. The VAF is obtained byusing the following equation:

VAF = 0.65 +[∑14

i=1Ci

100

]

where Ci = degree of influence by the ith-characteristic.5) Calculate the Adjusted Function Point (AFP) count by multiply-

ing the UFP by the VAF.

5. Description of data sets

Data sets of software projects with different characteristics,from different domains, organizations, and with different fea-

comparison between neural networks and statistical regression4), http://dx.doi.org/10.1016/j.asoc.2014.10.033

tures, are hard to compare, and results are difficult to generalize[43]. To reduce the variability of projects, the sample used in thisstudy was obtained from an analysis focused on projects having ahigher data quality, similar functional sizing methods, same kind of

Page 8: Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects

ARTICLE IN PRESSG ModelASOC-2592; No. of Pages 16

8 C. López-Martín / Applied Soft Computing xxx (2014) xxx–xxx

Table 3Criteria for ISBSG data set sample.

Attribute Selected value(s) Projects

Data quality rating A = The data submitted was assessed as being sound with nothing being identified that mightaffect its integrity.B = The submission appears fundamentally sound but there are some factors which could affect theintegrity of the submitted data.

4744

Unadjusted function pointrating

A = The unadjusted function point count was assessed as being sound with nothing beingidentified that might affect its integrity.B = The unadjusted function point count appears sound, but integrity cannot be assured as a singlefigure was provided.

3184

Functional sizing methods IFPUG V4+ and NESMA 1600

dwtn11aiw

pSuaVVrbdhoPIdttWfiia

odbensta

i

TS

Development Platform Mainframe

Resource level Development team effort

Language type 3GL

evelopment platform, same kind of resource level, and projectshose code have been written using programming languages of

he same generation. The data set was obtained from the Inter-ational Software Benchmarking Standards Group (ISBSG) Release1, which involves data of 5052 projects developed between years989 and 2009. The size of software projects is measured indjusted function points (AFP). The effort by project was measuredn man-months (one man-month is equal to 152 h [34,44]). Projects

hose effort was greater than one man-month were selected.The selection criteria were based on the suggested attributes

resented in Table 3, taken from the “Selecting a Suitable Dataubset” section of the “Guidelines for use of the ISBSG data” doc-ment [32]. A data set integrated by 355 projects was obtainedfter criteria were applied to the 5052 projects. In Table 3, IFPUG4+ and NESMA were selected because pre-IFPUG V4 projects with4 and post V4 should be not mixed (the sizing changes with thatelease), and because projects sized using the NESMA standard cane included with IFPUG V4+ projects [32]. The criterion for selectingeveloped projects on Mainframe was because this kind of platformad the highest percentage, 39% of the 4109 projects (only 4109f the 5052 had the value filled) followed by Multiplatform (33%),C (18%) and Mid Range (10%). Of the 698 projects counted fromFPUG V4+ and NESMA methods, and developed in Mainframes, theevelopment team effort was selected as resource level, becausehe effort of the 72% (506 of 698) was reported involving the effort ofhe project team, project management and project administration.

ith regard to the language type criterion, 50 of the 698 main-rame projects had empty values. The 3GL was se lected becauset had the highest percentage, 74%, of the 648 projects developedn Mainframes, followed by 4GL (17%), Application Generator (8%)nd 2GL (1%).

In Table 4, the 355 projects were analyzed by their typef development: new development, enhanced projects and re-evelopment. Only the first two types were selected for this study,ecause re-development projects were only six. Projects whoseffort was greater or equal than one man-month were selected. Allew projects met with this criterion, whereas for enhanced dataet, seven projects had an effort lower than one man month, hence,

Please cite this article in press as: C. López-Martín, Predictive accuracyfor development effort of software projects, Appl. Soft Comput. J. (201

he data sets containing 75 and 267 projects were used for trainingnd testing the four models of this study.

In addition to the nineteen independent variables consideredn the AFP, other independent variables were also contemplated.

able 4ample classification by type of development from Table 3.

Description Number of projects

New development 75Enhancement 274Re-development 6

Total 355

698506355

However, none of them could be selected, because their fields wereneither filled for all 75 nor 267 projects of data sets.

The RBFNN, SLR, MLP and GRNN were trained and tested for eachdata set. Hence, the hypothesis formulated in Section 1 is derivedas follows:

H1. The SDEP accuracy for RBFNN is statistically better than theaccuracy obtained from a SLR, MLP and GRNN when adjusted func-tion points data, obtained from new software projects, is used asthe independent variable.

H2. The SDEP accuracy for RBFNN is statistically better than theaccuracy obtained from a SLR, MLP and GRNN when adjusted func-tion points data, obtained from enhanced software projects, is usedas the independent variable.

6. Data functional form and accuracy criterion

6.1. Functional form

Based upon the theoretically justifiable functional form, as amajor evaluation criterion, rather than any accuracy statistic [43],it is first necessary to analyze if the predicted effort makes sensewith the data of its independent variable (“A model chosen forempirical analysis should be consistent with theory” [43]), thatis, if the relationship between effort and adjusted function points(AFP) is according to the software development theory. Unfortu-nately, in most of the literature it is not clear whether functionalform had any influence on the performance of prediction models[13]. A functional form analysis contributes to reduce the conclu-sion instability. Scatter plots of Figs. 2 and 3 show the relationshipbetween AFP and development effort for new (75) and enhanced(267) projects, respectively.

The data in Figs. 2 and 3 show the following characteristics:

1. There are fewer large projects than small projects (skewness).2. The variability of effort increases with project size (heterosce-

dasticity).3. There are extremely large data values (outliers).

When these three characteristics are presented, the use of asimple linear regression is inappropriate, unless the data sets arenormalized by applying a functional transformation of the data val-ues to make the distribution closer to a normal distribution. Hence,each data set is normalized as is suggested in Refs. [12,45]: takingthe natural log (ln). Figs. 4 and 5 show these data transformations.

The values of the coefficients of correlation (r) were equal to0.82 and 0.75 for new and enhanced projects, respectively; indi-

comparison between neural networks and statistical regression4), http://dx.doi.org/10.1016/j.asoc.2014.10.033

cating a moderately strong relationship between the independentand dependent variables for the two data sets. Table 5 shows thecharacteristics of the simple linear regression model (SLR) by dataset showed in Figs. 4 and 5. Since the two ANOVA p-values were

Page 9: Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects

Please cite this article in press as: C. López-Martín, Predictive accuracy comparison between neural networks and statistical regressionfor development effort of software projects, Appl. Soft Comput. J. (2014), http://dx.doi.org/10.1016/j.asoc.2014.10.033

ARTICLE IN PRESSG ModelASOC-2592; No. of Pages 16

C. López-Martín / Applied Soft Computing xxx (2014) xxx–xxx 9

Fig. 2. Effort vs. AFP for new development projects. N = 75.

Fig. 3. Effort vs. AFP for enhanced projects. N = 267.

Fig. 4. Normalization of effort and AFP for new development projects. N = 75.

Table 5Simple linear regression equations by data set (r: coefficient of correlation, r2: coefficient of determination).

Data set Sample size SLR r r2 ANOVA p-value

New development projects 75 ln(effort) = −1.98 + 0.93 * ln(AFP) 0.82 0.67 0.0000Enhanced projects 267 ln(effort) = −0.95 + 0.73 * ln(AFP) 0.75 0.56 0.0000

Page 10: Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects

ARTICLE IN PRESSG ModelASOC-2592; No. of Pages 16

10 C. López-Martín / Applied Soft Computing xxx (2014) xxx–xxx

nd AFP

etvsbrwmi

6

stremmmevmtit[ttuf

A

eo

M

ao

Fig. 5. Normalization of effort a

qual to 0.000, the two equations had a statistically significant rela-ionship between effort and AFP at the 99.0% confidence level. Thealues of the coefficients of determination (r2) for the two equationshow that 67% and 56% of the variation of the effort is explainedy the models. The positive sign for the two coefficients of cor-elation and for the slopes of the two equations of Table 5 meetsith the following assumption related to the software develop-ent: the higher the value of adjusted function points, the higher

s the development effort.

.2. Accuracy criteria

Several prediction accuracy measure statistics have been used,uch as the mean magnitude of relative error (MMRE), mean magni-ude of error relative to the estimate (MMER), mean of the balancedelative error BRE (MBRE), and mean inverted balanced relativerror (MIBRE). Unfortunately, it is not possible to widely recom-end a specific accuracy statistic for selecting a specific predictionodel because MMRE, MMER, MBRE and MIBRE select inferiorodels [43]. In addition, there has not been any study having

vidence that any specific mix of these accuracy statistics is aalid measurement instrument in the context of ranking predictionodels; to the contrary, the composite accuracy statistics has con-

ributed to increased confusion [43]. Specifically regarding MMRE,n spite it leads to asymmetry [17], the MMRE has been used inhe 89% of studies where ML techniques have been applied to SDEP5], and it was used in eight of the ten RBFNN studies identified inhis research. On the contrary, the accuracy criterion for evaluatinghe models of this study is the Absolute Residual (AR). The AR isnbiased, since it is not based on ratios [17]. The AR is defined asollows:

Ri = |actual efforti − predicted efforti|

It is calculated for each observation (software project) i, theffort of which is predicted. The aggregation of the AR over multiplebservations (N) can be obtained by its mean (MAR) as follows:

AR =(

1N

) N∑i=1

ARi

Please cite this article in press as: C. López-Martín, Predictive accuracyfor development effort of software projects, Appl. Soft Comput. J. (201

It can be also calculated the median of the ARs, (MdAR). Theccuracy of a prediction model is inversely proportional to the MARr MdAR.

for enhanced projects. N = 267.

7. Validation method and its application in models

7.1. Validation method

Among the methods most used to evaluate the generalizationlevel of SDEP models are the holdout, leave-one-out cross-validationand k-fold cross validation (k > 1) [5]. The holdout is done by parti-tioning the sample into two mutually exclusive subsamples termedtraining and test samples. In the k-fold cross validation (k > 1)method, the sample is divided into k mutually exclusive subsam-ples, k − 1 subsamples are used for training, and the k-th is usedfor testing. This process is repeated k times, each using a differentsubsample for testing. When k is equal to data sample size, it iscalled leave-one-out cross-validation (LOOCV). The accuracy statis-tic on the test sample is used to evaluate generalization level of themodel.

The potential over-fitting problem of a neural network shouldbe dealt by applying methods as cross validation [5,15,35]. In thisstudy, a LOOCV is used for training and testing the four models. Ithas specifically been suggested to be used in SDEP by comparing itwith k-fold cross-validation methods. The bias and variance trade-off and run times for LOOCV did not show statistically difference inrelation to 3-fold and 10-fold cross-validation [18].

7.2. Training and testing models

In accordance with a LOOCV, a number of 75 and 267 simpleregression equations for new and enhanced projects were respec-tively generated. All of them with a value for the coefficient ofdetermination between 0.54 and 0.68. Moreover, the independentvariable (AFP) was statistically significant at the 99% confidencelevel in all equations for the two kinds of samples.

The MLP, RBFNN and GRNN were trained and tested when theirparameters were empirically optimized until obtaining the bestprediction accuracy. It means that a value lower or higher thannumber of either hidden layers or neurons in hidden layer for theMLP, or a value lower or higher of the spread value for the RBFNNor GRNN, generated a higher (worse) value for the MAR.

Regarding MLP, a range of 1–5 hidden layers and a range of2–40 neurons by hidden layer, were explored. A LOOCV by combi-nation of number of layers and number of neurons by layer, was

comparison between neural networks and statistical regression4), http://dx.doi.org/10.1016/j.asoc.2014.10.033

performed. In relation to new projects, the best accuracy result(lower MAR) was obtained when the number of hidden neuronswas equal to 15 in one hidden layer, whereas for enhanced projects,35 neurons in two hidden layers generated the best accuracy.

Page 11: Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects

ARTICLE IN PRESSG ModelASOC-2592; No. of Pages 16

C. López-Martín / Applied Soft Computing xxx (2014) xxx–xxx 11

Table 6Best prediction accuracy by model obtained from LOOCV.

Data set of projects Accuracy criterion Model

SLR MLP GRNN RBFNN

New MAR 0.45 0.32 0.33 0.31MdAR 0.39 0.29 0.30 0.27

Enhanced MAR 0.50 0.42 0.49 0.46MdAR 0.42 0.34 0.41 0.37

Table 7p-Values by model from normality tests.

Data set of projects Normality test p-Values

SLR MLP GRNN RBFNN

New Chi-Squared 0.0874 0.0001 0.1387 0.0024Shapiro–Wilk 0.0006 0.0002 0.0120 0.0002Skewness 0.1163 0.1197 0.1761 0.0902Kurtosis 0.6153 0.7231 0.6518 0.8622

Enhanced Chi-Squared 0.0002 0.0007 0.0097 0.00060

5

4

iwsroa

An

a

b

c

8

aX

comparison among MLP, GRNN and RBFNN, there was not a statis-

Shapiro–Wilk 0.000Skewness 0.058Kurtosis 0.415

In accordance with RBFNN, a range of 0.1–10 (with increasen 0.1) for spread value, was explored. A LOOCV by spread value

as executed. The best accuracy result was obtained when thepread value was equal to 1.1 and 5 for new and enhanced projects,espectively, whereas for the GRNN, the best accuracy result wasbtained when the spread value was equal to 0.15 and 0.4 for newnd enhanced projects, respectively.

Statistical test for comparing of the accuracy of the four sets ofRs is selected taking into account the assumptions of dependence,ormality and variance of data [21]:

) Dependence: Software project data can be described by n setsof four dimensions (Wi, Xi, Yi, Zi), i = 1, . . ., n, where i is theith project; n is the number of projects; Wi, Xi, Yi, and Zi arethe ARs obtained from the SLR, MLP, GRNN and RBFNN modelsrespectively.

) Normality: The tests for normality by data set of ARs are Chi-squared, Shapiro–Wilk, skewness, and kurtosis.

) Variance: The equality of dispersion or variability in the dis-tribution for the four sets of ARs is tested by using the test ofLevene.

. Results

Please cite this article in press as: C. López-Martín, Predictive accuracyfor development effort of software projects, Appl. Soft Comput. J. (201

Table 6 shows the best prediction accuracy once LOOCV waspplied by model. In this study, each of the four sets of ARs Wi,i, Yi, and Zi is obtained from the corresponding project i, and

Fig. 6. Plot of equal standard de

0.0000 0.0000 0.00000.0406 0.0406 0.02810.5234 0.6626 0.8126

therefore Xi, . . ., Xn, Wi, . . ., Wn, Yi, . . ., Yn, and Zi, . . ., Zn are depend-ent samples. In relation to normality, Table 7 shows the results ofthe four normality tests on the four data sets. Since the smallest p-value in the tests performed is less than 0.01 for all samples, it canbe rejected with 99% confidence that the ARs comes from a normaldistribution for the four data sets.

As for the variance of samples showed in Figs. 6 and 7, the p-value of the test of Levene was equal to 0.13 and 0.78 for the newand enhanced data set, respectively, it means that for these twokinds of projects there was not a statistically significant differenceamongst the standard deviations at the 99.0% confidence level forthe four data sets.

Since the number of data sets of ARs in this study are morethan two, dependent, not normally distributed and have equal vari-ances, it implies that the statistical test to be used for comparing theaccuracies of the four models should be the Friedman test [19], thattests the null hypothesis, which is proposing that the medians inthe models are the same. Results of the p-values after applying theFriedman test by comparing the prediction accuracy of the SLR,MLP, GRNN and RBFNN for new projects are shown in Table 8.

Table 8 shows that the MLP, GRNN and RBFNN models had astatistically significant difference with the SLR model. The RBFNNhas the highest confidence level with 99.0%. Regarding accuracy

comparison between neural networks and statistical regression4), http://dx.doi.org/10.1016/j.asoc.2014.10.033

tically significant difference among the medians of the three neuralnetworks at the 99.0% confidence level. This comparison can graph-ically be depicted in the box and whisker plot of Fig. 8.

viations for new projects.

Page 12: Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects

ARTICLE IN PRESSG ModelASOC-2592; No. of Pages 16

12 C. López-Martín / Applied Soft Computing xxx (2014) xxx–xxx

Fig. 7. Plot of equal standard devia

Table 8Accuracy comparison between models for new projects.

Contrast Friedman testp-value

Confidencelevel (%)

n9

9

idttptatab

1

2

SLR–MLP 0.061 90SLR–GRNN 0.082 90SLR–RBFNN 0.001 99

Regarding enhanced projects, there was not a statistically sig-ificant difference among the medians of the four models at the9% confidence level.

. Discussion, conclusions and future research

Prediction is required in the planning of a software project. Its possible to predict the size, effort, schedule, costs and risks. Theevelopment effort of projects has mainly been predicted usingechniques such as expert judgment, statistical regressions, and MLechniques. In this research, the application of a ML technique wasroposed: a RBFNN, whose prediction accuracy was compared withhat of a SLR, MLP and GRNN. Two kinds of data sets were used: newnd enhanced projects developed in enterprise environments. Theraining and testing of these four models were done taking intoccount the six weaknesses identified when ML techniques haveeen applied to SDEP as follows:

. An unbiased accuracy criterion based on absolute residuals (AR)was used for comparing the performance of the four models.

. A validation method leading to conclusion stability by remov-

Please cite this article in press as: C. López-Martín, Predictive accuracyfor development effort of software projects, Appl. Soft Comput. J. (201

ing the random selection of train and test data sets, was used:LOOCV. A study concluded that the LOOCV had a statisticallyindistinguishable bias and variance, as well as similar run timeswith respect to k-fold cross-validation [18].

Fig. 8. Box and whisker pl

tions for enhanced projects.

3. The RBFNN and GRNN contain a parameter named spread, andthe MLP involves hidden layers and number of neurons by hid-den layer that influence the accuracy of the result. The processesfor empirically finding the optimal spread values as well as thenumber of hidden layers and number of neurons by hidden layerwere described.

4. The SDEP based its predictions on historical data; it means thatthe method validation should divide the data into following twosamples: (1) training data that the prediction model can learnfrom, and (2) test data that is used to assess predictive accuracy[18]. The RBFNN, SLR, MLP and GRNN were trained and testedusing actual mutually exclusive data sets by using the LOOCV.

5. A validity analysis for assumptions such as dependence of sam-ples, equal standard deviations and normality was done. Thisanalysis allowed selecting the suitable statistical test for compar-ing the accuracies of the two models. In this study, a Friedmantest was used because (1) there were more than two samplesto be compared, (2) each prediction accuracy of the four mod-els were obtained from the same project, that is, they weredependent, (3) the data resulted not normally distributed, and(4) the four samples had equal deviation standards. It meansthat if a different combination for these four characteristics hadbeen presented, other statistical test should have be selectedwhen more than two samples are compared, such as an ANOVA(normal independent samples) [49] or Kruskal–Wallis (not nor-mally distributed independent samples) [49], whereas whentwo samples are compared, a t-student (normal independentsamples) [21], Paired t-test (normal dependent samples), Welch(normal independent samples with non-equal variances) [46],U Mann–Whitney (not normally distributed independent sam-

comparison between neural networks and statistical regression4), http://dx.doi.org/10.1016/j.asoc.2014.10.033

ples) [21], Yuen (dependent samples with non-equal variances)[47], or Wilcoxon (not normally distributed dependent samples)[48], should be selected.

ot for new projects.

Page 13: Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects

ING ModelA

ft Com

6

mi

1

2

tau

Ho

Ht

Ho

ARTICLESOC-2592; No. of Pages 16

C. López-Martín / Applied So

. The purpose of using statistical techniques was threefold. Firstly,to analyze the data sets for their skewness, heteroscedasticity,and outliers. Secondly, to approve the significant independentvariable (AFP) for the four models. For this selection a regressionand correlation analysis was used, as suggested in Refs. [13,50].The selection was based on a global analysis of the model bymeans of its coefficient of determination, as well as by individ-ual statistical analysis of its parameter. The mentioned analysisproduced the SLR which the RBFNN, MLP and GRNN was com-pared to. Thirdly, based on an analysis of dependence normality,and variance of data, a suitable statistical test (Friedman) wasselected for concluding that there was a statistically significantdifference among the prediction accuracy of the four models.

An additional comparison of the ten studies regarding validationethods for models and size of samples analyzed in Tables 1 and 2

s done as follows:

. Validation methods: three studies [28–30] used the full dataset just for training their model, that is, they do not implementany validation nor use a testing phase. Two studies [23,31] usedthe LOOCV method; three of them [22,26,27] used the holdoutmethod. Two studies [25,26] used a k-fold cross-validation (withk = 3 and k = 10), whereas the tenth study [24] used five projectsfrom the same data set which was used for the training (onestudy [26] used two validation methods). In this research, tomitigate the potential over-fitting problem of the models, thedata set of projects was split into samples as is suggested inRefs. [15,35]: training and testing data sets, having as validationmethod the LOOCV.

. Size of samples: the number of the actual projects (excludingthose three samples artificially generated [28–30]), used eitherfor training or for testing the models was between 18 and 63projects. It stands out that in those four studies where a testingphase was implemented [22,24,26,27] and excluding those twowhich used the LOOCV method [23,31], the number of projectsused was only from 5 to 10. It suggests that their generalizationis uncertain about the accuracy of the results, although their bestMMRE values (Table 1) were between 0.13 and 0.32. In one study,the authors selected the sample of 1538 projects from the ISBSGRelease 10 using as criterion that if they were selected more thanthose five variables, then they would have got only a few projectswhich it would not have been sufficient for applying the MLtechniques they used in the study [25], that is, their sample didnot observed the guidelines of the ISBSG when not consideringdata quality, and the kind of functional sizing methods, develop-ment platform, resource level, nor programming language type.Based upon the fact that the performance of a neural network isunfavorable when it is trained from a small data set [5], in thisresearch, the data sets involved 75 and 267 new and enhancedprojects, respectively from ISBSG Release 11.

Once the RBFNN were trained and tested according to AR cri-erion, the following hypotheses were accepted (in all cases whendjusted function points data, obtained from software projects, wassed as the independent variable):

New projects:

1. The SDEP accuracy for RBFNN is statistically better than thatf a SLR at the 99% confidence level.

2. The SDEP accuracy for MLP and GRNN is statistically betterhan that of a SLR at the 90% confidence level.

Please cite this article in press as: C. López-Martín, Predictive accuracyfor development effort of software projects, Appl. Soft Comput. J. (201

3. The SDEP accuracy for RBFNN is statistically equal than thatf a MLP and GRNN at the 99% confidence level.

Enhanced projects:

PRESSputing xxx (2014) xxx–xxx 13

H4. The SDEP accuracy for RBFNN was statistically equal than thatof a SLR, MLP and GRNN at the 99% confidence level.

Results of this research suggest that a RBFNN can be used toSDEP of new software projects when they have been developed inmainframes and using third generation programming languages.In comparing the prediction accuracy among the RBFNN, MLP andGRNN, the RBFNN had the highest confidence level with respect toSLR. A RBFNN has as advantage over a MLP and GRNN that the pro-cedure for training a RBFNN is faster than that one used for traininga MLP [15], whereas a GRNN requires more substantial computationthan a RBFNN to evaluate new points [38].

A limitation of this study could be related to the use of one kind ofsample (described in Table 3). Unfortunately, the number of emptyfields in the ISBSG data set limited the use of a still more specificsample, including characteristics described in fields such as organi-zation type (v. gr., Banking or Public Administration), Business AreaType (v. gr., banking or insurance), application type (v. gr., Produc-tion System or Management Information System), architecture (v.gr., stand alone or client server), development techniques (v. gr.,Data Modeling or Object Oriented Analysis), primary programminglanguage (v. gr., COBOL, PL/I, Java or RPG), operating system (v. gr.,MVS or UNIX), or data base system (v. gr., DB2, IMS or IDMS). Itstands out that the data set for new projects had more fields filledthan that of enhanced projects.

Regarding the accuracy comparison between neural networksand other models, the following issues are recommended: (1) usea prediction accuracy criterion which leads to symmetry, (2) usedata sets of recently developed projects, or justify the use of oldones, (3) report the criteria for selecting the data set used for train-ing and testing the models, (4) use mutually exclusive subsets fortraining and testing the models, (5) analyze the skewness, hetero-scedasticity, outliers, as well as the functional form of the dataset for model generation, (6) report the values for trained neuralnetwork parameters and how these values were obtained (suchas number of hidden layers, number of nodes by hidden layer, orspread values), (7) select a suitable validation method for the modelgeneration, (8) use statistical techniques for selecting the indepen-dent variables, comparing the accuracies of models (by means of asuitable statistical test selected from the analysis the dependence,normality and variance of data), and for reporting the confidencelevel percentage of the test, and (9) warn about the risk of gen-eralization when small data sets have been used for testing themodels.

The following weaknesses related to neural networks should beconsidered for future studies [5,13]: (1) interpretability of weightsor spread values, (2) sensitivity of network architecture, (3) param-eter setting, (4) time spent on the process for finding its optimalconfiguration, (5) propensity to over-fit to the training data, and (6)requirement of plentiful data for training and testing the models.

Future research involves ML models having prediction accuracystatistically better than that of a statistical regression model forenhanced projects.

Acknowledgments

I would like to thank CUCEA of the Universidad de Guadala-jara, Jalisco, México, Programa de Mejoramiento del Profesorado(PROMEP), as well as to Consejo Nacional de Ciencia y Tecnología(Conacyt). In addition, I appreciate the valuable comments of PhDStudent Ivica Kalichanin-Balich, as well as the support obtainedfrom PhD Student Rosa Leonor Ulloa Cazarez.

comparison between neural networks and statistical regression4), http://dx.doi.org/10.1016/j.asoc.2014.10.033

Appendix A. Unadjusted function points tables

Tables A1–A4

Page 14: Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects

ARTICLE ING ModelASOC-2592; No. of Pages 16

14 C. López-Martín / Applied Soft Com

Table A1Complexity matrix for ILF and EIF.

RET DET

1–19 20–50 Greater than 50

1 Low Low Average2–5 Low Average High6 or more Average High High

Table A2Functional complexity matrix for EI.

FTR DET

1–4 5–15 Greater than 15

Less than 2 Low Low Average2 Low Average HighGreater than 2 Average High High

Table A3Functional complexity matrix for EO and EQ.

FTR DET

1–5 6–19 Greater than 19

Less than 2 Low Low Average2–3 Low Average HighGreater than 3 Average High High

Table A4Unadjusted function points.

Components Function levels

Low Average High

Internal logical file (ILF) 7 10 15External interface file (EIF) 5 7 10External input (EI) 3 4 8

Ad

01

2

3

4

5

0

1

2

3

4

5

External output (EO) 4 5 7External inquiry (EQ) 3 4 6

ppendix B. Descriptions to determine the influenceegree by characteristic

1. Data communications Application is pure batch processing or a standalone PC

Application is batch but has remote data entry or remotePrinting

Application is batch but has remote data entry and remotePrinting

Application includes online data collection or TP(teleprocessing) front end to a batch process or query system

Application is more than a front-end, but supports only onetype of TP communications protocol

Application is more than a front-end, and supports more thanone type of TP communications protocol

2. Distributed data processing Application does not aid the transfer of data or processing

function between components of the system Application prepares data for end user processing on another

component of the system such as PC spreadsheets and PCDBMS

Data is prepared for transfer, then is transferred and processedon another component of the system (not for end-userprocessing)

Distributed processing and data transfer are online and in one

Please cite this article in press as: C. López-Martín, Predictive accuracyfor development effort of software projects, Appl. Soft Comput. J. (201

direction only Distributed processing and data transfer are online and in both

directions Processing functions are dynamically performed on the most

appropriate component of the system

PRESSputing xxx (2014) xxx–xxx

3. Performance0 No special performance requirements were stated by the user1 Performance and design requirements were stated and

reviewed but no special actions were required2 Response time or throughput is critical during peak hours. No

special design for CPU utilization was required. Processingdeadline is for the next business day

3 Response time or throughput is critical during all businesshours. No special design for CPU utilization was required.Processing deadline requirements with interfacing systems areconstraining

4 In addition, stated user performance requirements arestringent enough to require performance analysis tasks in thedesign phase

5 In addition, performance analysis tools were used in thedesign, development, and/or implementation phases to meetthe stated user performance requirements

4. Heavily used configuration0 No explicit or implicit operational restrictions are included1 Operational restrictions do exist, but are less restrictive than a

typical application. No special effort is needed to meet therestrictions

2 Some security or timing considerations are included3 Specific processor requirements for a specific piece of the

application are included4 Stated operation restrictions require special constraints on the

application in the central processor or a dedicated processor5 In addition, there are special constraints on the application in

the distributed components of the system5. Transaction rate

0 No peak transaction period is anticipated1 Peak transaction period (e.g., monthly, quarterly, seasonally,

annually) is anticipated2 Weekly peak transaction period is anticipated3 Daily peak transaction period is anticipated4 High transaction rate(s) stated by the user in the application

requirements or service level agreements are high enough torequire performance analysis tasks in the design phase

5 High transaction rate(s) stated by the user in the applicationrequirements or service level agreements are high enough torequire performance analysis tasks and, in addition, requirethe use of performance analysis tools in the design,development, and/or installation phases

6. On-line data entry0 All transactions are processed in batch mode1 1–7% of transactions are interactive data entry2 8–15% of transactions are interactive data entry3 16–23% of transactions are interactive data entry4 24–30% of transactions are interactive data entry5 More than 30% of transactions are interactive data entry

7. End-user efficiencyThe online functions provided emphasize a design for end-user

efficiency. The design includes:

• Navigational aids (for example, function keys, jumps, dynami-cally generated menus)

• Menus• Online help and documents• Automated cursor movement• Scrolling• Remote printing (via online transactions)• Preassigned function keys• Batch jobs submitted from online transactions• Cursor selection of screen data• Heavy use of reverse video, highlighting, colors underlining, and

other indicators• Hard copy user documentation of online transactions• Mouse interface• Pop-up windows• As few screens as possible to accomplish a business function

comparison between neural networks and statistical regression4), http://dx.doi.org/10.1016/j.asoc.2014.10.033

• Bilingual support (supports two languages; count as four items)• Multilingual support (supports more than two languages; count

as six items)

Page 15: Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects

ING ModelA

ft Com

0123

4

5

01

2

34

5

f

•••

012345

012

3

4

5

0

1

2

3

4

5

ARTICLESOC-2592; No. of Pages 16

C. López-Martín / Applied So

None of the above One to three of the above Four to five of the above Six or more of the above, but there are no specific user

requirements related to efficiency Six or more of the above, and stated requirements for end-user

efficiency are strong enough to require design tasks for humanfactors to be included (for example, minimize key strokes,maximize defaults, use of templates)

Six or more of the above, and stated requirements for end-userefficiency are strong enough to require use of special tools andprocesses to demonstrate that the objectives have beenachieved

8. On-line update None Online update of one to three control files is included. Volume

of updating is low and recovery is easy Online update of four or more control files is included. Volume

of updating is low and recovery easy Online update of major internal logical files is included In addition, protection against data lost is essential and has

been specially designed and programmed in the system In addition, high volumes bring cost considerations into the

recovery process. Highly automated recovery procedures withminimum operator intervention are included

9. Complex processingComplex processing is a characteristic of the application. The

ollowing components are present.

Sensitive control (for example, special audit processing) and/orapplication specific security processingExtensive logical processingExtensive mathematical processingMuch exception processing resulting in incomplete transactionsthat must be processed again, for example, incomplete ATMtransactions caused by TP interruption, missing data values, orfailed editsComplex processing to handle multiple input/output possibili-ties, for example, multimedia, or device independence

None of the above Any one of the above Any two of the above Any three of the above Any four of the above All five of the above

10. Code reusability No reusable code Reusable code is used within the application Less than 10% of the application considered more than one

user’s needs Ten percent (10%) or more of the application considered more

than one user’s needs The application was specifically packaged and/or documented

to ease re-use, and the application is customized by the user atsource code level

The application was specifically packaged and/or documentedto ease re-use, and the application is customized for use bymeans of user parameter maintenance

11. Installation ease No special considerations were stated by the user, and no

special setup is required for installation No special considerations were stated by the user but special

setup is required for installation Conversion and installation requirements were stated by the

user, and conversion and installation guides were providedand tested. The impact of conversion on the project is notconsidered to be important

Conversion and installation requirements were stated by theuser, and conversion and installation guides were providedand tested. The impact of conversion on the project is

Please cite this article in press as: C. López-Martín, Predictive accuracyfor development effort of software projects, Appl. Soft Comput. J. (201

considered to be important In addition to 2 above, automated conversion and installation

tools were provided and tested In addition to 3 above, automated conversion and installation

tools were provided and tested

PRESSputing xxx (2014) xxx–xxx 15

12. Operational ease0 No special operational considerations other than the normal

back-up procedures were stated by the user1–4 One, some, or all of the following items apply to the

application. Select all that apply. Each item has a point value ofone, except as noted otherwise.Effective start-up, back-up, and recovery processes wereprovided, but operator intervention is required.

Effective start-up, back-up, and recovery processes wereprovided, but no operator intervention is required (count astwo items).

The application minimizes the need for tape mounts.

The application minimizes the need for paper handling5 The application is designed for unattended operation.

Unattended operation means no operator intervention isrequired to operate the system other than to start up or shutdown the application. Automatic error recovery is a feature ofthe application

13. Multiple sites0 User requirements do not require considering the needs of

more than one user/installation site1 Needs of multiple sites were considered in the design, and the

application is designed to operate only under identicalhardware and software environments

2 Needs of multiple sites were considered in the design, and theapplication is designed to operate only under similar hardwareand/or software environments

3 Needs of multiple sites were considered in the design, and theapplication is designed to operate under different hardwareand/or software environments

4 Documentation and support plan are provided and tested tosupport the application at multiple sites and the application isas described by 1 or 2

5 Documentation and support plan are provided and tested tosupport the application at multiple sites and the application isas described by 3

14. Facilitate changeThe application has been specifically designed, developed, and

supported to facilitate change. The following characteristics canapply for the application:

• Flexible query and report facility is provided that can handlesimple requests; for example, and/or logic applied to only oneinternal logical file (count as one item).

• Flexible query and report facility is provided that can handlerequests of average complexity, for example, and/or logic appliedto more than one internal logical file (count as two items).

• Flexible query and report facility is provided that can handle com-plex requests, for example, and/or logic combinations on one ormore internal logical files (count as three items).

• Business control data is kept in tables that are maintained by theuser with online interactive processes, but changes take effectonly on the next business day.

• Business control data is kept in tables that are maintained by theuser with online interactive processes, and the changes take effectimmediately (count as two items).

0 None of the above1 Any one of the above2 Any two of the above3 Any three of the above4 Any four of the above5 All five of the above

References

[1] M. Jørgensen, T. Halkjelsvik, The effects of request formats on judgment-based

comparison between neural networks and statistical regression4), http://dx.doi.org/10.1016/j.asoc.2014.10.033

effort estimation, J. Syst. Softw. 83 (2010) 29–36, http://dx.doi.org/10.1016/j.jss.2009.03.076.

[2] S.S. Berlin, T. Raz, Ch. Glezer, M. Zviran, Comparison of estimation methods ofcost and duration in IT projects, J. Inf. Softw. Technol. 51 (9) (2009) 738–748,http://dx.doi.org/10.1016/j.infsof.2008.09.007.

Page 16: Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects

ING ModelA

1 ft Com

[

[

[

[

[

[

[

[

[

[[

[

[

[

[

[

[

[

[

[

[

[

[

[

[[

[

[

[

[

[

[

[

[

[

[

[

[

[

[49] D. Montgomery, Applied Statistics and Probability for Engineers, third ed., John

ARTICLESOC-2592; No. of Pages 16

6 C. López-Martín / Applied So

[3] S. Yeong-Seok, B. Doo-Hwan, J. Ross, AREION: software effort estimationbased on multiple regressions with adaptive recursive data partitioning,Inf. Softw. Technol. 55 (10) (2013) 1710–1725, http://dx.doi.org/10.1016/j.infsof.2013.03.007.

[4] M.A. Ahmed, I. Ahmad, J.S. AlGhamdi, Probabilistic size proxy for softwareeffort prediction: a framework, Inf. Softw. Technol. 55 (2) (2013) 241–251,http://dx.doi.org/10.1016/j.infsof.2012.08.001.

[5] J. Wen, S. Li, Z. Lin, Y. Hu, Ch. Huang, Systematic literature review ofmachine learning based software development effort estimation models,Inf. Softw. Technol. 54 (2012) (2012) 41–59, http://dx.doi.org/10.1016/j.infsof.2011.09.002.

[6] T. Halkjelsvik, M. Jørgensen, From origami to software development: a reviewof studies on judgment-based predictions of performance time, Psychol. Bull.138 (2) (2012) 238–271, http://dx.doi.org/10.1037/a0025996.

[7] Y. Yang, Z. He, K. Mao, Q. Li, V. Nguyen, B. Boehm, R. Valerdi, Analyzingand handling local bias for calibrating parametric cost estimation mod-els, Inf. Softw. Technol. 55 (8) (2013) 1496–1511, http://dx.doi.org/10.1016/j.infsof.2013.03.002.

[8] C. Lopez-Martin, A. Abran, Applying expert judgment to improve an indi-vidual’s ability to predict software development effort, Int. J. Softw. Eng.Knowl. Eng. (IJSEKE) 22 (4) (2012) 467–483, http://dx.doi.org/10.1142/S0218194012500118.

[9] A. Chavoya, C. Lopez-Martin, R. Andalon, M.E. Meda, Genetic program-ming as alternative for predicting development effort of individual softwareprojects, PLOS ONE 7 (11) (2012) e50531, http://dx.doi.org/10.1371/journal.pone.0050531.

10] H. Park, S. Baek, An empirical validation of a neural network model forsoftware effort estimation, J. Expert Syst. Appl. 35 (3) (2008) 929–937,http://dx.doi.org/10.1016/j.eswa.2007.08.001.

11] I. González-Carrasco, R. Colomo-Palacios, J.L. López-Cuadrado, F.J. García-Penalvo, SEffEst: effort estimation in software projects using fuzzy logicand neural networks, Int. J. Comput. Intell. Syst. 5 (4) (2012) 679–699,http://dx.doi.org/10.1080/18756891.2012.718118.

12] A.B. Nassif, D. Ho, L.F. Capretz, Towards an early software estimation using log-linear regression and a multilayer perceptron model, J. Syst. Softw. 86 (2013)144–160, http://dx.doi.org/10.1016/j.jss.2012.07.050.

13] M. Paliwal, U.A. Kumar, Neural networks and statistical techniques: a reviewof applications, J. Expert Syst. Appl. 36 (2009) 2–17, http://dx.doi.org/10.1016/j.eswa.2007.10.005.

14] H. Chao-Jung, H. Chin-Yu, Comparison of weighted grey relational analy-sis for software effort estimation, Softw. Qual. J. 19 (1) (2011) 165–200,http://dx.doi.org/10.1007/s11219-010-9110-y.

15] M.Ch. Bishop, Neural Networks for Pattern Recognition, Oxford UniversityPress, 1995.

16] D. Montgomery, E. Peck, Introduction to Linear Regression Analysis, John Wiley,2001.

17] M. Shepperd, S. MacDonell, Evaluating prediction systems in software projectestimation, Inf. Softw. Technol. 54 (2012) 820–827, http://dx.doi.org/10.1016/j.infsof.2011.12.008.

18] E. Kocaguneli, T. Menzies, Software effort models should be assessed via leave-one-out validation, J. Syst. Softw. 86 (2013) 1879–1890, http://dx.doi.org/10.1016/j.jss.2013.02.053.

19] W.J. Conover, Practical Nonparametric Statistics, third ed., Wiley, 1999.20] T. Dybå, V.B. Kampenes, D.I.K. Sjøberg, A systematic review of statistical power

in software engineering experiments, J. Inf. Softw. Technol. 48 (8) (2006)745–755, http://dx.doi.org/10.1016/j.infsof.2005.08.009.

21] S.M. Ross, Introduction to Probability and Statistics for Engineers and Scientists,third ed., Elsevier Press, 2004.

22] A. Heiat, Comparison of artificial neural network and regression models forestimating software development effort, J. Inf. Softw. Technol. 44 (15) (2002)911–922, http://dx.doi.org/10.1016/S0950-5849(02)00128-3.

23] A.L.I. Oliveira, Estimation of software project effort with support vectorregression, J. Neurocomput. 69 (2006) 1749–1753, http://dx.doi.org/10.1016/j.neucom.2005.12.119.

Please cite this article in press as: C. López-Martín, Predictive accuracyfor development effort of software projects, Appl. Soft Comput. J. (201

24] Ch.S. Reddy, P.S. Rao, K. Raju, V.V. Kumari, A new approach for estimating soft-ware effort using RBFN network, Int. J. Comput. Sci. Netw. Secur. 8 (7) (2008)237–241.

25] J.S. Pahariya, V. Ravi, M. Carr, Software cost estimation using compu-tational intelligence techniques, in: Proceedings of the World Congress

[

PRESSputing xxx (2014) xxx–xxx

on Nature and Biologically Inspired Computing, 2009, pp. 849–854,http://dx.doi.org/10.1109/NABIC.2009.5393534.

26] K.V. Kumar, V. Ravi, M. Carr, N.R. Kiran, Software development cost estima-tion using wavelet neural networks, J. Syst. Softw. 81 (2008) 1853–1867,http://dx.doi.org/10.1016/j.jss.2007.12.793.

27] P.V.G.D. Prasad-Reddy, K.R. Sudha, P. Rama-Sree, S.N.S.V.S.C. Ramesh, Softwareeffort estimation using radial basis and generalized regression neural networks,J. Comput. 2 (5) (2010) 87–92.

28] A. Idri, A. Abran, S. Mbarki, An experiment on the design of radial basis func-tion neural networks for software cost estimation, in: Proceedings of the2nd Conference on Information and Communication Technologies, 2006, pp.1612–1617, http://dx.doi.org/10.1109/ICTTA.2006.1684625.

29] A. Idri, A. Zahi, E. Mendes, A. Zakrani1, Software cost estimation models usingradial basis function neural networks, Proc. IWSM-Mensura, LNCS 4895 (2007)21–31, http://dx.doi.org/10.1007/978-3-540-85553-8 2.

30] A. Idri, A. Zakrani, A. Zahi, Design of radial basis function neural networks forsoftware effort estimation, Int. J. Comput. Sci. Issues 7 (4) (2010) 11–17.

31] M. Shin, A.L. Goel, Empirical data modeling in software engineering usingradial basis functions, IEEE Trans. Softw. Eng. 26 (6) (2000) 567–576,http://dx.doi.org/10.1109/32.852743.

32] ISBSG, Guidelines for Use of the ISBSG Data, Release 11, International SoftwareBenchmarking Standards Group, 2009.

33] M. Fernández-Diego, F. González-Ladrón-de-Guevara, Potential and limitationsof the ISBSG dataset in enhancing software engineering research: a mappingreview, Inf. Softw. Technol. 56 (6) (2014) 527–544, http://dx.doi.org/10.1016/j.infsof.2014.01.003.

34] B. Boehm, Software Engineering Economics, Prentice Hall, 1981.35] S. Haykin, Neural Networks, a Comprehensive Foundation, second ed., Pearson,

1999.36] I. Finschi, An Implementation of the Levenberg-Marquardt Algorithm, Eid-

genössische Technische Hochschule Zürich, 1996.37] C. López-Martín, C. Isaza, A. Chavoya, Software development effort prediction

of industrial projects applying a general regression neural network, Empir.Softw. Eng. 17 (6) (2012) 738–756, http://dx.doi.org/10.1007/s10664-011-9192-6.

38] D.F. Specht, A general regression neural network, IEEE Trans. Neural Netw. 2(6) (1991) 568–576, http://dx.doi.org/10.1109/72.97934.

39] S.D. Sheetz, D. Henderson, L. Wallace, Understanding developer and managerperceptions of function points and source lines of code, J. Syst. Softw. 82 (2009)1540–1549, http://dx.doi.org/10.1016/j.jss.2009.04.038.

40] D. Garmus, D. Herron, Measuring the Software Process, a Practical Guide toFunctional Measurements, Prentice Hall, 1996.

41] A. Albrecht, Measuring application development productivity, in: IBMApplications Development Symposium, Monterey, CA, October 14–17,1979.

42] A. Albrecht, J.E. Gaffney, Software function, source lines of code, and develop-ment effort prediction: a software science validation, IEEE Trans. Softw. Eng. 9(6) (1983) 639–648, http://dx.doi.org/10.1109/TSE.1983.235271.

43] I. Myrtveit, E. Stensrud, Validity and reliability of evaluation procedures in com-parative studies of effort prediction models, Empir. Softw. Eng. 17 (1–2) (2012)23–33, http://dx.doi.org/10.1007/s10664-011-9183-7.

44] B. Boehm, Ch. Abts, A.W. Brown, S. Chulani, B.K. Clarck, E. Horowitz, R. Madachy,D. Reifer, B. Steece, COCOMO II, Prentice Hall, 2000.

45] B. Kitchenham, E. Mendes, Why comparative effort prediction studies maybe invalid, in: Proceedings of the 5th International Conference on Pre-dictor Models in Software Engineering, PROMISE, 2009, http://dx.doi.org/10.1145/1540438.1540444.

46] J.C.F. Winter, Using the Student’s t-test with extremely small sample sizes,Pract. Assess. Res. Eval. 18 (10) (2013) 1–12.

47] K.K. Yuen, The two-sample trimmed t for unequal population variances,Biometrika 61 (1) (1974) 165–170, http://dx.doi.org/10.2307/2334299.

48] D.S. Moore, G.P. McCabe, B.A. Craig, Introduction to the Practice of Statistics,sixth ed., W.H. Freeman and Company, 2009.

comparison between neural networks and statistical regression4), http://dx.doi.org/10.1016/j.asoc.2014.10.033

Wiley, 2003.50] B.A. Kitchenham, E. Mendes, G.H. Travassos, Cross versus within-company cost

estimation studies: a systematic review, IEEE Trans. Softw. Eng. 33 (5) (2007)316–329, http://dx.doi.org/10.1109/TSE.2007.1001.