[ieee 2013 brazilian conference on intelligent systems (bracis) - fortaleza, brazil...
TRANSCRIPT
Statistical and Heuristic Model Selection inRegularized Least-Squares
Igor Braga and Maria Carolina MonardInstitute of Mathematics and Computer Science
University of Sao Paulo
Sao Carlos, SP 13566–590
{igorab,mcmonard}@icmc.usp.br
Abstract—The Regularized Least-Squares (RLS) method usesthe kernel trick to perform non-linear regression estimation. Itsperformance depends on the proper selection of a regularizationparameter. This model selection task has been traditionallycarried out using cross-validation. However, when training data isscarce or noisy, cross-validation may lead to poor model selectionperformance. In this paper we investigate alternative statisticaland heuristic procedures for model selection in RLS that wereshown to perform well for other regression methods. Experimentsconducted on real datasets show that these alternative modelselection procedures are not able to improve performance whencross-validation fails.
Keywords—regularized least-squares, model selection, cross-validation, penalization methods, metric-based methods
I. INTRODUCTION
The widespread adoption of kernel methods [1] in machinelearning made its way to the least-squares “world”, resultingin the so-called Regularized Least-Squares (RLS) method [2],[3]. RLS is an state-of-the-art method for non-linear regressionestimation. Moreover, RLS is also considered nowadays as analternative to Support Vector Machines [4] in some classifica-tion tasks [3].
In a nutshell, the regularized least-squares methodworks as follows. Given a sequence of training data(x1, y1), . . . , (xn, yn) drawn i.i.d from a fixed but unknownprobability distribution P (x, y), where x ∈ Rd, y ∈ R(regression), or y ∈ {+1,−1} (classification), this methodyields a function fγ(x) so that
fγ = argminf∈H
[1
n
n∑i=1
(yi − f(xi))2+ γ ‖f‖2H
], (1)
where γ > 0 is a real-valued regularization parameter andH is a Reproducing Kernel Hilbert Space (RKHS) with norm‖·‖H. A function f for which ‖f‖H is bounded satisfies somesmoothness properties, hence the use of the term “regularized”to name the method.
In order to successfully apply RLS, that is, to use fγ(x)to predict the output y of unseen x, we must find such fγ that1) fits the training sequence well (i.e, minimizes the squaredloss) and 2) is a reasonably smooth function (i.e, minimizesthe norm ‖ · ‖H). As one can always minimize the former atthe expense of the latter, and vice-versa, proper selection ofthe parameter γ of RLS is indispensable for generalization.
Formally, the best value of γ is the value γ∗ that yieldsin (1) a function fγ∗ that minimizes the risk of the predictionerror as measured by the expected squared loss, that is:
γ∗ = argminγ>0
∫(y − fγ(x))
2dP (x, y). (2)
The estimation of γ∗ belongs to the category of statisticalproblems known as model selection. In contrast with the relatedcategory of model assessment, model selection does not requirethe estimation of the value of the prediction error. It suffices toindicate the function with the smallest prediction error amonga set of candidate functions fγ1 , fγ2 , . . . , fγN
.
In practice, the selection of the regularization parameter γhas been traditionally carried out using cross-validation, whichdoes not necessarily lead to good results when the data isscarce or noisy. Thus, finding model selection procedures thatare safer than cross-validation is an important research goal. Inthis paper we investigate statistical and heuristic procedures formodel selection in RLS which were shown to perform well forother regression methods, and, to the best of our knowledge,have not been applied in this context before.
The remainder of this paper is organized as follows.Section II shows how the minimization problem in (1) issolved for a fixed value of the parameter γ. Section III andIV describe statistical and heuristic procedures which are usedin this work to perform model selection. Section V reportsthe experimental evaluation of the model selection proceduresconsidered. Conclusions and indications of future work arepresented in Section VI.
II. SOLUTION OF THE REGULARIZED LEAST-SQUARES
MINIMIZATION PROBLEM
The content in this section is informational and also in-troduces some notation used afterwards. To start with, notethat RLS requires the choice of a symmetric, positive definitekernel function k : (Rd × Rd) �→ R that spans the set offunctions H under consideration. An example of such functionis the well-known Gaussian RBF kernel. Selecting a properkernel function for a specific task is considered a modelselection problem on its own, and the reader is encouragedto consult the related literature. From now on, we assume thata kernel function k(x′,x) has already been chosen, includingits potential parameters.
2013 Brazilian Conference on Intelligent Systems
978-0-7695-5092-3/13 $26.00 © 2013 IEEE
DOI 10.1109/BRACIS.2013.46
231
By the representer theorem [5], the minimizer in (1) hasthe form
f(x) =n∑
i=1
αi k(xi,x), αi ∈ R, (3)
Hereafter, we denote by y the n-dimensional vector
[y1, . . . , yn]�
and by K the n×n-dimensional matrix with en-tries kij = k(xi,xj). We also denote by α the n-dimensional
vector [α1, . . . , αn]�
.
Plugging (3) into (1) yields the following expression forcalculating the squared loss:
1
n
n∑i=1
(yi − f(xi))2=
1
nα�KKα− 2
nα�Ky+ const. (4)
Morover, by considering the special properties of the norm inan RKHS, we have that ‖f‖H = α�Kα. Ignoring the constantterm in (4), we arrive at the following quadratic minimizationproblem for (1):
αγ = argminα∈Rn
[1
nα�KKα− 2
nα�Ky + γα�Kα
]. (5)
A necessary and sufficient condition on αγ is obtained bytaking the derivative of the expression to be minimized in (5)with respect to each αi and equating it to zero. By doing that,we arrive at the following system of linear equations
2
nKKαγ − 2
nKy + γKαγ = 0. (6)
Denoting by I the n × n identity matrix, embedding 2n into
γ, and solving for αγ in (6), we arrive at the solution of theminimization problem in (5):
αγ = (K + γI)−1y.
Most model selection procedures require the calculationof αγ for several different values of γ. In order to avoidsolving one system of linear equations for each new γ, onecan take advantage of the spectral decomposition of the kernelmatrix: K = UΣU�, where U is the n × n matrix formedby the eigenvectors of K and Σ is the n× n diagonal matrixcontaining the eigenvalues σi of K. Denoting by Λγ the n×ndiagonal matrix with entries λii =
1σi+γ , αγ can be calculated
by performing only matrix multiplications:
αγ = UΛγU�y. (7)
III. STATISTICAL MODEL SELECTION
In this section we focus on statistical procedures of modelselection. These methods can be used to bound or estimate thevalue of the prediction error of a function f(x):
R(f) =
∫(y − f(x))
2dP (x, y). (8)
Even though these statistical procedures can be used to per-form model assessment, recall that only their model selectioncapabilities are of interest in this paper.
A. Leave-One-Out Cross-Validation
Cross-validation (CV) is one of the earliest model selec-tion/assessment techniques that appeared in machine learning(cf. [6]), and it is still one of the most frequently used. Theidea is to allow the same dataset to be used for learning andevaluation.
Here we focus on the special case of leave-one-out cross-validation, which in its j-th iteration uses all training examplesexcept the j-th one to learn a function f j(x), and after thatevaluates the prediction error of f j(x) on the j-th example.After iterating over all n training examples, the followingestimate of the prediction error of the learning algorithm iscalculated:
Rloo(f) =1
n
n∑j=1
(yj − f j(x)
)2. (9)
Under the assumption that the functions f j(x) are not verydifferent from the function f(x) learned using all trainingdata (i.e, the learning algorithm is stable), then Rloo(f) is an“almost” unbiased estimate of R(f) [7]. This way, among aset of candidate functions, the leave-one-out procedure selectsthe function which minimizes Rloo(f).
In general, the major drawback of leave-one-out CV is therequirement to call the learning algorithm as many times as thenumber of training examples. Fortunately, the special structureof the regularized least-squares method allows the calculationof Rloo(fγ) practically for free after finding the solution αγ
using all training data with the expression in (7).
The following derivations are carried out when the j-thexample is held-out. Let f j
γ be the function obtained by solvingthe expression in (7) using the n− 1 remaining examples:
f jγ(x) =
n∑i=1
αjγ,i k(xi,x). (10)
The following proposition regarding f jγ is true [3].
Proposition. Let f jγ be the minimizer of the RLS problem
minf∈H
⎡⎣ 1
n
n∑i=1,i �=j
(yi − f(xi))2+ γ ‖f‖2H
⎤⎦ .
Then, f jγ is also the minimizer of the RLS problem
minf∈H
[1
n
n∑i=1,i �=j
(yi − f(xi))2
+1
n
(f jγ(xj)− f(xj)
)2+ γ ‖f‖2H
].
(11)
Using (11) and denoting by yj the n-dimensional vector[yj1, . . . , y
jn
]�in which
yji =
{yi i �= j
f jγ(xj) i = j
,
the coefficients αjγ in (10) would be calculated by solving
αjγ = (K + γI)−1yj .
232
However, yj is not fully specified because it depends on theknowledge of f j
γ (and αjγ thereof). To escape from this circular
reasoning, let fγ(xj) be the function obtained from (7) usingall training examples. Denoting G = (K + γI), the followingchain of equalities is true.
f jγ(xj)− fγ(xj) =
n∑i=1
(KG−1)ji(yji − yi)
= (KG−1)jj(f jγ(xj)− yj).
(12)
Solving for f jγ(xj) in (12) and letting hjj = (KG−1)jj , the
following expression for f jγ(xj) is obtained:
f jγ(xj) =
fγ(xj)− hjjyj1− hjj
. (13)
Denoting by g−1ij the entries of G−1, and after some algebra in
(13), we arrive at the following expression for the differencebetween the target value yj and the predicted value by theleave-one-out function f j
γ(xj):
yj − f jγ(xj) =
αj,γ
1− g−1jj
.
Plugging this last expression into (9), we arrive at the leave-one-out estimator for the parameter γ in RLS:
γ∗loo = argminγ>0
1
n
n∑j=1
(αj,γ
1− g−1jj
)2
.
Note that in order to solve for αγ in (7) using all training data,the values g−1
jj need to be computed in advance.
B. Complexity Penalization
Another direction of research in statistical model selectionis based on the idea of bounding the prediction error in (8).Let us suppose that a learning algorithm selects from a setof functions F a function f which minimizes the empiricalprediction error over F :
Remp(f) =1
n
n∑i=1
(yi − f(xi))2,
Moreover, suppose there exists a quantity P(n, p) whichmeasures the degree of complexity (or capacity) of the setof functions F . This function P(n, p) depends on the size ofthe training data n and on a parameter of complexity p. Forinstance, in polynomial regression, p would be set to the degreeof the polynomial used to fit a set of data.
In what follows, we consider bounds of the form
R(f) ≤ Remp(f)P(n, p).This expression captures the idea that the more complex a setof functions F is, the less we can trust the empirical predictionerror Remp(f).
In the 1970’s, several attempts were made at definingP(n, p), resulting in model selection procedures that wereproven to work asymptotically (n→∞). In Table I, the most
Table I. CLASSICAL COMPLEXITY PENALIZATION
Name P(n, p)
Final Prediction Error (FPE) [8]1+
pn
1− pn
Generalized Cross-Validation (GCV) [9] 1(1− p
n
)2
Schwartz Criterion (SC) [10] 1 +pn
lnn
2
(1− p
n
)Shibata’s Model Selector (SMS) [11] 1 + 2 p
n
common definitions are listed1. The parameter p present in allof the definitions is meant to quantify the degrees of freedomof the set of functions F from which f is selected. We willcome back to p at the end of this section.
In the 1990’s, a non-asymptotic expression for P(n, p)was derived based on the results of the VC-Theory [12]. Thegeneral form of this expression is2:
P(n, p) =
⎛⎜⎜⎝1− c
√√√√p(1 + ln an
p
)− ln η
n
⎞⎟⎟⎠
+
, (14)
where p corresponds to the VC-dimension of the set offunctions F and c, a, η are constants that need to be set toreflect the particularities of the learning method being used. In[13], the assignment c = a = 1 and η = 1/
√n was shown to
perform well in a polynomial regression task.
The Effective Number of Parameters in RLS: One of thetenets of VC-Theory is that the VC-dimension is the charac-terization of the complexity (capacity) of a set of functions F .However, in the case of the sets of functions in RLS, it is notknown how to efficiently calculate their VC-dimension. WhenF is a set of functions linear in their parameters, which is thecase of RLS, the heuristic method of calculating the effectivenumber of parameters [14] can be used to approximate theVC-dimension.
Let σ1 > σ2 > . . . > σn be the eigenvalues of the kernelmatrix K used for solving the RLS problem in (7). The value
pγ =n∑
i=1
σi
σi + γ, 0 < pγ ≤ n
quantifies the effective degrees of freedom of the set offunctions from where the solution fγ is selected [12]. Usingp = pγ in any of the expressions for P(n, p) shown in thissection, we arrive at a complexity penalization estimator forthe parameter γ in RLS:
γ∗cp = argminγ>0
Remp(fγ)P(n, pγ)
1Unlike its name suggests, the GCV criterion is not a cross-validationprocedure, but actually an approximation to the leave-one-out estimatorpresented in (9)
2(a)+ = max(0, a)
233
IV. HEURISTIC MODEL SELECTION
In this section we present two metric-based heuristic pro-cedures that have been found to perform well in certain modelselection tasks. It is not clear how these procedures relate to ourgoal in (2). Hence, their performance need to be experimentallyevaluated to have an idea of how promising they are in RLSmodel selection.
Both procedures, called TRI and ADJ, use geometric ideasand unlabeled data to perform model selection [15], [16]. Inparticular, the ADJ heuristic was found to be a state-of-the-artmodel selection procedure in polynomial regression [16], [17].
A. TRI
Suppose the real regression function freg(x) is known andbelongs to a metric space. Then, given any two functions f1(x)and f2(x) in the same metric space and the distance (metric)d(·, ·), the triangle inequality applies:
d(f1, freg) + d(f2, freg) ≥ d(f1, f2).
Given a training sequence (x1, y1), . . . , (xn, yn) and alarge set of unlabeled data (x∗1, . . . ,x
∗m) which comes from
the same distribution as the training sequence, we can approx-imately verify the validity of the triangle inequality using
dtr(f1,y) + dtr(f2,y) ≥ dun(f1, f2),
where dtr(f,y) =√Remp(f) of the function f(x) and
dun(f1, f2) is the disagreement on unlabeled data of functionsf1 and f2:
dun(f1, f2) =
√√√√ 1
n
m∑i=1
(f1(x∗i )− f2(x∗i ))2.
Now, let us use RLS to obtain two functions fγ1(x) and
fγ2(x) such that γ1 > γ2. Consequently, while fγ2
(x) willpotentially have a smaller empirical error than fγ1
(x), there isa risk that its empirical error is a worse estimate of d(f1, freg),and it may happen that:
dtr(fγ1,y) + dtr(fγ2
,y) < dun(fγ1, fγ2
).
In this situation, the TRI heuristic blames the function fγ2
for breaking the triangle inequality. This gives rise to the TRIprocedure for model selection: given a sequence of functionsfγ1
, fγ2, . . . such that γ1 > γ2 > . . ., choose the last function
in the sequence that does not violate the triangle inequalitywith any preceding function.
B. ADJ
This other heuristic is based on the assumption that sincethe training and unlabeled sets come from the same distribu-tion, the following relation should be observed:
dun(f1, freg)
dun(f1, f2)≈ dtr(f1, freg)
dtr(f1, f2). (15)
If the functions f1 and f2 are given along with the trainingand unlabeled sets, the values dtr(f1, f2) and dun(f1, f2) canbe readily calculated. The value dtr(f1, freg) can be estimated
using dtr(f1,y). The only unknown in (15) is dun(f1, freg).Thus it can be estimated using
dun(f1, freg) = dtr(f1,y)dun(f1, f2)
dtr(f1, f2).
This means that, if the assumption in (15) is a good one,the prediction error of function f1 on unseen data, whichis a good approximation to the true prediction error, can beestimated from its empirical error on training data and the ratiodun(f1,f2)dtr(f1,f2)
. This gives rise to the ADJ procedure, which for
model selection in RLS reads: given a sequence of functionsfγ1
, fγ2, . . . such that γ1 > γ2 > . . ., first multiply dtr(fγi
,y)
by the largest observed ratiodun(fγi ,fγj )
dtr(fγi ,fγj ), i < j, and then
choose the function fγiin the sequence that has the smallest
value of this product.
V. EXPERIMENTS
The experiments reported in this section aim at answeringthe following questions regarding model selection in regular-ized least-squares:
Q1: The complexity penalization methods described inSection III-B were evaluated mainly for polynomialregression in artificial datasets, where they were foundto be comparable to cross-validation [13]. Does thisresult hold for RLS in real datasets?
Q2: Can the constants of the VC expression in (14) forP (n, p) given in [13] be improved for model selectionin RLS?
Q3: The heuristic approach ADJ was demonstrated to givestate-of-the-art results in polynomial regression, evenoutperforming cross-validation, while TRI has beenless successful than ADJ [16], [17]. Do these resultshold for RLS in real datasets?
A. Experimental Setup
In all experiments, we use the linear splines with an infinitenumber of knots kernel function [12]. Denoting by xk the k-thcoordinate of vector x, it is calculated as:
k(xi,xj)=∏d
k=11+xk
i xkj+
12 |xk
i−xkj |(xk
i ∧xkj )
2+ 13 (x
ki ∧xk
j )3,
This kernel can be used to perform non-linear regression andhas the advantage of having no free parameters.
We use 10 regression datasets from two repositories: UCI 3
and Keel 4 [18]. A detailed description of the datasets can befound in their repository websites. A brief description can befound in Table II.
In what follows we describe a single trial of ourexperiments for one dataset. Two-thirds of the trainingexamples were randomly held-out to form a test set(x′1, y
′1), . . . , (x
′�, y
′�). From the remaining one-third, n ex-
amples are randomly selected to compose the training set(x1, y1), . . . , (xn, yn). The examples not selected in the one-third to compose the training set are used without their labels asthe unlabeled set (x∗1, . . . ,x
∗m) for both TRI and ADJ methods.
3http://archive.ics.uci.edu/ml/4http://sci2s.ugr.es/keel/datasets.php
234
Table II. DATASETS USED IN THE EXPERIMENTS
# DATASET INSTANCES FEATURES
1 Abalone 4177 8
2 Compactiv 8192 21
3 Concrete-str 1030 8
4 Friedman 1200 5
5 Mortgage 1049 15
6 Stock 950 9
7 Treasury 1049 15
8 Wankara 321 9
9 Wine-red 1599 11
10 Wine-white 4898 11
The training set thus created is used along with a fixedvalue of γ to obtain a function fγ(x) in accordance to (7).The root mean squared error (RMSE) of fγ(x) in the test set
rmse(fγ) =
√√√√1
�
�∑i=1
(fγ(x′i)− y′i)2.
is then recorded for evaluation purposes.
In order to try reasonable values of γ, the first (larger)eigenvalue σ1 of the kernel matrix K is taken as a reference.Using the 50 values v1, . . . , v50 equally spaced according tothe logarithmic scale in the range [10−6, 101], we try thefollowing 50 values: γ1 = v1σ1, . . . , γ50 = v50σ1.
Thus, after obtaining fγ1, . . . , fγ50
, and calculating theirrespective RMSE on the test set, we find the function fγ∗with the minimum value of RMSE, which is the gold standardto our model selection experiments described ahead.
We implemented and ran the following model selectionprocedures. Cross-validation: leave-one-out CV (LOO), 5-fold CV (5CV); complexity penalization: Final PredictionError (FPE), Schwarz Criterion (SC), Generalized Cross-Validation (GCV), Shibata’s Model Selector (SMS), VC ex-pression (VC1); and metric-based: TRI and ADJ. Each modelselection procedure had access only to the training set, with theexception of TRI and ADJ, which had access to the unlabeledset as well. According to their methods, each model selectionprocedure picked one function fchosen among fγ1
, . . . , fγ50.
We evaluate the quality of each procedure by the ratio:
r =rmse(fchosen)
rmse(fγ∗).
That is, the closer r is to 1, the better is the model selectionperformance.
So far, this description refers to a single trial. For a fixedsize of the training set n, we conducted 20 such trials andreport the mean and the standard deviation of each modelselection procedure. This experiment was carried out forn = 20 and n = 100 on each dataset. The results can befound in Table III.
In order to address question Q2, the constants c and a ofthe VC expression in (14) were experimentally optimized usingonly one of the datasets (randomly selected). The resulting
constants c = 0.74 and a = 1.35 were used as another modelselection procedure with the VC expression (VC2).
B. Results and Discussion
In Table III, the best results for each dataset and n arehighlighted in bold, and the second best results (provided itis less than 1.15) are underlined. Taking into account all theresults, the following distribution of (<best>,<2nd best>)results can be observed: LOO (12,5); 5CV (14,4); FPE (2,6);SC(1,9); GCV(1,3); SMS(4,2); VC1(0,1); VC2(4,2); TRI(4,2);and ADJ (4,2). According to this criteria, it is clear that LOOand 5CV are the best model selection procedures overall. How-ever, observe that for the Abalone dataset the cross-validationprocedures did not perform as well as they performed for theother datasets.
In what follows we answer the four questions posed in thebeginning of this section.
Q1: Considering the complexity penalization procedures(except VC2 which will be treated latter), it can beobserved that FPE and SC show similar results, whichare in general better than the ones from GCV, SMSand VC1. Bearing in mind the overall results of these 5complexity penalization methods on the 10 real worlddatasets, they do not seem to be comparable withcross-validation as it was found in [13] for polynomialregression on artificial datasets.
Q2: Comparing the results obtained by VC1 and VC2,(the last with c = 0.74 and a = 1.35), better resultsare obtained by VC2 in 17 of the 20 (85 %) casesconsidered. Moreover, VC1 shows three catastrophicperformances (i.e. > 2.0) in datasets Conc, Stock andWankara with n = 20. Thus, we can conclude thatthere is room for improving the constants of the VCexpression for P (n, p) given in [13].
Q3: Comparing the results obtained by the two heuristicmetric-based approaches, TRI and ADJ, there is nopattern of one heuristic outperforming the other. Thus,these results do not show, as claimed in polynomialregression [16], [17], that ADJ outperforms TRI.Comparing TRI and ADJ with LOO, worse resultsare obtained by TRI in 15 (75%) and by ADJ in17 (85%) of the 20 cases considered. Furthermore,when compared with 5CV, both TRI and ADJ obtainedworse results in 16 (80%) of the 20 cases considered.These results show that both heuristic metric-basedapproaches do not outperform either LOO nor 5CVcross-validation procedures.
VI. CONCLUSION
In this paper we investigated alternative statistical andheuristic model selection procedures for the selection of theparameter γ in the regularized least-squares method. Theseprocedures, six of them based on complexity penalization(FPE, SC, GCV, SMS, VC1 and VC2) and two of them basedon the geometry of metric spaces (TRI and ADJ), were experi-mentally evaluated in real datasets and compared to traditionalcross-validation procedures (LOO and 5CV). The results show
235
Table III. EXPERIMENTAL RESULTS FOR STATISTICAL AND HEURISTIC MODEL SELECTION
# n CROSS-VALIDATION COMPLEXITY PENALIZATION METRIC-BASED
LOO 5CV FPE SC GCV SMS VC1 VC2 TRI ADJ
1 20 1.20 (0.39) 1.15 (0.27) 1.32 (0.74) 1.34 (0.77) 1.09 (0.09) 1.86 (0.75) 1.18 (0.14) 1.86 (0.75) 1.30 (0.21) 1.82 (0.74)100 1.01 (0.01) 1.01 (0.01) 1.02 (0.01) 1.03 (0.02) 1.02 (0.02) 1.24 (0.16) 1.22 (0.04) 1.13 (0.08) 1.34 (0.11) 1.15 (0.19)
2 20 1.00 (0.00) 1.00 (0.00) 1.14 (0.01) 1.14 (0.01) 1.14 (0.01) 1.00 (0.00) 1.14 (0.01) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00)100 1.00 (0.00) 1.00 (0.00) 1.54 (0.03) 1.54 (0.03) 1.54 (0.03) 1.00 (0.00) 1.54 (0.03) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00)
3 20 1.01 (0.01) 1.00 (0.01) 1.03 (0.04) 1.02 (0.03) 1.37 (0.34) 1.04 (0.13) 2.43 (0.37) 1.04 (0.13) 1.01 (0.02) 1.04 (0.13)100 1.02 (0.02) 1.02 (0.02) 1.11 (0.12) 1.12 (0.11) 1.16 (0.08) 1.20 (0.22) 1.47 (0.18) 1.20 (0.22) 1.06 (0.07) 1.14 (0.12)
4 20 1.11 (0.11) 1.11 (0.13) 1.13 (0.15) 1.11 (0.14) 1.21 (0.18) 1.12 (0.12) 1.58 (0.27) 1.12 (0.12) 1.05 (0.05) 1.12 (0.12)100 1.01 (0.02) 1.02 (0.03) 1.08 (0.04) 1.10 (0.04) 1.16 (0.06) 1.21 (0.10) 1.80 (0.14) 1.21 (0.10) 1.07 (0.06) 1.17 (0.10)
5 20 1.04 (0.07) 1.02 (0.03) 1.02 (0.03) 1.02 (0.02) 1.17 (0.17) 1.01 (0.02) 1.53 (0.37) 1.01 (0.02) 1.08 (0.18) 1.01 (0.02)100 1.03 (0.04) 1.07 (0.13) 1.02 (0.04) 1.02 (0.04) 1.08 (0.09) 1.02 (0.04) 1.20 (0.15) 1.02 (0.04) 1.73 (1.03) 1.02 (0.04)
6 20 1.04 (0.06) 1.03 (0.05) 1.06 (0.13) 1.06 (0.13) 1.32 (0.32) 1.10 (0.16) 2.09 (0.40) 1.10 (0.16) 1.08 (0.09) 1.10 (0.16)100 1.07 (0.14) 1.06 (0.14) 1.06 (0.11) 1.06 (0.10) 1.07 (0.06) 1.13 (0.14) 1.27 (0.12) 1.13 (0.14) 1.05 (0.06) 1.11 (0.10)
7 20 1.05 (0.06) 1.05 (0.06) 1.07 (0.08) 1.06 (0.07) 1.13 (0.12) 1.06 (0.11) 1.58 (0.43) 1.06 (0.11) 1.18 (0.26) 1.06 (0.11)100 1.03 (0.06) 1.03 (0.03) 1.06 (0.05) 1.06 (0.06) 1.11 (0.10) 1.10 (0.07) 1.23 (0.14) 1.10 (0.07) 1.32 (0.38) 1.09 (0.08)
8 20 1.01 (0.02) 1.01 (0.02) 1.04 (0.04) 1.03 (0.04) 1.30 (0.17) 1.02 (0.03) 2.04 (0.33) 1.02 (0.03) 1.10 (0.13) 1.02 (0.03)100 1.03 (0.03) 1.03 (0.03) 1.03 (0.03) 1.04 (0.04) 1.09 (0.06) 1.14 (0.09) 1.23 (0.11) 1.14 (0.09) 1.48 (0.61) 1.14 (0.09)
9 20 1.04 (0.05) 1.04 (0.06) 1.08 (0.07) 1.06 (0.06) 1.13 (0.08) 1.16 (0.11) 1.33 (0.14) 1.16 (0.11) 1.09 (0.08) 1.16 (0.11)100 1.01 (0.01) 1.01 (0.01) 1.18 (0.18) 1.18 (0.18) 1.06 (0.02) 1.42 (0.13) 1.20 (0.06) 1.42 (0.13) 1.26 (0.07) 1.42 (0.13)
10 20 1.04 (0.09) 1.02 (0.02) 1.05 (0.03) 1.04 (0.03) 1.08 (0.04) 1.17 (0.14) 1.21 (0.10) 1.17 (0.14) 1.11 (0.09) 1.17 (0.14)100 1.02 (0.04) 1.02 (0.04) 1.04 (0.03) 1.05 (0.03) 1.05 (0.03) 1.54 (0.14) 1.16 (0.05) 1.54 (0.14) 1.35 (0.08) 1.54 (0.14)
AVG 20 1.05 (0.09) 1.05 (0.06) 1.09 (0.13) 1.09 (0.13) 1.19 (0.15) 1.15 (0.16) 1.61 (0.26) 1.15 (0.16) 1.10 (0.11) 1.15 (0.16)100 1.02 (0.04) 1.03 (0.04) 1.11 (0.06) 1.12 (0.06) 1.13 (0.05) 1.20 (0.11) 1.33 (0.10) 1.19 (0.10) 1.26 (0.25) 1.18 (0.10)
that the considered procedures often perform worse than cross-validation. This outcome is different from previous findings inmodel selection for other regression methods.
Our results confirmed that cross-validation may performpoorly when data is scarce. Unfortunately, the best performingalternative methods (FPE and SC) do not lead to safer modelselection when cross-validation fails. Among the investigatedprocedures, the only one that can be tweaked seems to bethe VC expression. In future work, we intend to investigatewhether there is an assignment of constants that may renderthe VC expression practical. We also plan to investigate com-binations of leave-one-out CV and the alternative proceduresconsidered in this paper.
ACKNOWLEDGMENTS
This work is supported by grant #2009/17773-7, Sao PauloResearch Foundation (FAPESP). The authors would like tothank the anonymous reviewers for their helpful comments.
REFERENCES
[1] B. Scholkopf and A. J. Smola, Learning With Kernels: Support VectorMachines, Regularization, Optimization and Beyond. MIT Press, 2002.
[2] G. Wahba, Spline models for observational data, ser. CBMS-NSFRegional Conference Series in Applied Mathematics. Society forIndustrial and Applied Mathematics (SIAM), 1990.
[3] R. M. Rifkin, “Everything old is new again: a fresh look at historicalapproaches in machine learning,” Ph.D. dissertation, MIT-Sloan Schoolof Management, 2006. [Online]. Available: http://dspace.mit.edu/bitstream/handle/1721.1/17549/51896466.pdf?sequence=1
[4] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning,vol. 20, pp. 273–297, 1995.
[5] G. Kimeldorf and G. Wahba, “A correspondence between Bayesianestimation on stochastic processes and smoothing by splines,” Annalsof Mathematical Statistics, vol. 41, no. 2, pp. 495–502, 1970.
[6] A. Luntz and V. Brailovsky, “On estimation of characters obtained instatistical procedure of recognition,” Technicheskaya Kibernetica, vol. 3,1969.
[7] A. Elisseeff and M. Pontil, “Leave-one-out Error and Stability of Learn-ing Algorithms with Applications,” in Learning Theory and Practice.IOS Press, 2002.
[8] H. Akaike, “Statistical Predictor Identification,” Annals of The Instituteof Statistical Mathematics, vol. 22, no. 1, pp. 203–217, 1970.
[9] G. Wahba and P. Craven, “Smoothing noisy data with spline functions:Estimating the correct degree of smoothing by the method of general-ized cross-validation,” Numerische Mathematik, vol. 31, pp. 377–404,1979.
[10] G. Schwarz, “Estimating the dimension of a model,” The Annals ofStatistics, vol. 6, no. 2, pp. 461–464, 1978.
[11] R. Shibata, “An optimal selection of regression variables,” Biometrika,vol. 68, no. 1, pp. 45–54, 1981.
[12] V. Vapnik, Statistical Learning Theory. Wiley-Interscience, 1998.
[13] V. Cherkassky, X. Shao, F. Mulier, and V. Vapnik, “Model complexitycontrol for regression using VC generalization bounds,” IEEE Transac-tions on Neural Networks, vol. 10, no. 5, pp. 1075–1089, 1999.
[14] T. J. Hestie and R. J. Tibshirani, Generalized Additive Models. CRCMonographs on Statistics and Applied Probability. Chapman and Hall,1990.
[15] D. Schuurmans, “A new metric-based approach to model selection,”in AAAI ’97: Proceedings of the Fourteenth National Conference onArtificial Intelligence, 1997, pp. 552–558.
[16] D. Schuurmans, F. Southey, D. Wilkinson, and Y. Guo, Metric-basedApproaches for Semi-supervised Regression and Classification, ser.Adaptive Computation and Machine Learning. MIT Press, 2006, pp.421–451.
[17] O. Chapelle, V. Vapnik, and Y. Bengio, “Model selection for smallsample regression,” Machine Learning, vol. 48, no. 1-3, pp. 9–23, 2002.
[18] J. Alcala-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. Garcıa,L. Sanchez, and F. Herrera, “Keel data-mining software tool: Dataset repository, integration of algorithms and experimental analysisframework,” Journal of Multiple-Valued Logic and Soft Computing,vol. 17, no. 2-3, pp. 255–287, 2011.
236