[ieee 2013 brazilian conference on intelligent systems (bracis) - fortaleza, brazil...

6
Statistical and Heuristic Model Selection in Regularized Least-Squares Igor Braga and Maria Carolina Monard Institute of Mathematics and Computer Science University of Sao Paulo Sao Carlos, SP 13566–590 {igorab,mcmonard}@icmc.usp.br Abstract—The Regularized Least-Squares (RLS) method uses the kernel trick to perform non-linear regression estimation. Its performance depends on the proper selection of a regularization parameter. This model selection task has been traditionally carried out using cross-validation. However, when training data is scarce or noisy, cross-validation may lead to poor model selection performance. In this paper we investigate alternative statistical and heuristic procedures for model selection in RLS that were shown to perform well for other regression methods. Experiments conducted on real datasets show that these alternative model selection procedures are not able to improve performance when cross-validation fails. Keywordsregularized least-squares, model selection, cross- validation, penalization methods, metric-based methods I. I NTRODUCTION The widespread adoption of kernel methods [1] in machine learning made its way to the least-squares “world”, resulting in the so-called Regularized Least-Squares (RLS) method [2], [3]. RLS is an state-of-the-art method for non-linear regression estimation. Moreover, RLS is also considered nowadays as an alternative to Support Vector Machines [4] in some classifica- tion tasks [3]. In a nutshell, the regularized least-squares method works as follows. Given a sequence of training data (x 1 ,y 1 ),..., (x n ,y n ) drawn i.i.d from a fixed but unknown probability distribution P (x,y), where x R d , y R (regression), or y ∈{+1, 1} (classification), this method yields a function f γ (x) so that f γ = arg min f ∈H 1 n n i=1 (y i f (x i )) 2 + γ f 2 H , (1) where γ> 0 is a real-valued regularization parameter and H is a Reproducing Kernel Hilbert Space (RKHS) with norm · H . A function f for which f H is bounded satisfies some smoothness properties, hence the use of the term “regularized” to name the method. In order to successfully apply RLS, that is, to use f γ (x) to predict the output y of unseen x, we must find such f γ that 1) fits the training sequence well (i.e, minimizes the squared loss) and 2) is a reasonably smooth function (i.e, minimizes the norm · H ). As one can always minimize the former at the expense of the latter, and vice-versa, proper selection of the parameter γ of RLS is indispensable for generalization. Formally, the best value of γ is the value γ that yields in (1) a function f γ that minimizes the risk of the prediction error as measured by the expected squared loss, that is: γ = arg min γ>0 (y f γ (x)) 2 dP (x,y). (2) The estimation of γ belongs to the category of statistical problems known as model selection. In contrast with the related category of model assessment, model selection does not require the estimation of the value of the prediction error. It suffices to indicate the function with the smallest prediction error among a set of candidate functions f γ1 ,f γ2 ,...,f γN . In practice, the selection of the regularization parameter γ has been traditionally carried out using cross-validation, which does not necessarily lead to good results when the data is scarce or noisy. Thus, finding model selection procedures that are safer than cross-validation is an important research goal. In this paper we investigate statistical and heuristic procedures for model selection in RLS which were shown to perform well for other regression methods, and, to the best of our knowledge, have not been applied in this context before. The remainder of this paper is organized as follows. Section II shows how the minimization problem in (1) is solved for a fixed value of the parameter γ . Section III and IV describe statistical and heuristic procedures which are used in this work to perform model selection. Section V reports the experimental evaluation of the model selection procedures considered. Conclusions and indications of future work are presented in Section VI. II. SOLUTION OF THE REGULARIZED LEAST-SQUARES MINIMIZATION PROBLEM The content in this section is informational and also in- troduces some notation used afterwards. To start with, note that RLS requires the choice of a symmetric, positive definite kernel function k :(R d × R d ) R that spans the set of functions H under consideration. An example of such function is the well-known Gaussian RBF kernel. Selecting a proper kernel function for a specific task is considered a model selection problem on its own, and the reader is encouraged to consult the related literature. From now on, we assume that a kernel function k(x , x) has already been chosen, including its potential parameters. 2013 Brazilian Conference on Intelligent Systems 978-0-7695-5092-3/13 $26.00 © 2013 IEEE DOI 10.1109/BRACIS.2013.46 231

Upload: maria-carolina

Post on 15-Apr-2017

214 views

Category:

Documents


1 download

TRANSCRIPT

Statistical and Heuristic Model Selection inRegularized Least-Squares

Igor Braga and Maria Carolina MonardInstitute of Mathematics and Computer Science

University of Sao Paulo

Sao Carlos, SP 13566–590

{igorab,mcmonard}@icmc.usp.br

Abstract—The Regularized Least-Squares (RLS) method usesthe kernel trick to perform non-linear regression estimation. Itsperformance depends on the proper selection of a regularizationparameter. This model selection task has been traditionallycarried out using cross-validation. However, when training data isscarce or noisy, cross-validation may lead to poor model selectionperformance. In this paper we investigate alternative statisticaland heuristic procedures for model selection in RLS that wereshown to perform well for other regression methods. Experimentsconducted on real datasets show that these alternative modelselection procedures are not able to improve performance whencross-validation fails.

Keywords—regularized least-squares, model selection, cross-validation, penalization methods, metric-based methods

I. INTRODUCTION

The widespread adoption of kernel methods [1] in machinelearning made its way to the least-squares “world”, resultingin the so-called Regularized Least-Squares (RLS) method [2],[3]. RLS is an state-of-the-art method for non-linear regressionestimation. Moreover, RLS is also considered nowadays as analternative to Support Vector Machines [4] in some classifica-tion tasks [3].

In a nutshell, the regularized least-squares methodworks as follows. Given a sequence of training data(x1, y1), . . . , (xn, yn) drawn i.i.d from a fixed but unknownprobability distribution P (x, y), where x ∈ Rd, y ∈ R(regression), or y ∈ {+1,−1} (classification), this methodyields a function fγ(x) so that

fγ = argminf∈H

[1

n

n∑i=1

(yi − f(xi))2+ γ ‖f‖2H

], (1)

where γ > 0 is a real-valued regularization parameter andH is a Reproducing Kernel Hilbert Space (RKHS) with norm‖·‖H. A function f for which ‖f‖H is bounded satisfies somesmoothness properties, hence the use of the term “regularized”to name the method.

In order to successfully apply RLS, that is, to use fγ(x)to predict the output y of unseen x, we must find such fγ that1) fits the training sequence well (i.e, minimizes the squaredloss) and 2) is a reasonably smooth function (i.e, minimizesthe norm ‖ · ‖H). As one can always minimize the former atthe expense of the latter, and vice-versa, proper selection ofthe parameter γ of RLS is indispensable for generalization.

Formally, the best value of γ is the value γ∗ that yieldsin (1) a function fγ∗ that minimizes the risk of the predictionerror as measured by the expected squared loss, that is:

γ∗ = argminγ>0

∫(y − fγ(x))

2dP (x, y). (2)

The estimation of γ∗ belongs to the category of statisticalproblems known as model selection. In contrast with the relatedcategory of model assessment, model selection does not requirethe estimation of the value of the prediction error. It suffices toindicate the function with the smallest prediction error amonga set of candidate functions fγ1 , fγ2 , . . . , fγN

.

In practice, the selection of the regularization parameter γhas been traditionally carried out using cross-validation, whichdoes not necessarily lead to good results when the data isscarce or noisy. Thus, finding model selection procedures thatare safer than cross-validation is an important research goal. Inthis paper we investigate statistical and heuristic procedures formodel selection in RLS which were shown to perform well forother regression methods, and, to the best of our knowledge,have not been applied in this context before.

The remainder of this paper is organized as follows.Section II shows how the minimization problem in (1) issolved for a fixed value of the parameter γ. Section III andIV describe statistical and heuristic procedures which are usedin this work to perform model selection. Section V reportsthe experimental evaluation of the model selection proceduresconsidered. Conclusions and indications of future work arepresented in Section VI.

II. SOLUTION OF THE REGULARIZED LEAST-SQUARES

MINIMIZATION PROBLEM

The content in this section is informational and also in-troduces some notation used afterwards. To start with, notethat RLS requires the choice of a symmetric, positive definitekernel function k : (Rd × Rd) �→ R that spans the set offunctions H under consideration. An example of such functionis the well-known Gaussian RBF kernel. Selecting a properkernel function for a specific task is considered a modelselection problem on its own, and the reader is encouragedto consult the related literature. From now on, we assume thata kernel function k(x′,x) has already been chosen, includingits potential parameters.

2013 Brazilian Conference on Intelligent Systems

978-0-7695-5092-3/13 $26.00 © 2013 IEEE

DOI 10.1109/BRACIS.2013.46

231

By the representer theorem [5], the minimizer in (1) hasthe form

f(x) =n∑

i=1

αi k(xi,x), αi ∈ R, (3)

Hereafter, we denote by y the n-dimensional vector

[y1, . . . , yn]�

and by K the n×n-dimensional matrix with en-tries kij = k(xi,xj). We also denote by α the n-dimensional

vector [α1, . . . , αn]�

.

Plugging (3) into (1) yields the following expression forcalculating the squared loss:

1

n

n∑i=1

(yi − f(xi))2=

1

nα�KKα− 2

nα�Ky+ const. (4)

Morover, by considering the special properties of the norm inan RKHS, we have that ‖f‖H = α�Kα. Ignoring the constantterm in (4), we arrive at the following quadratic minimizationproblem for (1):

αγ = argminα∈Rn

[1

nα�KKα− 2

nα�Ky + γα�Kα

]. (5)

A necessary and sufficient condition on αγ is obtained bytaking the derivative of the expression to be minimized in (5)with respect to each αi and equating it to zero. By doing that,we arrive at the following system of linear equations

2

nKKαγ − 2

nKy + γKαγ = 0. (6)

Denoting by I the n × n identity matrix, embedding 2n into

γ, and solving for αγ in (6), we arrive at the solution of theminimization problem in (5):

αγ = (K + γI)−1y.

Most model selection procedures require the calculationof αγ for several different values of γ. In order to avoidsolving one system of linear equations for each new γ, onecan take advantage of the spectral decomposition of the kernelmatrix: K = UΣU�, where U is the n × n matrix formedby the eigenvectors of K and Σ is the n× n diagonal matrixcontaining the eigenvalues σi of K. Denoting by Λγ the n×ndiagonal matrix with entries λii =

1σi+γ , αγ can be calculated

by performing only matrix multiplications:

αγ = UΛγU�y. (7)

III. STATISTICAL MODEL SELECTION

In this section we focus on statistical procedures of modelselection. These methods can be used to bound or estimate thevalue of the prediction error of a function f(x):

R(f) =

∫(y − f(x))

2dP (x, y). (8)

Even though these statistical procedures can be used to per-form model assessment, recall that only their model selectioncapabilities are of interest in this paper.

A. Leave-One-Out Cross-Validation

Cross-validation (CV) is one of the earliest model selec-tion/assessment techniques that appeared in machine learning(cf. [6]), and it is still one of the most frequently used. Theidea is to allow the same dataset to be used for learning andevaluation.

Here we focus on the special case of leave-one-out cross-validation, which in its j-th iteration uses all training examplesexcept the j-th one to learn a function f j(x), and after thatevaluates the prediction error of f j(x) on the j-th example.After iterating over all n training examples, the followingestimate of the prediction error of the learning algorithm iscalculated:

Rloo(f) =1

n

n∑j=1

(yj − f j(x)

)2. (9)

Under the assumption that the functions f j(x) are not verydifferent from the function f(x) learned using all trainingdata (i.e, the learning algorithm is stable), then Rloo(f) is an“almost” unbiased estimate of R(f) [7]. This way, among aset of candidate functions, the leave-one-out procedure selectsthe function which minimizes Rloo(f).

In general, the major drawback of leave-one-out CV is therequirement to call the learning algorithm as many times as thenumber of training examples. Fortunately, the special structureof the regularized least-squares method allows the calculationof Rloo(fγ) practically for free after finding the solution αγ

using all training data with the expression in (7).

The following derivations are carried out when the j-thexample is held-out. Let f j

γ be the function obtained by solvingthe expression in (7) using the n− 1 remaining examples:

f jγ(x) =

n∑i=1

αjγ,i k(xi,x). (10)

The following proposition regarding f jγ is true [3].

Proposition. Let f jγ be the minimizer of the RLS problem

minf∈H

⎡⎣ 1

n

n∑i=1,i �=j

(yi − f(xi))2+ γ ‖f‖2H

⎤⎦ .

Then, f jγ is also the minimizer of the RLS problem

minf∈H

[1

n

n∑i=1,i �=j

(yi − f(xi))2

+1

n

(f jγ(xj)− f(xj)

)2+ γ ‖f‖2H

].

(11)

Using (11) and denoting by yj the n-dimensional vector[yj1, . . . , y

jn

]�in which

yji =

{yi i �= j

f jγ(xj) i = j

,

the coefficients αjγ in (10) would be calculated by solving

αjγ = (K + γI)−1yj .

232

However, yj is not fully specified because it depends on theknowledge of f j

γ (and αjγ thereof). To escape from this circular

reasoning, let fγ(xj) be the function obtained from (7) usingall training examples. Denoting G = (K + γI), the followingchain of equalities is true.

f jγ(xj)− fγ(xj) =

n∑i=1

(KG−1)ji(yji − yi)

= (KG−1)jj(f jγ(xj)− yj).

(12)

Solving for f jγ(xj) in (12) and letting hjj = (KG−1)jj , the

following expression for f jγ(xj) is obtained:

f jγ(xj) =

fγ(xj)− hjjyj1− hjj

. (13)

Denoting by g−1ij the entries of G−1, and after some algebra in

(13), we arrive at the following expression for the differencebetween the target value yj and the predicted value by theleave-one-out function f j

γ(xj):

yj − f jγ(xj) =

αj,γ

1− g−1jj

.

Plugging this last expression into (9), we arrive at the leave-one-out estimator for the parameter γ in RLS:

γ∗loo = argminγ>0

1

n

n∑j=1

(αj,γ

1− g−1jj

)2

.

Note that in order to solve for αγ in (7) using all training data,the values g−1

jj need to be computed in advance.

B. Complexity Penalization

Another direction of research in statistical model selectionis based on the idea of bounding the prediction error in (8).Let us suppose that a learning algorithm selects from a setof functions F a function f which minimizes the empiricalprediction error over F :

Remp(f) =1

n

n∑i=1

(yi − f(xi))2,

Moreover, suppose there exists a quantity P(n, p) whichmeasures the degree of complexity (or capacity) of the setof functions F . This function P(n, p) depends on the size ofthe training data n and on a parameter of complexity p. Forinstance, in polynomial regression, p would be set to the degreeof the polynomial used to fit a set of data.

In what follows, we consider bounds of the form

R(f) ≤ Remp(f)P(n, p).This expression captures the idea that the more complex a setof functions F is, the less we can trust the empirical predictionerror Remp(f).

In the 1970’s, several attempts were made at definingP(n, p), resulting in model selection procedures that wereproven to work asymptotically (n→∞). In Table I, the most

Table I. CLASSICAL COMPLEXITY PENALIZATION

Name P(n, p)

Final Prediction Error (FPE) [8]1+

pn

1− pn

Generalized Cross-Validation (GCV) [9] 1(1− p

n

)2

Schwartz Criterion (SC) [10] 1 +pn

lnn

2

(1− p

n

)Shibata’s Model Selector (SMS) [11] 1 + 2 p

n

common definitions are listed1. The parameter p present in allof the definitions is meant to quantify the degrees of freedomof the set of functions F from which f is selected. We willcome back to p at the end of this section.

In the 1990’s, a non-asymptotic expression for P(n, p)was derived based on the results of the VC-Theory [12]. Thegeneral form of this expression is2:

P(n, p) =

⎛⎜⎜⎝1− c

√√√√p(1 + ln an

p

)− ln η

n

⎞⎟⎟⎠

+

, (14)

where p corresponds to the VC-dimension of the set offunctions F and c, a, η are constants that need to be set toreflect the particularities of the learning method being used. In[13], the assignment c = a = 1 and η = 1/

√n was shown to

perform well in a polynomial regression task.

The Effective Number of Parameters in RLS: One of thetenets of VC-Theory is that the VC-dimension is the charac-terization of the complexity (capacity) of a set of functions F .However, in the case of the sets of functions in RLS, it is notknown how to efficiently calculate their VC-dimension. WhenF is a set of functions linear in their parameters, which is thecase of RLS, the heuristic method of calculating the effectivenumber of parameters [14] can be used to approximate theVC-dimension.

Let σ1 > σ2 > . . . > σn be the eigenvalues of the kernelmatrix K used for solving the RLS problem in (7). The value

pγ =n∑

i=1

σi

σi + γ, 0 < pγ ≤ n

quantifies the effective degrees of freedom of the set offunctions from where the solution fγ is selected [12]. Usingp = pγ in any of the expressions for P(n, p) shown in thissection, we arrive at a complexity penalization estimator forthe parameter γ in RLS:

γ∗cp = argminγ>0

Remp(fγ)P(n, pγ)

1Unlike its name suggests, the GCV criterion is not a cross-validationprocedure, but actually an approximation to the leave-one-out estimatorpresented in (9)

2(a)+ = max(0, a)

233

IV. HEURISTIC MODEL SELECTION

In this section we present two metric-based heuristic pro-cedures that have been found to perform well in certain modelselection tasks. It is not clear how these procedures relate to ourgoal in (2). Hence, their performance need to be experimentallyevaluated to have an idea of how promising they are in RLSmodel selection.

Both procedures, called TRI and ADJ, use geometric ideasand unlabeled data to perform model selection [15], [16]. Inparticular, the ADJ heuristic was found to be a state-of-the-artmodel selection procedure in polynomial regression [16], [17].

A. TRI

Suppose the real regression function freg(x) is known andbelongs to a metric space. Then, given any two functions f1(x)and f2(x) in the same metric space and the distance (metric)d(·, ·), the triangle inequality applies:

d(f1, freg) + d(f2, freg) ≥ d(f1, f2).

Given a training sequence (x1, y1), . . . , (xn, yn) and alarge set of unlabeled data (x∗1, . . . ,x

∗m) which comes from

the same distribution as the training sequence, we can approx-imately verify the validity of the triangle inequality using

dtr(f1,y) + dtr(f2,y) ≥ dun(f1, f2),

where dtr(f,y) =√Remp(f) of the function f(x) and

dun(f1, f2) is the disagreement on unlabeled data of functionsf1 and f2:

dun(f1, f2) =

√√√√ 1

n

m∑i=1

(f1(x∗i )− f2(x∗i ))2.

Now, let us use RLS to obtain two functions fγ1(x) and

fγ2(x) such that γ1 > γ2. Consequently, while fγ2

(x) willpotentially have a smaller empirical error than fγ1

(x), there isa risk that its empirical error is a worse estimate of d(f1, freg),and it may happen that:

dtr(fγ1,y) + dtr(fγ2

,y) < dun(fγ1, fγ2

).

In this situation, the TRI heuristic blames the function fγ2

for breaking the triangle inequality. This gives rise to the TRIprocedure for model selection: given a sequence of functionsfγ1

, fγ2, . . . such that γ1 > γ2 > . . ., choose the last function

in the sequence that does not violate the triangle inequalitywith any preceding function.

B. ADJ

This other heuristic is based on the assumption that sincethe training and unlabeled sets come from the same distribu-tion, the following relation should be observed:

dun(f1, freg)

dun(f1, f2)≈ dtr(f1, freg)

dtr(f1, f2). (15)

If the functions f1 and f2 are given along with the trainingand unlabeled sets, the values dtr(f1, f2) and dun(f1, f2) canbe readily calculated. The value dtr(f1, freg) can be estimated

using dtr(f1,y). The only unknown in (15) is dun(f1, freg).Thus it can be estimated using

dun(f1, freg) = dtr(f1,y)dun(f1, f2)

dtr(f1, f2).

This means that, if the assumption in (15) is a good one,the prediction error of function f1 on unseen data, whichis a good approximation to the true prediction error, can beestimated from its empirical error on training data and the ratiodun(f1,f2)dtr(f1,f2)

. This gives rise to the ADJ procedure, which for

model selection in RLS reads: given a sequence of functionsfγ1

, fγ2, . . . such that γ1 > γ2 > . . ., first multiply dtr(fγi

,y)

by the largest observed ratiodun(fγi ,fγj )

dtr(fγi ,fγj ), i < j, and then

choose the function fγiin the sequence that has the smallest

value of this product.

V. EXPERIMENTS

The experiments reported in this section aim at answeringthe following questions regarding model selection in regular-ized least-squares:

Q1: The complexity penalization methods described inSection III-B were evaluated mainly for polynomialregression in artificial datasets, where they were foundto be comparable to cross-validation [13]. Does thisresult hold for RLS in real datasets?

Q2: Can the constants of the VC expression in (14) forP (n, p) given in [13] be improved for model selectionin RLS?

Q3: The heuristic approach ADJ was demonstrated to givestate-of-the-art results in polynomial regression, evenoutperforming cross-validation, while TRI has beenless successful than ADJ [16], [17]. Do these resultshold for RLS in real datasets?

A. Experimental Setup

In all experiments, we use the linear splines with an infinitenumber of knots kernel function [12]. Denoting by xk the k-thcoordinate of vector x, it is calculated as:

k(xi,xj)=∏d

k=11+xk

i xkj+

12 |xk

i−xkj |(xk

i ∧xkj )

2+ 13 (x

ki ∧xk

j )3,

This kernel can be used to perform non-linear regression andhas the advantage of having no free parameters.

We use 10 regression datasets from two repositories: UCI 3

and Keel 4 [18]. A detailed description of the datasets can befound in their repository websites. A brief description can befound in Table II.

In what follows we describe a single trial of ourexperiments for one dataset. Two-thirds of the trainingexamples were randomly held-out to form a test set(x′1, y

′1), . . . , (x

′�, y

′�). From the remaining one-third, n ex-

amples are randomly selected to compose the training set(x1, y1), . . . , (xn, yn). The examples not selected in the one-third to compose the training set are used without their labels asthe unlabeled set (x∗1, . . . ,x

∗m) for both TRI and ADJ methods.

3http://archive.ics.uci.edu/ml/4http://sci2s.ugr.es/keel/datasets.php

234

Table II. DATASETS USED IN THE EXPERIMENTS

# DATASET INSTANCES FEATURES

1 Abalone 4177 8

2 Compactiv 8192 21

3 Concrete-str 1030 8

4 Friedman 1200 5

5 Mortgage 1049 15

6 Stock 950 9

7 Treasury 1049 15

8 Wankara 321 9

9 Wine-red 1599 11

10 Wine-white 4898 11

The training set thus created is used along with a fixedvalue of γ to obtain a function fγ(x) in accordance to (7).The root mean squared error (RMSE) of fγ(x) in the test set

rmse(fγ) =

√√√√1

�∑i=1

(fγ(x′i)− y′i)2.

is then recorded for evaluation purposes.

In order to try reasonable values of γ, the first (larger)eigenvalue σ1 of the kernel matrix K is taken as a reference.Using the 50 values v1, . . . , v50 equally spaced according tothe logarithmic scale in the range [10−6, 101], we try thefollowing 50 values: γ1 = v1σ1, . . . , γ50 = v50σ1.

Thus, after obtaining fγ1, . . . , fγ50

, and calculating theirrespective RMSE on the test set, we find the function fγ∗with the minimum value of RMSE, which is the gold standardto our model selection experiments described ahead.

We implemented and ran the following model selectionprocedures. Cross-validation: leave-one-out CV (LOO), 5-fold CV (5CV); complexity penalization: Final PredictionError (FPE), Schwarz Criterion (SC), Generalized Cross-Validation (GCV), Shibata’s Model Selector (SMS), VC ex-pression (VC1); and metric-based: TRI and ADJ. Each modelselection procedure had access only to the training set, with theexception of TRI and ADJ, which had access to the unlabeledset as well. According to their methods, each model selectionprocedure picked one function fchosen among fγ1

, . . . , fγ50.

We evaluate the quality of each procedure by the ratio:

r =rmse(fchosen)

rmse(fγ∗).

That is, the closer r is to 1, the better is the model selectionperformance.

So far, this description refers to a single trial. For a fixedsize of the training set n, we conducted 20 such trials andreport the mean and the standard deviation of each modelselection procedure. This experiment was carried out forn = 20 and n = 100 on each dataset. The results can befound in Table III.

In order to address question Q2, the constants c and a ofthe VC expression in (14) were experimentally optimized usingonly one of the datasets (randomly selected). The resulting

constants c = 0.74 and a = 1.35 were used as another modelselection procedure with the VC expression (VC2).

B. Results and Discussion

In Table III, the best results for each dataset and n arehighlighted in bold, and the second best results (provided itis less than 1.15) are underlined. Taking into account all theresults, the following distribution of (<best>,<2nd best>)results can be observed: LOO (12,5); 5CV (14,4); FPE (2,6);SC(1,9); GCV(1,3); SMS(4,2); VC1(0,1); VC2(4,2); TRI(4,2);and ADJ (4,2). According to this criteria, it is clear that LOOand 5CV are the best model selection procedures overall. How-ever, observe that for the Abalone dataset the cross-validationprocedures did not perform as well as they performed for theother datasets.

In what follows we answer the four questions posed in thebeginning of this section.

Q1: Considering the complexity penalization procedures(except VC2 which will be treated latter), it can beobserved that FPE and SC show similar results, whichare in general better than the ones from GCV, SMSand VC1. Bearing in mind the overall results of these 5complexity penalization methods on the 10 real worlddatasets, they do not seem to be comparable withcross-validation as it was found in [13] for polynomialregression on artificial datasets.

Q2: Comparing the results obtained by VC1 and VC2,(the last with c = 0.74 and a = 1.35), better resultsare obtained by VC2 in 17 of the 20 (85 %) casesconsidered. Moreover, VC1 shows three catastrophicperformances (i.e. > 2.0) in datasets Conc, Stock andWankara with n = 20. Thus, we can conclude thatthere is room for improving the constants of the VCexpression for P (n, p) given in [13].

Q3: Comparing the results obtained by the two heuristicmetric-based approaches, TRI and ADJ, there is nopattern of one heuristic outperforming the other. Thus,these results do not show, as claimed in polynomialregression [16], [17], that ADJ outperforms TRI.Comparing TRI and ADJ with LOO, worse resultsare obtained by TRI in 15 (75%) and by ADJ in17 (85%) of the 20 cases considered. Furthermore,when compared with 5CV, both TRI and ADJ obtainedworse results in 16 (80%) of the 20 cases considered.These results show that both heuristic metric-basedapproaches do not outperform either LOO nor 5CVcross-validation procedures.

VI. CONCLUSION

In this paper we investigated alternative statistical andheuristic model selection procedures for the selection of theparameter γ in the regularized least-squares method. Theseprocedures, six of them based on complexity penalization(FPE, SC, GCV, SMS, VC1 and VC2) and two of them basedon the geometry of metric spaces (TRI and ADJ), were experi-mentally evaluated in real datasets and compared to traditionalcross-validation procedures (LOO and 5CV). The results show

235

Table III. EXPERIMENTAL RESULTS FOR STATISTICAL AND HEURISTIC MODEL SELECTION

# n CROSS-VALIDATION COMPLEXITY PENALIZATION METRIC-BASED

LOO 5CV FPE SC GCV SMS VC1 VC2 TRI ADJ

1 20 1.20 (0.39) 1.15 (0.27) 1.32 (0.74) 1.34 (0.77) 1.09 (0.09) 1.86 (0.75) 1.18 (0.14) 1.86 (0.75) 1.30 (0.21) 1.82 (0.74)100 1.01 (0.01) 1.01 (0.01) 1.02 (0.01) 1.03 (0.02) 1.02 (0.02) 1.24 (0.16) 1.22 (0.04) 1.13 (0.08) 1.34 (0.11) 1.15 (0.19)

2 20 1.00 (0.00) 1.00 (0.00) 1.14 (0.01) 1.14 (0.01) 1.14 (0.01) 1.00 (0.00) 1.14 (0.01) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00)100 1.00 (0.00) 1.00 (0.00) 1.54 (0.03) 1.54 (0.03) 1.54 (0.03) 1.00 (0.00) 1.54 (0.03) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00)

3 20 1.01 (0.01) 1.00 (0.01) 1.03 (0.04) 1.02 (0.03) 1.37 (0.34) 1.04 (0.13) 2.43 (0.37) 1.04 (0.13) 1.01 (0.02) 1.04 (0.13)100 1.02 (0.02) 1.02 (0.02) 1.11 (0.12) 1.12 (0.11) 1.16 (0.08) 1.20 (0.22) 1.47 (0.18) 1.20 (0.22) 1.06 (0.07) 1.14 (0.12)

4 20 1.11 (0.11) 1.11 (0.13) 1.13 (0.15) 1.11 (0.14) 1.21 (0.18) 1.12 (0.12) 1.58 (0.27) 1.12 (0.12) 1.05 (0.05) 1.12 (0.12)100 1.01 (0.02) 1.02 (0.03) 1.08 (0.04) 1.10 (0.04) 1.16 (0.06) 1.21 (0.10) 1.80 (0.14) 1.21 (0.10) 1.07 (0.06) 1.17 (0.10)

5 20 1.04 (0.07) 1.02 (0.03) 1.02 (0.03) 1.02 (0.02) 1.17 (0.17) 1.01 (0.02) 1.53 (0.37) 1.01 (0.02) 1.08 (0.18) 1.01 (0.02)100 1.03 (0.04) 1.07 (0.13) 1.02 (0.04) 1.02 (0.04) 1.08 (0.09) 1.02 (0.04) 1.20 (0.15) 1.02 (0.04) 1.73 (1.03) 1.02 (0.04)

6 20 1.04 (0.06) 1.03 (0.05) 1.06 (0.13) 1.06 (0.13) 1.32 (0.32) 1.10 (0.16) 2.09 (0.40) 1.10 (0.16) 1.08 (0.09) 1.10 (0.16)100 1.07 (0.14) 1.06 (0.14) 1.06 (0.11) 1.06 (0.10) 1.07 (0.06) 1.13 (0.14) 1.27 (0.12) 1.13 (0.14) 1.05 (0.06) 1.11 (0.10)

7 20 1.05 (0.06) 1.05 (0.06) 1.07 (0.08) 1.06 (0.07) 1.13 (0.12) 1.06 (0.11) 1.58 (0.43) 1.06 (0.11) 1.18 (0.26) 1.06 (0.11)100 1.03 (0.06) 1.03 (0.03) 1.06 (0.05) 1.06 (0.06) 1.11 (0.10) 1.10 (0.07) 1.23 (0.14) 1.10 (0.07) 1.32 (0.38) 1.09 (0.08)

8 20 1.01 (0.02) 1.01 (0.02) 1.04 (0.04) 1.03 (0.04) 1.30 (0.17) 1.02 (0.03) 2.04 (0.33) 1.02 (0.03) 1.10 (0.13) 1.02 (0.03)100 1.03 (0.03) 1.03 (0.03) 1.03 (0.03) 1.04 (0.04) 1.09 (0.06) 1.14 (0.09) 1.23 (0.11) 1.14 (0.09) 1.48 (0.61) 1.14 (0.09)

9 20 1.04 (0.05) 1.04 (0.06) 1.08 (0.07) 1.06 (0.06) 1.13 (0.08) 1.16 (0.11) 1.33 (0.14) 1.16 (0.11) 1.09 (0.08) 1.16 (0.11)100 1.01 (0.01) 1.01 (0.01) 1.18 (0.18) 1.18 (0.18) 1.06 (0.02) 1.42 (0.13) 1.20 (0.06) 1.42 (0.13) 1.26 (0.07) 1.42 (0.13)

10 20 1.04 (0.09) 1.02 (0.02) 1.05 (0.03) 1.04 (0.03) 1.08 (0.04) 1.17 (0.14) 1.21 (0.10) 1.17 (0.14) 1.11 (0.09) 1.17 (0.14)100 1.02 (0.04) 1.02 (0.04) 1.04 (0.03) 1.05 (0.03) 1.05 (0.03) 1.54 (0.14) 1.16 (0.05) 1.54 (0.14) 1.35 (0.08) 1.54 (0.14)

AVG 20 1.05 (0.09) 1.05 (0.06) 1.09 (0.13) 1.09 (0.13) 1.19 (0.15) 1.15 (0.16) 1.61 (0.26) 1.15 (0.16) 1.10 (0.11) 1.15 (0.16)100 1.02 (0.04) 1.03 (0.04) 1.11 (0.06) 1.12 (0.06) 1.13 (0.05) 1.20 (0.11) 1.33 (0.10) 1.19 (0.10) 1.26 (0.25) 1.18 (0.10)

that the considered procedures often perform worse than cross-validation. This outcome is different from previous findings inmodel selection for other regression methods.

Our results confirmed that cross-validation may performpoorly when data is scarce. Unfortunately, the best performingalternative methods (FPE and SC) do not lead to safer modelselection when cross-validation fails. Among the investigatedprocedures, the only one that can be tweaked seems to bethe VC expression. In future work, we intend to investigatewhether there is an assignment of constants that may renderthe VC expression practical. We also plan to investigate com-binations of leave-one-out CV and the alternative proceduresconsidered in this paper.

ACKNOWLEDGMENTS

This work is supported by grant #2009/17773-7, Sao PauloResearch Foundation (FAPESP). The authors would like tothank the anonymous reviewers for their helpful comments.

REFERENCES

[1] B. Scholkopf and A. J. Smola, Learning With Kernels: Support VectorMachines, Regularization, Optimization and Beyond. MIT Press, 2002.

[2] G. Wahba, Spline models for observational data, ser. CBMS-NSFRegional Conference Series in Applied Mathematics. Society forIndustrial and Applied Mathematics (SIAM), 1990.

[3] R. M. Rifkin, “Everything old is new again: a fresh look at historicalapproaches in machine learning,” Ph.D. dissertation, MIT-Sloan Schoolof Management, 2006. [Online]. Available: http://dspace.mit.edu/bitstream/handle/1721.1/17549/51896466.pdf?sequence=1

[4] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning,vol. 20, pp. 273–297, 1995.

[5] G. Kimeldorf and G. Wahba, “A correspondence between Bayesianestimation on stochastic processes and smoothing by splines,” Annalsof Mathematical Statistics, vol. 41, no. 2, pp. 495–502, 1970.

[6] A. Luntz and V. Brailovsky, “On estimation of characters obtained instatistical procedure of recognition,” Technicheskaya Kibernetica, vol. 3,1969.

[7] A. Elisseeff and M. Pontil, “Leave-one-out Error and Stability of Learn-ing Algorithms with Applications,” in Learning Theory and Practice.IOS Press, 2002.

[8] H. Akaike, “Statistical Predictor Identification,” Annals of The Instituteof Statistical Mathematics, vol. 22, no. 1, pp. 203–217, 1970.

[9] G. Wahba and P. Craven, “Smoothing noisy data with spline functions:Estimating the correct degree of smoothing by the method of general-ized cross-validation,” Numerische Mathematik, vol. 31, pp. 377–404,1979.

[10] G. Schwarz, “Estimating the dimension of a model,” The Annals ofStatistics, vol. 6, no. 2, pp. 461–464, 1978.

[11] R. Shibata, “An optimal selection of regression variables,” Biometrika,vol. 68, no. 1, pp. 45–54, 1981.

[12] V. Vapnik, Statistical Learning Theory. Wiley-Interscience, 1998.

[13] V. Cherkassky, X. Shao, F. Mulier, and V. Vapnik, “Model complexitycontrol for regression using VC generalization bounds,” IEEE Transac-tions on Neural Networks, vol. 10, no. 5, pp. 1075–1089, 1999.

[14] T. J. Hestie and R. J. Tibshirani, Generalized Additive Models. CRCMonographs on Statistics and Applied Probability. Chapman and Hall,1990.

[15] D. Schuurmans, “A new metric-based approach to model selection,”in AAAI ’97: Proceedings of the Fourteenth National Conference onArtificial Intelligence, 1997, pp. 552–558.

[16] D. Schuurmans, F. Southey, D. Wilkinson, and Y. Guo, Metric-basedApproaches for Semi-supervised Regression and Classification, ser.Adaptive Computation and Machine Learning. MIT Press, 2006, pp.421–451.

[17] O. Chapelle, V. Vapnik, and Y. Bengio, “Model selection for smallsample regression,” Machine Learning, vol. 48, no. 1-3, pp. 9–23, 2002.

[18] J. Alcala-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. Garcıa,L. Sanchez, and F. Herrera, “Keel data-mining software tool: Dataset repository, integration of algorithms and experimental analysisframework,” Journal of Multiple-Valued Logic and Soft Computing,vol. 17, no. 2-3, pp. 255–287, 2011.

236