feature selection using partial least squares regression ... · [email protected] wael...

Feature Selection using Partial Least SquaresRegression and Optimal Experiment Design

Varun K. NagarajaDept. of Computer Science,

University of Maryland, College [email protected]

Wael Abd-AlmageedInformation Sciences Institute,

University of Southern [email protected]

Abstract—We propose a supervised feature selection techniquecalled the Optimal Loadings, that is based on applying the theoryof Optimal Experiment Design (OED) to Partial Least Squares(PLS) regression. We apply the OED criterions to PLS withthe goal of selecting an optimal feature subset that minimizesthe variance of the regression model and hence minimize itsprediction error. We show that the variance of the PLS modelcan be minimized by employing the OED criterions on theloadings covariance matrix obtained from PLS. We also providean intuitive viewpoint to the technique by deriving the A-optimality version of the Optimal Loadings criterion using theproperties of maximum relevance and minimum redundancy forPLS models. In our experiments we use the D-optimality versionof the criterion which maximizes the determinant of the loadingscovariance matrix. To overcome the computational challenges inthis criterion, we provide an approximate D-optimality criterionalong with the theoretical justification.

I. INTRODUCTION

Datasets with a large number of features are prevalent inmany fields like Computer Vision, Bioinformatics and Chemo-metrics. These large datasets pose analytical and computationalchallenges, and the problem is even worse for high dimensionalcases where the number of features is much greater thanthe number of samples. A feature selection process reducesthe dimensionality of the data by identifying a subset ofthe original features that captures the maximum amount ofinformation from the data. The advantages of feature selectionare improving the generalization capability of models, reducecomputation time and provide a better understanding of theinteraction among features [1].

Among supervised feature selection techniques, ranking byregression coefficients is one of the simplest ways to selectfeatures. Partial Least Squares (PLS) [2], [3] is a widelyused regression technique for high dimensional datasets. Itis extensively used for wavelength selection in Chemometricsand gene selection in Computational Biology [4], [5] as theytypically present with high dimensional datasets. The featuresare usually selected by ranking them according to the value oftheir PLS regression coefficients or other relevance measures.The caveat of this procedure is that it doesn’t jointly look atthe features and is susceptible to selecting redundant features.Similar to `1 and `2 norm penalized regression techniques, pe-nalized techniques for PLS [6], [7] are one of the approaches toperform feature selection with PLS. The penalized regressiontechniques enforce sparsity in the regression coefficients alongwith the minimization of model variance.

The other approach to minimizing the variance of theregression model is to apply the theory of Optimal Experi-ment Design (OED) [8] and its optimality criterions to PLSregression. The three most commonly used optimality crite-rions are A-optimality, D-optimality and E-optimality whichrespectively minimize the trace, determinant and maximumeigenvalue of the covariance matrix of regression coefficients.Optimal Experiment Design has been used for sample selectionproblems like sensor selection and Active Learning [9]. Theoptimality criterions are not specific to sample selection andcan also be used to measure the optimality of models withdifferent sets of features. Hence we use these criterions withPLS to develop a supervised feature selection technique. Weshow that an optimal feature subset can be selected by applyingthese criterions to the loadings covariance matrix obtainedfrom PLS.

We first decompose the prediction error of PLS regres-sion into its bias, variance and noise components. We thenapply the OED criterions to the covariance matrix of regres-sion coefficients to derive the A-optimality and D-optimalityversions of the Optimal Loadings criterion. We also showthat the A-Optimal Loadings criterion can be obtained byexplicitly incorporating the property of maximum relevanceas maximizing energy content in the loadings matrix. Theminimum redundancy property is incorporated as minimizingthe condition number of loadings matrix. However, solving theOptimal Loadings criterions is computationally challenging asit is dependent on different PLS models for evaluating differentfeature subsets. Hence we propose an approximate D-OptimalLoadings criterion that is based on a single loadings covariancematrix obtained with the entire set of features. We alsoobtain a mathematical relationship between the approximateand the original D-Optimal Loadings criterion and use it toqualitatively justify the approximation.

The advantage of the Optimal Loadings criterions is thatthe features are evaluated as subsets rather than individualfeatures and hence can simultaneously measure redundancyalong with relevance of features. This advantage is clearlyevident in our experiments when the number of selectedfeatures is small. In our experiments we implement the D-Optimal Loadings criterion that maximizes the determinant ofthe loadings covariance matrix. Experiments on four datasetsindicate that the D-Optimal Loadings criterion performs con-sistently better than the standard feature selection techniques,in terms of classification accuracies obtained with featuresubsets.

II. RELATED WORK

Feature selection techniques can be classified [1] intoindividual feature ranking methods and feature subset eval-uation methods. The individual feature ranking methods userelevance measures to sort the features in a rank order. FisherScore [10] and ReliefF [11] are two techniques that belongto the ranking methods. Features can also be ranked basedon regression coefficients and other informative vectors likeVariable Influence on Projection (VIP) [12]. Although thesemethods have a computational advantage, they fail in thepresence of redundant features as the minimum redundancyproperty needs to be measured by jointly looking at the fea-tures. A popular technique that incorporates both the relevanceand redundancy properties is the minimum redundancy andmaximum relevance (mRMR) framework [13], [14]. It involvesan objective function that is based on Information Theoreticmeasures and uses incremental search techniques to find thefeature subsets. The computational challenge in the originalmRMR framework is the estimation of mutual informationwhen the number of samples is small and also when the datais continuous. However, a kernel based dependency measurelike the Hilbert Schmidt Independence Criterion (HSIC) canbe used instead of the mutual information measure. The HSIChas been used as a measure of feature dependence by L.Songet al. [15].

In the presence of high dimensionality, ordinary leastsquares regression fails due to the singularity of the featurecovariance matrix. Hence regularized linear regression, usuallywith `1 and/or `2 penalization [16], [17], is employed to obtaina biased model with smaller variance. Partial Least Squares(PLS) regression [2], [3] is a commonly used technique forhandling high dimensional datasets. It provides two viewpointsto the modeling process - as a regression technique and as afeature extraction technique. While it can extract informationin a latent space of few dimensions, the sparsity of the featuresneeds to be explicitly incorporated into the PLS formulationfor feature selection. In the Sparse PLS of K-A Le Cao etal. [6], `1 penalization is applied to the loading vectors in thePLS-SVD formulation to integrate feature selection into themodeling process. The Sparse PLS of H.Chun and S.Keles [7]uses both the `1 and `2 penalization like that of Elastic Netsin the PLS formulation.

In Ordinary Least Squares regression, under uniform noiseassumption, the covariance matrix of the regression coefficientsis independent of the response variable. This property is usedto apply the Optimal Experiment Design [8] to unsupervisedfeature selection. The Laplacian Score technique [18] is aranking based algorithm for unsupervised feature selectionthat has been extended [19] with OED and shown to performbetter than the original ranking based algorithm. While boththe penalization and the OED approaches have been studiedfor ordinary least squares regression, only the penalizationmethods have been tried with PLS. Our work explores theapplication of the OED criterions to PLS regression.

III. PRELIMINARIES

A. Partial Least Squares

Partial Least Squares is a simultaneous feature extractionand regression technique, well suited for high dimensional

problems where the number of samples is much lesser thanthe number of features (n ⌧ p). The linear PLS model can beexpressed as

X = TP

>+X

res

(1)Y = UQ

>+ Y

res

(2)

where X

n⇥p

is the feature matrix, Y

n⇥q

is the matrix ofresponse variables or class labels, T

n⇥d

is called the X-scores,P

p⇥d

is X-loadings, Un⇥d

is Y -scores, Qq⇥d

is Y -loadings,X

res

and Y

res

are the residuals. The data in X and Y areassumed to be mean-centered. X-scores and Y -scores are theprojections of n samples onto a d-dimensional orthogonalsubspace. The X-scores are obtained by a linear combination ofthe variables in X with the weights W

⇤ as shown in Eqn. (3).

T = XW

⇤ (3)

The inner relation between X-scores and Y -scores is alinear regression model [2] and hence X-scores are calledpredictors of Y -scores. If B is the regression coefficient forthe inner relation between the scores, we can write

U = TB (4)

Substituting the above Eqn. (4) in Eqn. (2) we get

Y = TBQ

>+ Y

res

(5)= T

˜

B + Y

res

(6)

where ˜

B = BQ

>. The least squares estimate of ˜

B is thengiven by

ˆ

B = (T

>T )

�1T

>Y (7)

Hence PLS can be expressed in a linear regression form as,ˆ

Y = T

ˆ

B = T (T

>T )

�1T

>Y (8)

For a detailed explanation of the PLS technique, we guide thereaders to refer [2], [3].

The two most popular algorithms to obtain the PLS modelare NIPALS [3] and SIMPLS [20]. SIMPLS provides weightsW

⇤ which can be combined directly with X where as NIPALSprovides weights W that act on the residuals Z

a

obtained bydeflating X at every component a. The relationship betweenthe two is given by [3],

W

⇤= W (P

>W )

�1 (9)

Here we consider the case of a single response variableY

n⇥1 and use the equations from the NIPALS algorithm toobtain the PLS model. However we consider a small variation,where we normalize the scores instead of the loadings. Atevery iteration for the component a, we have

w

a

=

Z

>a

YpY

>Z

a

Z

>a

Y

(10)

t

a

=

Z

a

w

apw

>a

Z

>a

Z

a

w

a

(11)

p

a

= Z

>a

t

a

(12)Z

a+1 = Z

a

� t

a

p

>a

(13)

where Z1 = X . The weights and scores form an orthonormalset i.e. w>

i

w

j

= 0 and t

>i

t

j

= 0 for i 6= j.

B. Notation

Let ⇡ denote a subset of feature indices from the set{1, 2, 3, . . . , p} containing exactly k elements. The featuresubset matrix X

⇡

is expressed as

X

⇡

= X(n⇥p)⇧(p⇥k) (14)

where ⇧ is a column selection matrix that selects k out of p

features. Each of the k columns of ⇧ contains a single entry ofone at a row indexed by an element in ⇡ and zeros elsewhere.Any parameter of a model built with a subset of features isrepresented by a subscript ⇡.

IV. OPTIMAL LOADINGS TECHNIQUE

A. Optimal Experiment Design for PLS

Consider a linear regression model

Y = X� + ✏ (15)

where Y

n⇥1 is the response vector, Xn⇥p

is the feature matrix,�

p⇥1 is the regression coefficient vector and ✏

n⇥1 is the noisevector with mean zero and covariance �

2I

n

. The noise fordifferent observations are assumed to be independent of eachother.

The Partial Least Squares estimate of the regression coef-ficients can be obtained by substituting for T

⇡

from Eqn. (3)in Eqn. (8).

ˆ

�

⇡

= ⇧W

⇤⇡

(T

>⇡

T

⇡

)

�1T

>⇡

Y = ⇧W

⇤⇡

T

>⇡

Y (16)

By substituting for Y from Eqn. (15) in the above Eqn. (16),we find that the mean of the PLS estimate is given by

E[

ˆ

�

⇡

] = ⇧W

⇤⇡

T

>⇡

X� +⇧W

⇤⇡

T

>⇡

E[✏] (17)= ⇧W

⇤⇡

T

>⇡

X� (18)

where in Eqn. (17) we have assumed that ⇧W

⇤⇡

T

>⇡

and ✏ arenegligibly correlated. This is possible when the Signal to NoiseRatio is high and hence the deviation in the PLS model withrespect to noise is negligible. The covariance of ˆ

�

⇡

is givenby

cov(

ˆ

�

⇡

) = E

h(

ˆ

�

⇡

� E[

ˆ

�

⇡

])(

ˆ

�

⇡

� E[

ˆ

�

⇡

])

>i

(19)

= E[

ˆ

�

⇡

ˆ

�

>⇡

]� E[

ˆ

�

⇡

]E[

ˆ

�

>⇡

] (20)= �

2⇧W

⇤⇡

W

⇤>⇡

⇧

> (21)

For a new sample (x, y) such that y = x

>�+e and y = x

>ˆ

�

⇡

,the mean squared prediction error of PLS can be decomposedinto its bias, variance and noise components.

E[(y � y)

2] (22)

= x

>E[(� � ˆ

�

⇡

)(� � ˆ

�

⇡

)

>]x+ �

2 (23)= Bias

2+ x

> ��

2⇧W

⇤⇡

W

⇤>⇡

⇧

>�x+ �

2 (24)

where

Bias

2= x

>(I

p

�⇧W

⇤⇡

T

>⇡

X)��

>(I

p

�X

>T

⇡

W

⇤>⇡

⇧

>)x

(25)Since the squared prediction error is directly proportional tocov(

ˆ

�

⇡

), the prediction error can be minimized by minimizing

the covariance of PLS regression coefficients. Also, in highdimensional datasets, reducing the model variance helps avoidoverfitting to the data. The theory of Optimal ExperimentDesign proposes to minimize this covariance by optimizingthe eigenvalues of ⇧W

⇤⇡

W

⇤>⇡

⇧

> through various criterions.

Lemma 1. The matrices W

⇤⇡

W

⇤>⇡

and (P

⇡

P

>⇡

)

† have thesame non-zero eigenvalues, where † represents the Moore-Penrose inverse.

Proof: By substituting for W ⇤⇡

from Eqn. (9), we get

eigval(W

⇤⇡

W

⇤>⇡

) (26)= eigval

⇥W

⇡

(P

>⇡

W

⇡

)

�1(P

>⇡

W

⇡

)

�1)

>W

>⇡

⇤(27)

= eigval

⇥W

⇡

W

>⇡

(P

⇡

P

>⇡

)

†W

⇡

W

>⇡

⇤(28)

= eigval [(P

⇡

P

>⇡

)

†] (29)

where eigval() refers to the eigenvalues of a matrix. Eqn. (28)can be regarded as a similarity transformation since W

⇡

isorthonormal. The rank of these matrices is equal to the numberof latent components (d) extracted.

Using the above Lemma 1 and applying the A-optimalitycriterion to the covariance matrix of PLS coefficients inEqn. (21) we get,

argmin

⇧trace

⇥⇧(P

⇡

P

>⇡

)

†⇧

>⇤ (30)

We can drop the pre and post multiplication by ⇧ as it is onlypadding zeros to change the size of the matrix, (P

⇡

P

>⇡

)

†, fromk⇥ k to p⇥ p. For a fixed number of selected features, k, theA-optimal criterion can be rewritten as

Definition 1 (A-Optimal Loadings criterion). The A-optimalityversion of Optimal Loadings criterion is given by

argmin

⇧trace

⇥(P

⇡

P

>⇡

)

†⇤ (31)

We could also apply the D-optimality or E-optimalitycriterion which minimize the determinant or the maximumeigenvalue respectively, instead of the trace in Eqn. (31).Among these optimality criterions, the D-optimality criterionis the most popular due to availability of off-the-shelf algo-rithms in convex optimization toolboxes and row exchangealgorithms. It also simplifies the determinant minimization ofan inverse to maximizing the determinant of the matrix itself.The D-optimality version of the criterion (31) is given by

argmin

⇧det

† ⇥(P

⇡

P

>⇡

)

†⇤ (32)

which is equivalent to

Definition 2 (D-Optimal Loadings criterion). The D-optimalityversion of Optimal Loadings criterion is given by

argmax

⇧det

† �P

⇡

P

>⇡

�(33)

where det

†() represents pseudo-determinant which is a prod-

uct of non-zero eigenvalues of the matrix.

The actual determinant is substituted by a pseudo determi-nant as the criterion involves a rank deficient matrix.

B. PLS models with Maximum Relevance and Minimum Re-dundancy

The A-Optimal Loadings criterion (31) can also be ob-tained by applying the requirements of maximum relevanceand minimum redundancy for feature subsets. The followingderivation provides an intuitive viewpoint to the same criterionthat is obtained from the theory of Optimal Experiment Design.

The reconstruction error in a feature extraction techniquemeasures the difference between the original energy contentin all the features and the amount captured by the latentcomponents. While it is our goal to obtain features that bestexplain a response variable, the structure in data should also bepreserved. By substituting for p

a

from Eqn. (12) in Eqn. (13),we get

Z

a+1 = [I � t

a

t

>a

]Z

a

= [I �aX

i=1

t

i

t

>i

]X (34)

The reconstruction error can also be viewed as the residualsthat cannot be explained by the PLS model. Hence we canuse Eqn. (34) to express the error in a form similar to that ofreconstruction error for PCA.

error

2= ||X

res

||22 =

��X � TT

>X

��22

(35)= trace

⇥X

>X �X

>TT

>X

⇤(36)

= trace

⇥X

>X

⇤� trace

⇥PP

>⇤ (37)

where Eqn. (37) is obtained by substituting for X fromEqn. (1) in the second term and making use of the factthat the scores T are orthogonal to the residuals X

res

. Thereconstruction error reduces with increase in the number ofcomponents extracted. But for a fixed number of componentsd, the error is minimum when the trace

⇥PP

>⇤ is maximum.Therefore we start by defining the feature selection criterionas

argmax

⇧trace

⇥P

⇡

P

>⇡

⇤(38)

It should be noted that the reconstruction error in itself isnot considered in criterion (38). This criterion tries to selectthe feature subset that contains the maximum energy content(measured by Frobenius norm) in the PLS model after featureselection.

The criterion (38) is also directly proportional to covariancebetween the features X and the response variable Y . This canbe seen by substituting for p

⇡

from Eqn. (12) in criterion (38)and then expanding up to w

⇡

in Eqn. (10). We get

trace[P

⇡

P

>⇡

] =

dX

a=1

Y

> �Z

a

Z

>a

�3Y

Y

>(Z

a

Z

>a

)

2Y

(39)

Since PLS extracts components such that the covariancebetween features and response variable and the covariancebetween features itself are simultaneously maximized, thecriterion (38) simultaneously satisfies the relevance propertytowards the response variable and the latent information infeatures.

However, the trace criterion (38) does not measure the re-dundancy property and hence we incorporate condition numberof P>

⇡

to measure the linear dependence of columns/features.

Since we want to minimize the condition number, the criterion(38) can be rewritten as

argmax

⇧

trace

⇥P

⇡

P

>⇡

⇤

( (P

>⇡

))

2

!(40)

The condition number in Frobenius norm is defined as�

�P

>⇡

��2= trace

⇥P

⇡

P

>⇡

⇤. trace

⇥(P

⇡

P

>⇡

)

†⇤ (41)

We now substitute for in criterion (40) to obtain

argmin

⇧trace

⇥(P

⇡

P

>⇡

)

†⇤ (42)

This is the same A-Optimal Loadings criterion (31) obtainedearlier by applying the Optimal Experiment Design to PartialLeast Squares regression.

C. Approximation for the D-Optimal Loadings criterion

In our experiments we choose to implement the D-optimality version of Optimal Loadings criterion as it sim-plifies the minimization of determinant of inverse matrix tothe maximization of determinant itself. The availability of off-the-shelf algorithms for determinant maximization is anotheradvantage of using the D-optimality criterion.

The loadings in criterion (33) is dependent on ⇡ and isinfeasible to construct a PLS model every time a subset offeatures is to be evaluated. This would defeat the purposeof a feature selection technique. Hence we try to express thecriterion in terms of loadings obtained with all features. FromEqn. (1), we have

X

>⇡

X

⇡

= P

⇡

P

>⇡

+X

>res⇡

X

res⇡

(43)⇧

>X

>X⇧ = ⇧

>PP

>⇧+⇧

>X

>res

X

res

⇧ (44)

The right hand terms of the above Eqns. (43) and (44) can beequated (Eqn. (14)) to obtain

P

⇡

P

>⇡

= ⇧

>PP

>⇧+�

⇡

(45)

where �

⇡

is a symmetric matrix given by

�

⇡

=

⇥⇧

>X

>res

X

res

⇧�X

>res⇡

X

res⇡

⇤(46)

Since we use the D-optimality criterion for feature selection,we discuss the relationship between the determinants of P

⇡

P

>⇡

and ⇧

>PP

>⇧. The singularity of these matrices presents dif-

ficulties in quantifying their behavior. Therefore we obtain therelationship between the determinants of regularized matrices(P

⇡

P

>⇡

+ I) and (⇧>PP

>⇧+ I).

Theorem 1. The relationship between the determinants of(⇧>

PP

>⇧+ I) and (P

⇡

P

>⇡

+ I) is given by,

det(P

⇡

P

>⇡

+ I) = det(M + ⇤M⌃

�1) det(⇧

>PP

>⇧+ I)

(47)where M is a unitary matrix, ⇤ is a diagonal matrix of realeigenvalues of �

⇡

and ⌃ is a diagonal matrix of positiveeigenvalues of (⇧>

PP

>⇧+ I).

Proof: Let the two symmetric, positive semi-definite ma-trices P

⇡

P

>⇡

and ⇧

>PP

>⇧, each be of rank d and size k⇥k,

with the relationship between them as

P

⇡

P

>⇡

= ⇧

>PP

>⇧+�

⇡

(48)

where �

⇡

is a symmetric matrix given by

�

⇡

=

⇥⇧

>X

>res

X

res

⇧�X

>res⇡

X

res⇡

⇤(49)

We make use of Sherman-Morrison-Woodbury formula [21]for expressing the determinant of sum of matrices.

det(P

⇡

P

>⇡

+ I) = det(⇧

>PP

>⇧+ I +�

⇡

) (50)= det(⇧

>PP

>⇧+ I + U⇤U

>) (51)

= det(I + ⇤U

>(⇧

>PP

>⇧+ I)

�1U) det(⇧

>PP

>⇧+ I)

(52)= det(I + ⇤U

>V ⌃

�1V

>U) det(⇧

>PP

>⇧+ I) (53)

= det(I + ⇤M⌃

�1M

>) det(⇧

>PP

>⇧+ I) (54)

= det(M + ⇤M⌃

�1) det(⇧

>PP

>⇧+ I) (55)

where we have applied Eigen-decomposition on �

⇡

and(⇧>

PP

>⇧ + I). ⌃ and ⇤ are diagonal matrices containing

non-negative eigenvalues (�) of (⇧

>PP

>⇧ + I) and real

eigenvalues (�) of �

⇡

, respectively. M is a unitary matrixobtained as a product of two other unitary matrices U and V .

The two determinants are highly correlated when thecondition number of (M + ⇤M⌃

�1) is small. The condition

number of a matrix measures the asymptotic worst case of theamount of perturbation that can be produced by the matrixwhen multiplied with other matrices. The eigenvalues in ⌃

and ⇤ are indicators of the energy content in structureddata and noise respectively, where noise is any structure thatcannot be explained by the first d components of the PLSmodel. The theoretical and empirical observations (found inthe supplementary material) suggest that the condition numberis small when the variance in noise is low and levels of noiseare far away from that of structure in data. Therefore underthe assumption of high Signal to Noise Ratio, we can ignore�

⇡

and substitute for P⇡

P

>⇡

from Eqn. (45) in criterion (33).The approximate feature selection criterion is given by,

argmax

⇧det

†(⇧

>PP

>⇧) (56)

The number of components in ⇧

>PP

>⇧ and P

⇡

P

>⇡

must beequal to compare the information between the two matrices.The number of components in PLS regression determines thebias and variance of the model. It is usually chosen such thatthe cross-validation error of PLS regression is minimum.

The experiments and discussion in the following sectionsuse the D-optimality criterion for feature selection. D-optimaldesigns are usually generated by employing row exchangealgorithms [22], [23]. These algorithms add or delete rows,starting from a non-singular set, in order to increase thedeterminant. The algorithm iterates until the increment indeterminant becomes lesser than some fixed threshold or thenumber of iterations reach a maximum value. However, itis not guaranteed that the iterations will converge to theglobal maximum value. One of the first exchange algorithmswas developed by V.V.Fedorov and several modifications havebeen proposed to improve the computational performance [23].The traditional D-optimal experiment design differs from thefeature selection problem, as it allows duplicate samples.Hence the standard exchange algorithms need to be tweakedto avoid duplicates for feature selection.

Since the D-optimality criterion involves maximization ofthe determinant, it can also be treated as a convex optimizationproblem [24]. The integer constraints ⇡

i

2 {0, 1} need to berelaxed to ⇡

i

2 [0, 1].

minimize � log det

"pX

i=1

⇡

i

P

i

P

>i

#(57)

subject topX

i=1

⇡

i

= k (58)

0 ⇡

i

1, i = 1, . . . p (59)

It can be seen that the solution to the original problemin Criterion (56) is a feasible solution to the above relaxedproblem. Usually we obtain a discrete solution by consideringthe k largest values of ⇡

i

, which can lead to a sub-optimalsolution to the original problem. The log det criterion is anobjective function available with popular SDP solvers [25].One of the disadvantages of the convex optimization methodsis that they store the entire convex hull of features, which isdifficult to handle for large loadings matrix due to memoryrestrictions.

V. ANALYSIS OF THE RELATIONSHIP BETWEEN P

⇡

P

>⇡

AND ⇧

>PP

>⇧

We can obtain an upper bound for the relationship (47) byfinding the largest singular value of det(M + ⇤M⌃

�1). The

spectral norm measures the largest singular value of a matrix.Using few of the properties of norms we can write

||M + ⇤M⌃

�1||2 ||M ||2 + ||⇤||2||M ||2||⌃�1||2 (60)

= 1 +

�

max

�

min

(61)

where �

max

= max

i

|�i

| and �

min

= min

i

�

i

. Therefore theupper bound for relationship (47) is given by,

det(P

⇡

P

>⇡

+ I) ✓1 +

�

max

�

min

◆k

det(⇧

>PP

>⇧+ I) (62)

To determine the lower bound, we will need to find the smallestsingular value of det(M + ⇤M⌃

�1). However the only safe

bound that can be obtained is that the smallest singular valueis greater than zero as the determinants on both sides of therelationship need to be positive. Nevertheless, a qualitativediscussion can be provided by estimating the smallest singularvalue by a lower bound for the norms of the columns. We willfirst express the matrix M + ⇤M⌃

�1 as,

M + ⇤M⌃

�1= (63)

0

BBB@

(1 +

�1�1)m11 (1 +

�1�2)m12 . . . (1 +

�1�

k

)m1k

(1 +

�2�1)m21 (1 +

�2�2)m22 . . . (1 +

�2�

k

)m2k

.... . .

(1 +

�

k

�1)m

k1 (1 +

�

k

�2)m

k2 . . . (1 +

�

k

�

k

)m

kk

1

CCCA

(64)

The norm of a column is given by

�

i

=

vuutkX

j=1

✓1 +

�

j

�

i

◆2

m

2ji

(65)

5 10 15 20 25 30 35 4025

30

35

40

k=200k=125

k=50k=5

y=x

logdet[ΠTPPTΠ]

logdet[P

πP

T π]

Random dataset

(a) Random data

50 100 150 200 250

80

100

120

140

160

180

200

220

240

260 k=200k=125k=50

k=15

y=x

logdet[ΠTPPTΠ]

logdet[P

πP

T π]

MNIST dataset

(b) MNIST dataset

80 90 100 110 120 130 140 15080

90

100

110

120

130

140 k=200k=125

k=50k=10

y=x

logdet[ΠTPPTΠ]

logdet[P

πP

T π]

ORL dataset

(c) ORL dataset

280 300 320 340 360 380 400 420

300

320

340

360

380

400

420 k=200

k=125

k=50k=30

y=x

logdet[ΠTPPTΠ]

logdet[P

πP

T π]

CMU PIE dataset

(d) CMU PIE dataset

Fig. 1. Relationship between the original criterion log det[P⇡P>⇡ ] and the approximate criterion log det[⇧

>PP>⇧], that are obtained by applying PLS for

varying number of features, k, in a subset ⇡. The approximate and original criterions are positively correlated for the real datasets. Hence, by maximizing theapproximate criterion we are not too far away from the maximum of the original criterion.

Since �

j

can be negative, the lower bound is dependent onratio between �

j

and �

i

. Therefore we just let l be the columnthat minimizes the column norms and calculate a �

min,l

suchthat

�

min,l

= argmin

�

j

��1 +�

j

�

l

�� (66)

Then a lower bound for the norms of columns is given by

�

i

��1 +

�

min,l

�

l

�� (67)

and an approximate lower bound on the determinant can bewritten as

det(P

⇡

P

>⇡

+ I) &��1 +

�

min,l

�

l

��k

det(⇧

>PP

>⇧+ I) (68)

Combining the two bounds in (62) and (68) we get��1 +

�

min,l

�

l

��k

. det(P

⇡

P

>⇡

+ I)

det(⇧

>PP

>⇧+ I)

✓1 +

�

max

�

min

◆k

(69)In most practical situations the bounds in inequality (69) aremuch tighter. The number of non-zero eigenvalues of �

⇡

areusually few and hence the exponential factor of k is also low.

Figure 1 shows a quantitative relationship between the logdeterminants of P

⇡

P

>⇡

and ⇧

>PP

>⇧ for random data and

three of the datasets (ORL, MNIST, CMU PIE) used in ourexperiments. The random data is of size 100 ⇥ 1000. Thepoint clouds are generated by observing the determinant valuesfor randomly selected subsets of size k. We can see that forthe MNIST (Figure 1(b)), ORL (Figure 1(c)) and CMU PIE(Figure 1(d)) datasets, the point clouds are very narrow andthe behavior of two matrices are almost positively correlated.

We can use the bounds in inequality (69) to qualitativelydescribe the situations when the point clouds in Figure 1 willbe narrow so that the determinants are positively correlated.The point clouds are narrower when the ratio between thebounds is close to one. Minimizing this ratio is equivalent tominimizing the condition number of the matrix (W+⇤W⌃

�1)

which is the ratio of its largest singular value to its smallestsingular value. The largest singular value of (W + ⇤W⌃

�1)

is 1 + �

max

since the minimum of �

i

is one. The conditionnumber is then given by,

' 1 + �

max��1 + �

min,l

�

l

��(70)

The eigenvalues (�) of �

⇡

are usually few large positivevalues coming mostly due to ⇧

>X

>res

X

res

⇧ and few negativevalues coming due to �(X

>res⇡

X

res⇡

). These eigenvaluesare indicators of the information content in the noise, wherenoise is any structure that cannot be explained by the firstd components of the PLS model. The eigenvalues in ⌃ areindicators of information content in the structured data.

When all the �

j

are positive, the approximate lower boundin inequality (67) is

⇣1 +

�

min

�

max

⌘where �

min

= min

j

�

j

.In this case, the condition number is low when the ratiobetween �

max

and �

min

is low i.e. the variance in noise islow. When there are negative eigenvalues, �

min,l

is satisfiedby an eigenvalue whose absolute value is close to �

l

. In such asituation, the condition number is low when �

i

are farther from�

i

i.e. the levels of noise and structured data are separated.Therefore the approximation gets better as the variance innoise gets lower and noise levels are farther away from thatof structured data. In many real datasets, linear regressioncan provide good models and hence in such situations bymaximizing det(⇧

>PP

>⇧+I), we are not too far away from

the maximum of det(P⇡

P

>⇡

+ I).

VI. EXPERIMENTS AND RESULTS

To evaluate the performance of our feature selection cri-terion, we test it in a classification framework where featureselection is treated as a preprocessing filter that produces theindices, ⇡, of the selected feature subset. The feature subsetis then used to obtain low dimensional subspaces using PLS.The classification is performed using a Linear DiscriminantClassifier in the low-dimensional projection subspace. In across-validation setting, the test data is separated from trainingdata for both feature selection and classifier training. Thisexperimental setup is used in order to avoid any overoptimisticperformance results obtained when evaluating feature selectionusing the entire data, as reported in [26].

A. Datasets

The experiments are performed on four datasets - twoof them are face image datasets, one is a handwritten digitdataset and the last one is a mass-spectrometric dataset ofcancerous and normal tissues. For all of the three imagedatasets, pixel values are used as features and no featureextraction is performed. The first dataset is a subset1 [27]of the MNIST handwritten digits. This dataset contains 200

1http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html

20 40 60 80 100 120 140 160 1800

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Accu

racy

Number of features

OptimalLoadingsRegressionCoeffsRReliefFFisherScoremRMRq

(a) MNIST dataset

20 40 60 80 100 120 140 160 180 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Accu

racy

Number of features


(b) ORL dataset

40 60 80 100 120 140 160 180 2000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Accu

racy

Number of features


(c) CMU PIE dataset

100 200 300 400 500 600 700 800 900 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Accu

racy

Number of features


(d) Arcene dataset

Fig. 2. Classification performance with feature subsets: The D-Optimal Loadings criterion performs better than others on the MNIST and the CMU PIE datasetsand performs equally well with the mRMR technique on the ORL and the Arcene datasets. It also shows a consistent performance especially when the numberof selected features is small.

images each for 10 different digit classes, producing a datasetof size 2000⇥ 784. The second one is a subset of the AT&TORL face image database. The dataset consists of face imagesfor 10 subjects with 10 images for each subject with posevariations, which produces a dataset of size 100⇥ 10304. Thethird dataset is a subset of the CMU PIE database that containsface images of 10 different people in a fixed frontal pose (Pose27) with light and illumination changes. There are 49 imagesper person, hence producing a dataset of size 490⇥4096. Thefourth is the Arcene dataset from the NIPS Feature SelectionChallenge. It contains training and validation sets each of size100⇥ 10000. There are two classes in this dataset.

B. Comparison with other Feature Selection Techniques

We evaluate the performance of D-Optimal Loadings cri-terion along with other supervised feature selection techniquessuch as ranking by regression coefficients, Fisher Score [10],RRelief-F [11] and mRMR [14]. For the D-Optimal Loadingscriterion, the number of components is chosen based on theminimization of cross validation error of PLS regression andthe determinant maximization is performed using a tweakedversion of the row exchange algorithm available in MATLABStatistics Toolbox. The same PLS model is used to obtainthe regression coefficients and the top features are selectedbased on the absolute value of their coefficient. We use theregression version of Relief-F as it showed better performancethan the classification version. In RRelief-F, the neighborhoodand number of samples for quality estimation are set to 10and 100 respectively. Finally for the mRMR technique we usethe Mutual Information Quotient scheme since it is shown toperform better than the MI Difference scheme. Here we do notdiscretize the data any further.

We compare the performance using classification accura-cies obtained using a Linear Discriminant Classifier. We preferto use a simple linear classifier so as to avoid tuning thenew parameters introduced by nonlinear classifiers. Since thenumber of selected features can be greater than the number ofsamples, the classifier is trained in a PLS subspace to avoidover-fitting. The feature subset is used to construct a subspacewhose dimensions are again selected based on least cross-validation error for PLS regression. This happens to be sameas that used for the D-Optimal Loadings criterion. Given thenumber of components as d, the experiments are conducted forvarying sizes of the feature subset. During the test phase, weselect the feature subset from test data, find projections usingweights from training phase and then classify using the trainedmodel.

We found that the cross validation error of PLS regressionstabilizes at around 10, 15, 30 and 20 components for theORL, MNIST, CMU PIE and Arcene datasets respectively.Using these number of components, we perform a 20-foldcross-validation experiment for the ORL dataset and 10 foldcross-validation for MNIST and CMU PIE datasets. A largernumber of folds is used for the ORL dataset due to smallernumber of samples. For the Arcene dataset, the validation setis used as the test set and entire training set is used for training.Figure 2 shows the classification accuracies obtained withD-Optimal Loadings, Regression coefficients, Fisher score,Relief-F and mRMR for the four datasets. The D-OptimalLoadings criterion outperforms other techniques on the MNISTand the CMU PIE datasets and performs equally well withthe mRMR technique on the ORL and the Arcene datasets.The D-Optimal Loadings technique can very well handle thesituation when the number of selected features is small. We

k =

50k

= 10

0k

= 15

0

OptimalLoadings

k =

200

RegressionCoeffs RReliefF FisherScore mRMRq

(a) Sample ORL image

k =

50k

= 10

0k

= 15

0

OptimalLoadings

k =

200

RegressionCoeffs RReliefF FisherScore mRMRq

(b) Sample CMU PIE image

Fig. 3. Feature points selected by D-Optimal Loadings, Regression Coef-ficients, Relief-F, Fisher Score and mRMR techniques. The features selectedby D-Optimal Loadings are well distributed across the significant regions ofthe image unlike others that tend to get clustered or lie in noisy regions.

see that the Fisher score and Relief-F are generally worseperforming for smaller number of features since they do nothandle redundancy among features. In Figure 3 the featurepoints selected by the five techniques are shown overlaidon sample images from two of the datasets. The featuresselected by D-Optimal Loadings are well distributed acrossthe significant regions of the image unlike others that tend toget clustered or lie in the noisy regions.

VII. CONCLUSION

Our work explores the application of the theory OptimalExperiment Design (OED) to Partial Least Squares (PLS)regression. We use the OED to derive the A-Optimal Load-ings and D-Optimal Loadings feature selection criterions withthe goal of minimizing the variance of the PLS regressionmodel. We specifically use an approximation of the D-OptimalLoadings criterion that maximizes the determinant of load-ings covariance matrix to select an optimal feature subset.The availability of off-the-shelf row exchange algorithms andconvex optimization methods for determinant maximizationhastens the feature selection stage in a pattern analysis prob-lem. One of the important characteristics of the OptimalLoadings criterions is that they are based on optimization ofeigenvalues which is necessarily evaluated at a subset level.We also provide insight into the technique by deriving theA-Optimal Loadings criterion by using just the propertiesof maximum relevance and minimum redundancy for featuresubsets. The results from our experiments with four datasetsindicate that the D-Optimal Loadings criterion selects betterfeature subsets when compared to other techniques such asmRMR and Relief-F. Apart from classification accuracies, thelocations of feature points on these images also indicate thatit selects non-redundant features from the significant regionsof the image.

REFERENCES

[1] I. Guyon and A. Elisseefi, “An Introduction to Variable and FeatureSelection,” Journal of Machine Learning Research, vol. 3, no. 7-8, pp.1157–1182, Oct. 2003.

[2] P. Geladi, “Partial least-squares regression: a tutorial,” Analytica Chim-ica Acta, vol. 185, no. 1, pp. 1–17, 1986.

[3] S. Wold, “PLS-regression: a basic tool of chemometrics,” Chemometricsand Intelligent Laboratory Systems, vol. 58, no. 2, pp. 109–130, Oct.2001.

[4] D. Nguyen and D. Rocke, “Tumor classification by partial least squaresusing microarray gene expression data,” Bioinformatics, vol. 18, no. 1,pp. 39–50, 2002.

[5] A.-L. Boulesteix and K. Strimmer, “Partial least squares: a versatiletool for the analysis of high-dimensional genomic data.” Briefings inBioinformatics, vol. 8, no. 1, pp. 32–44, Jan. 2007.

[6] K.-A. Le Cao, D. Rossouw, C. Robert-Granie, and P. Besse, “A sparsePLS for variable selection when integrating omics data.” StatisticalApplications in Genetics and Molecular Biology, vol. 7, no. 1, p. Article35, Jan. 2008.

[7] H. Chun and S. Keles, “Sparse partial least squares regression forsimultaneous dimension reduction and variable selection.” Journal ofthe Royal Statistical Society. Series B, Statistical methodology, vol. 72,no. 1, pp. 3–25, Jan. 2010.

[8] F. Pukelsheim, Optimal Design of Experiments. Society for Industrialand Applied Mathematics, 2006, vol. 50.

[9] X. He, “Laplacian Regularized D-optimal Design for active learningand its application to image retrieval.” IEEE Transactions on ImageProcessing, vol. 19, no. 1, pp. 254–63, Jan. 2010.

[10] R. Duda, P. Hart, and D. Stork, “Pattern Classification and SceneAnalysis 2nd ed.” 1995.

[11] M. Robnik-Sikonja and I. Kononenko, “Theoretical and empiricalanalysis of ReliefF and RReliefF,” Machine learning, vol. 53, no. 1,pp. 23–69, 2003.

[12] R. F. Teofilo, J. a. P. a. Martins, and M. M. C. Ferreira, “Sortingvariables by using informative vectors as a strategy for feature selectionin multivariate regression,” Journal of Chemometrics, vol. 23, no. 1, pp.32–48, Jan. 2009.

[13] C. Ding and H. Peng, “Minimum redundancy feature selection frommicroarray gene expression data,” Journal of Bioinformatics and Com-putational Biology, pp. 1–8, 2005.

[14] H. Peng, F. Long, and C. Ding, “Feature selection based on mutualinformation: criteria of max-dependency, max-relevance, and min-redundancy.” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 27, no. 8, pp. 1226–38, Aug. 2005.

[15] L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt, “Featureselection via dependence maximization,” The Journal of MachineLearning Research, vol. 13, no. 1, pp. 1393–1434, 2012.

[16] R. Tibshirani, “Regression shrinkage and selection via the lasso,”Journal of the Royal Statistical Society. Series B (Methodological), pp.267–288, 1996.

[17] H. Zou and T. Hastie, “Regularization and variable selection via theelastic net,” Journal of the Royal Statistical Society: Series B (StatisticalMethodology), vol. 67, no. 2, pp. 301–320, Apr. 2005.

[18] X. He, D. Cai, and P. Niyogi, “Laplacian score for feature selection,”in Advances in Neural Information Processing Systems, vol. 18, 2006,p. 507.

[19] X. He, M. Ji, C. Zhang, and H. Bao, “A Variance MinimizationCriterion to Feature Selection using Laplacian Regularization.” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 33,no. 10, pp. 2013–2025, Mar. 2011.

[20] S. De Jong, “SIMPLS: An alternative approach to partial least squaresregression,” Chemometrics and Intelligent Laboratory Systems, vol. 18,no. 3, pp. 251–263, 1993.

[21] F. Giannessi, P. Pardalos, and T. Rapcsak, “Optimization Theory: RecentDevelopments from Matrahaza,” pp. 124–125, 2002.

[22] R. C. S. John and N. R. Draper, “D-Optimality for Regression Designs:A Review,” Technometrics, vol. 17, no. 1, p. 15, Feb. 1975.

[23] R. Cook, “A comparison of algorithms for constructing exact D-optimaldesigns,” Technometrics, vol. 22, no. 3, pp. 315–324, 1980.

[24] L. Vandenberghe, S. Boyd, and S.-P. Wu, “Determinant Maximizationwith Linear Matrix Inequality Constraints,” SIAM Journal on MatrixAnalysis and Applications, vol. 19, no. 2, p. 499, 1998.

[25] R. Tutuncu, K. Toh, and M. Todd, “Solving semidefinite-quadratic-linear programs using SDPT3,” Mathematical programming, vol. 95,no. 2, pp. 189–217, 2003.

[26] P. Smialowski, D. Frishman, and S. Kramer, “Pitfalls of supervisedfeature selection.” Bioinformatics (Oxford, England), vol. 26, no. 3, pp.440–3, Feb. 2010.

[27] D. Cai, X. He, and Y. Hu, “Learning a spatially smooth subspace forface recognition,” in IEEE Conference on Computer Vision and PatternRecognition, Jun. 2007.

feature selection using partial least squares regression ... · [email protected] wael...

Documents