feature selection with neural networks dmitrij lagutin, [email protected] t-61.6040 - variable...

Feature selection with Neural Networks

Dmitrij Lagutin, [email protected]

T-61.6040 - Variable Selection for Regression

24.10.2006

Contents

• Introduction

• Model independent feature selection

• Feature selection with neural networks

• Experimental comparison between different methods

Introduction

• Feature selection consists usually of following– A feature evaluation criteria to evaluate and

select variable subsets– A search procedure, to explore a subspace of

possible variable combinations– A stop criterion or a model selection strategy

Model independent feature selection

• These methods are not neural network oriented, they are mentioned here because they are used in experimental comparison

• These methods do not take into account the classification or regression model during variable selection


• Bonnlander method utilizes mutual information– Mutual information for variables a and b that

have probability densities P(a) and P(b) is

– It is a forward search and it selects variable xp that maximises:

– SVp-1 is the set of p-1 already selected variables

ba bPaP

baPbaPbaMI

,

))()(

),(log(*),(),(

)},{( 1 yxSVMI pp


• Stepdisc is a stepwise feature selection method for classification

Feature selection with neural networks

• Feature selection with neural networks uses mostly backwards search: in the beginning all variables are present and unnecessary variables are eliminated

• Neural networks are usually non linear models. Thus methods that assume that input-output variables dependency is linear are not suited for neural networks

Feature selection with neural networks

• Different feature selection algorithms using neural networks can be classified using following criteria– Zero order methods which use only the network

parameter values– First order methods which use first derivatives

of network parameters– Second order methods which use second

derivatives of network parameters

Zero order methods

• Yacoub and Bennani has proposed a method with following evaluation criterion that uses both weights and the network structure (I, H, O denote input, hidden and output layers)

Hj Ok

Hjkj

kj

Iiji

iji

w

w

w

wS )

||

||

||

||(

''''

Zero order methods

• This method uses a backward search and the neural network is retrained after each variable deletion

• Stop criterion is based on the evolution of the performances on a validation set, as soon as performances decrease, the elimination is stopped

First order methods

• First order methods evaluate the relevance of a variable by computing derivative of the error or of the output with respect to the variable

• Method proposed by Moody and Utans uses variation of the learning error as evaluation criterion:

N

l

llk

li

li

ii

yxxxfN

xMSE

xMSEMSES

1

21 ||),...,,...,(||

1)(

)(

First order methods

• Because the computation of Si is difficult for large values of N, Si can be approximated

N

l

li

li

i

ll

i xxx

yxf

NS

1

2

)(||)(||1

Comparison of first order methods

• There are several first order methods that use output derivatives and which mainly differs on the derivative used

• On the next slide there is a comparison of these methods– C/R describes tasks on which the method can

be used, C = classification, R = regression

Experiments

Second order methods

• Use second derivatives of network parameters

• Optimal Cell Damage method was proposed by Cibas, its evaluation criteria is

• Where fan-out(i) is set of weights of input i

2

)(2

2

2

1)( j

ioutfanj jii w

w

MSExSaliencyS

Second order methods

• Early Cell Damage method is somehow similar. Leray has proposed following evaluation criteria:

)

)(

2

1

(2

1)(

2

2

2

2

)(2

2

j

j

j

jioutfanj j

ii

wMSE

wMSE

w

MSE

ww

MSExSaliencyS

Experimental comparison between different methods

• Neural networks used in comparison are multilayer perceptrons with one hidden layer containing 10 neurons

• First problem is a three class waveforms classification problem with 21 noisy dependent features (Breiman et al. 1984)

• In the first example, 19 pure noise variables were added to 21 initial variables, thus there were 40 input variables in total


Method p Selected variables Perf.None 40 111111111111111111111 1111111111111111111 82,51%

Stepdisc 14 000110111111111011100 0000000000000000000 85,35%

Bonnlander 12 000011101111111110000 0000000000000000000 85,12%

Yacoub 16 000111111111111111100 0000000000000000000 85,16%

Moody 16 000111111111111111100 0000000000000000000 85,19%

Ruck, Dorizzi 18 011111111111111111100 0000000000000000000 85,51%

Czernichow 17 010111111111111111100 0000000000000000000 85,67%

Cibas 9 000001111110111000000 0000000000000000000 82,26%

Leray 11 000001111111111100000 0000000000000000000 84,56%


• In the first example all methods removed pure noise variables

• Bonnlander and Stepdisc methods performed quite well

• Ruck, Dorizzi and Czernichow methods did not remove enought variables while Cibas method removed too many variables


• In the second example, the problem is the same, but now only original 21 variables are present

• Leray method performed very well, Yacoub method removed too few variable while Bonnlander and Czernichow methods removed too many variables and have poor performance


• Second problem is a two class problem in a 20 dimensional space. The class are distributed according two gaussians.

• Again, Bonnlander method removed too many variables and performance suffered while Yacoub method removed too few variables

• In this example, Dorizzi and Ruck methods performed quite well, they removed a lot variables and achieved a good performance

Conclusions

• Methods using neural networks can be divided in three categories: zero order methods, first order methods and second order methods

• The best method depends on the task. For example in the first problem second order methods performed poorly when noise was added. But without an additional noise, performance of second order methods was very good– Non neural network methods (Stepdisc and

Bonnlander) performed well in the original example, but quite poorly in other examples

References

• P. Leray and P. Gallinari. Feature selection with neural networks. Behaviormetrica, 1998.

• Breiman, L., Friedman, J., Olshen R., and Stone, C. (1984). Classification and Regression Trees. Wadsworth Internation Group

feature selection with neural networks dmitrij lagutin, [email protected] t-61.6040 - variable...

Documents

order methodsbecause

order methodsthere

neural network oriented

network structure

output derivatives

variable xp

neural networksdmitrij

following evaluation