center for machine perception feature selection based on

CENTER FOR

MACHINE PERCEPTION

CZECH TECHNICAL

UNIVERSITY

RESEARCH

REPO

RT

ISSN

1213

-236

5Feature selection based on

the training set manipulationPhD thesis proposal

Pavel Krızek

[email protected]

CTU–CMP–2005–07

February 28, 2005

Available atftp://cmp.felk.cvut.cz/pub/cmp/articles/krizek/Krizek-TR-2005-07.pdf

Supervisor: Vaclav Hlavac , Josef Kittler

The author was supported by The Czech Science Foundation underproject GACR 102/03/0440.

Research Reports of CMP, Czech Technical University in Prague, No. 7, 2005

Published by

Center for Machine Perception, Department of CyberneticsFaculty of Electrical Engineering, Czech Technical University

Technicka 2, 166 27 Prague 6, Czech Republicfax +420 2 2435 7385, phone +420 2 2435 7637, www: http://cmp.felk.cvut.cz

Abstract

A novel feature selection technique for the classification problems is pro-posed in this PhD thesis proposal. The method is based on the training setmanipulation. A weight is associated with each training sample similarly as itis in the AdaBoost algorithm. The weights form a distribution. Any changeof the distribution of weights influences the behaviour of particular featuresin a different manner. This brings new information to the selection process incontrast to other feature selection techniques. The main idea is to modify theweights in each selection step so that the currently selected feature appears,with respect to the distribution, like an irrelevant observation. We show inexperiments that such a change of the weights distribution allows to revealhidden relationships between features. Although the feature selection algo-rithm is not completely developed yet, preliminary results achieved on severalartificial problem looks promising.

1

Contents

1 Introduction 31.1 Feature selection: The traditional approach . . . . . . . . . . . . . . . 41.2 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Problem formulation 7

3 State of the art 73.1 Traditional deterministic methods . . . . . . . . . . . . . . . . . . . . 83.2 Stochastic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Feature selection in machine learning . . . . . . . . . . . . . . . . . . 103.4 AdaBoost and feature selection . . . . . . . . . . . . . . . . . . . . . 103.5 Toolboxes and data benchmarks . . . . . . . . . . . . . . . . . . . . . 113.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Proposed method 134.1 The training set manipulation . . . . . . . . . . . . . . . . . . . . . . 134.2 Selecting the features . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.3 The distribution change by AdaBoost . . . . . . . . . . . . . . . . . . 15

5 Implementation, AdaBoost 175.1 Discrete AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.2 Real AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.3 Decision area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6 Experiments 226.1 Zero correlation and irrelevant observations . . . . . . . . . . . . . . . 226.2 Redundant features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.3 Nested subsets of features . . . . . . . . . . . . . . . . . . . . . . . . 256.4 Analysis of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7 Summary and thesis proposal 26

2

1 Introduction

The performance and generalization abilities of a classifier depend on the numberof training samples, dimensionality of the feature space, and the classifier complex-ity. Determination of an arbitrarily complicated decision boundary requires a largenumber of training samples which, in addition, grows exponentially with the fea-ture space dimensionality [5]. Consequently, huge data storage and computationresources are required. This fact is termed as the curse of dimensionality and leadsto a paradoxical behaviour called the peaking phenomenon [41]. The phenomenonstates that adding features may actually degrade the performance of a designedclassifier in practice if the feature space dimensionality is too large in comparisonwith the number of training samples. The fundamental reason for this is that in-creasing the problem dimensionality induces a corresponding increase of the numberof unknown parameters of a classifier. The parameter estimates reliability decreasesfor a fixed sample size and consequently the performance of the resulting classifiermay degrade. Poor generalization capability of a classifier may be also achievedwhile optimizing the classifier too intensively on the training set. The classifier isovertrained [56] which is analogous to overfitting in regression [35].

The curse of dimensionality and consequent issues are the main reason for re-stricting data dimension. There are two approaches leading to the dimensionalityreduction. In literature, they are referred as feature selection, sometimes also fea-ture subset selection, and feature extraction. It is important to distinguish betweenthe both notions. While the feature selection methods select potentially the bestsubset of the input features, the feature extraction algorithms transform the originalfeatures in to another space. It is a matter of an application and the training datacharacter which of the methods to choose.

xn

x1

.

.

.

.

.

.

xid

xi1

xn

x1

.

.

.

.

.

.

f(x)

yd

y1

a) b)

Figure 1: a) Feature selection, b) feature extraction.

The feature selection process finds a reduced subset of features that is necessaryand sufficient to describe the class membership of the examples without significantlydegrading the performance of a subsequent classifier. This is reached by eliminatingmaximum redundant and irrelevant information from data. As a result, an influenceof noise is reduced and savings in measurement cost and data storage requirementsare reached. Features also keep their original physical meaning, because no transfor-mation of data is made which may be important for better problem understanding.

In contrast, all input features are necessary for constructing a new observationspace in feature extraction. Either linear or nonlinear transformations or combina-tions of original features are computed to create a certain mapping. The mappingis determined so that an appropriately chosen criterion function is optimized. Only

3

a subspace of the most discriminatory newly created features is selected as the resultconsidering some kind of threshold. Although feature extraction is more general andthe transformation mapping may provide better discriminatory ability than the bestsubset of input features, new observations may not have clear physical interpreta-tion. There are no savings in data storage requirements or measurement cost.

The scope of both approaches is too broad and thus, the rest of the report isfocused on the feature selection topic. A new feature selection technique inspiredby the AdaBoost algorithm [44] has been developed and studied. Our conceptof selecting features gains from the effective training set manipulation. The workconcentrates to the question how to select features considering one dimensional spaceonly. Although Cover and Van Campenhout [8] claims that it is not possible withouta significant information loss, we will show that the training set manipulation mayachieve quite interesting results.

Let us mention the main difference between the traditional approach and theproposed method. Traditional techniques use the same training set of examplesto evaluate subsets of features during the selection process which does not bringany new information. In the new algorithm, the training set may be modified toprovide new useful information for subsequent selection steps. Better performanceis expected compared to the traditional methods because redundant and irrelevantfeatures are identified more easily using the training set manipulation. Our hypoth-esis is also that there should be a faster advance in the selection process, becausethe search is constrained to one-dimensional space only in each iteration, whereastraditional feature selection techniques evaluate a specific criterion for subsets offeatures.

There is a wide field of applications in which the use of the feature selectiontechniques may be beneficial. A typical example is gene selection problem weresamples are available just from a small number of patients but the dimensionality isenormous [33]. Or, acquiring measurements may be expensive as it is in medicine, ge-netic engineering or some other fields. Creating meaningful rules preserving originalmeasurement units and minimizing the number of control parameters is importantin control theory applications. In text classification problems, there is an issue ofnon-numeric features represented by vocabulary of words [42]. Recovering the hid-den relationships among the large number of features is resolved in data mining andknowledge discovery applications [29]. And many others fields.

The report is organized as follows. The rest of Section 1 introduces the traditionalfeature selection techniques and the AdaBoost algorithm. In Section 2, we formulatethe problem and explain what redundant and irrelevant features are. The state of theart is given in Section 3. Section 4 introduces the proposed method for the trainingset manipulation and describes several implementation issues. Experimental resultsare shown in Section 5 and analyzed in Section 6. Section 7 specifies the futurework. The report is concluded in Section 8.

1.1 Feature selection: The traditional approach

A suitably chosen criterion function assesses the feature subset effectiveness in tra-ditional feature selection techniques. The best subset of candidate features basedon the selected measure may be found by solving an optimization problem. Theoptimization may be transformed to a state space exploration problem, in which

4

each state represents a different set of features. The size of the space correspondsto the number of all possible combinations of features. The task is to find a subsetof d features in the original set of n features that optimizes (usually maximizes)the criterion function. This is also why the traditional approach is often criticized,since the size of the optimal subset is not known beforehand. The search strategyis independent of the criterion function used [10].

An exhaustive search for the optimal subset of features requires to examine(

nd

)possible combinations of features. This number grows combinatorially with thedimension and makes such a method computationally prohibitive even for problemsof a small dimensionality. Thus the main research was directed to development ofsuboptimal sequential search algorithms.

The heuristic basis of most suboptimal sequential methods is the criterion mono-tonicity. It means that any change in the feature set size is positively correlated withthe criterion function value. In other words, the criterion value does not decreasewhile adding features to the current set or increase while removing them (under theassumption: the higher the criterion value the better corresponding subsets of fea-tures). However, most commonly used criteria do not satisfy monotonicity propertyor some features may become redundant at later selection stage due to statisticaldependence between measurements. The search results in nested subsets of features.The phenomenon is known as the nesting effect [10] and may be partly suppressedby backtracking [53, 36] in the search process.

The criterion function guiding the search, sometimes also called the objectivefunction, is usually some kind of separability measure between classes. There aretwo general approaches for the choice called filters and wrappers [21], see Figure 2.

Complete setof features

Training set

Featuresubset selection

Final subset

Learning algorithm

Featuresubset

Informationcontent


Searchalgorithm

functionCriterion

Final subset

Learning algorithm

Training set

Featuresubset

Informationcontent


Searchalgorithm

algorithmLearning

Final subset

Learning algorithm

Training set

a) b) c)

Figure 2: a) A general filter method and b) a traditional filter technique, where the featuresare filtered independently of the learning algorithm ; c) wrapper approach, where the learningalgorithm is involved in the selection.

Wrappers are generally more precise and achieve better predictive accuracy thanfilters. The recognition rate of a preselected learning algorithm is used as a crite-rion function to guide the search. However, the solution lacks generality, sincethe selected subset of features is optimized for a classifier used in the evaluation

5

function. When the sample size is too small compared to the number of features(peaking phenomenon) then parameters of the learning algorithm cannot be reli-ably estimated. Cross-validation measures are typically used to prevent this effectand also to suppress overfitting [24]. Wrappers are brute-force methods. A massiveamount of computations is required, since the wrapper must train a classifier (orseveral classifiers if cross-validation is used) for each subset of features which maybecome infeasible especially for computationally intensive methods. Speed up maybe achieved by using efficient space search strategies.

Filters may be seen as a preprocessing step for a subsequent learning algorithm.Irrelevant and redundant information is filtered out before learning process starts.This method provides a more general approach than wrappers since the solutionrelies on intrinsic properties of the training data rather than their interactions witha particular classifier. The criterion function evaluates feature subsets by their in-formation content. Measures like the inter-class distance, statistical dependenceor other information-theoretic measures [10] are typically used. Because of inde-pendence on the learning algorithm, the solution is suitable for a larger family ofclassifiers. Filters usually executes quite fast.

The initial point in the feature space specifies the search direction. Startingwith an empty set and subsequent addition of features is referred as a bottom-up approach or forward selection. To the contrary, using the full set of featuresat the beginning and subsequent remove of features is called top-down approachor backward elimination. Randomized methods such as genetic algorithms startsomewhere in the middle of the feature space and proceed from this point.

1.2 AdaBoost

Boosting is a general approach for improving the performance of weak learningalgorithms. It works so that it effectively transforms (or “boosts”) weak methodsinto strong ones. The term weak learning algorithm means any learning algorithmwhich is better than a random one (i.e., the prediction error probability is lessthan 1/2). Although an expression “weak” was used, in practice, boosting mightbe combined with a quite strong learning algorithms such as C4.5 decision treealgorithm [40], for example.

AdaBoost (from Adaptive Boosting) is a quite new and powerful machine learningalgorithm [43]. A training set of labeled examples is taken as the input for learningthe algorithm. The goal is to find a final classifier with a low prediction error rate.AdaBoost may be used in many settings to improve the performance of the learningalgorithms.

The main idea of AdaBoost is to maintain a distribution of weights over thetraining set so that each example is associated with a weight. AdaBoost works byrepeatedly running a given weak learning algorithm on the training set attached tovarious distributions of weights. In each such an iteration, weights of incorrectlyclassified examples are increased. The weak learning algorithm is forced to generatea new classifier according to the new distribution in this way and the learningprocess concentrates on the difficult examples (i.e., examples in the training setthat are difficult to discriminate). The final strong classifier combines all generatedweak classifiers into a single composite decision rule using a weighted majority vote.AdaBoost uses a greedy search strategy to find the best weak classifier.

6

2 Problem formulation

Let X ⊂ Rn be a n dimensional observation space and Y the set of labels. Considera labeled training set of examples {xi, yi}, where xi ∈ X are single samples, yi ∈ Yare corresponding class labels, and i = 1, 2, . . . ,m enumerates the examples. Forsimplicity, the further work is restricted to a two class problem (dichotomy). Theset of admissible labels is assumed to be Y = {−1, +1} in the rest of the report.

Examples xi consist of n observations (called also features or variables) of dif-ferent discriminatory power. However, some of the observations may be redundant,which does not bring new information, or irrelevant, which does not bring any infor-mation at all. The occurrence of these features usually deteriorates performance ofa subsequent classifier and it is more costly in computational and memory resources.

The aim of my PhD research is to find a suitable feature selection techniquewhich utilizes the idea of the weights distribution manipulation similarly as in theAdaBoost algorithm. The outcome should be a sufficiently reduced subset of featuresof size d ≤ n, where the redundant and irrelevant information is minimized.

Redundancy means that features are strongly correlated so that they do notcontain much additional information, see Figure 3a. Nevertheless, correlation doesnot always imply absence of information. Consider an example in Figure 3b. Betterclass separation and noise reduction may be obtained by selecting both variables eventhough observations are correlated but under presence of noise. Another example iswhen the first principal direction of the covariance matrices of the class conditionaldensities is the same, however, class centers are shifted similarly as in Figure 3c.Although such a features are correlated, they are not redundant, because selectingboth provides more discriminatory information.

Irrelevant observations has completely overlapping class conditional probabil-ity density functions which does not bring any useful information, see Figure 3d.However, not all features that seem to be irrelevant are completely useless. Figure3c shows a situation in which probability density functions are class conditionallydependent. Even though one variable is useless by itself, it may provide a signifi-cant performance improvement when taken with others. Such a variables are notirrelevant.

3 State of the art

The earliest approaches emphasized the filter methods and used probabilistic sepa-rability measures [10] as a criterion. Pudil et al. [36] claim that wrappers are theonly promising and correct way of evaluating subsets of features, because classifieraccuracy used as a criterion function captures the probabilistic class separabilityand structural errors imposed by the classifier. However, using different classifierslead to different subset of features and this makes feature selection dependent onthe specific classifier and training data. Terms filter and wrapper were introducedby John et al. [21] in 1992.

7

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

x2

x 1

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

x2

x 1

a) b)

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

x2

x 1

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

x2

x 1

c) d)

Figure 3: Two dimensional scatter plots. Examples of a) redundant, strongly correlated fea-tures, b) presumably redundant features, c) class conditional dependent features, d) irrelevantfeatures.

3.1 Traditional deterministic methods

Pudil et al. [37] show a historic development of traditional feature selection algo-rithms. In 1963, Marill and Green [31] introduced a technique nowadays knownas the Sequential Backward Selection (SBS) that used the divergence distance asa criterion function. Its bottom up counterpart, termed as the Sequential ForwardSelection (SFS), was proposed by Whitney [58] in 1971. Both methods are optimalwithin particular steps and thus, suffer from the nesting effect caused by their greedyapproach. First attempt dealing with nested subsets of features was presented byMichael and Lin [32] in 1973. Stearns [53] redeveloped this idea into the Plus-l-Minus-r search algorithm (l, r) in 1976. However, this is also a suboptimal methodand there is no way how to predict values l and r corresponding to the number offorward and backward steps leading to the optimal subset of features. In 1978, Kit-tler [23] generalized SBS, SFS and (l, r) search techniques and gave a comparativestudy of traditional feature selection algorithms. He showed that generalized meth-ods perform better than ordinary ones but only at the expense of computationaltime.

8

The Max-Min algorithm (MM), a computationally perspective method proposedby Backer and Schipper [2] in 1977, was experimentally shown to be rather un-satisfactory [23]. This technique compares only individual and pairwise merits offeatures. Cover and Van Campenhout [8] assert that selecting subsets of features ina high dimensional space based on two-dimensional information measures is not pos-sible without a significant information loss. They also claim that sequential featureselection methods are not guaranteed to find the optimal subset of features withoutperforming an exhaustive search.

In 1977, Narendra and Fukunaga [34] put forward the Branch and Bound algo-rithm (B&B). This method finds the optimal solution only if the criterion mono-tonicity is guaranteed. Jain [20] claims that techniques based on the Branch andBound algorithm when restricted for use with the monotonic criterion are the onlyoptimal search methods without performing the exhaustive search. Nevertheless,most commonly used criteria violates the monotonicity and the computational timeis sill enormous for high dimensional problems. Other graph search procedures maybe found for example in Ichino and Sklansky [18].

Devijver and Kittler [10] summarized the feature selection issues and gave anoverview of traditional search strategies with their properties. Interclass distancemeasures and probabilistic separability measures are also discussed. Basic informa-tion related to the feature selection and several improvements in the Branch andBound algorithm may be found in Fukunaga’s book [17].

Probably one of the most effective sequential suboptimal techniques in terms ofthe computational time and the solution optimality are currently the Floating Searchmethods introduced by Pudil et al. [36, 37] in 1994. The nesting effect is efficientlycounteracted by applying a number of backtracking steps. Floating methods alsoallow to overcome non-monotonic criteria functions. The idea originates from the(l, r) algorithm. The number of forward steps l and backward steps r is controlleddynamically within the search thus no parameter setting is necessary. Dependingon the direction of the search, algorithms are called the Sequential Forward FloatingSearch (SFFS) and the Sequential Backward Floating Search (SBFS). The floatingmethods may be useful in situations in which the B&B algorithm cannot be useddue to either criterion non-monotonicity or computational reasons.

Zongker and Jain [60, 19] contrast a wide range of feature selection techniqueson artificial normally distributed two-class problem [37]. The Mahalanobis distancewas used as a criterion function. Their experiments show that the sequential floatingmethods perform in terms of classification error almost as well as the Branch andBound algorithm and with much less running time. Further, they achieved a signif-icant improvement in recognition rate of SAR (Synthetic Aperture Radar) satelliteimages applying the wrapper feature selection approach which is combining SFFSmethod and 3–NN classifier.

An adaptive version of floating search methods was proposed by Somol et al.[52] in 1999. Adaptive Floating Search Methods incorporates generalized strategiesinto the floating methods. The algorithm performs even better than floating one,however, the running time is enormous from a certain stage. Further, in 2000, Somoland Pudil [50] introduced Oscillating Search algorithms (OS). Assuming an initialsubset of features of given cardinality, these methods explore the close by region inorder to improve the criterion value by the subset of the same cardinality. Oscillatingmethods are independent of the search direction. They are very efficient if used in

9

addition with other subset search methods. The solution is usually better than thatfound by the floating methods.

3.2 Stochastic methods

In order to escape nested subsets of features and the criterion non-monotonicityas it is in sequential algorithms Siedlecki and Sklansky [47, 48], in 1989, exploredstochastic approaches like genetic algorithms (GA) and Monte Carlo based methods.The main inconvenience of these techniques is that there is no unique solution.Moreover, methods need a certain parameters tuning which is not always easy.

Vafaie and De Jong [55] compared GA to SBS in the experiment based on tex-ture images, however, with contradictory results. In 1994, floating methods werecompared to GA by Ferri et al. [13]. Experiments show a diagnostic problem anda document recognition task. The conclusion is that the traditional approach ofselecting features is more successful especially for high dimensional problems. Theadvantage of GA is an efficient search in the near-optimal region. Thus embed-ding floating techniques into GA may be very effective, for example to initialize thepopulation.

3.3 Feature selection in machine learning

In machine learning community, the most discussed filter methods are the Focusalgorithm introduced by Almuallim and Dietterich [1] in 1991 and the Relief algo-rithm presented by Kira and Rendell [22] in 1992. Focus is an exhaustive searchmethod originally defined for binary noise-free data with quasi-polynomial compu-tational time complexity. Relief is a randomized technique imposing an ordering offeatures. A weight is assigned to each feature and indicates its relevance and thus,redundant information cannot be removed using this kind of algorithm. In order toavoid a randomized character of Relief, John at al. [21] proposed a deterministicversion called ReliefD in 1994. There are some other variants of Relief improvingthe performance, speed or both [25, 59, 30].

The wrappers techniques are the most recent research work in machine learning,see Langley [27]. Wrappers are often combining simple greedy search strategies withinduction algorithms like k-NN or Naive-Bayes classifier [10, 11] or decision trees likeCART, ID3, and C4.5 [7, 39, 40]. Kohavi and John [24] give a comparative studyof wrapper techniques using different types of induction algorithms as a criteriafunctions. Experiments are discussed on the UCI Benchmark Repository [4].

3.4 AdaBoost and feature selection

In most papers, AdaBoost is used both for selecting a small number of importantfeatures from a very high number of potential candidates and for building the finalstrong classifier. A weak classifier is usually designed to work in one-dimensionalspace. In the view of feature selection, the conventional AdaBoost algorithm maybe interpreted as a wrapper using a forward greedy selection strategy. Because ofthe greedy heuristic, AdaBoost suffers from the nesting effect. A brief introductionto boosting is given in [43].

10

The idea of the AdaBoost algorithm was presented by Freund and Schapire inthe original paper [15] in 1995. The error bounds and the generalization errorare analyzed for the discrete version of the algorithm and its extensions for multi-class and regression problems. In 1996, Freund and Schapire [16] show the firstexperiments with the algorithm. They compared boosting with bagging [6] andC4.5 [40] and concluded that boosting performs significantly better than baggingand comparably to C4.5.

Schapire and Singer [44] outlined an analysis of AdaBoost in 1999 and yield animproved and a more general version of the algorithm. They extended AdaBoostto use in multi-class and multi-label classification problems and proposed discreteand real versions of AdaBoost.MH (based on the Hamming loss), AdaBoost.MO(based on the output codes) and AdaBoost.MR (based on the ranking loss). Fromtheir experiments it can be seen that the real version of AdaBoost.MH dramaticallyoutperforms the other methods.

The FloatBoost algorithm, presented by Li et al. [28] in 2002, is probably theonly boosting method that deals with the nesting effect and the criterion non-monotonicity. It combines the floating search methods [36] into AdaBoost. A back-tracking mechanism is efficiently used to remove weak hypothesis that cause highererror rates after each iteration of AdaBoost. The resulting classifier consists of fewerweak hypotheses than AdaBoost and achieves similar or better performance such aslower error rates in both training and testing data set. Although this strategy pro-vides an effective feature selection and classifier design, it may cause some problemsdue to terminating in a local minimum.

In 2004, Sochman and Matas [49] proposed the Totally Corrective Update Stepfor the discrete AdaBoost. This method performs also a kind of backtracking, be-cause the strength of each single weak classifier in the final decision rule is changedwithin the learning process. The complexity of a resulting classifier is lower than inordinary AdaBoost and it achieves similar results.

Boosting was used in many applications like an image retrieval and an efficienton-line learning [54], the robust real-time face detection [57], multi-view face detec-tion [28], an accurate location of the vessels borders in the intra-vascular ultrasoundimages [38] and many others. An attempt to use AdaBoost purely as a filter methodfor feature selection was presented by Dash [9] in 2001.

3.5 Toolboxes and data benchmarks

A software package for the feature selection purpose in statistical pattern recognitionwas developed by Somol and Pudil [51]. It contains a number of known optimal andsuboptimal traditional feature selection search strategies. The software has a formof executable 32-bit Windows application. The kernel is written in ANSI C languageand is connected to a graphical user interface.

Since we have developed the training set manipulation technique in Matlab, weare interested mainly in feature selection methods and learning algorithms imple-mented for the use with Matlab.

11

Related toolboxes widely employed in pattern recognition are for example:

• STPRtool a Statistical Pattern Recognition Toolbox for Matlab [14]

The core of the toolbox is developed by Franc and Hlavac at the Czech Techni-cal University in Prague. It contains statistical pattern recognition algorithmsdescribed in the monograph Schlesinger and Hlavac [45]. Algorithms for analy-sis of linear discriminant functions, feature extraction, probability distributionestimation and clustering, and Support Vector and other Kernel Machines areincluded.

The toolbox is downloadable at :

http://cmp.felk.cvut.cz/˜xfrancv/stprtool/index.html

• PRTools a Matlab based toolbox for pattern recognition [12]

This is an object-oriented programming based toolbox for Matlab developedby the Delft University of Technology. It contains a wide range of patternrecognition algorithms for analysis of linear and non-linear classifiers, featureselection and extraction, combining classifiers and clustering.

The toolbox is downloadable at :

http://www.prtools.org

It is very difficult to compare different feature selection methods, because thereare no standardized benchmark data sets and rules for results evaluation. It is notalways clear how large the training and testing set was, it is not known how themodel selection was performed and there is no information about reliability of givenresults and how they were achieved.

Problems about most of data repositories are that multi-class problems haveoften to be handled, data sometimes need preprocessing, or there are binary andreal value observations or completely missing values. A very clean repository calledIDA [3] has been created for this purpose. Nevertheless, the most widely used datasets are Kittler’s artificial data [23] and some either artificial or real world data setsfrom UCI Benchmark Repository [4].

3.6 Summary

The only optimal traditional feature selection methods without performing the ex-haustive search are techniques based on the Branch and Bound algorithm [17]. How-ever, they are constrained to use with monotonous criteria only and their compu-tational complexity is exponential. Methods based on the Floating Search strategy[37] are probably the most effective sequential suboptimal algorithms dealing withthe nesting effect and the criteria non-monotonicity.

There are no standardized data benchmarks and rules how to compare miscel-laneous feature selection techniques. In most cases, certain data sets from UCIBenchmark Repository [4] or an artificial problem as used in Kittler [23] are em-ployed to contrast particular algorithms. However, in UCI Repository, there arefew if any redundant and irrelevant observations, which leave very limited space forimprovement.

12

4 Proposed method

The training set manipulation is a heuristic feature selection technique based on thefilter strategy. The method is inspired by the idea of the AdaBoost algorithm whichis to maintain a distribution of weights over the examples in the training set. Ourhypothesis is that we can reveal hidden relationships between features and indicateirrelevant and redundant features by an effective modification of the distributionwithin the selection process.

Consider a labeled training set of examples S = {xi, yi}, where each xi belongsto a n dimensional observation space X ⊂ Rn, yi ∈ {−1, +1} are the labels, andi = 1, 2, . . . ,m enumerates the examples. The algorithm works in series of iterationsdenoted by t ∈ N. Let us assign a weight Dt(i) ∈ R+ to every example xi anditeration t. The distribution of weights has to satisfy

∑mi=1 Dt(i) = 1 to be a

distribution.

4.1 The training set manipulation

The training set manipulation works as follows. The observations from the trainingset S are projected into a one-dimensional space given by the feature under consid-eration. The single features are indexed by j ∈ {1, 2, . . . , n}. A simple thresholdfunction is employed to assess the discriminatory power of the feature j but withrespect to the distribution of weights Dt. The best achievable classification error fora threshold θ ∈ R is given by the function

ε(θ, j, t) =1

2−

∣∣∣∣∣∣12 −∑

l∈L(θ,j)

Dt(l)

∣∣∣∣∣∣ , (1)

where the index set L(θ, j) = {i = 1, 2, . . . ,m | i : (yi = +1 ∧ xi,j < θ) ∨ (yi = −1 ∧xi,j > θ)} are indexes of misclassified examples and xi,j denotes a single observationcorresponding to the j-th feature of an example xi .

Suppose that no distribution of weights was used. The features with overlap-ping probability density functions would be identified as irrelevant using any one-dimensional classifier. However, some of these features may be class conditionallydependent with others and may provide crucial discriminatory information.

The situation is different if the distribution of weights Dt over the training ex-amples is exploited. The weights distribution values influence the classification errorin Equation (1) that is measured in distinct feature space dimensions. The error isclose to 1/2 for all thresholds θ and is almost independent of the distribution onlyfor really irrelevant features, whereas for the others the error changes significantly.Thus, this condition may be used to identify irrelevant features.

Some other hidden relationships among the features may be also exposed. Aneffective manipulation of the distribution Dt brings new information to the selectionprocess. Let us take the feature k, for example as the best feature to select inthe sense of the highest information content. The idea is to force the feature k tolook like a irrelevant observation with respect to the distribution Dt+1. Thus, a newweight distribution has to be found so that the classification error given by Equation(1) satisfies

ε(θ, k, t + 1) ≈ 1

2, for all θ ∈ R . (2)

13

This circumstance may be consequently used to identify redundant information likea highly correlated features or up to scale identical copies. The error function (1)for these observations appears very similar or the same as for the feature k, i.e., likethe irrelevant observation with respect to the distribution Dt+1.

Consider the Artificial problem A, where the feature x1 brings some information,the feature x2 has an overlapping class conditional probability density functions andthe feature x3 is an irrelevant observation. The combination {x1, x2} has the highestdiscriminatory information. Figure 4 shows how the classification error given byEquation (1) may appear as a function of the weights distribution and threshold. Itcan be seen that there is a quite significant change for the feature x2 but almost nochange for the feature x3.

−5 0 5

−5

0

5

x2

x 1

−5 0 5

−5

0

5

x3

x 1

−5 0 5

−5

0

5

x3

x 2

−3 −2 −1 0 1 20

0.5

θ (=x3)

ε(θ,

3,t)

−3 −2 −1 0 1 2 30

0.5

θ (=x2)

ε(θ,

2,t)

−3 −2 −1 0 1 2 3 40

0.5

θ (=x1)

ε(θ,

1,t)

t=1t=2

a) b)

Figure 4: Artificial problem A, a) scatter plots of two-dimensional data projections, b) the errorfunction (1) displayed for the distribution Dt (gray color) and for the distribution Dt+1 (blackcolor).

4.2 Selecting the features

The complete algorithm for selecting the features has not been developed yet andremains an open question for further research. From our observations we may state:

• The error function ε given by Equation (1) appears the similar for noisy ob-servations and its value is approximately 1/2 for all possible thresholds θ ∈ R.The error function is not significantly affected by any change of the weightsdistribution for irrelevant features.

• The currently selected feature looks as an irrelevant observation with respectto the new distribution of weights. For highly correlated features or identicalcopies up to scale, these features appear the same as the selected one. However,any subsequent modification of the weights distribution affects all features.Thus, a simple criterion considering only a similarity to noisy observationcannot be used in the whole selection process.

• For observations that are not correlated with the currently selected feature, theerror function (1) is not significantly influenced by the change of the weights

14

distribution. A parallel search rather than a sequential search has to be prob-ably employed. By a parallel search, we mean a comparison of the errorfunctions for several features and weights distributions simultaneously.

• One step of the training set manipulation technique allows to detect dependen-cies between pairs of features. Dependencies between triplets or larger familiesof features might be examined sequentially following the highest variation inthe error function (1).

The classification error according to Equation (1) is determined for all featuresj = 1, . . . , n with respect to the distribution Dt. At this stage, the best feature toselect in the sense of the highest information content is the one with the smallesterror. The feature is indexed by

k = arg minj

minθ∈R

ε(θ, j, t) . (3)

It is highly probable that the least discriminatory information is contained in featuresthat have the classification error very close to 1/2 level for all possible thresholds θ.

To compare the weights distribution change between two steps of the trainingset manipulation technique, similarity measures like Kullback-Leibler distance orJensen-Shannon divergence [26] may be employed. So far, only the difference

δj = minθ

ε(θ, j, t)−minθ

ε(θ, j, t + 1) (4)

has been used, where j = 1, . . . , n index features and θ ∈ R is a threshold.In our experiments, the distribution of weights is initially set to be equally dis-

tributed, i.e., D1(i) = 1/m for the training examples i = 1, . . . ,m. For a sequentialsearch, the feature with the smallest error is selected at first according to Equation(3). Afterwards, the distribution of weights is modified using Equation (2). The sub-sequent question is what feature should be selected in the next step. The highestdifference in Equation (4) and the smallest classification error, Equation (3), seemsto be the sought clue. It is evident that the weights distribution and the history ofthe error function (1) should be also considered.

4.3 The distribution change by AdaBoost

The training set manipulation technique changes the distribution of weights in sucha way that the currently selected feature k appears like an irrelevant observation.We have not find an analytical solution for solving Equation (2) so far. Nevertheless,the distribution of weights may be modified iteratively. The following mechanismbenefits from the properties of the re-weighting scheme employed in the AdaBoostalgorithm.

Basically, there are two main variants of the AdaBoost algorithm, the discreteand the real version. Both are described in more detail in Section 5. Note that thenotation is very similar to that one of the training set manipulation technique.

Consider a weak classifier which generates predictions about observations x withrespect to the distribution of weights Dt. Let ht be the best weak hypothesis aboutthe observation xi chosen by AdaBoost. Next, the weights distribution changes.

15

The classification error εt+1t of the latest weak hypothesis on the new distribution of

weights Dt+1 is given by the equation

εt+1t =

1

2

(m∑

i=1

Dt+1(i)yiht(xi)

)=

1

2. (5)

This means that a weak classifier generated by AdaBoost at time t is maximallyindependent of the mistakes made by a weak classifier induced at time t−1. A higherror value also assures that the same weak classifier cannot be selected in the nextboosting step.

Let us follow the fact that the classification error of the latest weak hypothesisht generated by AdaBoost is 1/2 with respect to the distribution of weights Dt+1,see Equation (5). The modification of the AdaBoost algorithm for our purposeconsists in generating several new weak classifiers in the same dimension in whichthe previously selected feature was chosen. Thereby, the classification error (5) willbe set to 1/2 for several different weak classifiers for the current feature.

However, the original discrete version of AdaBoost [15, 43] cannot guaranteethat the classification error of all preceding weak classifiers will keep the 1/2 valuewhile the distribution of weights has been changed by a new classifier. The discreteAdaBoost with the Totally Corrective Step [49] or the real version of AdaBoost [44]has much better properties. The behaviour of the error function (1) as a functionof the weights distribution is displayed in Figure 5 for all mentioned approaches.

The Totally Corrective Step [49] forces the classification error (5) of all generatedweak classifiers hq to be εt+1

q = 1/2 for q = 1, . . . , t on the latest distribution ofweights Dt+1. In contrast to the simple discrete version of AdaBoost, the predictionerror gets straighten up very quickly to the 1/2 level for all thresholds θ when theTotally Corrective Step is applied. The solution is also more stable. Finally, oncethe distribution of weights is changed by series of previous steps, all generated weakclassifiers have to be discarded. The reason is the fact that any change of thedistribution of weights affects also the prediction error measured on other features.Hence, keeping the error on the noise level is infeasible for all recently selected weakclassifiers together. Because benefit is taken just from the distribution change, itdoes not mater to delete all these weak classifiers.

As for the real version of AdaBoost [44], the results indicate that the errorfunction (1) is approaching the noise level more smoothly and faster compared tothe discrete AdaBoost with the Totally Corrective Step. Thus, the real AdaBoostis employed in experiments.

It is disputable what stopping condition should be used for terminating theweights distribution updating process in one such a cycle. Observing a single value,for example the minimal classification error, is not very reliable. On the other hand,taking into account the whole error function (1) may be time consuming. So far,the mean square error

e =∑θi

(1

2− ε(θi, k, t + 1)

)2

(6)

has been used, where θi are discretized threshold values. The updating processterminates after a certain threshold ϑ is achieved.

16

−5 0 5

−5

0

5

x2

x 1

−5 0 5

−5

0

5

x3

x 1

−5 0 5

−5

0

5

x3

x 2

−3 −2 −1 0 1 20

0.5

θ (=x3)

ε(θ,

3,t)

−3 −2 −1 0 1 2 30

0.5

θ (=x2)

ε(θ,

2,t)

−3 −2 −1 0 1 2 3 40

0.5

θ (=x1)

ε(θ,

1,t) t=1

t=2t=3t=4

a) b)

−3 −2 −1 0 1 20

0.5

θ (=x3)

ε(θ,

3,t)

−3 −2 −1 0 1 2 30

0.5

θ (=x2)

ε(θ,

2,t)

−3 −2 −1 0 1 2 3 40

0.5

θ (=x1)

ε(θ,

1,t) t=1

t=2t=3t=4

−3 −2 −1 0 1 20

0.5

θ (=x3)

ε(θ,

3,t)

−3 −2 −1 0 1 2 30

0.5

θ (=x2)

ε(θ,

2,t)

−3 −2 −1 0 1 2 3 40

0.5

θ (=x1)

ε(θ,

1,t) t=1

t=2t=3t=4

c) d)

Figure 5: Results comparing the error function (1) for different distributions of weights. Arti-ficial problem A, a) scatter plots of two-dimensional data projections, and plots of the errorfunction, where the distribution of weights is changed by means of b) discrete AdaBoost, c)discrete AdaBoost with the Totally Corrective Step, and d) real AdaBoost.

5 Implementation, AdaBoost

The feature selection technique based on the training set manipulation has beendeveloped in Matlab. From the pattern recognition toolboxes, the STPRtool [14] isemployed for a classifier design and visualizations. Since the toolbox does not containfeature selection techniques necessary for comparison with the proposed algorithm,we implemented traditional sequential methods like SFS, SBS, (l,r), GSFS, GSBS,SFFS, and SBFS. These methods may be used with any criterion function to guidethe search.

The AdaBoost algorithm is also not included in the toolbox. We implementedthe discrete and the real version of AdaBoost and the Totally Corrective Step. SinceAdaBoost is the key algorithm in the proposed feature selection method, we givesome implementation details in following sections. The discrete and the real versionof AdaBoost is discussed and constructions of one-dimensional weak classifiers areshown. The notation is similar as in the case of the training set manipulation

17

algorithm.

AdaBoost works in series of iterations denoted by t = 1, 2, . . . , T . Let {xi, yi}be a labeled training set of examples, where xi ∈ X are single samples with labelsyi ∈ {−1, +1}, and i = 1, 2, . . . ,m enumerates the samples. A weight Dt(i) ∈ R+

is assigned to every example from the training set. These weights have to satisfy∑mi=1 Dt(i) = 1 to produce a distribution.

5.1 Discrete AdaBoost

In the discrete version of AdaBoost [15, 43], a weak hypothesis h about a predictedobservation x has the form h : X → {−1, +1}. Hence, resulting predictions arerestricted to the values matching the labels only. The algorithm is described inFigure 6. For completeness, the Totally Corrective Step [49] is shown in Figure 7.

The weak classifier may be built as follows. Observations x are projected intoa specific dimension j by a function f . The one-dimensional weak classifier has theform

ht(x) =

{+1 if (f(x, j)− θ)p ≥ 0 ,−1 otherwise ,

(7)

where θ is a threshold function with a certain parity p, and t is AdaBoost’s iterationnumber.

Given: {xi, yi}; xi ∈ X , yi ∈ {−1,+1}; T ∈ N.Initialize the weights D1(i) = 1/m .For t = 1, . . . , T

• Find ht = arg minhj∈H

εj ; εj =12

[1−

m∑i=1

Dt(i)yihj(xi)

].

• If εt ≥ 1/2 then stop.

• Set αt =12

log(

1− εt

εt

).

• Update:

Dt+1(i) =Dt(i) exp (−αtyiht(xi))

Zt,

where Zt is a normalization factor.

• (Apply the Totally Corrective Step, see Figure 7)

Output the strong classifier:

H(x) = sign

(T∑

t=1

αtht(x)

).

Figure 6: The discrete AdaBoost algorithm.

18

Initialize: D1 = Dt

For j = 1, . . . , Jmax

• qj = arg minq=1,...,t εq .

• If |εqj − 1/2| < ∆min then exit the loop.

• Let αj = 0.5 ln((1− εqj )/εqj ) .

• Update:

Dj+1(i) =Dj(i) exp (−αjyihj(xi))

Zj

.

• αqj = αqj + αj .

Assign: Dt+1 = Dj .

Figure 7: The Totally Corrective Step.

5.2 Real AdaBoost

In general, a weak hypothesis may have the form h : X → R which is employed in thereal version of the AdaBoost algorithm [44], see Figure 8. The sign(h(x)) is taken asa predicted label and the magnitude |h(x)| as a confidence of this prediction. Weakhypotheses are constructed in one dimension as the half log-likelihood ratio of theclass conditional a posteriori probabilities.

Let Dt be the actual distribution of weights and t the number of the currentiteration. A weak hypothesis about an example x is determined as

ht(x) =1

2ln

P (y = +1|x, Dt−1)

P (y = −1|x, Dt−1). (8)

A posteriori probabilities are estimated using histograms resulting from weightvoting of the training examples. These histograms are constructed as follows (with-out loss of generality, subscripts t are omitted for simpler notation). Suppose anobservation space X partitioned into disjoint bins X1, . . . , XK for which h(x) = h(x′)for all x, x′ ∈ Xk, where

⋃K Xk = X and K is the number of bins. Let ck = h(x) be

the estimated prediction for x ∈ Xk. For each bin k and for each label b ∈ {−1, +1},let

W kb =

∑i : xi∈Xk∧yi=b

D(i) = Pri∼D [xi ∈ Xk ∧ yi = b] (9)

be the weight fraction of examples, where Pri∼D means probability with respect tothe distribution D. The optimal prediction value [44] for bin k is given by

ck =1

2ln

W k+1

W k−1

. (10)

19

To limit the magnitudes of predictions, Schapire and Singer [44] suggest usingsmoothed values

ck =1

2ln

W k+1 + ξ

W k−1 + ξ

(11)

for an appropriately small positive value ξ. For instance ξ = 1/m, where m is thenumber of training examples.

Depending on the character of data, the number of bins or the width of individualbins may be different for particular dimensions in the observation space. For a fixwidth of bins, the number of bins in histograms may be determined using Sturge’s[46] rule for number of bins,

K = 1 + log2 m , (12)

where m is the number of the training samples. For a n-dimensional observationspace, we suggest to use a different number of bins in different space dimensions.The best number m corresponds to the unique number of values contained in eachdimension.

Given: {xi, yi}; xi ∈ X , yi ∈ {−1,+1}; T ∈ N.Initialize the weights D1(i) = 1/m .For t = 1, . . . , T

• Find ht = arg minh† J(Ht−1(x) + h†(x)) ,where J(Ht) =

∑i exp(−yiHt(xi)) .

• Update: Dt+1(i) = exp (−yiHt(xi)) , and normalizeto∑m

i=1 Dt+1(i) = 1 .

Output the strong classifier:

H(x) = sign

(T∑

t=1

ht(x)

).

Figure 8: The real AdaBoost algorithm.

5.3 Decision area

The final strong hypothesis H about an example x ∈ X is determined as a sum of allweak predictions about x in the real AdaBoost or as a sum of weak predictions thatare weighted in the discrete AdaBoost. To have a better notion how the decisionregions of discussed algorithms may look like, see Figure 9, where two-dimensionalnormally distributed data of a two-class decision problem are classified by AdaBoost.It can be seen from the results that there are quite significant discontinuities in theerror change for the discrete AdaBoost.

20

x1

x 2

−10 −5 0 5 10−10

−5

0

5

10

−5 0 50

0.2

0.4

θ (=x2)

ε(θ,

2,t)

−4 −3 −2 −1 0 1 2 30

0.2

0.4

θ (=x1)

ε(θ,

1,t) t=1

t=2t=3t=4t=5t=6

a) b)

x1

x 2

−10 −5 0 5 10−10

−5

0

5

10

−5 0 50

0.2

0.4

θ (=x2)

ε(θ,

2,t)

−4 −3 −2 −1 0 1 2 30

0.2

0.4

θ (=x1)

ε(θ,

1,t) t=1

t=2t=3t=4t=5t=6

c) d)

x1

x 2

−10 −5 0 5 10−10

−5

0

5

10

−5 0 50

0.2

0.4

θ (=x2)

ε(θ,

2,t)

−4 −3 −2 −1 0 1 2 30

0.2

0.4

θ (=x1)

ε(θ,

1,t) t=1

t=2t=3t=4t=5t=6

e) f)

Figure 9: Artificial problem B, decision regions after six boosting steps created by means of a)the discrete AdaBoost, c) the discrete AdaBoost with the Totally Corrective Step, and e) thereal AdaBoost, b), d), f) corresponding classification errors of weak classifiers as a function ofthe weights distribution and threshold.

21

6 Experiments

The following experiments show the properties of the proposed feature selectiontechnique on various types of artificial data. For simplicity, only two-class clas-sification problems in low dimension are considered. Data samples are normallydistributed having the same covariance matrices for both classes. Parameters of thedistributions and relationships between the classes and features are known which isimportant for better understanding of the problem. Having this ground truth, it ispossible to say beforehand what are irrelevant and redundant observations, etc.

For comparison, results achieved using the traditional feature selection approachare also provided. Features are selected by the SFFS method with the Mahalanobisdistance [11] as a criterion function. The predictive accuracy of the selected subsetsof features is measured by the theoretical error which is determined by means of theBayes classifier [11].

6.1 Zero correlation and irrelevant observations

This experiment shows variations in the error function (1) if there is no correla-tion between attributes. Data for the Artificial problem C are three-dimensional.Features x1 and x2 are the only informative observations, the feature x3 is irrelevant.

Figure 10a illustrates a situation in which the feature x1 is selected first. Next,the distribution of weights is modified in order to satisfy the Equation (2). Thefeature x2 is selected as the second one and the distribution of weights is modifiedagain. Analogous situation is depicted in Figure 10b but the selection process isinverted. The feature x2 is selected first and x1 second. It can be seen in both casesthat the error function is not significantly influenced by the change of the weightsdistribution for those features that are not correlated.

−4 −3 −2 −1 0 1 20

0.2

0.4

θ (=x3)

ε(θ,

3,t)

−4 −2 0 2 40

0.2

0.4

θ (=x2)

ε(θ,

2,t)

−2 −1 0 1 20

0.2

0.4

θ (=x1)

ε(θ,

1,t)

t=1t=2t=3

−4 −3 −2 −1 0 1 20

0.2

0.4

θ (=x3)

ε(θ,

3,t)

−4 −2 0 2 40

0.2

0.4

θ (=x2)

ε(θ,

2,t)

−2 −1 0 1 20

0.2

0.4

θ (=x1)

ε(θ,

1,t)

t=1t=2t=3

a) b)

Figure 10: Artificial problem C. The error function is displayed for two selection steps, a) thefeature x1 is selected first, x2 second, b) the feature x2 is selected first, x1 second.

22

Subset Selected Mahalanobis Predictionsize subset distance accuracy [%]1 {x1} 0.640 87.12 {x1, x2} 1.203 93.83 {x1, x2, x3} 1.203 93.8

Table 1: Artificial problem C. Comparison to SFFS method.

6.2 Redundant features

The effect of the weights distribution change is evaluated for redundant highly cor-related observations in this experiment. Five-dimensional data is considered in theArtificial problem D. Features x1 and x2 are identical copies. Features x3, x4, andx5 are also copies of the feature x1 but there is an extra noise of different varianceadded on.

Let us select the feature x1. Error functions (1) appear very similar after theweights distribution change for all five observations, see Figure 11. The values ofthe classification error ε are close to 1/2 for all thresholds θ. The lower the varianceof the additional noise, the closer to 1/2 the error is. Some noisy observations mayhave a bit more discriminatory power when taken with others, nevertheless, verylittle improvement in the prediction accuracy is achieved by selecting these features,see Table 2.

−5 0 50

0.2

0.4

θ (=x5)

ε(θ,

5,t)

−4 −3 −2 −1 0 1 2 30

0.2

0.4

θ (=x4)

ε(θ,

4,t)

−3 −2 −1 0 1 2 30

0.2

0.4

θ (=x3)

ε(θ,

3,t)

−3 −2 −1 0 1 2 30

0.2

0.4

θ (=x2)

ε(θ,

2,t)

−3 −2 −1 0 1 2 30

0.2

0.4

θ (=x1)

ε(θ,

1,t)

t=1t=2

Figure 11: Artificial problem D. The error function change after selecting the feature x1.

23

Subset Selected Mahalanobis Predictionsize subset distance accuracy [%]1 {x1} 0.500 84.52 {x1, x5} 0.514 84.73 {x1, x4, x5} 0.520 84.74 {x2, x3, x4, x5} 0.551 85.15 {x1, x2, x3, x4, x5} 0.553 85.1

Table 2: Artificial problem D. Comparison to SFFS method.

An interesting situation is created in the Artificial problem E. The problem hasthree dimensions. Features x1 and x2 are not correlated and the observation x3 isbuilt as x3 = x1+x2. Although the feature x3 is dependent on the other two featuresit is not redundant. It may provide crucial discriminatory information, see Table 3.An experiment with such a data is shown in Figure 12. Selecting either the featurex1 or x2 affects also the error function (1) of the feature x3.

−4 −3 −2 −1 0 1 2 3 40

0.2

0.4

θ (=x3)

ε(θ,

3,t)

−3 −2 −1 0 1 2 3 40

0.2

0.4

θ (=x2)

ε(θ,

2,t)

−3 −2 −1 0 1 2 30

0.2

0.4

θ (=x1)

ε(θ,

1,t)

t=1t=2

−4 −3 −2 −1 0 1 2 3 40

0.2

0.4

θ (=x3)

ε(θ,

3,t)

−3 −2 −1 0 1 2 3 40

0.2

0.4

θ (=x2)

ε(θ,

2,t)

−3 −2 −1 0 1 2 30

0.2

0.4

θ (=x1)

ε(θ,

1,t)

t=1t=2t=3

a) b)

Figure 12: Artificial problem E. The change of the error function after selecting a) the featurex1, b) the feature x2 and x1.


Table 3: Artificial problem E. Comparison to SFFS method.

24

6.3 Nested subsets of features

Nested subsets of features are created in the Artificial problem F. A three-dimensionalproblem is considered. The best subset of two features which has the highest dis-criminatory power is the combination {x1, x3}.

The traditional SFFS technique selects the feature x2 at first. The correct so-lution is found after four iteration steps of the SFFS algorithm. The results of theSFFS feature selection method are shown in Table 4.

If the feature x2 is selected by the training set manipulation technique first thenthere is almost no change in the error function (1) for features x1 and x3, see Figure13b. However, if the feature x1 is selected first then the change is quite significantfor the feature x3 while there is no change for the feature x2, see Figure 13a.

−4 −3 −2 −1 0 1 2 30

0.2

0.4

θ (=x3)

ε(θ,

3,t)

−3 −2 −1 0 1 2 30

0.2

0.4

θ (=x2)

ε(θ,

2,t)

−4 −3 −2 −1 0 1 2 3 40

0.2

0.4

θ (=x1)

ε(θ,

1,t)

t=1t=2t=3

−4 −3 −2 −1 0 1 2 30

0.2

0.4

θ (=x3)

ε(θ,

3,t)

−3 −2 −1 0 1 2 30

0.2

0.4

θ (=x2)

ε(θ,

2,t)

−4 −3 −2 −1 0 1 2 3 40

0.2

0.4

θ (=x1)

ε(θ,

1,t)

t=1t=2

a) b)

Figure 13: Artificial problem F. The error function is displayed for a) two selection steps – thefeature x1 is selected first and x3 second and b) one selection step – only the feature x2 isselected.


Table 4: Artificial problem F. Comparison to SFFS method.

6.4 Analysis of results

Preceding experiments demonstrate that the training set manipulation techniquemay bring useful information for the feature selection process. Irrelevant and re-dundant observations, nested subsets of features or observations with overlapping

25

class conditional probability density functions are identified easily. For compari-son, results achieved using the traditional Sequential Floating Forward Search [36]algorithm has been stated. The performance of the selected subsets of feature isexpresed by comparison with the theoretical error.

7 Summary and thesis proposal

We have proposed a novel feature selection technique which is based on the train-ing set manipulation in this report. The method maintains a distribution of weightslinked to examples from the training set similarly as it is in the AdaBoost algorithm.The main idea is to modify the distribution of weights so that the currently selectedfeature appears like an irrelevant observation. We have shown in experiments thatsuch a change of the weights distribution allows to reveal hidden relationships be-tween features, because the weights distribution change influences all features.

Since we have not found any analytical formula for updating the distributionof weights, the AdaBoost algorithm has been employed. For this purpose, we havegiven necessary details about the discrete and the real AdaBoost algorithm andshowed construction of one-dimensional weak classifiers.

Data used in experiments has been created artificially and contains irrelevantand redundant features, nested subsets of features, correlated and uncorrelated ob-servations, etc. Although the entire feature selection algorithm is not developed yet,preliminary results achieved look promising on these artificial problems. The pro-posed method has been also compared with the Sequential Forward Floating Searchalgorithm which is commonly understood as the state-of-the-art.

In the future research, we would like to study related issues more theoretically.There are still many open questions remaining that has to be answered as well.In the given time for the PhD study, we would like to investigate the followingproblems:

• Improvements of the proposed algorithm and potential further in-vestigationsThe main open question is the issue of the selection algorithm. There areseveral other closely related problems. A proper algorithm necessary for theweights distribution manipulation should be found. We would like to inves-tigate the use of more accurate similarity measures to compare the progressof the error function (1) change. A suitable condition for terminating theselection process has to be found and possible backtracking abilities of thealgorithm should be explored.

• Experiments on the real datasetsWe have experimented with low-dimensional normally distributed artificialdata so far. In further research, we would like to explore the performance ofthe proposed feature selection technique on the real-world data sets ideallywith the available ground truth and compare it with other feature selectionalgorithms.

26

References

[1] H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. InProceedings of the 9th National Conference on Artificial Intelligence, volume 2,pages 547–552. AAAI Press , Menlo Park, CA, USA, July 1991.

[2] E. Backer and J. A. D. Schipper. On the max-min approach for feature orderingand selection. In the seminar on pattern recognition. Liege University, Sart-Tilman, Belgium, page 2.4.1, 1977.

[3] IDA benchmark repository used in several boosting, KDF and SVM papers.http://ida.first.fraunhofer.de/projects/bench/benchmarks.htm, June 2004.

[4] UCI Benchmark Repository – a huge collection of artificial and real-worlddataset. University of California, Irvine, Dept. of Information and ComputerSciences. http://www.ics.uci.edu/˜mlearn, June 2004.

[5] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford UniversityPress, 1996.

[6] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, August1996.

[7] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification andRegression Trees. Wadsworth International Group, 1994.

[8] T. M. Cover and J. M. V. Campenhout. On the possible orderings in measure-ment selection problem. IEEE Transactions on Systems, Man and Cybernetics,7:657–661, September 1977.

[9] S. Dash. Filters, wrappers and a boosting-based hybrid for feature selection. InProceedings of the 18th International Conference on Machine Learning, pages74–81. Morgan Kaufmann Publishers Inc., June 2001.

[10] P. A. Devijver and J. Kittler. Pattern Recognition: A Statistical Approach.Prentice Hall, 1982.

[11] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley& Sons, New York, USA, 2001.

[12] R. P. W. Duin, P. Juszczak, P. Paclik, E. Pekalska, D. de Ridder, and D. M. J.Tax. PRTools4, A Matlab Toolbox for Pattern Recognition. Delft Universityof Technology, 2004.

[13] F. Ferri, P. Pudil, M. Hatef, and J. Kittler. Comparative study of techniques forlarge-scale feature selection. Pattern Recognition in Practice IV, pages 403–413,1994.

[14] V. Franc and V. Hlavac. Statistical pattern recognition toolbox for Matlab.Research Report CTU–CMP–2004–08, Center for Machine Perception, K13133FEE Czech Technical University, Prague, Czech Republic, June 2004.

27

[15] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-linelearning and an application to boosting. Unpublished manuscript available elec-tronically at http://www.cs.huji.ac.il/˜yoavf/papers/adaboost.ps. An extendedabstract appeared in Computational Learning Theory: Second European Con-ference, EuroCOLT’95, pages 23–37, Springer-Verlag, 1995.

[16] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. InProceedings of the 13th International Conference on Machine Learning, pages148–156, Bari, Italy, January 1996. Morgan Kaufmann.

[17] K. Fukunaga. Introduction to Statistical Pattern Recognition. Computer Scienceand Scientific Computing. Academic Press, San Diego, California, USA, 2ndedition, 1990.

[18] M. Ichino and J. Sklansky. Optimum feature selection by zero-one program-ming. IEEE Transactions on Systems, Man and Cybernetics, 14(5):373–746,September 1984.

[19] A. Jain and D. Zongker. Feature selection: Evaluation, application and smallsample performance. IEEE Transactions on Pattern Analysis and MachineIntelligence, 19(2):153–158, February 1997.

[20] A. K. Jain, R. P. W. Duin, and J. Mao. Statistical pattern recognition: Arewiew. IEEE Transactions on Pattern Analysis and Machine Intelligence,22(1):4–37, January 2000.

[21] G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selec-tion problem. In W. W. Cohen and H. Hirsh, editors, Proceedings of the 11thInternational Conference on Machine Learning, pages 121–129, San Francisco,CA, 1994. Morgan Kaufmann Publishers.

[22] K. Kirra and L. A. Rendell. The feature selection problem: Traditional methodsand a new algorithm. In Proceedings of the 10th National Conference on Arti-ficial Intelligence, pages 129–134, San Jose, CA, September 1992. MIT Press,Cambridge, MA.

[23] J. Kittler. Feature set search algorithms. Pattern Recognition and Signal Pro-cessing, pages 41–60, 1978.

[24] R. Kohavi and G. H. John. Wrappers for feature subset selection. ArtificialIntelligence, 97(1-2):273–324, 1997.

[25] D. Koller and M. Sahami. Toward optimal feature selection. In Proceedings ofthe 13th International Conference on Machine Learning, pages 284–292, Bari,Italy, July 1996. Morgan Kaufmann.

[26] A. Korhonen and Y. Krymolowski. On the robustness of entropy-based similar-ity measures in evaluation of subcategorization acquisition systems. In Proceed-ings of the 6th Conference on Natural Language Learning, pages 91–97. Taipei,Taiwan, 2002.

28

[27] P. Langley. Selection of relevant features in machine learning. In Proceedingsof AAAI Fall Symposium on Relevance, New Orleans, 1994. AAAI Press.

[28] S. Z. Li, L. Zhu, Z.-Q. Zhang, A. Blake, H.-J. Zhang, and H. Shum. Statisticallearning of multi-view face detection. In Proceedings of the 7th European Con-ference on Computer Vision, pages 1–15, Copenhagen, Denmark, May 2002.Springer.

[29] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and DataMining. Kluwer Academic Press, 1998.

[30] H. Liu, H. Motoda, and L. Yu. Feature selection with selective sampling. InProceedings of the 19th International Conference on Machine Learning, pages395–402, San Francisco, CA, USA, July 2002. Morgan Kaufmann PublishersInc.

[31] T. Marill and D. M. Green. On the effectiveness of receptors in recognitionsystem. IEEE Transactions on Information Theory, pages 11–17, September1963.

[32] M. Michael and W. C. Lin. Experimental study of information and inter-intraclass distance ratios and feature selection and orderings. IEEE Transactionson Systems, Man and Cybernetics, 3(2):172–181, March 1973.

[33] S. Mukherjee and S. J. Roberts. A theoretical analysis of gene selection. InProceedings of the IEEE Computer Society Bioinformatics Conference, pages131–141, Stanford, CA, USA, August 2004. IEEE Comput. Soc , Los Alamitos,CA, USA.

[34] P. M. Narendra and K. Fukunaga. A branch and bound algorithm for featuresubset selection. IEEE Transactions Computers, 26:917–922, September 1977.

[35] A. Papoulis. Probability and Statistics. Prentice-Hall, 1990.

[36] P. Pudil, F. J. Ferri, J. Novovicova, and J. Kittler. Floating search methodsfor feature selection with nonmonotonic criterion functions. In Proceedingsof the 12th International Conference on Pattern Recognition, pages 279–283,Jerusalem, Israel, May 1994. Los Alamitos, IEEE Computer Society Press.

[37] P. Pudil, J. Novovicova, and J. Kittler. Floating search methods in featureselection. Pattern Recognition Letters, 15:1119–1125, November 1994.

[38] O. Pujol, M. Rosales, P. Radeva, and E. Nofrerias-Fernandez. Intravascularultrasound images vessel characterization using adaboost. In Proceedings of the2nd International Workshop on Functional Imaging and Modeling of the Heart,pages 242–251, Lyon, France, June 2003. Springer.

[39] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106,1986.

[40] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, LosAltos, California, 1992.

29

[41] S. J. Raudys and A. K. Jain. Small sample size effects in statistical patternrecognition: Recommendations for practitioners. IEEE Transactions PatternAnalysis and Machine Intelligence, 13(3):252–264, March 1991.

[42] M. Rogati and Y. Yang. High-performing feature selection for text classifica-tion. In Proceedings of the 11th International Conference on Information andKnowledge Management, pages 659–661. ACM , New York, NY, USA, Novem-ber 2002.

[43] R. E. Schapire. A brief introduction to boosting. In Proceedings of the 16thInternational Joint Conference on Artificial Intelligence, volume 2, pages 1–6,Stockholm, Sweden, July 1999. San Francisco, CA, Morgan Kaufmann.

[44] R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 2(37):297–337, 1999.

[45] M. I. Schlesinger and V. Hlavac. Ten Lectures on Statistical and Structural Pat-tern Recognition. Kluwer Academic Publishers, Dordrecht, The Netherlands,2002.

[46] D. W. Scott, editor. Multivariate Density Estimation: Theory, Practice, andVisualization. Wiley Series in Probability and Mathematical Statistics. JohnWiley & Sons, 1992.

[47] W. Siedlecki and J. Sklansky. On automatic feature selection. InternationalJournal of Pattern Recognition and Artificial Inteligence, 2(2):197–220, June1988.

[48] W. Siedlecki and J. Sklansky. A note on genetic algorithms for large-scalefeature selection. Pattern Recognition Letters, 10(5):335–347, November 1989.

[49] J. Sochman and J. Matas. Adaboost with totally corrective updates for fastface detection. In Proceedings of the 6th IEEE International Conference on Au-tomatic Face and Gesture Recognition, pages 445–450. IEEE Computer Society,May 2004.

[50] P. Somol and P. Pudil. Oscillating search algorithms for feature selection.In Proceedings of the 15th International Conference on Pattern Recognition,volume 2, pages 2406–2409, Barcelona, Spain, September 2000. IEEE ComputerSociety.

[51] P. Somol and P. Pudil. Feature selection toolbox. Pattern Recognition,12(35):2749–2759, December 2002.

[52] P. Somol, P. Pudil, J. Novovicova, and P. Paclık. Adaptive floating searchmethods in feature selection. Pattern Recognition Letters, 20(11-13):1157–1163,November 1999.

[53] S. D. Stearns. On selecting features for pattern classifiers. In Proceedings ofthe 3rd International Conference on Pattern Recognition, pages 71–75. IEEE,New York, NY, USA, November 1976.

30

[54] K. Tieu and P. Viola. Boosting image retrieval. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, volume 1, pages 1228–1236, Hilton Head Island, South Carolina, USA, June 2000. IEEE ComputerSociety.

[55] H. Vafaie and K. D. Jong. Robust feature selection algorithms. In Proceedingsof the 5th IEEE International Conference on Tools for Artificial Intelligence,pages 356–363, Boston, MA, 1993. IEEE Press.

[56] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, New York,USA, 1998.

[57] P. Viola and M. Jones. Robust real-time object detection. In Proceedings ofthe 2nd IEEE workshop on Statistical and Computational Theories of Vision,pages 1–25, Vancouver, Canada, July 2001. IEEE Computer Society.

[58] A. W. Whitney. A direct method of nonparametric measurement selection.IEEE Transactions on Computers, 20(9):1100–1103, September 1971.

[59] I. H. Witten and E. Frank. Data mining – practical machine learning tools andtechniques with JAVA implementations. Morgan Kaufmann, 1999.

[60] D. Zongker and A. Jain. Algorithms for feature selection: An evaluation. InProceedings of the 13th International Conference on Pattern Recognition, vol-ume 2, pages 18–22. IEEE Comput. Soc. Press, Los Alamitos, CA, USA, August1996.

31

center for machine perception feature selection based on

Documents