near-optimal scalable feature selection

Near-Optimal Scalable Feature Selection

Siggi Olafsson and Jaekyung YangIowa State University

INFORMS Annual Conference

October 24, 2004

2004 INFORMS Annual Conference 2

Feature Selection

Eliminate redundant/irrelevant featuresReduced dimensionality

Potential benefits: Simpler models Faster induction More accurate prediction/classification Knowledge obtained from knowing which features

are important


Measuring Feature Quality

Find the subset F of features that maximizes some objective, e.g., Correlation measures (filter)

Accuracy of classification model (wrapper)

Information gain of filter, gain ratio, etc. No single measure that always works best

)(1

)()(

FFFF

FFFf

aa

cancorrelatio

)(Ffaccuracy


Optimization Approach

Combinatorial optimization problem Feasible region is {0,1}m, where m is number of

features NP-Hard

Previous optimization methods applied Branch-and-bound Genetic algorithms & evolutionary search Single pass heuristics Also been formulated as mathematical

programming problem


New Approach: NP Method

Nested Partitions (NP) method:

Developed for simulation optimization

Particularly effective for large-scale

combinatorial type optimization problems

Accounts for noisy performance measures


NP Method

Maintains a subset called most promising region

Partitioning Most promising region partitioned into subsets

Remaining feasible solutions aggregated

Random Sampling Random sample of solutions from each subset

Used to select the next most promising region


Partitioning TreeAll

subsets

a1 not included

Feature a1 included

a2 not included

Feature a2 included

a3 not included

Feature a3 included

Current most promising

Move to best subregion

Backtrack to previous


Intelligent Partitioning

For NP in general Partitioning imposes a structure on the search

space Done well the algorithm converges quickly

For NP for feature selection Partitioning defined by the order of feature Select most important feature first, etc E.g., rank according to the information gain of the

features (entropy partitioning)


Test Data Sets

Test data sets from UCI Repository

Data Set Instances Features

lymph 148 18

vote 435 16

audiology 226 69

cancer 286 9

kr-vs-kp 3196 36


How Well Does it Work?

Comparison between NP and another well known heuristic, namely genetic algorithm (GA)

NPAccuracy

GAAccuracy

Lymph 85.7 84.3

Vote 99.0 94.3

Audiology 70.5 69.6

Cancer 73.7 73.6

Kr-vs-Kp 90.7 89.8


How Close to Optimal?

So far, this is a heuristic random search with no performance guarantee

However, the Two-Stage Nested Partitions (TSNP) can be shown obtain near optimal solutions with high probability Assure that ‘correct choice’ is made with

probability at least each time Correct choice means within an indifference zone

of optimal performance


Two-Stage Sampling

Instead of taking a fixed number of samples from each subregion, use statistical selection, e.g. Rinott:

2

22

0

)(,1max)(

kShnkN

jj

Number of samples needed from j-th region in iteration k

Sample variance estimated from 1st

phase

Constant determined by the desired probability ψ of selecting the correct region

Sample points in 1st phase


Performance Guarantee

When maximum depth is reached

where *))((Pr fkTf

eperformanc Optimal

timestopping Random)(

)1(

*

f

kT

mm

m


Scalability

The NP and TSNP were originally conceived for simulation optimizationCan handle noisy performance More sample prescribed in noisy regions Incorrect moves are corrected through the

backtracking element (both NP and TSNP)Can we use a (small) subset of instances instead of all instances?This is a common approach to increase scalability of data mining algorithms, but is it worthwhile here?


Numerical Results: Original NPData Set Fraction Accuracy Speed (ms) Backtracks

vote

100% 93.5±0.4 2820 0.0

80% 92.8±0.6 2766 0.0

40% 92.2±0.5 1694 0.0

20% 92.6±1.3 1065 0.6

10% 92.4±1.0 816 1.6

5% 91.9±1.7 947 13.2

2% 92.6±1.1 1314 90.4

kr-vs-kp

100% 87.9±5.7 107467 0.0

80% 87.3±7.3 87687 0.0

40% 89.8±3.1 47741 0.0

20% 86.1±4.4 19384 0.4

10% 91.1±3.6 11482 0.2

5% 89.0±1.2 7246 1.8

2% 88.6±2.3 7742 25.8


ObservationsUsing a random sample can improve performance considerably Evaluation of each sample feature subset

becomes fasterVery small sample degenerates performance There is now too much noise and the method

backtracks excessively → more stepsThe TSNP would prescribe more samples! The expected number of steps is constant

What is the best fraction R of instances to use in the TSNP?


Optimal Sample for TSNP

If we decrease the sample size, then the computation for each sample point decreasesHowever, the sample variance increased and more sample points will be neededTo find approximate R*, we thus minimize

]|[)1(][ kkk NTENE

Number of sample points needed in each step

Computation time given the number of sample point needed


Approximating the Variance

Now it can be shown that that

IcNTE

echE

ShENE

kk

Kc

kk

0

21

2

2

22

2


Optimal Sampling Ratio

Now obtain the optimal sampling ratio

The constants c0, c1, c2 are estimated from the data, and h, and are determined by user preferences

21

2

20

2

* )1(ln

1

cch

mc

cR


Numerical Results

Data set ApproachSample

Rate (%) Accuracy Speed Backtracks

vote

TSNP 16 *93.2 786 0.2

NP 10 92.4 816 1.6

NP 100 93.5 2820 0.0

cancer

TSNP 24 *73.5 *418 2.4

NP 10 72.6 486 7.4

NP 100 73.2 795 0.0

kr-vs-kp

TSNP 3 89.0 *5189 0.0

NP 5 89.0 7246 1.8

NP 100 87.9 107467 1.8* Statistically better than TSNP w/sampling


ConclusionsFeature selection integral in data miningInherently a combinatorial optimization problemFrom scalability standpoint it is desirable to be able to deal with nosy dataNested partitions methodFlexible performance guaranteesAllows for effective use of random samplingVery good performance on test problems


References

Full papers available: S. Ólafsson and J. Yang (2004). “Intelligent

Partitioning for Feature Selection,” INFORMS Journal on Computing, in print.

S. Ólafsson (2004). “Two-Stage Nested Partitions Method for Stochastic Optimization,” Methodology and Computing in Applied Probability, 6, 5-27.

J. Yang and S. Ólafsson (2004). “Optimization-Based Feature Selection with Adaptive Instance Sampling,” Computers and Operations Research, to appear

near-optimal scalable feature selection

Documents