near-optimal scalable feature selection

22
Near-Optimal Scalable Feature Selection Siggi Olafsson and Jaekyung Yang Iowa State University INFORMS Annual Conference October 24, 2004

Upload: kristina-lonna

Post on 03-Jan-2016

41 views

Category:

Documents


1 download

DESCRIPTION

Near-Optimal Scalable Feature Selection. Siggi Olafsson and Jaekyung Yang Iowa State University INFORMS Annual Conference October 24, 2004. Feature Selection. Eliminate redundant/irrelevant features Reduced dimensionality Potential benefits: Simpler models Faster induction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Near-Optimal Scalable Feature Selection

Near-Optimal Scalable Feature Selection

Siggi Olafsson and Jaekyung YangIowa State University

INFORMS Annual Conference

October 24, 2004

Page 2: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 2

Feature Selection

Eliminate redundant/irrelevant featuresReduced dimensionality

Potential benefits: Simpler models Faster induction More accurate prediction/classification Knowledge obtained from knowing which features

are important

Page 3: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 3

Measuring Feature Quality

Find the subset F of features that maximizes some objective, e.g., Correlation measures (filter)

Accuracy of classification model (wrapper)

Information gain of filter, gain ratio, etc. No single measure that always works best

)(1

)()(

FFFF

FFFf

aa

cancorrelatio

)(Ffaccuracy

Page 4: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 4

Optimization Approach

Combinatorial optimization problem Feasible region is {0,1}m, where m is number of

features NP-Hard

Previous optimization methods applied Branch-and-bound Genetic algorithms & evolutionary search Single pass heuristics Also been formulated as mathematical

programming problem

Page 5: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 5

New Approach: NP Method

Nested Partitions (NP) method:

Developed for simulation optimization

Particularly effective for large-scale

combinatorial type optimization problems

Accounts for noisy performance measures

Page 6: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 6

NP Method

Maintains a subset called most promising region

Partitioning Most promising region partitioned into subsets

Remaining feasible solutions aggregated

Random Sampling Random sample of solutions from each subset

Used to select the next most promising region

Page 7: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 7

Partitioning TreeAll

subsets

a1 not included

Feature a1 included

a2 not included

Feature a2 included

a3 not included

Feature a3 included

Current most promising

Move to best subregion

Backtrack to previous

Page 8: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 8

Intelligent Partitioning

For NP in general Partitioning imposes a structure on the search

space Done well the algorithm converges quickly

For NP for feature selection Partitioning defined by the order of feature Select most important feature first, etc E.g., rank according to the information gain of the

features (entropy partitioning)

Page 9: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 9

Test Data Sets

Test data sets from UCI Repository

Data Set Instances Features

lymph 148 18

vote 435 16

audiology 226 69

cancer 286 9

kr-vs-kp 3196 36

Page 10: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 10

How Well Does it Work?

Comparison between NP and another well known heuristic, namely genetic algorithm (GA)

NPAccuracy

GAAccuracy

Lymph 85.7 84.3

Vote 99.0 94.3

Audiology 70.5 69.6

Cancer 73.7 73.6

Kr-vs-Kp 90.7 89.8

Page 11: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 11

How Close to Optimal?

So far, this is a heuristic random search with no performance guarantee

However, the Two-Stage Nested Partitions (TSNP) can be shown obtain near optimal solutions with high probability Assure that ‘correct choice’ is made with

probability at least each time Correct choice means within an indifference zone

of optimal performance

Page 12: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 12

Two-Stage Sampling

Instead of taking a fixed number of samples from each subregion, use statistical selection, e.g. Rinott:

2

22

0

)(,1max)(

kShnkN

jj

Number of samples needed from j-th region in iteration k

Sample variance estimated from 1st

phase

Constant determined by the desired probability ψ of selecting the correct region

Sample points in 1st phase

Page 13: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 13

Performance Guarantee

When maximum depth is reached

where *))((Pr fkTf

eperformanc Optimal

timestopping Random)(

)1(

*

f

kT

mm

m

Page 14: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 14

Scalability

The NP and TSNP were originally conceived for simulation optimizationCan handle noisy performance More sample prescribed in noisy regions Incorrect moves are corrected through the

backtracking element (both NP and TSNP)Can we use a (small) subset of instances instead of all instances?This is a common approach to increase scalability of data mining algorithms, but is it worthwhile here?

Page 15: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 15

Numerical Results: Original NPData Set Fraction Accuracy Speed (ms) Backtracks

vote

100% 93.5±0.4 2820 0.0

80% 92.8±0.6 2766 0.0

40% 92.2±0.5 1694 0.0

20% 92.6±1.3 1065 0.6

10% 92.4±1.0 816 1.6

5% 91.9±1.7 947 13.2

2% 92.6±1.1 1314 90.4

kr-vs-kp

100% 87.9±5.7 107467 0.0

80% 87.3±7.3 87687 0.0

40% 89.8±3.1 47741 0.0

20% 86.1±4.4 19384 0.4

10% 91.1±3.6 11482 0.2

5% 89.0±1.2 7246 1.8

2% 88.6±2.3 7742 25.8

Page 16: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 16

ObservationsUsing a random sample can improve performance considerably Evaluation of each sample feature subset

becomes fasterVery small sample degenerates performance There is now too much noise and the method

backtracks excessively → more stepsThe TSNP would prescribe more samples! The expected number of steps is constant

What is the best fraction R of instances to use in the TSNP?

Page 17: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 17

Optimal Sample for TSNP

If we decrease the sample size, then the computation for each sample point decreasesHowever, the sample variance increased and more sample points will be neededTo find approximate R*, we thus minimize

]|[)1(][ kkk NTENE

Number of sample points needed in each step

Computation time given the number of sample point needed

Page 18: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 18

Approximating the Variance

Now it can be shown that that

IcNTE

echE

ShENE

kk

Kc

kk

0

21

2

2

22

2

Page 19: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 19

Optimal Sampling Ratio

Now obtain the optimal sampling ratio

The constants c0, c1, c2 are estimated from the data, and h, and are determined by user preferences

21

2

20

2

* )1(ln

1

cch

mc

cR

Page 20: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 20

Numerical Results

Data set ApproachSample

Rate (%) Accuracy Speed Backtracks

vote

TSNP 16 *93.2 786 0.2

NP 10 92.4 816 1.6

NP 100 93.5 2820 0.0

cancer

TSNP 24 *73.5 *418 2.4

NP 10 72.6 486 7.4

NP 100 73.2 795 0.0

kr-vs-kp

TSNP 3 89.0 *5189 0.0

NP 5 89.0 7246 1.8

NP 100 87.9 107467 1.8* Statistically better than TSNP w/sampling

Page 21: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 21

ConclusionsFeature selection integral in data miningInherently a combinatorial optimization problemFrom scalability standpoint it is desirable to be able to deal with nosy dataNested partitions methodFlexible performance guaranteesAllows for effective use of random samplingVery good performance on test problems

Page 22: Near-Optimal Scalable Feature Selection

2004 INFORMS Annual Conference 22

References

Full papers available: S. Ólafsson and J. Yang (2004).  “Intelligent

Partitioning for Feature Selection,” INFORMS Journal on Computing, in print. 

S. Ólafsson (2004). “Two-Stage Nested Partitions Method for Stochastic Optimization,” Methodology and Computing in Applied Probability, 6, 5-27.

J. Yang and S. Ólafsson (2004). “Optimization-Based Feature Selection with Adaptive Instance Sampling,” Computers and Operations Research, to appear