near-optimal scalable feature selection
DESCRIPTION
Near-Optimal Scalable Feature Selection. Siggi Olafsson and Jaekyung Yang Iowa State University INFORMS Annual Conference October 24, 2004. Feature Selection. Eliminate redundant/irrelevant features Reduced dimensionality Potential benefits: Simpler models Faster induction - PowerPoint PPT PresentationTRANSCRIPT
Near-Optimal Scalable Feature Selection
Siggi Olafsson and Jaekyung YangIowa State University
INFORMS Annual Conference
October 24, 2004
2004 INFORMS Annual Conference 2
Feature Selection
Eliminate redundant/irrelevant featuresReduced dimensionality
Potential benefits: Simpler models Faster induction More accurate prediction/classification Knowledge obtained from knowing which features
are important
2004 INFORMS Annual Conference 3
Measuring Feature Quality
Find the subset F of features that maximizes some objective, e.g., Correlation measures (filter)
Accuracy of classification model (wrapper)
Information gain of filter, gain ratio, etc. No single measure that always works best
)(1
)()(
FFFF
FFFf
aa
cancorrelatio
)(Ffaccuracy
2004 INFORMS Annual Conference 4
Optimization Approach
Combinatorial optimization problem Feasible region is {0,1}m, where m is number of
features NP-Hard
Previous optimization methods applied Branch-and-bound Genetic algorithms & evolutionary search Single pass heuristics Also been formulated as mathematical
programming problem
2004 INFORMS Annual Conference 5
New Approach: NP Method
Nested Partitions (NP) method:
Developed for simulation optimization
Particularly effective for large-scale
combinatorial type optimization problems
Accounts for noisy performance measures
2004 INFORMS Annual Conference 6
NP Method
Maintains a subset called most promising region
Partitioning Most promising region partitioned into subsets
Remaining feasible solutions aggregated
Random Sampling Random sample of solutions from each subset
Used to select the next most promising region
2004 INFORMS Annual Conference 7
Partitioning TreeAll
subsets
a1 not included
Feature a1 included
a2 not included
Feature a2 included
a3 not included
Feature a3 included
Current most promising
Move to best subregion
Backtrack to previous
2004 INFORMS Annual Conference 8
Intelligent Partitioning
For NP in general Partitioning imposes a structure on the search
space Done well the algorithm converges quickly
For NP for feature selection Partitioning defined by the order of feature Select most important feature first, etc E.g., rank according to the information gain of the
features (entropy partitioning)
2004 INFORMS Annual Conference 9
Test Data Sets
Test data sets from UCI Repository
Data Set Instances Features
lymph 148 18
vote 435 16
audiology 226 69
cancer 286 9
kr-vs-kp 3196 36
2004 INFORMS Annual Conference 10
How Well Does it Work?
Comparison between NP and another well known heuristic, namely genetic algorithm (GA)
NPAccuracy
GAAccuracy
Lymph 85.7 84.3
Vote 99.0 94.3
Audiology 70.5 69.6
Cancer 73.7 73.6
Kr-vs-Kp 90.7 89.8
2004 INFORMS Annual Conference 11
How Close to Optimal?
So far, this is a heuristic random search with no performance guarantee
However, the Two-Stage Nested Partitions (TSNP) can be shown obtain near optimal solutions with high probability Assure that ‘correct choice’ is made with
probability at least each time Correct choice means within an indifference zone
of optimal performance
2004 INFORMS Annual Conference 12
Two-Stage Sampling
Instead of taking a fixed number of samples from each subregion, use statistical selection, e.g. Rinott:
2
22
0
)(,1max)(
kShnkN
jj
Number of samples needed from j-th region in iteration k
Sample variance estimated from 1st
phase
Constant determined by the desired probability ψ of selecting the correct region
Sample points in 1st phase
2004 INFORMS Annual Conference 13
Performance Guarantee
When maximum depth is reached
where *))((Pr fkTf
eperformanc Optimal
timestopping Random)(
)1(
*
f
kT
mm
m
2004 INFORMS Annual Conference 14
Scalability
The NP and TSNP were originally conceived for simulation optimizationCan handle noisy performance More sample prescribed in noisy regions Incorrect moves are corrected through the
backtracking element (both NP and TSNP)Can we use a (small) subset of instances instead of all instances?This is a common approach to increase scalability of data mining algorithms, but is it worthwhile here?
2004 INFORMS Annual Conference 15
Numerical Results: Original NPData Set Fraction Accuracy Speed (ms) Backtracks
vote
100% 93.5±0.4 2820 0.0
80% 92.8±0.6 2766 0.0
40% 92.2±0.5 1694 0.0
20% 92.6±1.3 1065 0.6
10% 92.4±1.0 816 1.6
5% 91.9±1.7 947 13.2
2% 92.6±1.1 1314 90.4
kr-vs-kp
100% 87.9±5.7 107467 0.0
80% 87.3±7.3 87687 0.0
40% 89.8±3.1 47741 0.0
20% 86.1±4.4 19384 0.4
10% 91.1±3.6 11482 0.2
5% 89.0±1.2 7246 1.8
2% 88.6±2.3 7742 25.8
2004 INFORMS Annual Conference 16
ObservationsUsing a random sample can improve performance considerably Evaluation of each sample feature subset
becomes fasterVery small sample degenerates performance There is now too much noise and the method
backtracks excessively → more stepsThe TSNP would prescribe more samples! The expected number of steps is constant
What is the best fraction R of instances to use in the TSNP?
2004 INFORMS Annual Conference 17
Optimal Sample for TSNP
If we decrease the sample size, then the computation for each sample point decreasesHowever, the sample variance increased and more sample points will be neededTo find approximate R*, we thus minimize
]|[)1(][ kkk NTENE
Number of sample points needed in each step
Computation time given the number of sample point needed
2004 INFORMS Annual Conference 18
Approximating the Variance
Now it can be shown that that
IcNTE
echE
ShENE
kk
Kc
kk
0
21
2
2
22
2
2004 INFORMS Annual Conference 19
Optimal Sampling Ratio
Now obtain the optimal sampling ratio
The constants c0, c1, c2 are estimated from the data, and h, and are determined by user preferences
21
2
20
2
* )1(ln
1
cch
mc
cR
2004 INFORMS Annual Conference 20
Numerical Results
Data set ApproachSample
Rate (%) Accuracy Speed Backtracks
vote
TSNP 16 *93.2 786 0.2
NP 10 92.4 816 1.6
NP 100 93.5 2820 0.0
cancer
TSNP 24 *73.5 *418 2.4
NP 10 72.6 486 7.4
NP 100 73.2 795 0.0
kr-vs-kp
TSNP 3 89.0 *5189 0.0
NP 5 89.0 7246 1.8
NP 100 87.9 107467 1.8* Statistically better than TSNP w/sampling
2004 INFORMS Annual Conference 21
ConclusionsFeature selection integral in data miningInherently a combinatorial optimization problemFrom scalability standpoint it is desirable to be able to deal with nosy dataNested partitions methodFlexible performance guaranteesAllows for effective use of random samplingVery good performance on test problems
2004 INFORMS Annual Conference 22
References
Full papers available: S. Ólafsson and J. Yang (2004). “Intelligent
Partitioning for Feature Selection,” INFORMS Journal on Computing, in print.
S. Ólafsson (2004). “Two-Stage Nested Partitions Method for Stochastic Optimization,” Methodology and Computing in Applied Probability, 6, 5-27.
J. Yang and S. Ólafsson (2004). “Optimization-Based Feature Selection with Adaptive Instance Sampling,” Computers and Operations Research, to appear