an introduction to variable and feature selection
Post on 11-Feb-2017
188 Views
Preview:
TRANSCRIPT
AN INTRODUCTION TO VARIABLE AND FEATURE SELECTION
Meoni Marco – UNIPI – March 30th 2016
Isabelle Guyon Clopinet
André Elisseeff Max Planck Institute for Biological Cybernetics
PhD course in Optimization for Machine Learning
Definition and Goal • Variable/Attribute/Dimension/Feature Selection/Reduction
• “variables”: the raw input variables • “features”: variables constructed for the input variables
• Select a subset of learning algorithms’ relevant features • Given a set of features find a subset
that “maximizes the learners ability to classify patterns” • Model simplification to make it easier to interpret by users • Shorter training time to improve learning algorithm’s performance • Enhanced generalization to limit overfitting
1{ ,..., ,..., }i nF f f f= 'F F⊆
Feature Selection in Biology • Monkey performing classification task
• Diagnostic features: eye separation and height • Non-Diagnostic features: mouth height, nose length
3/54
Feature Selection in Machine Learning • Information about the target class is intrinsic in the variables • More info does not mean more discrimination power • Dimensionality and Performance
- Required #samples grows exponentially with #variables - Classifier’s performance degrades for a large number of features
Variable Ranking - Scoring • Order a set of features F by the value of a scoring function
S(fi) computed from the training data
• Select the k highest ranked features according to S • Computationally efficient: only calculation and sorting of n scores • Statistically robust against overfitting, low variance
1' { ,..., ,... }
j ni i iF f f f=1
( ) ( ); 1,..., 1;j ji iS f S f j n
+≥ = −
Variable Ranking - Correlation • Criteria to detect linear dependency features/target
• Pearson correlation coefficient
• Estimate for m samples:
• Higher correlation means higher score
• mostly used R(xi,y)² or |R(xi,y)|
cov( , )( , )var( ) var( )
ii
i
f yf yf y
=R
( ) ( )( ) ( )
,1
2 2
,1 1
( , )m
k i i kki
m m
k i kk k
f f y yR f y
f fi y y
=
= =
− −=
− −
∑
∑ ∑
[ ]( , ) 1,1iX Y ∈ −R
Variable Ranking – Single Var Classifier • Select variables according to individual predictive power • Performance of a classifier built with 1 variable
• e.g. the value of the variable itself (set threshold on the values) • usually measured in terms of error rate (or criteria using fpr, fnr, …)
Variable Ranking – Mutual Information • Empirical estimates of mutual information features/target: • If discrete variables (probabilities estimated from
frequency counts):
( ) ( , )( , ) , log( ) ( )
i
ii i
ix y
p x yI x y p x y dxdyp x p y
= ∫ ∫
( , )( , ) ( , ) log( ) ( )i
ii ix y
i
P X x Y yI x y P X x Y yP X x P Y y
= == = =
= =∑ ∑
Questions • Correlation variable/target not enough to assess relevance • Do not discard variables with small (redundant) score • Low-score variables can be useful with others
Feature Subset Selection • Requirements:
• Scoring function to asses the optimal feature subset • Strategy to search the space of possible feature subsets
• finding the optimal feature subset for arbitrary target is NP-hard
• Methods: • Filters • Wrappers • Embedded
Feature Subset Selection - Filters • Select subsets of variables as a pre-processing step,
independently of the used classifier • Variable ranking with score function is a filter method
• Fast • Generic selection of features, not optimized for used classifier • Sometimes used as a pre-processing step for other methods
Feature Subset Selection - Wrappers • Score feature subsets based on learner predictive power • Heuristic search strategies:
• Forward selection: start with empty set and add features at each step • Backward elimination: start with full set and discard features at each step
• Predictive power measured on validation set or cross-validation • Pro: learner as a black box makes wrappers simple • Cons: required large amount of computation and risk of overfitting
Feature Subset Selection - Embedded • Performs feature selection during training • Nested Subset Methods
• Guide the search process by predicting the changes in the objective function values when moving in variable subsets space: 1. Finite difference method: differences calculated w/o retraining new models for each
candidate variable 2. Quadratic approximation of cost function: used for backward elimination of variables 3. Sensitivity of the objective function calculation: devise a forward selection procedure
• Direct Objective Optimization • Formalize the objective function of variable selection and optimize
1. the goodness-of-fit (to be maximized) 2. the number of variables (to be minimized)
Feature Selection - Summary • Feature selection can increase performance of learning algos
• Both accuracy and computation time, but not easy • Ranking-criteria of features
• Don’t automatically discard variables with small scores • Filters, Wrappers, Embedded Methods
• How to search the space of all feature subsets? • How to asses performance of learner that uses a given feature subset?
Feature Extraction • Feature Selection:
• Feature Construction
F
F‘
F F‘
11 .{ ,..., ,..., } { ,..., ,..., }j mi n i i if selectionf f f f f f⎯⎯⎯⎯→
1 1 1 1 1.{ ,..., ,..., } { ( ,..., ),..., ( ,..., ),..., ( ,..., )}i n n j n m nf extractionf f f g f f g f f g f f⎯⎯⎯⎯→
Feature Construction • Goal: reduce data dimensionality • Methods
• Clustering: replace a group of “similar” variables by a cluster centroid (K-means, Hierarchical clustering)
• Linear transform of input variables (PCA/SVD, LDA) • Matrix factorization of variable subsets
Validation Methods • Issues on Generalization Prediction and Model Selection
• Determine the number of variables that are “significant” • Guide and halt for good variables subsets • Choose hyper-parameters • Evaluate the final performance of the system
• Model Selection • Compare training errors with statistical tests (Rivals & Personaz 2003) • Estimate generalization error confidence intervals (Bengio & Chapados
2003) • Choose what fraction of the data to split (leave-one-out cross-
validation, Monari & Dreyfus 2000)
Advanced Topics & Open Problems • Variance of Variable Subset Selection
• Methods sensitive to perturbations of the experimental conditions • Variance is often the symptom of a model that does not generalize
• Variable Ranking in the Context of Others • Ranking a subset may infer different criteria than a single variable
• Forward vs Backward • Depending on applications
Advanced Topics & Open Problems • Multi-class Problem
• Some variable selection methods use multi-class rather than decompose in several two-class problems
• Methods based on mutual information criteria extend to this case • Inverse Problems
• Reverse engineering: find the reasons from results of a predictor • E.g. identify factors that triggered a disease
• Key issue: distinction between correlation and casuality • Method: use variables discarded by variable selection as additional
outputs of a neural network
THANK YOU!
top related