![Page 1: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/1.jpg)
CE-725: Statistical Pattern Recognition Sharif University of TechnologySpring 2013
Soleymani
Feature Selection
![Page 2: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/2.jpg)
Outline Dimensionality reduction Feature selection vs. feature extraction
Filter univariate methods Multi-variate filter & wrapper methods Evaluation criteria Search strategies
2
![Page 3: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/3.jpg)
Feature Selection Data may contain many irrelevant and redundant variables and
often comparably few training examples
Application for which datasets with tens or hundreds ofthousands of variables are available text or document processing, gene expression array analysis
3
⋮ ⋮ … ⋮( )( )
1 2 3 4 − 1
…
![Page 4: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/4.jpg)
Feature Selection
4
A way to select the most relevant features to find moreaccurate, faster, and easier to understand classifiers. Performance: enhancing generalization ability
alleviating the effect of the curse of dimensionality the higher the ratio of the number of training patterns to the number of free
classifier parameters, the better the generalization of the learned classifier
Efficiency: speeding up the learning process Interpretability: resulting a model that is easier to understand
![Page 5: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/5.jpg)
Peaking Phenomenon
5
If the class conditional pdfs are unknown: For a finite , by increasing (the number of features), the probability of
error is initially decreased, but later (after a critical point) is increased
Minimum at Usually ∈ [2,10]
= =
[Theodoridis et. Al, 2008]
![Page 6: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/6.jpg)
Noise (or irrelevant) features
6
Eliminating irrelevant features can decrease theclassification error on test data
SVM Decision Boundary
Noise feature
SVM Decision Boundary
![Page 7: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/7.jpg)
Feature Selection
7
Supervised feature selection: Given a labeled set of data points,select a subset of features for data representation If class information is not available, unsupervised feature selection
approaches [we will discuss supervised ones in this lecture]
Feature Selection
= ( ) ⋯ ( )⋮ ⋱ ⋮( ) ⋯ ( )= ( )⋮( )
, , … ,The selected
features
![Page 8: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/8.jpg)
Dimensionality Reduction:Feature Selection vs. Feature Extraction
Feature selection Select a subset of a given feature set
Feature extraction (e.g., PCA, LDA) A linear or non-linear transform on the original feature space
8
⋮ → ⋮Feature
Selection( < )
⋮ → ⋮ = ⋮Feature
Extraction
![Page 9: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/9.jpg)
Gene Expression
9
Usually examples (patients) But, the number of features ranges from 6000 to 60,000
[Golub et al, Science, 1999]
+1
-1
-1
+1
, = 1, … ,−
Leukemia Diagnosis′ selected features
![Page 10: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/10.jpg)
Face Male/Female Classification
Relief:
Simba:
100 500 1000
[Navot, Bachrach, and Tishby, ICML, 2004]
= 1000 training images = 60 × 85 = 5100 features
10
![Page 11: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/11.jpg)
Good Features Discriminative: different values for samples of different
classes and almost same values for very similar samples
Interpretable: features that are easier to understand byhuman experts (corresponding to object characteristics )
Representative: provide concise description
11
![Page 12: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/12.jpg)
Some Definitions
12
One categorization of feature selection methods: Univariate method: considers one variable (feature) at a time. Multivariate method: considers subsets of features together.
Another categorization: Filter method: ranks features or feature subsets independent of
the classifier as a preprocessing step. Wrapper method: uses a classifier to evaluate the score of
features or feature subsets. Embedded method: Feature selection is done during the training
of a classifier E.g., Adding a regularization term in the cost function of linear
classifiers
![Page 13: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/13.jpg)
Filter – Univariate Method
13
Filter – Univariate Method Score of each feature is found based on the -th column of
the data matrix and label vector Rank features according to their score values and select the
ones with the highest scores.
Advantage: computational and statistical scalability
![Page 14: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/14.jpg)
Filter – Univariate Method
14
Filter methods are usually used for univariate case Considering a feature irrespective of others is not such
meaningful to train a classifier to evaluate it.
Criterion of feature selection: Relevance of the feature to predict labels:
Can the feature discriminate the patterns of different classes?
Reducing redundancy: do not select features that areredundant to the previously selected variables
![Page 15: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/15.jpg)
Irrelevance : random variable corresponding to the -th component of
input feature vectors : random variable corresponding to the labels
Irrelevance feature to predict ( = 2): ( | = 1) = ( | = −1)
Using KL divergence to find a distance between ( | = 1) and ( | = −1):= ( ( | = 1)| = −1+ ( ( | = −1)|| ( | = 1))
15
Y=1
Y=-1
density
![Page 16: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/16.jpg)
Pearson Correlation Criteria
16
= ( , )( ) ( ) ≈ ∑ ( ) − ( ) −∑ − ∑ −
(1) ≫ (2)
![Page 17: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/17.jpg)
Univariate Dependence
17
Independence:
Mutual information as a measure of dependence:
, Score of based on MI with : = ,
![Page 18: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/18.jpg)
ROC Curve
1- Specificity=FPR
Sens
itivi
ty=T
PR
18 -+ x1
Y=1
Y=-1
-+
density
Y=1
Y=-1
x2
x1
x2
density
-+
-+
Area Under Curve
(AUC)
![Page 19: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/19.jpg)
Statistical T-test
19
T statistic: based on the experimental data, we reject or accept H0 at a
certain significance level H0: values of the feature for different classes do not differ
significantly = To select the features for which the means are significantly different
Y=1
Y=-1
x2
density
-+
+-
![Page 20: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/20.jpg)
Statistical T-test
20
Assumptions: Normally distributed classes, equal variance,unknown; as the estimated variance from data.
Null hypothesis H0:
T has a Student distribution with degrees of freedom = + − 2 If / or / then
reject the hypothesis /2/2
![Page 21: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/21.jpg)
Statistical T-test
21
Assumptions: Normally distributed classes, unknownvariances, estimated variances of classes from data as
. The statistic t will have a Student’s t distribution:
= + + − 2
![Page 22: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/22.jpg)
Statistical T-test: Example
22
feature:
Then, is not a proper feature.
feature:
Then, is a proper feature.
= 1, = 0= = 1⇒ = 1.224, = 4 ⇒ . = 2.78 > 1.224
= 2, = −1= = 1⇒ = 3.674, = 4 ⇒ . = 2.78 < 3.674
= 0.05
![Page 23: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/23.jpg)
Filter – Univariate: Disadvantage
23
Redundant subset: Same performance could possibly beachieved with a smaller subset of complementaryvariables that does not contain redundant features.
What is the relation between redundancy andcorrelation: Are highly correlated features necessarily redundant? What about completely correlated ones?
![Page 24: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/24.jpg)
Univariate Methods: Failure
24
Samples where univariate feature analysis and scoringfails:
[Guyon-Elisseeff, JMLR 2004; Springer 2006]
![Page 25: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/25.jpg)
Multivariate Methods: General Procedure
Subset Generation: select a candidate feature subset for evaluation
Subset Evaluation: compute the score (relevancy value) of the subset
Stopping criterion: when stopping the search in the space of feature subsets
Validation: verify that the selected subset is valid
25
Subset generation
Subset evaluation
Stopping criterion
Validation
Original feature set Subset
No Yes
![Page 26: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/26.jpg)
Search Space for Feature Selection ( )
26
0,0,0,0
1,0,0,0 0,1,0,0 0,0,1,0 0,0,0,1
1,1,0,0 1,0,1,0 0,1,1,0 1,0,0,1 0,1,0,1 0,0,1,1
1,1,1,0 1,1,0,1 1,0,1,1 0,1,1,1
1,1,1,1
[Kohavi-John,1997]
![Page 27: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/27.jpg)
Stopping Criteria Predefined number of features is selected Predefined number of iterations is reached Addition (or deletion) of any feature does not result in a
better subset An optimal subset (according to the evaluation criterion)
is obtained.
27
![Page 28: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/28.jpg)
Multi-variate Feature Selection
28
Search in the space of all possible combinations of features. all feature subsets: For features, 2 possible subsets. high computational and statistical complexity.
Wrappers use the classifier performance to evaluate thefeature subset utilized in the classifier. Training 2 classifiers is infeasible for large . Most wrapper algorithms use a heuristic search.
Filters use an evaluation function that is cheaper to computethan the performance of the classifier e.g. correlation coefficient
![Page 29: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/29.jpg)
Filters vs. Wrappers
rank subsets of useful features
FilterFeature subset
Classifier
Original feature set
Wrapper
Multiple feature subsets
Classifier
take classifier into account to rank feature subsets
29
Original feature set
![Page 30: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/30.jpg)
Filter Methods: Evaluation criteria Distance (Eulidean distance)
Class separability: Features supporting instances of the same class to becloser in terms of distance than those from different classes
Information (Information Gain) Select S1 if IG(S1,Y)>IG(S2,Y)
Dependency (correlation coefficient) good feature subsets contain features highly correlated with the class,
yet uncorrelated with each other
Consistency (min-features bias) Selects features that guarantee no inconsistency in data
inconsistent instances have the same feature vector but different class labels
Prefers smaller subset with consistency (min-feature)
30
f1 f2 classinstance 1 a b c1instance 2 a b c2inconsistent
![Page 31: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/31.jpg)
Subset Selection or Generation (Search) Search direction
Forward Backward Random
Search strategies Exhaustive - Complete
Branch & Bound Best first
Heuristic Sequential forward selection Sequential backward elimination Plus-l Minus-r Selection Bidirectional Search Sequential floating Selection
Non-deterministic Simulated annealing Genetic algorithm
31
![Page 32: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/32.jpg)
Search Strategies Complete: Examine all combinations of feature subset Optimal subset is achievable Too expensive if is large
Heuristic: Selection is directed under certain guidelines Incremental generation of subsets Smaller search space and thus faster search May miss out feature sets of high importance
Non-deterministic or random: No predefined way toselect feature candidate (i.e., probabilistic approach) Optimal subset depends on the number of trials Need more user-defined parameters
32
![Page 33: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/33.jpg)
Sequential Forward Selection
33
0,0,0,0
1,0,0,0 0,1,0,0 0,0,1,0 0,0,0,1
1,1,0,0 1,0,1,0 0,1,1,0 1,0,0,1 0,1,0,1 0,0,1,1
1,1,1,0 1,1,0,1 1,0,1,1 0,1,1,1
1,1,1,1
Start
![Page 34: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/34.jpg)
Filters vs. Wrappers Filters Fast execution: evaluation function computation is faster than a classifier training
Generality: Evaluate intrinsic properties of the data, rather than their interactions with aparticular classifier (“good” for a larger family of classifiers)
Tendency to select large subsets: Their objective functions are generally monotonic (sotending to select the full feature set). a cutoff is required on the number of features
Wrappers Slow execution: must train a classifier for each feature subset (or several trainings if cross-
validation is used)
Lack of generality: the solution lacks generality since it is tied to the bias of the classifierused in the evaluation function.
Ability to generalize: Since they typically use cross-validation measures to evaluateclassification accuracy, they have a mechanism to avoid overfitting.
Accuracy: Generally achieve better recognition rates than filters since they find a properfeature set for the intended classifier.
34
![Page 35: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/35.jpg)
35 [H. Liu and L. Yu, Feature Selection for Data Mining, 2002]
![Page 36: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/36.jpg)
Feature Selection: Summary
36
Most univariate methods are filters and most wrappersare multivariate.
No feature selection method is universally better thanothers: wide variety of variable types, data distributions, and classifiers.
Match the method complexity to the ratio d/N: univariate feature selection may work better than multivariate. weaker classifiers (e.g. linear) may be better than complex ones.
Feature selection is not always necessary to achieve goodperformance.
![Page 37: Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature Selection 4 A way to select the most relevant features to find more accurate,faster,and](https://reader036.vdocument.in/reader036/viewer/2022081407/5f1e5772e8ddf208aa619e35/html5/thumbnails/37.jpg)
References
37
I. Guyon and A. Elisseeff, An Introduction to Variable andFeature Selection, JMLR, vol. 3, pp. 1157-1182, 2003.
S. Theodoridis and K. Koutroumbas, Pattern Recognition, 4th
edition, 2008. [Chapter 5]
H. Liu and L.Yu, Feature Selection for Data Mining, 2002.