minimum redundancy and maximum relevance feature selection hang xiao

27
Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Upload: roger-simmons

Post on 16-Dec-2015

230 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Minimum Redundancy and Maximum Relevance Feature Selection

Hang Xiao

Page 2: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Background

• Feature– a feature is an individual measurable heuristic

property of a phenomenon being observed– In character recognition: horizontal and vertical

profiles, number of internal holes, stroke detection

– In speech recognition: noise ratios, length of sounds, relative power, filter matches

– In microarray : genes expression

Page 3: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Background

• Relevance between features– Correlation– F-statistic– Mutual information

Independent : p(x,y) = p(x)p(y) I(x,y) = 0

p(x,y) : joint distribution function of X and Yp(x), p(y) : marginal probability distribution functions

Page 4: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Feature Selection Problem

• Maximal relevance– selecting the features with the highest relevance to the

target class c, based on mutual info., F-test, etc. without considering relationships among features

• Minimal Redundancy– Selected features are correlated– Selected features cover narrow regions in space

Page 5: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

mRMR: Discrete Variables

• Maximize Relevance:

• Minimal Redundancy:

S is the set of featuresI(i,j) is mutual information between feature i and j

Page 6: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

mRMR: Continuous Variables

• Maximum relevance: F-statistic F(i,h)

• Minimum redundancy : Correlation cor(i,j)

Page 7: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Combine Relevance and Redundancy

• Additive combination

• Multiplicative combination

Page 8: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Most Related Methods

• Most used feature selection methods: top-ranking features without considering relationships among features.

• Yu & Liu, 2003/2004. information gain, essentially similar approach

• Wrapper: not filter approach, classifier-involved and thus features do not generalize well

• PCA and ICA: Feature are orthogonal or independent, but not in the original feature space

Page 9: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Class Prediction Methods

• Naive Bayes (NB) classifier

{g1, g2, …, gm} gene expression level

p(gi|hk) is conditional table (density)

• Support Vector Machine SVM– Draw an optimal hyperplane in the feature vector

space

Page 10: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Class Prediction Methods

• Logistic Regression (LR) – a linear combination of the feature variables – transformed into probabilities by a logistic

function

• Linear Discriminant Analysis (LDA)– Find a linear combination of feature– ANOVA , regression analysis

Page 11: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Microarray Gene Expression Data Sets for Cancer Classification

Page 12: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

LOOCV : Leave- One-Out Cross ValidationBaseline feature : based solely on maximum relevance

Page 13: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

(a) Relevance VI, and (b) Redundancy for MRMR features on discretized NCI dataset. (c) The respective LOOCV errors obtained using the Naive Bayes

classifier

The role of redundancy reduction

Page 14: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Do mRMR Features Generalize Well on Unseen Data?

Child Leukemia data (7 classes, 215 training samples, 112 testing samples) testing errors. M is the number of features used in classification

Page 15: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

What is the Relationship of mRMR Features and Various Data Discretization Schemes?

LOOCV testing results classifier(#error) for binarized NCI and Lymphoma data using SVM classifier.

Page 16: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Comparison with other work

Page 17: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Theoretical basis of mRMR

• Maximum Dependency Criterion– Statistic association– Definition : mutual information I(Sm,h)

• Mutual Information– For two variables x and y

– For multivariate variable Sm and the target h

Page 18: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

High-Dimensional Mutual Information

• For multivariate variable Sm and the target h

• Estimate high-dimensional I(Sm,h) is so difficult– An ill-posed problem to find inverse of large co-

variance matrix– Insufficient number of samples– Combinatorial time complex O(C(|Ω|,|S|))

Page 19: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Factorize the Mutual Information

• Mutual information for multivariate variable Sm and the target h

Define:

It can be proved:

Page 20: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Factorize I(Sm,h)

• Relevance of S={x1,x2, …} and h, or RL(S,h)

• Redundancy among variables {x1,x2,...}, or RD(S)

• For incremental search, max I(S,h) is “equivalent” to max [RL(S,h) – RD(S)], so called min-Redundancy-Max-Relevance(mRMR)

Page 21: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Advantages of mRMR

• Both relevance and redundancy estimation are low- dimensional problems (i.e. involving only 2 variables). This is much easier than directly estimating multivariate density or mutual information in the high- dimensional space!

• Fast speed• More reliable estimation

• mRMR is an optimal first-order approximation of I(.) maximization

• Relevance-only ranking only maximizes J(.)!

Page 22: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Search Algorithm of mRMR

• Greedy search algorithm– In the pool Ω, find the variable x1 that has the

largest I(x1,h). Exclude x1 from Ω

– Search x2 so that it maximizes I(x2,h) - ∑I(.,x2)/|Ω|– Iterate this process until an expected number of

variables have been obtained, or other constraints are satisfied

• Complexity O(|S|*|Ω|)

Page 23: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Comparing Max-Dep and mRMR: Complexity of Feature Selection

Page 24: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Comparing Max-Dep and mRMR: Accuracy of Feature Selected in Classification

• Leave-One-Out cross validation of feature classification accuracies of mRMR and MaxDep

Page 25: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Use Wrappers to Refine Features

• mRMR is a filter approach – Fast – Features might be redundant – Independent of the classifier

• Wrappers seek to minimize the number of errors directly – Slow – Features are less robust – Dependent on classifier – Better prediction accuracy

• Use mRMR first to generate a short feature pool and use wrappers to get a least redundant feature set with better accuracy

Page 26: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Use Wrappers to Refine Features

Forward wrappers (incremental selection)

Backward wrappers (decremental selection)

NCI Data

Page 27: Minimum Redundancy and Maximum Relevance Feature Selection Hang Xiao

Conclusions

• The Max-Dependency feature selection can be efficiently implemented as the mRMR algorithm

• Significantly outperforms the widely used max-relevance selection method: mRMR features cover a broader feature space with less features

• mRMR is very efficient and useful for gene selection and many other applications. The programs are ready!