feature selection, dimensionality reduction, and clustering

Summer

Course

Mining

Feature Selection, Feature Selection, Dimensionality Reduction, and Dimensionality Reduction, and ClusteringClustering

Presenter: Georgi NalbantovPresenter: Georgi Nalbantov

Summer Course: Data Mining

August 2009

Summer

Course

Mining

Structure Feature Selection

Filtering approach Wrapper approach Embedded methods

Clustering

Density estimation and clustering K-means clustering Hierarchical clustering Clustering with Support Vector Machines (SVMs)

Dimensionality Reduction

Principal Components Analysis (PCA) Nonlinear PCA (Kernel PCA, CatPCA) Multi-Dimensional Scaling (MDS) Homogeneity Analysis

Summer

Course

Mining

Feature Selection, Dimensionality Reduction, and Clustering in the KDD Process

U.M.Fayyad, G.Patetsky-Shapiro and P.Smyth (1995)

Summer

Course

Mining

Feature Selection

In the presence of millions of features/attributes/inputs/variables, select the most relevant ones.

Advantages: build better, faster, and easier to understand learning machines.

Xm features

Summer

Course

Mining

Feature Selection

Goal: select the two best features individually

Any reasonable objective J will rank the features

J(x1) > J(x2) = J(x3) > J(x4) Thus, features chosen [x1,x2] or

[x1,x3]. However, x4 is the only feature

that provides complementary information to x1

Summer

Course

Mining

Feature Selection

Filtering approach: ranks features or feature subsets independently of the predictor (classifier).

…using univariate methods: consider one variable at a time …using multivariate methods: consider more than one variables at a time

Wrapper approach: uses a classifier to assess (many) features or feature subsets.

Embedding approach:uses a classifier to build a (single) model with a subset of features that are internally selected.

Summer

Course

Mining

Feature Selection: univariate filtering approach

Issue: determine the relevance of a given single feature.

Summer

Course

Mining

Issue: determine the relevance of a given single feature.

Under independence:P(X, Y) = P(X) P(Y)

Measure of dependence (Mutual Information):

MI(X, Y) = P(X,Y) log dX dY

= KL( P(X,Y) || P(X)P(Y) )

Summer

Course

Mining

Correlation and MINote: Correlation is a measure of linear dependence

Summer

Course

Mining

Correlation and MI under the Gaussian distribution

Summer

Course

Mining

Feature Selection: univariate filtering approach. Criteria for measuring dependence.

Summer

Course

Mining

Legend: Y=1Y=-1

- + xi-+

Summer

Course

Mining

Legend: Y=1Y=-1

- + xi-+

P(Xi| Y=1) = P(Xi| Y=-1) P(Xi| Y=1) /= P(Xi| Y=-1)

Summer

Course

Mining

- + xi

T-test

• Normally distributed classes, equal variance 2 unknown; estimated from data as 2

within.

• Null hypothesis H0: + = -

• T statistic: If H0 is true, then

t= (+ - -)/(withinm++1/m-Studentm++m--d.f.

- +Is this distance significant?

Summer

Course

Mining

Feature Selection: multivariate filtering approach

Guyon-Elisseeff, JMLR 2004; Springer 2006

Summer

Course

Mining

Feature Selection: search strategies

N features, 2N possible feature subsets!

Kohavi-John, 1997

Summer

Course

Mining

Feature Selection: search strategies

Forward selection or backward elimination.

Beam search: keep k best path at each step.

GSFS: generalized sequential forward selection – when (n-k) features are left try all subsets of g features. More trainings at each step, but fewer steps.

PTA(l,r): plus l , take away r – at each step, run SFS l times then SBS r times.

Floating search: One step of SFS (resp. SBS), then SBS (resp. SFS) as long as we find better subsets than those of the same size obtained so far.

Summer

Course

Mining

Feature Selection: filters vs. wrappers vs. embedding

Main goal: rank subsets of useful features

Summer

Course

Mining

Feature Selection: feature subset assessment (wrapper)

1) For each feature subset, train predictor on training data.

2) Select the feature subset, which performs best on validation data. Repeat and average if you want to

reduce variance (cross-validation).

3) Test on test data.

N variables/features

Split data into 3 sets:training, validation, and test set.

Danger of over-fitting with intensive search!

Summer

Course

Mining

Feature Selection via Embedded Methods:L1-regularization

sum(|beta|) sum(|beta|)

Summer

Course

Mining

Feature Selection: summary

Nearest Neighbors

Neural Nets

Trees, SVM

Mutual information

feature ranking

Non-linear

RFE with linear SVM or LDA

T-test, AUC, feature ranking

Linear

MultivariateUnivariate

Summer

Course

Mining

In the presence of may of features, select the most relevant subset of (weighted) combinations of features.

kpkm XXXX ,,,, 11 Feature Selection:

Dimensionality Reduction: ),,(,),,,(,, 1111 mpmm XXfXXfXX

Summer

Course

Mining

Dimensionality Reduction:(Linear) Principal Components Analysis

PCA finds a linear mapping of dataset X to a dataset X’ of lower dimensionality. The variance of X that is remained in X’ is maximal.

Dataset X is mapped to dataset X’, here of the same dimensionality. The first dimension in X’ (= the first principal component) is the direction of maximal variance. The second principal component is orthogonal to the first.

Summer

Course

Mining

Dimensionality Reduction:Nonlinear (Kernel) Principal Components Analysis

Original dataset X Map X to a HIGHER-dimensional space, and carry out LINEAR PCA in that space

(If necessary,) map the resulting principal components back to the origianl space

Summer

Course

Mining

Dimensionality Reduction:Multi-Dimensional Scaling

MDS is a mathematical dimension reduction technique that maps the distances between observations from the original (high) dimensional space into a lower (for example, two) dimensional space.

MDS attempts to retain pairwise Euclidean distances in the low-dimensional space .

Error on the fit is measured using a so-called “stress” function

Different choices for a stress function are possible

Summer

Course

Mining

Dimensionality Reduction:Multi-Dimensional Scaling

Raw stress function (identical to PCA):

Sammon cost function:

Summer

Course

Mining

Dimensionality Reduction:Multi-Dimensional Scaling (Example)

Input:

Output:

Summer

Course

Mining

Dimensionality Reduction:Homogeneity analysis

Homals finds a lower-dimensional representation of categorical data matrix X. It may be considered as a type of nonlinear extension of PCA.

Summer

Course

Mining

Clustering:Similarity measures for hierarchical clustering

Clustering Classification Regression

k-th Nearest Neighbour Parzen Window Unfolding, Conjoint Analysis,

Cat-PCA

Linear Discriminant Analysis, QDA Logistic Regression (Logit) Decision Trees, LSSVM, NN, VS

Classical Linear Regression Ridge Regression NN, CART

X 1X 1X 1

X 2 X 2

Summer

Course

Mining

Clustering

Clustering in an unsupervised learning technique.

Task: organize objects into groups whose members are similar in some way

Clustering finds structures in a collection of unlabeled data

A cluster is a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters

Summer

Course

Mining

Density estimation and clustering

Bayesian separation curve (optimal)

Summer

Course

Mining

Clustering:K-means clustering

Minimizes the sum of the squared distances to the cluster centers (reconstruction error)

Iterative process:

Estimate current assignments (construct Voronoi partition)

Given the new cluster assignments, set cluster center to center-of-mass

Summer

Course

Mining

Clustering:K-means clustering

Step 1 Step 2

Step 3 Step 4

Summer

Course

Mining

Clustering:Hierarchical clustering

Dendrogram

Clustering based on (dis)similarities. Multilevel clustering: level 1 has n clusters, level n has one cluster

Agglomerative HC: starts with N clusters and combines clusters iteratively

Divisive HC: starts with one cluster and divides iteratively

Disadvantage: wrong division cannot be undone

Summer

Course

Mining

Clustering:Nearest Neighbor algorithm for hierarchical clustering

1. Nearest Neighbor, Level 2, k = 7 clusters.

Summer

Course

Mining

Summer

Course

Mining

7. Nearest Neighbor, Level 8, k = 1 cluster.

Summer

Course

Mining

Clustering:Similarity measures for hierarchical clustering

Summer

Course

Mining

Clustering: Similarity measures for hierarchical clustering

Pearson Correlation: Trend Similarity

ab5.02.0ac

1),( caCpearson

1),( baCpearson

1),( cbCpearson

Summer

Course

Mining

Euclidean Distance

n nn yxyxd1

2)(),(

Summer

Course

Mining

Cosine Correlation

yxNyxC

cosine

+1 Cosine Correlation – 1 yx

Summer

Course

Mining

Cosine Correlation: Trend + Mean Distance

ab5.02.0ac

1),(inecos baC

9622.0),(inecos caC

9622.0),(inecos cbC

Summer

Course

Mining

ab5.02.0ac

1),(inecos baC

9622.0),(inecos caC

9622.0),(inecos cbC

5875.1),( cad

8025.2),( bad

2211.3),( cbd

1),( caCpearson

1),( baCpearson

1),( cbCpearson

Summer

Course

Mining

7544.0),(inecos baC

8092.0),(inecos caC

844.0),(inecos cbC

0255.0),( cad

0279.0),( bad

0236.0),( cbd

1244.0),( caCpearson

1175.0),( baCpearson

1779.0),( cbCpearson

Similar?

Summer

Course

Mining

Clustering: Grouping strategies for hierarchical clustering

Merge which pair of clusters?

Summer

Course

Mining

Single Linkage

Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters

Tend to generate “long chains”

Summer

Course

Mining

Complete Linkage

Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters

Tend to generate “clumps”

Summer

Course

Mining

Average Linkage

Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster).

Summer

Course

Mining

Average Group Linkage

Dissimilarity between two clusters = Distance between two cluster means.

Summer

Course

Mining

Clustering: Support Vector Machines for clustering

The not-noisy case

Objective function:

Ben-Hur, Horn, Siegelmann and Vapnik, 2001

Summer

Course

Mining

The noisy case

Objective function:

Summer

Course

Mining

The noisy case (II)

Objective function:

Summer

Course

Mining

The noisy case (III)

Objective function:

Summer

Course

Mining

Conclusion / Summary / References Feature Selection

Filtering approach Wrapper approach Embedded methods

Clustering

Density estimation and clustering K-means clustering Hierarchical clustering Clustering with Support Vector Machines (SVMs)

Principal Components Analysis (PCA) Nonlinear PCA (Kernel PCA, CatPCA) Multi-Dimensional Scaling (MDS) Homogeneity Analysis

Ben-Hur, Horn, Siegelmann and Vapnik, 2001http://www.autonlab.org/tutorials/kmeans11.pdf

Gifi, 1990

Schoelkopf et. al., 2001; .;Gifi, 1990Born and Groenen, 2005

http://www.cs.otago.ac.nz/cosc453/student_tutorials/...principal_components.pdf

I. Guyon et. al., 2006

Kohavi and John, 1996Kohavi and John, 1996

MacQueen, 1967

Hastie et. el., 2001

feature selection, dimensionality reduction, and clustering

feature selectionin

feature selectiongoal

pxi y s

ss xiss pxi y

given single feature

y pxpy

pxi y issue

n possible feature subsets

Documents

geomorphologic feature extraction and clustering …

day 9: unsupervised learning, dimensionality...

1.introduction and challenges of high dimensionality 2...

lecture 4 unsupervised learning clustering & dimensionality...

clustering and dimensionality reduction on riemannian...

non-hierarchical clustering and dimensionality reduction...

clustering - the stanford university...

feature selection, dimensionality reduction, and clustering

feature selection method for high dimensional data€¦ ·...

statistical challenges with high dimensionality:...

dimensionality reduction and feature selection for object...

asa clustering within vmdc architecture - cisco€¦ · asa...

machine learning problems unsupervised learning –...

multi-scale supervised clustering-based feature selection

statistical methods: dimensionality reduction, clustering...

using feature clustering for gp-based feature construction...

feature extraction and dimensionality reduction in pattern...

neighboring feature clustering

ramclust/ramsearch: efficient post-xcms feature clustering...

statistical methods: dimensionality reduction, clustering...