feature selection, dimensionality reduction, and clustering
Post on 31-Dec-2015
87 Views
Preview:
DESCRIPTION
TRANSCRIPT
Summer
Course
Data
Mining
Feature Selection, Feature Selection, Dimensionality Reduction, and Dimensionality Reduction, and ClusteringClustering
Presenter: Georgi NalbantovPresenter: Georgi Nalbantov
Summer Course: Data Mining
August 2009
Summer
Course
Data
Mining
2/54
Structure Feature Selection
Filtering approach Wrapper approach Embedded methods
Clustering
Density estimation and clustering K-means clustering Hierarchical clustering Clustering with Support Vector Machines (SVMs)
Dimensionality Reduction
Principal Components Analysis (PCA) Nonlinear PCA (Kernel PCA, CatPCA) Multi-Dimensional Scaling (MDS) Homogeneity Analysis
Summer
Course
Data
Mining
3/54
Feature Selection, Dimensionality Reduction, and Clustering in the KDD Process
U.M.Fayyad, G.Patetsky-Shapiro and P.Smyth (1995)
Summer
Course
Data
Mining
4/54
Feature Selection
In the presence of millions of features/attributes/inputs/variables, select the most relevant ones.
Advantages: build better, faster, and easier to understand learning machines.
Xm features
n
m’
Summer
Course
Data
Mining
5/54
Feature Selection
Goal: select the two best features individually
Any reasonable objective J will rank the features
J(x1) > J(x2) = J(x3) > J(x4) Thus, features chosen [x1,x2] or
[x1,x3]. However, x4 is the only feature
that provides complementary information to x1
Summer
Course
Data
Mining
6/54
Feature Selection
Filtering approach: ranks features or feature subsets independently of the predictor (classifier).
…using univariate methods: consider one variable at a time …using multivariate methods: consider more than one variables at a time
Wrapper approach: uses a classifier to assess (many) features or feature subsets.
Embedding approach:uses a classifier to build a (single) model with a subset of features that are internally selected.
Summer
Course
Data
Mining
7/54
Feature Selection: univariate filtering approach
xi
dens
ity,
P(X
i| Y
)
-
-1
- xi
dens
ity,
P(X
i| Y
)
Issue: determine the relevance of a given single feature.
Summer
Course
Data
Mining
8/54
Feature Selection: univariate filtering approach
Issue: determine the relevance of a given single feature.
Under independence:P(X, Y) = P(X) P(Y)
Measure of dependence (Mutual Information):
MI(X, Y) = P(X,Y) log dX dY
= KL( P(X,Y) || P(X)P(Y) )
Summer
Course
Data
Mining
9/54
Feature Selection: univariate filtering approach
Correlation and MINote: Correlation is a measure of linear dependence
Summer
Course
Data
Mining
10/54
Feature Selection: univariate filtering approach
Correlation and MI under the Gaussian distribution
Summer
Course
Data
Mining
11/59
Feature Selection: univariate filtering approach. Criteria for measuring dependence.
Summer
Course
Data
Mining
12/59
Feature Selection: univariate filtering approach
xi
Den
sity
P(X
i| Y
=-1
)
P(X
i| Y
=1)
Legend: Y=1Y=-1
-1
- +
- + xi-+
-, +
Summer
Course
Data
Mining
13/59
Feature Selection: univariate filtering approach
xi
dens
ity
Legend: Y=1Y=-1
-1
- + xi-+
P(Xi| Y=1) = P(Xi| Y=-1) P(Xi| Y=1) /= P(Xi| Y=-1)
Summer
Course
Data
Mining
14/59
Feature Selection: univariate filtering approach
-1
- + xi
T-test
• Normally distributed classes, equal variance 2 unknown; estimated from data as 2
within.
• Null hypothesis H0: + = -
• T statistic: If H0 is true, then
t= (+ - -)/(withinm++1/m-Studentm++m--d.f.
- +Is this distance significant?
Summer
Course
Data
Mining
15/59
Feature Selection: multivariate filtering approach
Guyon-Elisseeff, JMLR 2004; Springer 2006
Summer
Course
Data
Mining
16/59
Feature Selection: search strategies
N features, 2N possible feature subsets!
Kohavi-John, 1997
Summer
Course
Data
Mining
17/59
Feature Selection: search strategies
Forward selection or backward elimination.
Beam search: keep k best path at each step.
GSFS: generalized sequential forward selection – when (n-k) features are left try all subsets of g features. More trainings at each step, but fewer steps.
PTA(l,r): plus l , take away r – at each step, run SFS l times then SBS r times.
Floating search: One step of SFS (resp. SBS), then SBS (resp. SFS) as long as we find better subsets than those of the same size obtained so far.
Summer
Course
Data
Mining
18/59
Feature Selection: filters vs. wrappers vs. embedding
Main goal: rank subsets of useful features
Summer
Course
Data
Mining
19/59
Feature Selection: feature subset assessment (wrapper)
1) For each feature subset, train predictor on training data.
2) Select the feature subset, which performs best on validation data. Repeat and average if you want to
reduce variance (cross-validation).
3) Test on test data.
N variables/features
M s
ampl
esm1
m2
m3
Split data into 3 sets:training, validation, and test set.
Danger of over-fitting with intensive search!
Summer
Course
Data
Mining
20/59
Feature Selection via Embedded Methods:L1-regularization
sum(|beta|) sum(|beta|)
Summer
Course
Data
Mining
21/59
Feature Selection: summary
Nearest Neighbors
Neural Nets
Trees, SVM
Mutual information
feature ranking
Non-linear
RFE with linear SVM or LDA
T-test, AUC, feature ranking
Linear
MultivariateUnivariate
Summer
Course
Data
Mining
22/59
Dimensionality Reduction
In the presence of may of features, select the most relevant subset of (weighted) combinations of features.
kpkm XXXX ,,,, 11 Feature Selection:
Dimensionality Reduction: ),,(,),,,(,, 1111 mpmm XXfXXfXX
Summer
Course
Data
Mining
23/59
Dimensionality Reduction:(Linear) Principal Components Analysis
PCA finds a linear mapping of dataset X to a dataset X’ of lower dimensionality. The variance of X that is remained in X’ is maximal.
Dataset X is mapped to dataset X’, here of the same dimensionality. The first dimension in X’ (= the first principal component) is the direction of maximal variance. The second principal component is orthogonal to the first.
Summer
Course
Data
Mining
24/59
Dimensionality Reduction:Nonlinear (Kernel) Principal Components Analysis
Original dataset X Map X to a HIGHER-dimensional space, and carry out LINEAR PCA in that space
(If necessary,) map the resulting principal components back to the origianl space
Summer
Course
Data
Mining
25/59
Dimensionality Reduction:Multi-Dimensional Scaling
MDS is a mathematical dimension reduction technique that maps the distances between observations from the original (high) dimensional space into a lower (for example, two) dimensional space.
MDS attempts to retain pairwise Euclidean distances in the low-dimensional space .
Error on the fit is measured using a so-called “stress” function
Different choices for a stress function are possible
Summer
Course
Data
Mining
26/59
Dimensionality Reduction:Multi-Dimensional Scaling
Raw stress function (identical to PCA):
Sammon cost function:
Summer
Course
Data
Mining
27/59
Dimensionality Reduction:Multi-Dimensional Scaling (Example)
Input:
Output:
Summer
Course
Data
Mining
28/59
Dimensionality Reduction:Homogeneity analysis
Homals finds a lower-dimensional representation of categorical data matrix X. It may be considered as a type of nonlinear extension of PCA.
Summer
Course
Data
Mining
29/59
Clustering:Similarity measures for hierarchical clustering
Clustering Classification Regression
k-th Nearest Neighbour Parzen Window Unfolding, Conjoint Analysis,
Cat-PCA
Linear Discriminant Analysis, QDA Logistic Regression (Logit) Decision Trees, LSSVM, NN, VS
Classical Linear Regression Ridge Regression NN, CART
+
+
+
+ ++
++
++
+++
+
++
++
++
+
+
--
--
-
-
++
+
++
+++
+
X 1X 1X 1
X 2 X 2
Summer
Course
Data
Mining
30/59
Clustering
Clustering in an unsupervised learning technique.
Task: organize objects into groups whose members are similar in some way
Clustering finds structures in a collection of unlabeled data
A cluster is a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters
Summer
Course
Data
Mining
31/59
Density estimation and clustering
Bayesian separation curve (optimal)
Summer
Course
Data
Mining
32/59
Clustering:K-means clustering
Minimizes the sum of the squared distances to the cluster centers (reconstruction error)
Iterative process:
Estimate current assignments (construct Voronoi partition)
Given the new cluster assignments, set cluster center to center-of-mass
Summer
Course
Data
Mining
33/59
Clustering:K-means clustering
Step 1 Step 2
Step 3 Step 4
Summer
Course
Data
Mining
34/59
Clustering:Hierarchical clustering
Dendrogram
Clustering based on (dis)similarities. Multilevel clustering: level 1 has n clusters, level n has one cluster
Agglomerative HC: starts with N clusters and combines clusters iteratively
Divisive HC: starts with one cluster and divides iteratively
Disadvantage: wrong division cannot be undone
Summer
Course
Data
Mining
35/59
Clustering:Nearest Neighbor algorithm for hierarchical clustering
1. Nearest Neighbor, Level 2, k = 7 clusters.
2. Nearest Neighbor, Level 3, k = 6 clusters.
3. Nearest Neighbor, Level 4, k = 5 clusters.
Summer
Course
Data
Mining
36/59
Clustering:Nearest Neighbor algorithm for hierarchical clustering
4. Nearest Neighbor, Level 5, k = 4 clusters.
5. Nearest Neighbor, Level 6, k = 3 clusters.
6. Nearest Neighbor, Level 7, k = 2 clusters.
Summer
Course
Data
Mining
37/59
Clustering:Nearest Neighbor algorithm for hierarchical clustering
7. Nearest Neighbor, Level 8, k = 1 cluster.
Summer
Course
Data
Mining
38/59
Clustering:Similarity measures for hierarchical clustering
Summer
Course
Data
Mining
39/59
Clustering: Similarity measures for hierarchical clustering
Pearson Correlation: Trend Similarity
ab5.02.0ac
1),( caCpearson
1),( baCpearson
1),( cbCpearson
Summer
Course
Data
Mining
40/59
Clustering: Similarity measures for hierarchical clustering
Euclidean Distance
N
n nn yxyxd1
2)(),(
Nx
x
x 1
Ny
y
y 1
Summer
Course
Data
Mining
41/59
Clustering: Similarity measures for hierarchical clustering
Cosine Correlation
Nx
x
x 1
yx
yxNyxC
N
i ii
1
cosine
1
),(
Ny
y
y 1
yx
+1 Cosine Correlation – 1 yx
Summer
Course
Data
Mining
42/59
Clustering: Similarity measures for hierarchical clustering
Cosine Correlation: Trend + Mean Distance
ab5.02.0ac
1),(inecos baC
9622.0),(inecos caC
9622.0),(inecos cbC
Summer
Course
Data
Mining
43/54
Clustering: Similarity measures for hierarchical clustering
ab5.02.0ac
1),(inecos baC
9622.0),(inecos caC
9622.0),(inecos cbC
5875.1),( cad
8025.2),( bad
2211.3),( cbd
1),( caCpearson
1),( baCpearson
1),( cbCpearson
Summer
Course
Data
Mining
44/54
Clustering: Similarity measures for hierarchical clustering
7544.0),(inecos baC
8092.0),(inecos caC
844.0),(inecos cbC
0255.0),( cad
0279.0),( bad
0236.0),( cbd
1244.0),( caCpearson
1175.0),( baCpearson
1779.0),( cbCpearson
Similar?
Summer
Course
Data
Mining
45/54
Clustering: Grouping strategies for hierarchical clustering
C1
C2
C3
Merge which pair of clusters?
Summer
Course
Data
Mining
46/54
Clustering: Grouping strategies for hierarchical clustering
+
+
Single Linkage
C1
C2
Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters
Tend to generate “long chains”
Summer
Course
Data
Mining
47/54
Clustering: Grouping strategies for hierarchical clustering
Complete Linkage
Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters
Tend to generate “clumps”
+
+
C1
C2
Summer
Course
Data
Mining
48/54
Clustering: Grouping strategies for hierarchical clustering
+
+
Average Linkage
C1
C2
Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster).
Summer
Course
Data
Mining
49/54
Clustering: Grouping strategies for hierarchical clustering
+
+
Average Group Linkage
C1
C2
Dissimilarity between two clusters = Distance between two cluster means.
Summer
Course
Data
Mining
50/54
Clustering: Support Vector Machines for clustering
The not-noisy case
Objective function:
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
Summer
Course
Data
Mining
51/54
Clustering: Support Vector Machines for clustering
The noisy case
Objective function:
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
Summer
Course
Data
Mining
52/54
Clustering: Support Vector Machines for clustering
The noisy case (II)
Objective function:
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
Summer
Course
Data
Mining
53/54
Clustering: Support Vector Machines for clustering
The noisy case (III)
Objective function:
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
Summer
Course
Data
Mining
54/54
Conclusion / Summary / References Feature Selection
Filtering approach Wrapper approach Embedded methods
Clustering
Density estimation and clustering K-means clustering Hierarchical clustering Clustering with Support Vector Machines (SVMs)
Dimensionality Reduction
Principal Components Analysis (PCA) Nonlinear PCA (Kernel PCA, CatPCA) Multi-Dimensional Scaling (MDS) Homogeneity Analysis
Ben-Hur, Horn, Siegelmann and Vapnik, 2001http://www.autonlab.org/tutorials/kmeans11.pdf
Gifi, 1990
Schoelkopf et. al., 2001; .;Gifi, 1990Born and Groenen, 2005
http://www.cs.otago.ac.nz/cosc453/student_tutorials/...principal_components.pdf
I. Guyon et. al., 2006
Kohavi and John, 1996Kohavi and John, 1996
MacQueen, 1967
Hastie et. el., 2001
top related