![Page 1: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/1.jpg)
Data Mining Course 2007
Eric Postma
Clustering
![Page 2: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/2.jpg)
Overview
Three approaches to clustering
1. Minimization of reconstruction error• PCA, nlPCA, k-means clustering
2. Distance preservation• Sammon mapping, Isomap, SPE
3. Maximum likelihood density estimation• Gaussian Mixtures
![Page 3: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/3.jpg)
• These datasets have identical statistics up to 2nd order
![Page 4: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/4.jpg)
1. Minimization of reconstruction error
![Page 5: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/5.jpg)
Illustration of PCA (1)
• Face dataset (Rice database)
![Page 6: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/6.jpg)
Illustration of PCA (2)
• Average face
![Page 7: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/7.jpg)
Illustration of PCA (3)
• Top 10 Eigenfaces
![Page 8: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/8.jpg)
Each 39-dimensional data item describes different aspects of the welfare and poverty of one country.
2D PCA projection
![Page 9: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/9.jpg)
Non-linear PCA
• Using neural networks (to be discussed tomorrow)
![Page 10: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/10.jpg)
2. Distance preservation
![Page 11: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/11.jpg)
Sammon mapping
• Given a data set X. The distance between any two samples is defined as Dij
• We consider the projection on a two dimensional plane where the projected points are separated by dij
• Define an Error function
i jiij
ijij
i ji ij DdD
DE
21
![Page 12: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/12.jpg)
Sammon mapping
![Page 13: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/13.jpg)
Main limitations of Sammon
• The Sammon mapping procedure is a gradient descent method
• Main limitation: local minima
• MDS may be preferred because it finds global minima (being based on PCA)
• Both methods have difficulty with “curved or curly subspaces”
![Page 14: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/14.jpg)
Isomap
• Tenenbaum• Build a graph in which each node
represents a data point• Compute shortest distances along the
graph (e.g., Dijkstra’s algorithm)• Store all distances in a matrix D• Perform MDS on the matrix D
![Page 15: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/15.jpg)
Illustration of Isomap (1)
• For two arbitrary points on the manifold Euclidean distance does not always reflect similarity (cf. dashed blue line)
![Page 16: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/16.jpg)
Illustration of Isomap (2)
• Isomap finds the appropriate shortest path along the graph (red curve, for K=7, N=1000)
![Page 17: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/17.jpg)
Illustration of Isomap (3)
• Two-dimensional embedding (red line is the shortest path along the graph, blue line is the true distance in the embedding.
![Page 18: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/18.jpg)
Illustration of Isomap (4)
• Isomaps (●) ability to find the intrinsic dimensionality as compared to PCA and MDS (∆ and o).
![Page 19: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/19.jpg)
Illustration of Isomap (5)
![Page 20: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/20.jpg)
Illustration of Isomap (6)
![Page 21: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/21.jpg)
Illustration of Isomap (7)
• Interpolation along a straight line
![Page 22: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/22.jpg)
Stochastic Proximity Embedding
• SPE algorithm
• Agrafiotis, D.K. and Xu, H. (2002). A self-organizing principle for learning nonlinear manifolds. Proceedings of the National Academy of Sciences U.S.A.
![Page 23: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/23.jpg)
Stress function
Output proximity between points i and j Input proximity between points i and j
2)(),( ijijijij rdrdf
![Page 24: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/24.jpg)
Swiss roll data set
Original 3D set 2D embedding obtained by SPE
![Page 25: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/25.jpg)
Stress as a function of embedding dimension(averaged over 30 runs)
![Page 26: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/26.jpg)
Scalability (# steps for four set sizes)Linear scaling
![Page 27: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/27.jpg)
Conformations of methylpropyletherC1C2C3O4C5
![Page 28: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/28.jpg)
Diamine combinatorial library
![Page 29: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/29.jpg)
Clustering
• Minimize the total within-cluster variance (reconstruction error)
C
c
N
i ci
ic wxkE2
• kic = 1 if a data point belongs to cluster cK-means clustering1. Random selection of C cluster centres2. Partition the data by assigning them to the clusters3. The mean of each partitioning is the new cluster centreA distance threshold may be used…
![Page 30: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/30.jpg)
• Effect of distance threshold on the number of clusters
![Page 31: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/31.jpg)
Main limitation of k-means clustering
• Final partitioning and cluster centres depend on initial configuration
• Discrete partitioning may introduce errors
• Instead of minimizing the reconstruction error, we may maximize the likelihood of the data (given some probabilistic model)
![Page 32: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/32.jpg)
Neural algorithms related to k-means
• Kohonen self-organizing feature maps
• Competitive learning networks
![Page 33: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/33.jpg)
3. Maximum likelihood
![Page 34: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/34.jpg)
Gaussian Mixtures
• Model the pdf of the data using a mixture of distributions
• K is the number of kernels (<< # data points)• Common choice for the component densities p(x|i):
K
iiPixpxp )()|()(
2
2
2/2 2exp
)2(1)|(
i
id
i
xixp
![Page 35: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/35.jpg)
Illustration of EM applied to GM model
The solid line gives the initialization of the EM algorithm: two kernels,P(1) = P(2) = 0:5, μ1 = 0.0752; μ2 = 1.0176, σ1 = σ2 = 0:2356
![Page 36: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/36.jpg)
Convergence after 10 EM steps..
![Page 37: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering](https://reader035.vdocument.in/reader035/viewer/2022081521/5a4d1ad07f8b9ab059971383/html5/thumbnails/37.jpg)
Relevant literature
• L.J.P. van der Maaten, E.O. Postma, and H.J. van den Herik (submitted). Dimensionality Reduction: A Comparative Review.
• http://www.cs.unimaas.nl/l.vandermaaten