clustering methods course code: 175314 pasi fränti 10.3.2014 speech & image processing unit...
TRANSCRIPT
![Page 1: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/1.jpg)
Clustering methodsCourse code: 175314
Pasi Fränti
10.3.2014
Speech & Image Processing UnitSchool of Computing
University of Eastern FinlandJoensuu, FINLAND
Part 1: Introduction
![Page 2: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/2.jpg)
Sample data
Sources of RGB vectors
Red-Green plot of the vectors
![Page 3: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/3.jpg)
Sample data
Employment statistics:
![Page 4: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/4.jpg)
Application example 1Color reconstruction
Image with compression artifacts
Image with original colors
![Page 5: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/5.jpg)
Application example 2speaker modeling for voice biometrics
Training data
Feature extractionand clustering
Matti
Mikko
Tomi
Speaker models
Tomi
Matti
Feature extraction
Best match: Matti !
Mikko
?
![Page 6: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/6.jpg)
Speaker modeling
Speech data Result of clustering
![Page 7: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/7.jpg)
Application example 3Image segmentation
Normalized color plots according to red and green components.
Image with 4 color clusters
red
gree
n
![Page 8: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/8.jpg)
Application example 4Quantization
Quantized signal Original signal
Approximation of continuous range values (or a very large set of possible discrete values) by a small set of discrete symbols or integer values
![Page 9: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/9.jpg)
Color quantization of imagesColor quantization of images
Color image RGB samples
Clustering
![Page 10: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/10.jpg)
Application example 5Clustering of spatial data
![Page 11: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/11.jpg)
Clustered locations of users
![Page 12: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/12.jpg)
Clustered locations of users
Clustering of photos
Timeline clustering
![Page 13: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/13.jpg)
Clustering GPS trajectoriesMobile users, taxi routes, fleet management
![Page 14: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/14.jpg)
Conclusions from clusters
Cluster 1: Office
Cluster 2: Home
![Page 15: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/15.jpg)
Part I:Clustering problem
![Page 16: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/16.jpg)
Subproblems of clustering
1. Where are the clusters?(Algorithmic problem)
2. How many clusters?(Methodological problem: which criterion?)
3. Selection of attributes (Application related problem)
4. Preprocessing the data(Practical problems: normalization, outliers)
![Page 17: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/17.jpg)
Clustering result as partition
Illustrated by Voronoi diagram
Illustrated by Convex hulls
Cluster prototypesPartition of data
![Page 18: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/18.jpg)
Cluster prototypesPartition of data
Centroids as prototypes
Partition by nearestprototype mapping
Duality of partition and centroids
![Page 19: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/19.jpg)
Cluster missingClusters missing
Too m
any clusters
Incorrect cluster allocation
Incorrect number of clusters
Challenges in clustering
![Page 20: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/20.jpg)
How to solve?
Solve the clustering: Given input data (X) of N data vectors, and
number of clusters (M), find the clusters. Result given as a set of prototypes, or partition.
Solve the number of clusters: Define appropriate cluster validity function f. Repeat the clustering algorithm for several M. Select the best result according to f.
Solve the problem efficiently.
Algorithmic
problem
Mathematical
problem
Computer science problem
![Page 21: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/21.jpg)
Taxonomy of clustering[Jain, Murty, Flynn, Data clustering: A review, ACM Computing Surveys, 1999.]
• One possible classification based on cost function.
• MSE is well defined and most popular.
![Page 22: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/22.jpg)
Definitions and data
Set of N data points:X={x1, x2, …, xN}
Set of M cluster prototypes (centroids):
C={c1, c2, …, cM},
P={p1, p2, …, pM},
Partition of the data:
![Page 23: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/23.jpg)
Distance and cost function
K
k
kj
kiji xxxxd
1
2),(
N
ipi i
cxN
PCMSE1
21),(
Euclidean distance of data vectors:
Mean square error:
![Page 24: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/24.jpg)
Centroid condition: for a given partition (P), optimal cluster centroids (C) for minimizing MSE are the average vectors of the clusters:
Mj
x
c
jp
jpi
j
i
i ,11
Nicxdp jiMj
i ,1),(minarg 2
1
Dependency of data structures
Optimal partition: for a given centroids (C), optimal partition is the one with nearest centroid :
![Page 25: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/25.jpg)
Complexity of clustering
• Clustering problem is NP complete [Garey et al., 1982]
• Optimal solution by branch-and-bound in exponential time.
• Practical solutions by heuristic algorithms.
M
j
NjM jj
M
MM
N
1
)1(!
1
• Number of possible clusterings:
![Page 26: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/26.jpg)
Cluster software
Main area
Input area
Output
area
• Main area: working space for data
• Input area: inputs to be processed
• Output area:obtained results
• Menu Process:selection of operation
http://cs.joensuu.fi/sipu/soft/cluster2009.exe
![Page 27: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/27.jpg)
Clustering
imageData setCodebook
Partition
Procedure to simulate k-means
Open data set (file *.ts), move it into Input areaOpen data set (file *.ts), move it into Input area
Process – Random codebookProcess – Random codebook, select number of clusters, select number of clusters
REPEATREPEAT
Move obtained codebook from Output area into Input Move obtained codebook from Output area into Input areaarea
Process – Optimal partitionProcess – Optimal partition, select Error function, select Error function
Move codebook into Main area, partition into Input Move codebook into Main area, partition into Input areaarea
Process – Optimal codebookProcess – Optimal codebook
UNTIL DESIRED CLUSTERINGUNTIL DESIRED CLUSTERING
![Page 28: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/28.jpg)
XLMiner softwarehttp://www.resample.com/xlminer/help/HClst/HClst_ex.htmhttp://www.resample.com/xlminer/help/HClst/HClst_ex.htm
![Page 29: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/29.jpg)
Example of data in XLMiner
![Page 30: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/30.jpg)
Distance matrix & dendrogram
![Page 31: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/31.jpg)
Conclusions
Clustering is a fundamental tools needed in Speech and Image processing.
Failing to do clustering properly may defect the application analysis.
Good clustering tool needed so that researchers can focus on application requirements.
![Page 32: Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e3b5503460f94b2db93/html5/thumbnails/32.jpg)
1. S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 3rd edition, 2006.
2. C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
3. A.K. Jain, M.N. Murty and P.J. Flynn, Data clustering: A review, ACM Computing Surveys, 31(3): 264-323, September 1999.
4. M.R. Garey, D.S. Johnson and H.S. Witsenhausen, The complexity of the generalized Lloyd-Max problem, IEEE Transactions on Information Theory, 28(2): 255-256, March 1982.
5. F. Aurenhammer: Voronoi diagrams-a survey of a fundamental geometric data structure, ACM Computing Surveys, 23 (3), 345-405, September 1991.
Literature