clustering methods course code: 175314

44
Pasi Fränti 9.2.2017 Machine Learning School of Computing University of Eastern Finland Joensuu, FINLAND Part 1: Introduction Clustering methods

Upload: arden

Post on 06-Jan-2016

23 views

Category:

Documents


2 download

DESCRIPTION

Clustering methods Course code: 175314. Part 1: Introduction. Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND. Sample data. Sources of R G B vectors. Red - Green plot of the vectors. Sample data. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Clustering methods Course code: 175314

Pasi Fränti

9.2.2017

Machine LearningSchool of Computing

University of Eastern FinlandJoensuu, FINLAND

Part 1: Introduction

Clustering methods

Page 2: Clustering methods Course code: 175314

Sample data

Sources of RGB vectors

Red-Green plot of the vectors

Page 3: Clustering methods Course code: 175314

Sample data

Employment statistics:

Page 4: Clustering methods Course code: 175314

Application examples

Page 5: Clustering methods Course code: 175314

Color reconstruction

Image with compression artifacts

Image with original colors

Page 6: Clustering methods Course code: 175314

Speaker modelingfor voice biometrics

Training data

Feature extractionand clustering

Matti

Mikko

Tomi

Speaker models

Tomi

Matti

Feature extraction

Best match: Matti !

Mikko

?

Page 7: Clustering methods Course code: 175314

Speaker modeling

Speech data Result of clustering

Page 8: Clustering methods Course code: 175314

Image segmentation

Normalized color plots according to red and green components.

Image with 4 color clusters

red

gree

n

Page 9: Clustering methods Course code: 175314

Signal quantization

Quantized signal Original signal

Approximation of continuous range values (or a very large set of possible discrete values) by a small set of discrete symbols or integer values

Page 10: Clustering methods Course code: 175314

Color quantization of imagesColor quantization of images

Color image RGB samples

Clustering

Page 11: Clustering methods Course code: 175314

Users on map

Page 12: Clustering methods Course code: 175314

Clustering the users

Page 13: Clustering methods Course code: 175314

Clustering of photos in two ways

Clustering of photos

Clustering timeline

Page 14: Clustering methods Course code: 175314

Photo clusters on map

User anddate

Number of photos

Clusters

Last known location of the user

Page 15: Clustering methods Course code: 175314
Page 16: Clustering methods Course code: 175314

ClustersNumber

of photos

Functions:

Open cluster

Start slideshow

Clusters in the timeline view

Page 17: Clustering methods Course code: 175314

Clustering GPS tracksMobile users, taxi routes, fleet management

Page 18: Clustering methods Course code: 175314

Conclusions from clusters

Cluster 1: Office

Cluster 2: Home

Page 19: Clustering methods Course code: 175314

Clustering keywords

hostel

auberge

lodge

hostelry

film

cinema

movie

luncheon

meal

lunch

arena

stadium

gym

gymnasium

cafe

eatery

cafeteria

restaurantcoffeehouse

snack

collationsaloon

bar

barroom

ginmill

storeshop cubicle

market

stall

pharmacy

kiosk

booth

outlet

drugstore

0.560.56

0.48 0.48

0.91

0.910.91

00

0

0.91

0

0

0.92

0.88

0.88

0.85

0.74

Page 20: Clustering methods Course code: 175314

Clustering text descriptions

Page 21: Clustering methods Course code: 175314

Home take care services

Page 22: Clustering methods Course code: 175314

Clustering user preferences

Page 23: Clustering methods Course code: 175314

Part I:Clustering problem

Page 24: Clustering methods Course code: 175314

Subproblems of clustering

1. Where are the clusters?(Algorithmic problem)

2. How many clusters?(Methodological problem: which criterion?)

3. Selection of attributes (Application related problem)

4. Preprocessing the data(Practical problems: normalization, outliers)

Page 25: Clustering methods Course code: 175314

Definitions and data

Set of N data points:X={x1, x2, …, xN}

Set of M cluster prototypes (centroids):

C={c1, c2, …, cM},

P={p1, p2, …, pM},

Partition of the data:

Page 26: Clustering methods Course code: 175314

Distance and cost function

K

k

kj

kiji xxxxd

1

2),(

N

ipi i

cxPCTSE1

2),(

Euclidean distance of data vectors:

Total square error:

Page 27: Clustering methods Course code: 175314

Clustering result as partition

Illustrated by Voronoi diagram

Illustrated by Convex hulls

Cluster prototypesPartition of data

Page 28: Clustering methods Course code: 175314

Cluster prototypesPartition of data

Centroids as prototypes

Partition by nearestprototype mapping

Duality of partition and centroids

Page 29: Clustering methods Course code: 175314

Centroid condition: for a given partition (P), optimal cluster centroids (C) for minimizing MSE are the average vectors of the clusters:

Mj

x

c

jp

jpi

j

i

i ,11

Nicxdp jiMj

i ,1),(minarg 2

1

Dependency of data structures

Optimal partition: for a given centroids (C), optimal partition is the one with nearest centroid :

Page 30: Clustering methods Course code: 175314

K-means algorithm

Page 31: Clustering methods Course code: 175314

K-means algorithmX = Data setC = Cluster centroidsP = Partition

K-Means(X, C) → (C, P)

REPEAT

Cprev ← C;

FOR all i∈[1, N] DO

pi ← FindNearest(xi, C);

FOR all j∈[1, k] DOcj ← Average of xi pi = j;

UNTIL C = Cprev

Optimal partition

Optimal centoids

Page 32: Clustering methods Course code: 175314

Summary

Page 33: Clustering methods Course code: 175314

How to solve?

Solve the clustering: Given input data (X) of N data vectors, and

number of clusters (M), find the clusters. Result given as a set of prototypes, or partition.

Solve the number of clusters: Define appropriate cluster validity function f. Repeat the clustering algorithm for several M. Select the best result according to f.

Solve the problem efficiently.

Algorithmic

problem

Mathematical

problem

Computer science problem

Page 34: Clustering methods Course code: 175314

Cluster missingClusters missing

Too m

any clusters

Incorrect cluster allocation

Incorrect number of clusters

Challenges in clustering

Page 35: Clustering methods Course code: 175314

Taxonomy of clustering[Jain, Murty, Flynn, Data clustering: A review, ACM Computing Surveys, 1999.]

• One possible classification based on cost function.

• MSE is well defined and most popular.

Page 36: Clustering methods Course code: 175314

Clustering method = defines the problem Clustering algorithm = solves the problem Problem defined as cost function

- Goodness of one cluster- Similarity vs. distance - Global vs. local (“merge cost”, “cut”)

Solution: algorithm to solve the problem

Clustering method

Page 37: Clustering methods Course code: 175314

Complexity of clustering

• Clustering problem is NP complete [Garey et al., 1982]

• Optimal solution by branch-and-bound in exponential time.

• Practical solutions by heuristic algorithms.

M

j

NjM jj

M

MM

N

1

)1(!

1

• Number of possible clusterings:

Page 38: Clustering methods Course code: 175314

Software

Page 39: Clustering methods Course code: 175314

Animatorhttp://cs.uef.fi/sipu/clustering/animator/

Page 40: Clustering methods Course code: 175314

Clusteratorhttp://cs.uef.fi/paikka/Radu/clusterator/

Page 41: Clustering methods Course code: 175314

Cluster software

Main area

Input area

Output

area

• Main area: working space for data

• Input area: inputs to be processed

• Output area:obtained results

• Menu Process:selection of operation

http://cs.uef.fi/sipu/soft/cluster2009.exe

Page 42: Clustering methods Course code: 175314

Clustering

imageData setCodebook

Partition

Procedure to simulate k-means

Open data set (file *.ts), move it into Input areaOpen data set (file *.ts), move it into Input area

Process – Random codebookProcess – Random codebook, select number of clusters, select number of clusters

REPEATREPEAT

Move obtained codebook from Output area into Input Move obtained codebook from Output area into Input areaarea

Process – Optimal partitionProcess – Optimal partition, select Error function, select Error function

Move codebook into Main area, partition into Input Move codebook into Main area, partition into Input areaarea

Process – Optimal codebookProcess – Optimal codebook

UNTIL DESIRED CLUSTERINGUNTIL DESIRED CLUSTERING

Page 43: Clustering methods Course code: 175314

Conclusions

Clustering is a fundamental tool needed in everywhere in computer science and beyond.

Failing to do clustering properly may defect the application analysis.

Good clustering tool needed so that researchers can focus on application requirements.

Page 44: Clustering methods Course code: 175314

1. S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 3rd edition, 2006.

2. C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

3. A.K. Jain, M.N. Murty and P.J. Flynn, Data clustering: A review, ACM Computing Surveys, 31(3): 264-323, September 1999.

4. M.R. Garey, D.S. Johnson and H.S. Witsenhausen, The complexity of the generalized Lloyd-Max problem, IEEE Transactions on Information Theory, 28(2): 255-256, March 1982.

5. F. Aurenhammer: Voronoi diagrams-a survey of a fundamental geometric data structure, ACM Computing Surveys, 23 (3), 345-405, September 1991.

Literature