k-means algorithm

10
K-means algorithm Speech and Image Processing Unit School of Computing University of Eastern Finland Pasi Fränti Clustering Methods: Part 2a

Upload: mae

Post on 07-Jan-2016

24 views

Category:

Documents


2 download

DESCRIPTION

Speech and Image Processing Unit School of Computing University of Eastern Finland. K-means algorithm. Clustering Methods: Part 2a. Pasi Fränti. K- m eans overview. Well-known clustering algorithm Number of clusters must be chosen in advance Strengths: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: K-means algorithm

K-means algorithm

Speech and Image Processing UnitSchool of Computing

University of Eastern Finland

Pasi Fränti

Clustering Methods: Part 2a

Page 2: K-means algorithm

K-means overview

• Well-known clustering algorithm

• Number of clusters must be chosen in advance

• Strengths:1. Vectors can flexibly change clusters during the process.

2. Always converges to a local optimum.

3. Quite fast for most applications.

• Weaknesses:1. Quality of the output depends on the initial codebook.

2. Global optimum solution not guaranteed.

Page 3: K-means algorithm

K-means pseudo code

X: a set of N data vectors Data set

CI: initialized k cluster centroids Number of clusters,

C: the cluster centroids of k-clustering random initial centroids

P = {p(i) | i = 1, …, N} is the cluster label of X

KMEANS(X, CI) → (C, P)

REPEAT

Cprevious ← CI;

FOR all i ∈ [1, N] DO Generate new optimal paritions

p(i) ← arg min d(xi, cj);l ≤ j ≤ k

FOR all j ∈ [1, k] DO Generate optimal centroids

cj ← Average of xi, whose p(i) = j;

UNTIL C = Cprevious

Page 4: K-means algorithm

K-means exampleData set

X: a set of N data vectorsN = 6

Number of clustersRandom initial centroids

Initial codebook:c1 = C, c2 = D, c3 = E

CI: initialized k cluster centroidsk = 3

1

1

2 4 5 8

5

6

A B

C D

E

F

1

1

2 4 5 8

5

6

c1 c2

c3

(1/4)

Page 5: K-means algorithm

Generate optimal partitions

Distance matrix (Euclidean distance)

A B C D E Fc1

c2

c3

Generate optimal centroids

3.2,3.23

511,

3

4211

c

5562

55

2

852 ,.,c

6,53 c

2,314,18,54,6

1157,5

44,11

0

30

04,55

1

1

2 4 5 8

5

6

c1

c2

c3

1

1

2 4 5 8

5

6

A B

CD

E

Fc1

c2

c3

K-means example (2/4)

After 1st iteration: MSE = 9.0

Page 6: K-means algorithm

Generate optimal partitions

Distance matrix (Euclidean distance)

A B C D E Fc1

c2

c3

Generate optimal centroids

1,5.12

11,

2

211

c

3.5,7.43

655,

3

5543

c

5,82 c

2,38,54,6

8,15,15,268,6

3,65,48,31,3

011,4

1,5

1,41,9

1

1

2 4 5 8

5

6

c1

c3

c2

A B

C D

E

F

1

1

2 4 5 8

5

6

c1

c3

c2

K-means example (3/4)

After 2nd iteration: MSE = 1.78

Page 7: K-means algorithm

Generate optimal partitions

Distance matrix (Euclidean distance)

A B C D E Fc1

c2

c3

No object move - stop

3,31,57,5

2,3342,71,8

6,71,63,57,4

0,70,50,7

0

0,50,5

1

1

2 4 5 8

5

6

A B

C D

E

F

c1

c3 c2

K-means example (4/4)

After 3rd iteration: MSE = 0.31

Page 8: K-means algorithm

Counter example

1

1

2 4 5 8

5

6

1

1

2 4 5 8

5

6

1

3

2

Initial codebook:c1 = A, c2 = B, c3 = C

1

1

2 4 5 8

5

6

A B

C

E

D

F

c1 c2

c3

Page 9: K-means algorithm

• Repeated k-means

– Try several random initializations and take the best.

– Multiplies processing time.

– Works for easier data sets.

• Better initialization

– Use some better heuristic to allocate the initial distribution

of code vectors.

– Designing good initialization is not any easier than

designing good clustering algorithm at the first place!

– K-means can (and should) anyway be applied as fine-

tuning of the result of another method.

Two ways to improve k-means

Page 10: K-means algorithm

References1. Forgy, E. W. (1965) Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics 21, 768-769.

2. McQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281-297. Berkeley, CA: University of California Press.

3. Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100-108.

4. Xu, M.: K-Means Based Clustering And Context Quantization. University of Joensuu, Computer Science, Academic Dissertation, 2005.