k-means algorithm

K-means algorithm

Speech and Image Processing UnitSchool of Computing

University of Eastern Finland

Pasi Fränti

Clustering Methods: Part 2a

K-means overview

• Well-known clustering algorithm

• Number of clusters must be chosen in advance

• Strengths:1. Vectors can flexibly change clusters during the process.

2. Always converges to a local optimum.

3. Quite fast for most applications.

• Weaknesses:1. Quality of the output depends on the initial codebook.

2. Global optimum solution not guaranteed.

K-means pseudo code

X: a set of N data vectors Data set

CI: initialized k cluster centroids Number of clusters,

C: the cluster centroids of k-clustering random initial centroids

P = {p(i) | i = 1, …, N} is the cluster label of X

KMEANS(X, CI) → (C, P)

REPEAT

Cprevious ← CI;

FOR all i ∈ [1, N] DO Generate new optimal paritions

p(i) ← arg min d(xi, cj);l ≤ j ≤ k

FOR all j ∈ [1, k] DO Generate optimal centroids

cj ← Average of xi, whose p(i) = j;

UNTIL C = Cprevious

K-means exampleData set

X: a set of N data vectorsN = 6

Number of clustersRandom initial centroids

Initial codebook:c1 = C, c2 = D, c3 = E

CI: initialized k cluster centroidsk = 3

1

1

2 4 5 8

5

6

A B

C D

E

F

1

1

2 4 5 8

5

6

c1 c2

c3

(1/4)

Generate optimal partitions

Distance matrix (Euclidean distance)

A B C D E Fc1

c2

c3

Generate optimal centroids

3.2,3.23

511,

3

4211

c

5562

55

2

852 ,.,c

6,53 c

2,314,18,54,6

1157,5

44,11

0

30

04,55

1

1

2 4 5 8

5

6

c1

c2

c3

1

1

2 4 5 8

5

6

A B

CD

E

Fc1

c2

c3

K-means example (2/4)

After 1st iteration: MSE = 9.0



A B C D E Fc1

c2

c3

Generate optimal centroids

1,5.12

11,

2

211

c

3.5,7.43

655,

3

5543

c

5,82 c

2,38,54,6

8,15,15,268,6

3,65,48,31,3

011,4

1,5

1,41,9

1

1

2 4 5 8

5

6

c1

c3

c2

A B

C D

E

F

1

1

2 4 5 8

5

6

c1

c3

c2


After 2nd iteration: MSE = 1.78



A B C D E Fc1

c2

c3

No object move - stop

3,31,57,5

2,3342,71,8

6,71,63,57,4

0,70,50,7

0

0,50,5

1

1

2 4 5 8

5

6

A B

C D

E

F

c1

c3 c2


After 3rd iteration: MSE = 0.31

Counter example

1

1

2 4 5 8

5

6

1

1

2 4 5 8

5

6

1

3

2

Initial codebook:c1 = A, c2 = B, c3 = C

1

1

2 4 5 8

5

6

A B

C

E

D

F

c1 c2

c3

• Repeated k-means

– Try several random initializations and take the best.

– Multiplies processing time.

– Works for easier data sets.

• Better initialization

– Use some better heuristic to allocate the initial distribution

of code vectors.

– Designing good initialization is not any easier than

designing good clustering algorithm at the first place!

– K-means can (and should) anyway be applied as fine-

tuning of the result of another method.

Two ways to improve k-means

References1. Forgy, E. W. (1965) Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics 21, 768-769.

2. McQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281-297. Berkeley, CA: University of California Press.

3. Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100-108.

4. Xu, M.: K-Means Based Clustering And Context Quantization. University of Joensuu, Computer Science, Academic Dissertation, 2005.

k-means algorithm

Documents