k-means algorithm
DESCRIPTION
Speech and Image Processing Unit School of Computing University of Eastern Finland. K-means algorithm. Clustering Methods: Part 2a. Pasi Fränti. K- m eans overview. Well-known clustering algorithm Number of clusters must be chosen in advance Strengths: - PowerPoint PPT PresentationTRANSCRIPT
K-means algorithm
Speech and Image Processing UnitSchool of Computing
University of Eastern Finland
Pasi Fränti
Clustering Methods: Part 2a
K-means overview
• Well-known clustering algorithm
• Number of clusters must be chosen in advance
• Strengths:1. Vectors can flexibly change clusters during the process.
2. Always converges to a local optimum.
3. Quite fast for most applications.
• Weaknesses:1. Quality of the output depends on the initial codebook.
2. Global optimum solution not guaranteed.
K-means pseudo code
X: a set of N data vectors Data set
CI: initialized k cluster centroids Number of clusters,
C: the cluster centroids of k-clustering random initial centroids
P = {p(i) | i = 1, …, N} is the cluster label of X
KMEANS(X, CI) → (C, P)
REPEAT
Cprevious ← CI;
FOR all i ∈ [1, N] DO Generate new optimal paritions
p(i) ← arg min d(xi, cj);l ≤ j ≤ k
FOR all j ∈ [1, k] DO Generate optimal centroids
cj ← Average of xi, whose p(i) = j;
UNTIL C = Cprevious
K-means exampleData set
X: a set of N data vectorsN = 6
Number of clustersRandom initial centroids
Initial codebook:c1 = C, c2 = D, c3 = E
CI: initialized k cluster centroidsk = 3
1
1
2 4 5 8
5
6
A B
C D
E
F
1
1
2 4 5 8
5
6
c1 c2
c3
(1/4)
Generate optimal partitions
Distance matrix (Euclidean distance)
A B C D E Fc1
c2
c3
Generate optimal centroids
3.2,3.23
511,
3
4211
c
5562
55
2
852 ,.,c
6,53 c
2,314,18,54,6
1157,5
44,11
0
30
04,55
1
1
2 4 5 8
5
6
c1
c2
c3
1
1
2 4 5 8
5
6
A B
CD
E
Fc1
c2
c3
K-means example (2/4)
After 1st iteration: MSE = 9.0
Generate optimal partitions
Distance matrix (Euclidean distance)
A B C D E Fc1
c2
c3
Generate optimal centroids
1,5.12
11,
2
211
c
3.5,7.43
655,
3
5543
c
5,82 c
2,38,54,6
8,15,15,268,6
3,65,48,31,3
011,4
1,5
1,41,9
1
1
2 4 5 8
5
6
c1
c3
c2
A B
C D
E
F
1
1
2 4 5 8
5
6
c1
c3
c2
K-means example (3/4)
After 2nd iteration: MSE = 1.78
Generate optimal partitions
Distance matrix (Euclidean distance)
A B C D E Fc1
c2
c3
No object move - stop
3,31,57,5
2,3342,71,8
6,71,63,57,4
0,70,50,7
0
0,50,5
1
1
2 4 5 8
5
6
A B
C D
E
F
c1
c3 c2
K-means example (4/4)
After 3rd iteration: MSE = 0.31
Counter example
1
1
2 4 5 8
5
6
1
1
2 4 5 8
5
6
1
3
2
Initial codebook:c1 = A, c2 = B, c3 = C
1
1
2 4 5 8
5
6
A B
C
E
D
F
c1 c2
c3
• Repeated k-means
– Try several random initializations and take the best.
– Multiplies processing time.
– Works for easier data sets.
• Better initialization
– Use some better heuristic to allocate the initial distribution
of code vectors.
– Designing good initialization is not any easier than
designing good clustering algorithm at the first place!
– K-means can (and should) anyway be applied as fine-
tuning of the result of another method.
Two ways to improve k-means
References1. Forgy, E. W. (1965) Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics 21, 768-769.
2. McQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281-297. Berkeley, CA: University of California Press.
3. Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100-108.
4. Xu, M.: K-Means Based Clustering And Context Quantization. University of Joensuu, Computer Science, Academic Dissertation, 2005.