clustering (part 3)capri/bdc/material/clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering...
TRANSCRIPT
Clustering
(Part 3)
1
Introduction
This set of slides covers two other important center-based clusteringproblems which were defined earlier. Namely,
• k-means clustering
• k-median clustering
We will
• Review classical, well-esablished algorithms, and point out theirweaknesses when applied in a big-data scenario.
• Show that in both cases, a coreset-based strategy yields effectiveperformance-accuracy tradeoffs.
Observation: Unlike k-center clustering which aims at a worst-caseguarantee on the distance of a point from its cluster’s center, bothk-means and k-median clustering aim at optimizing average distances.
2
k-means clustering
3
Definition (review) and some observations
Given a pointset P from a metric space (M, d) and an integer k > 0, thek-means clustering problem requires to determine a k-clusteringC = (C1,C2, . . . ,Ck ; c1, c2, . . . , ck) which minimizes:
Φkmeans(C) =k∑
i=1
∑a∈Ci
(d(a, ci ))2.
Our aim:
• Focus on Euclidean spaces (i.e., <D with Euclidean distance).
• Focus on the unrestricted version of the problem where centersneed not belong to the input P
4
Observations
• k-means clustering aims at minimizing cluster variance, andworks well for discovering ball-shaped clusters.
• Because of the quadratic dependence on distances, k-meansclustering is rather sensitive to outliers (though less thank-center).
• Although some well-established algorithms for k-meansclustering have been known for decades, only recently arigorous assessment of their performance-accuracy tradeoffhas been carried out.
5
Properties of Euclidean spaces
Let X = (X1,X2, . . . ,XD) and Y = (Y1,Y2, . . . ,YD) be two pointsin <D . Recall that their Euclidean distance is
d(X ,Y ) =
√∑i=1
(Xi − Yi )2 4= ||X − Y ||.
Definition
The centroid of a set P of N points in <D is
c(P) =1
N
∑X∈P
X ,
where the sum is component-wise.
Observe that c(P) does not necessarily belong to P
6
Properties of Euclidean spaces (cont’d)
Lemma
The centroid c(P) of a set P ⊂ <D is the point of <D which minimizesthe sum of the square distances to all points of P.
Proof
Consider an arbitrary point Y ∈ <D . We have that∑X∈P
(d(X ,Y ))2 =∑X∈P
||X − Y ||2
=∑X∈P
||X − cP + cP − Y ||2
=∑X∈P
(X − cP + cP − Y ) · (X − cP + cP − Y )
where ”·” denotes the inner product.
7
Properties of Euclidean spaces (cont’d)
Proof. (cont’d)
∑X∈P
(d(X ,Y ))2 =∑X∈P
(X − cP + cP − Y ) · (X − cP + cP − Y )
=∑X∈P
[(X − cP) · (X − cP) + (X − cP) · (cP − Y )+
+(cP − Y ) · (X − cP) + (cP − Y ) · (cP − Y )]
=∑X∈P
||X − cP ||2 + N||cP − Y ||2 +
+2(cP − Y ) ·
(∑X∈P
(X − cP)
),
8
Properties of Euclidean spaces (cont’d)
Proof. (cont’d).
By definition of cP we have that∑
X∈P(X − cP) = (∑
X∈P X )− NcP isthe all-0’s vector, hence∑
X∈P
(d(X ,Y ))2 =∑X∈P
||X − cP ||2 + N||cP − Y ||2
≥∑X∈P
||X − cP ||2
=∑X∈P
(d(X , cP))2
Observation: The lemma implies that when seeking a k-clustering for
points in <D which minimizes the kmeans objective, the best center to
select for each cluster is its centroid (assuming that centers need not
necesseraly belong to the input pointset).
9
Lloyd’s algorithm
• Also known as k-means algorithm, it was developed by StuartLloyd in 1957 and became (one of) the most popularalgorithm for unsupervised learning.
• In 2006 it was listed as one of the top-10 most influential datamining algorithms.
• It relates to the Expectation-Maximization scheme since ititerates the following two steps
• Expectation step: best clustering computed around the givencentroids.
• Maximization step: new centroids computed to minimizeaverage square distance.
10
Lloyd’s algorithm (cont’d)
Input Set P of N points from <D , integer k > 1
Output k-clustering C = (C1,C2, . . . ,Ck ; c1, c2, . . . , ck) of P with smallΦkmeans(C). Centers need not be in P.
S ← arbitrary set of k points in <D
Φ←∞; stopping-condition ← false
while (!stopping-condition) do (C1,C2, . . . ,Ck ;S)← Partition(P,S)for i ← 1 to k do c ′i ← centroid of Ci
C ← (C1,C2, . . . ,Ck ; c ′1, c′2, . . . , c
′k)
S ← c ′1, c ′2, . . . , c ′kif Φkmeans(C) < Φ then Φ← Φkmeans(C)else stopping-condition ← true
return C
11
Example
12
Lloyd’s algorithm: analysis
Theorem
The Lloyd’s algorithm always terminates.
Proof.
Let Ct be the k-clusterings computed by the algorithm in Iteration t ≥ 1 of thewhile loop (line 3 of the loop). We have:
• By construction, the objective function values Φkmeans(Ct)’s, for every texcept for the last one, form a strictly decreasing sequence. Also, in thelast iteration (say Iteration t∗) Φkmeans(Ct∗) = Φkmeans(Ct∗−1), since theinvocation of Partition and the subsequent recomputation of the centroidsin this iteration cannot increase the value of the objective function.
• Each Ct is uniquely determined by the partition of P into k clusterscomputed at the beginning of Iteration t.
• By the above two points we conclude that all Ct ’s, except possibly thelast one, are determined by distinct partitions of P.
The theorem follows immediately since there are kN possible partitions of Pinto ≤ k clusters (counting also permutations of the same partition).
13
Observations on Lloyd’s algorithm
• Lloyd’s algorithm may be trapped into a local optimum whose valueof the objective function can be much larger than the optimal value.Hence, no guarantee can be proved on its accuracy. Consider thefollowing example and suppose that k = 3.
If initially one center falls among the points on the left side, andtwo centers fall among the points on the right side, it is impossiblethat one center moves from right to left, hence the two obviousclusters on the left will be considered as just one cluster.
14
Observations on Lloyd’s algorithm (cont’d)
• While the algorithm surely terminates, the number of iterations canbe exponential in the input size.
• Besides the trivial kN upper bound on the number of iterations,more sophisticated studies proved an O
(NkD
)upper bound which
improves upon the trivial one, in scenarios where k and D are small
• Some recent studies proved also a 2Ω(√N) lower bound on the
number of iterations in the worst case.
• Despite the not so promising theoretical results, emprical studiesshow that, in practice, the algorithm requires much less than Niterations, and, if properly seeded (see next slides) it featuresguaranteed accuracy.
• To improve performance, with a limited quality penalty, one couldstop the algorithm earlier, e.g., when the value of the objectivefunction decreases by a small additive factor.
15
Effective initialization
• The quality of the solution and the speed of convergence ofLloyd’s algorithm depend considerably from the choice of theinitial set of centers (e.g., consider the previous example)
• Typically, initial centers are chosen at random, but there aremore effective ways of selecting them.
• In [AV07], D. Arthur and S. Vassilvitskii developedk-means++, a careful center initialization strategy that yieldsclusterings not too far from the optimal ones.
16
k-means++
Let P be a set of N points from <D , and let k > 1 be an integer.
Algorithm k-means++ computes a (initial) set S of k centers for P usingthe following procedure (note that S ⊆ P):
c1 ← random point chosen from P with uniform probability
S ← c1for i ← 2 to k do
ci ← random point chosen from P − S, where a point pis chosen with probability (d(p,S))2/
∑q∈P−S(d(q,S))2
S ← S ∪ cireturn S
Observation: Altough k-means++ already provides a good set ofcenters for the k-means problem (see next theorem) it can be used toprovide the initialization for Lloyd’s algorithm, whose iterations can onlyimprove the initial solution.
17
k-means++ (cont’d)
Details on the selection of ci in k-means++
Consider Iteration i of the for loop of k-means++ (i ≥ 2)
Let S be the current set of i − 1 already discovered centers.
• Let P − S = pj : 1 ≤ j ≤ |P − S | (arbitrary order), and defineπj = (d(pj ,S))2/(
∑q∈P−S(d(q,S))2). Note that the πj ’s define a
probability distribution, that is,∑
j≥1 πj = 1.
• Draw a random number x ∈ [0, 1] and set ci = pr , where r is theunique index between 1 and |P − S | such that
r−1∑j=1
πj < x ≤r∑
j=1
πj .
18
k-means++ (cont’d)
19
k-means++ (cont’d)
Theorem (Arthur-Vassilvitskii’07)
Let Φoptkmeans(k) be the minimum value of Φkmeans(C) over all possible
k-clusterings C of P, and let Calg = Partition(P, S) be the k-clustering of Pinduced by the set S of centers returned by the k-means++. Note that Calg isa random variable which depends on the random choices made by thealgorithm. Then:
E [Φkmeans(Calg)] ≤ 8(ln k + 2) · Φoptkmeans(k),
where the expectation is over all possible sets S returned by k-means++,which depend on the random choices made by the algorithm.
N.B. We skip the proof which can be found in the original paper.
Remark: In recent years, many k-means algorithms have been devised, whichattain constant-approximation in polynomial time. (References in [C+19].)
20
k-means clustering for big data
How can we compute a “good” k-means clustering of apointset P that is too large for a single machine?
Main alternatives:
• Distributed (e.g., MapReduce) implementation of Lloyd’salgorithm and k-means++.
(The next two exercises explore this alternative.)
• Coreset-based approach, similar in spirit to what was done fork-center. This approach lends itself to efficient distributed(e.g., MapReduce) as well as streaming implementations.
21
Lloyd’s and k-means++ in MapReduce
Do the following two exercises assuming that the algorithms are applied to aset P of N points from <D , with D constant, and that the target number ofclusters is k ∈ (1,
√N].
Exercise
Show how to implement one iteration of Lloyd’s algorithm in MapReduce using
a constant number of rounds, local space ML = O(√
N)
and aggregate space
MA = O (N). Assume that at the beginning of the iteration a set S of kcenters and a current estimate Φ of the objective function are available. Theiteration must compute the clusters induced by S , the new centers (i.e., thecentroids of the clusters), and the value of the objective function for theclusters with the new centers.
Exercise
Show how to implement algorithm k-means++ in MapReduce using O (k)
rounds, local space ML = O(√
N)
and aggregate space MA = O (N).
22
Lloyd’s and k-means++ in MapReduce (cont’d)
Observations:
• In order to achieve good quality solutions several MapReducerounds may be required (especially if Lloyd’s algorithm is run)
• A parallel variant of k-means++ (dubbed k-means||) has beenproposed in [B+12]. In this variant, a set of > k candidatecenters is selected in O (logN) parallel rounds, and then kcenters are extracted from this set using a weighted version ofk-means++.
• Lloyd’s algorithm, with centers initialized using k-means|| isimplemented in the Spark MLlib.
23
Coreset-based approach for k-means
We use a coreset-based approach similar to the one used fork-center.
Differences with respect to k-center case:
• The local coresets Ti ’s are computedusing a sequential algorithm fork-means.
• Each point of T is given a weightequal to the number of points in P itrepresents, so to account for theircumulative contribution to theobjective function
24
Weighted k-means clustering
One can define the following, more general, weighted variant of thek-means clustering problem.
Let P be a set of N points from a metric space (M, d), and let k < N bethe target granularity. Suppose that for each point p ∈ P, an integerweight w(p) > 0 is also given. We want to compute the k-clusteringC = (C1,C2, . . . ,Ck ; c1, c2, . . . , ck) of P which minimizes the followingobjective function
Φwkmeans(C) =
k∑i=1
∑p∈Ci
w(p) · (d(p, ci ))2,
that is, the weighted sum of the squared distances of the points fromtheir closest centers.
Observation: the unweighted k-means clustering problem is a specialcase of the weighted variant where all weights are equal to 1.
25
Weighted k-means clustering (cont’d)
Most of the algorithms known for the standard k-means clusteringproblem (unit weights) can be adapted to solve the case with generalweights.
It is sufficient to regard each point p as representing w(p) identicalcopies of itself. The same approximation guarantees are maintained.
For example:
Lloyd’s algorithm: remains virtually identical except that:
• The centroid of a cluster Ci becomes 1∑p∈Ci
w(p)
∑p∈Ci
w(p) · p
• The objective function Φwkmeans(C) is used.
k-means++: remains virtually identical except that in the selection ofthe i-the center ci , the probability for a point p ∈ P − S to be selectedbecomes
w(p) · (d(p,S))2∑q∈P−S(w(q) · (d(q,S))2)
26
Example of weighted centroid
27
Coreset-based MapReduce algorithm for k-means
The MapReduce algorithm for k-means, described in the next slide, uses,as a parameter, a sequential algorithm A for k-means. We call thisalgorithm: MR-kmeans(A).
In principle, one can instantiate A with any sequential algorithm, as longas the following characteristics are met:
• A solves the more general weighted variant maintaining the sameapproximation ratio.
• A requires space proportional to the input size (if larger space isneeded, the analysis of ML for the MapReduce algorithm must beamended).
28
MR-kmeans(A)
Input Set P of N points from a metric space (M, d), integer k > 1
Output k-clustering C = (C1,C2, . . . ,Ck ; c1, c2, . . . , ck) of P which is agood approximation to the k-means problem for P.
Algorithm:
• Round 1: Partition P arbitrarily in ` subsets of equal sizeP1,P2, . . . ,P` and execute A independently on each Pi (with unitweights). For 1 ≤ i ≤ `, let Ti be the set of k centers computed byA on Pi . For each p ∈ Pi define its proxy π(p) as its closest centerin Ti , and for each q ∈ Ti , define its weight w(q) as the number ofpoints of Pi whose proxy is q.
• Round 2: Gather the coreset T = ∪`i=1Ti of ` · k points, togetherwith their weights. Using a single reducer, run A on T , using thegiven weights, to identify a set S = c1, c2, . . . , ck of k centers.
• Round 3: Run Partition(P,S) to return the final clustering.
29
Example for Round 1
30
Analysis of MR-kmeans(A)
Assume k ≤√N. By setting ` =
√N/k , it is easy to see (as for
MR-FFT) that the 3-round MR-kmeans(A) algorithm can beimplemented using
• Local space ML = O(
maxN/`, ` · k ,√N)
= O(√
N · k)
= o(N)
• Aggregate space MA = O (N)
The quality of the solution returned by the algorithm is stated in thefollowing theorem, which is proved in the next few slides.
Theorem
Suppose that A is an α-approximation algorithm for weighted k-meansclustering, for some α > 1, and let Calg be the k-clustering returned byMR-kmeans(A) with input P. Then:
Φkmeans(Calg) = O(α2 · Φopt
kmeans(P, k)).
That is, MR-kmeans(A) is an O(α2)-approximation algorithm.
31
Analysis of MR-kmeans(A) (cont’d)
Definition
Given a pointset P, a coreset T ⊆ P and a proxy function π : P → T ,we say that T is a γ-coreset for P, k and the k-means objective if∑
p∈P
(d(p, π(p)))2 ≤ γ · Φoptkmeans(P, k).
Remark: intuitively, the smaller γ, the better T (together with the proxyfunction) represents P with respect to the k-means clustering problem,and it is likely that a good solution for the problem can be extractedfrom T , by weighing each point ot T with the number of points forwhich it acts as proxy.
32
33
Analysis of MR-kmeans(A) (cont’d)
The following lemma relates the quality of the solution returned byMR-kmeans(A) to the quality of the coreset T and to the approximationguarantee of A.
Lemma 1
Suppose that the coreset T computed in Round 1 of MR-kmeans(A) is aγ-coreset and that A is an α-approximation algorithm for weightedk-means clustering, for some α > 1. Let Calg be the k-clustering of Preturned by MR-kmeans(A). Then:
Φkmeans(Calg) = O(α · (1 + γ) · Φopt
kmeans(P, k)).
(We skip the proof. If interested, I can give you the details.)
Remark: The lemma implies that by chosing a γ-coreset with small γ,the penalty in accuracy incurred by the MapReduce algorithm withrespect to the sequential algorithm is small. This shows that accuratesolutions even in the big data scenario can be computed.
34
Analysis of MR-kmeans(A) (cont’d)
Lemma 2
If A is an α-approximation algorithm for k-means clustering, then thecoreset T computed in Round 1 of MR-kmeans(A) is a γ-coreset with
γ = α.
Proof.
Let Sopt be the set of optimal centers for P (i.e., those of the clusteringwith value of the objective function Φopt
kmeans(P, k)). Let also Sopti be the
set of optimal centers for Pi .
Recall that for every 1 ≤ i ≤ `, for each p ∈ Pi its proxy π(p) is thepoint of Ti closest to p, thus d(p, π(p)) = d(p,Ti ).
35
Analysis of MR-kmeans(A) (cont’d)
Proof.
Hence, we have:∑p∈P
d2(p, π(p)) =∑i=1
∑p∈Pi
(d(p,Ti ))2
≤∑i=1
α∑p∈Pi
(d(p,Sopti ))2 (*)
≤ α∑i=1
∑p∈Pi
(d(p,Sopt))2 (**)
= α∑p∈P
(d(p,Sopt))2 = αΦoptkmeans(P, k).
(*) follows since A is an α-approximation algorithm;
(**) follows since Sopt is not necessarily optimal for Pi .
36
Analysis of MR-kmeans(A) (cont’d)
Proof of theorem.
The theorem is an immediate consequence of Lemmas 1 and 2.
Observations:
• The constants hidden in the order of magnitude for theapproximation factor are not large.
• Two different k-means sequential algorithms can be used in R1 andR2. For example, one could use k-means++ in R1, and k-means++plus LLoyd’s in R2.
• In practice, the algorithm is fast and quite accurate.
• If k ′ > k centers are selected from each Pi in Round 1 (e.g., throughk-means++), the quality of T , hence the quality of the finalclustering, improves. In fact, for low dimensional spaces, a γ-coresetT with γ α can be built in this fashion using a not too large k ′.
37
k-median clustering
38
Definition (review) and some observations
Given a pointset P from a metric space (M, d) and an integer k > 0, thek-median clustering problem requires to determine a k-clusteringC = (C1,C2, . . . ,Ck ; c1, c2, . . . , ck) which minimizes:
Φkmedian(C) =k∑
i=1
∑a∈Ci
d(a, ci ).
Our aim:
• Unlike the case of k-means clustering, for k-median we will focus onarbitrary metric spaces and require that cluster centers belong tothe input pointset.
• Traditionally, when centers of the clusters belong to the input theyare referred to as medoids.
• We will first study the popular PAM sequential algorithm devised byL. Kaufman and P.J. Rousseeuw in 1987 and based on the localsearch optimization strategy. Then, we will discuss how to copewith very large inputs.
39
Partitioning Around Medoids (PAM) algorithm
Input Set P of N points from (M, d), integer k > 1
Output k-clustering C = (C1,C2, . . . ,Ck ; c1, c2, . . . , ck) of P with smallΦkmedian(C). Centers must belong to P.
S ← c1, c2, . . . , ck (arbitrary set of k points of P)C ← Partition(P, S)stopping-condition ← false
while (!stopping-condition) do stopping-condition ← true
for each (c ∈ S and p ∈ P − S) do S ′ ← (S − c) ∪ pC′ ← Partition(P, S ′)if Φkmedian(C′) < Φkmedian(C) then stopping-condition ← false
C ← C′; S ← S ′
exit the for-each loop
return C
40
PAM algorithm: example
41
PAM algorithm: example
42
PAM algorithm: example
43
PAM algorithm: example
44
PAM algorithm: example
45
PAM algorithm: example
46
PAM algorithm: example
47
PAM algorithm: example
48
PAM algorithm: analysis
Exercise
Show that the PAM algorithm always terminates.
The following theorem, which we do not prove, states theapproximation guarantee of the PAM algorithm.
Theorem (Arya+04)
Let Φoptkmedian(P, k) be the minimum value of Φkmedian(C) over all
possible k-clusterings C of P, and let Calg be the k-clustering of Preturned by the PAM algorithm. Then:
Φkmedian(Calg) ≤ 5 · Φoptkmedian(k).
49
Remarks on the PAM algorithm
• The PAM algorithm is very slow in practice especially for largeinputs since: (1) the local search may require a large number ofiterations to converge; and (2) in each iteration up to (N − k) · kswaps may need to be checked, and for each swap a new clusteringmust be computed.
• The first issue can be tackled by stopping the iterations when theobjective function does not decrease significantly. It can be shownthat an approximation guarantee similar to the one stated in thetheorem can be achieved with a reasonable number of iterations.
• A faster alternative to PAM is an adaptation of the LLoyd’salgorithm where the center of a cluster C is defined to be the pointof C that minimizes the sum of the distances to all other points ofC . This center is called medoid rather than centroid. Thisadaptation appears faster than PAM and still very accurate inpractice (see [PJ09]).
50
k-median clustering for big dataThe solution of the k-median clustering problem on very large pointsetsP can be attained using the same coreset-based approach illustrated forthe case of k-means.
• Define the weighted variant of the problem where each point p ∈ Phas a weight w(p) and the objective function for a k-clusteringC = (C1,C2, . . . ,Ck ; c1, c2, . . . , ck) is
Φwkmedian(C) =
k∑i=1
∑p∈Ci
w(p) · d(p, ci ),
• Let B be a sequential algorithm which solves the weighted variant ofk-median using space linear in the input size. (EXERCISE: showhow to adapt PAM to solve the weighted variant )
• We can devise a 3-round MapReduce algorithm for k-median, whichuses B as a subroutine. Call this algorithm MR-kmedian(B).
• The description of MR-kmedian(B) is identical to the one ofMR-kmeans(A) (see Slide 29) except that A is now replaced by B.
51
Analysis of MR-kmedian(B) algorithm
Assume k ≤√N. Then, MR-kmedian(B) requires 3 rounds, local space
ML = O(√
N · k)
, and aggregate space MA = O (N), where N is the
input size.
The quality guarantee featured by MR-kmedian(B) is stated in thefollowing theorem (we skip the proof).
Theorem
Suppose that B is an α-approximation algorithm for weighted k-medianclustering, for some α > 1, and let Calg be the k-clustering of P returnedby MR-kmedian(B). Then:
Φkmedian(Calg) = O(α2 · Φopt
kmedian(P, k)).
That is, MR-kmedian(B) is a O(α2)-approximation algorithm.
52
Observations
• The same observations made for MR-kmeans (see Slide 37) apply toMR-kmedian.
• In general, the coreset-based strategy which we adopted for thethree clustering problems (k-center, k-means, k-median) featuresvarious strengths:
• Ability to handle huge inputs.
• Tunable performance-accuracy tradeoffs.
• Very good behaviour in practice.
• In the analysis, we assumed k ≤√N. By adapting the analysis of
Partition(P,S), one can show that the same bounds on the keyperformance indicators hold for larger values of k .
53
k-center vs k-means vs k-median
Some considerations that may help in selecting the most appropriateclustering problem
• All 3 problems tend to return spherically-shaped clusters.
• They differ in the center selection strategy since, given the centers,the best clustering around them is the same for the 3 problems.
• k-center is useful when we want to optimize the distance of everypoint from its centers, hence minimizes the radii of the clusterspheres. It is also useful for coreset selection.
• k-median is useful for facility-location applications when targetingminimum average distance from the centers. However, it is themost challenging problem from a computational standpoint.
• k-means is useful when centers need not belong to the pointset andminimum average distance is targeted. Optimized implementationsare available in many software packages (unlike k-median)
54
k-center vs k-means vs k-median
The following example shows that different impact that outliers can havefor the 3 problems.
For k ≥ 2, the relevance of the outlier x in the 3 problems is different
• k-center: x must be always selected as center!
• k-means: x must be selected as center if D ≥ 1000!
• k-median: x must be selected as center if D ≥ 1000000!
55
Exercises
Exercise
Let Calg = Partition(P,S) be the k-clustering of P induced by the set Sof centers returned by k-means++. Using Markov’s inequality determinea value α > 1 such that
Pr(Φkmeans(Calg) ≤ α · Φopt
kmeans(P, k))≥ 1/2.
Recall that for a real-valued nonnegative random variable X withexpectation E [X ] and a value a > 0, the Markov’s inequality states that
Pr(X ≥ a) ≤ E [X ]
a.
56
Exercises
Exercise
(This exercise shows how to make the approximation bound of algorithmk-means++ to hold with high probability rather than in expectation.)
Let Calg = Partition(P,S) be the k-clustering of P induced by the set Sof centers returned by an execution of k-means++. Suppose that weknow that for some value α > 1 the following probability bound holds:
Pr(Φkmeans(Calg) ≤ α · Φopt
kmeans(P, k))≥ 1/2.
Show that the same approximation ratio, but with probability at least1− 1/N, can be ensured by running several independent instances ofk-means++, computing for each instance the clustering around thereturned centers, and returning at the end the best clustering found (callit Cbest), i.e., the one with smallest value of the objective function.
57
References
LRU14 J. Leskovec, A. Rajaraman and J. Ullman. Mining Massive Datasets.Cambridge University Press, 2014. Sections 3.1.1 and 3.5, and Chapter 7
BHK18 A. Blum, J. Hopcroft, and R. Kannan. Foundations of Data Science.Manuscript, June 2018. Chapter 7
AV07 D. Arthur, S. Vassilvitskii. k-means++: the advantages of carefulseeding. Proc. of ACM-SIAM SODA 2007: 1027-1035.
B+12 B. Bahmani et al. Scalable K-Means++. PVLDB 5(7):622-633, 2012
C+19 V. Cohen-Addad et al. Tight FPT Approximations for k-Median andk-Means. Proc. of ICALP 2019: 42:1-42:14.
Arya+04 V. Arya et al. Local Search Heuristics for k-Median and Facility LocationProblems. SIAM J. Comput. 33(3):544-562, 2004.
PJ09 H.S. Park, C.H. Jun. A simple and fast algorithm for K-medoidsclustering. Expert Syst. Appl. 36(2):3336-3341, 2009
58
ErrataChanges w.r.t. first draft:
• Slide 21: ”Known alternatives” → ”Main alternatives”.
• Slide 23: modified the reference to Spark (bottom line).
• Slide 25: corrected the sentence after the formula for the objectivefunction.
• Slide 26: p → q in the last formula.
• Slide 27: new slide with an example of weighted coreset.
• Slide 30: new slide with an example for Round 1 of MR-kmeans.
• Slide 31: added some details to analysis of local and aggregate space.
• Slide 33: new slide with clarification of proof idea.
• Slide 34: γ → (1 + γ) in the centered formula.
• Slide 35: removed Aw .
• Slide 40: ”both for each loops” → ”the for each loop”
• Slide 51: added exercise.
• Slides 54,55: new slides.
59