clustering (part 3)capri/bdc/material/clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering...

59
Clustering (Part 3) 1

Upload: others

Post on 12-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Clustering

(Part 3)

1

Page 2: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Introduction

This set of slides covers two other important center-based clusteringproblems which were defined earlier. Namely,

• k-means clustering

• k-median clustering

We will

• Review classical, well-esablished algorithms, and point out theirweaknesses when applied in a big-data scenario.

• Show that in both cases, a coreset-based strategy yields effectiveperformance-accuracy tradeoffs.

Observation: Unlike k-center clustering which aims at a worst-caseguarantee on the distance of a point from its cluster’s center, bothk-means and k-median clustering aim at optimizing average distances.

2

Page 3: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

k-means clustering

3

Page 4: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Definition (review) and some observations

Given a pointset P from a metric space (M, d) and an integer k > 0, thek-means clustering problem requires to determine a k-clusteringC = (C1,C2, . . . ,Ck ; c1, c2, . . . , ck) which minimizes:

Φkmeans(C) =k∑

i=1

∑a∈Ci

(d(a, ci ))2.

Our aim:

• Focus on Euclidean spaces (i.e., <D with Euclidean distance).

• Focus on the unrestricted version of the problem where centersneed not belong to the input P

4

Page 5: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Observations

• k-means clustering aims at minimizing cluster variance, andworks well for discovering ball-shaped clusters.

• Because of the quadratic dependence on distances, k-meansclustering is rather sensitive to outliers (though less thank-center).

• Although some well-established algorithms for k-meansclustering have been known for decades, only recently arigorous assessment of their performance-accuracy tradeoffhas been carried out.

5

Page 6: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Properties of Euclidean spaces

Let X = (X1,X2, . . . ,XD) and Y = (Y1,Y2, . . . ,YD) be two pointsin <D . Recall that their Euclidean distance is

d(X ,Y ) =

√∑i=1

(Xi − Yi )2 4= ||X − Y ||.

Definition

The centroid of a set P of N points in <D is

c(P) =1

N

∑X∈P

X ,

where the sum is component-wise.

Observe that c(P) does not necessarily belong to P

6

Page 7: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Properties of Euclidean spaces (cont’d)

Lemma

The centroid c(P) of a set P ⊂ <D is the point of <D which minimizesthe sum of the square distances to all points of P.

Proof

Consider an arbitrary point Y ∈ <D . We have that∑X∈P

(d(X ,Y ))2 =∑X∈P

||X − Y ||2

=∑X∈P

||X − cP + cP − Y ||2

=∑X∈P

(X − cP + cP − Y ) · (X − cP + cP − Y )

where ”·” denotes the inner product.

7

Page 8: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Properties of Euclidean spaces (cont’d)

Proof. (cont’d)

∑X∈P

(d(X ,Y ))2 =∑X∈P

(X − cP + cP − Y ) · (X − cP + cP − Y )

=∑X∈P

[(X − cP) · (X − cP) + (X − cP) · (cP − Y )+

+(cP − Y ) · (X − cP) + (cP − Y ) · (cP − Y )]

=∑X∈P

||X − cP ||2 + N||cP − Y ||2 +

+2(cP − Y ) ·

(∑X∈P

(X − cP)

),

8

Page 9: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Properties of Euclidean spaces (cont’d)

Proof. (cont’d).

By definition of cP we have that∑

X∈P(X − cP) = (∑

X∈P X )− NcP isthe all-0’s vector, hence∑

X∈P

(d(X ,Y ))2 =∑X∈P

||X − cP ||2 + N||cP − Y ||2

≥∑X∈P

||X − cP ||2

=∑X∈P

(d(X , cP))2

Observation: The lemma implies that when seeking a k-clustering for

points in <D which minimizes the kmeans objective, the best center to

select for each cluster is its centroid (assuming that centers need not

necesseraly belong to the input pointset).

9

Page 10: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Lloyd’s algorithm

• Also known as k-means algorithm, it was developed by StuartLloyd in 1957 and became (one of) the most popularalgorithm for unsupervised learning.

• In 2006 it was listed as one of the top-10 most influential datamining algorithms.

• It relates to the Expectation-Maximization scheme since ititerates the following two steps

• Expectation step: best clustering computed around the givencentroids.

• Maximization step: new centroids computed to minimizeaverage square distance.

10

Page 11: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Lloyd’s algorithm (cont’d)

Input Set P of N points from <D , integer k > 1

Output k-clustering C = (C1,C2, . . . ,Ck ; c1, c2, . . . , ck) of P with smallΦkmeans(C). Centers need not be in P.

S ← arbitrary set of k points in <D

Φ←∞; stopping-condition ← false

while (!stopping-condition) do (C1,C2, . . . ,Ck ;S)← Partition(P,S)for i ← 1 to k do c ′i ← centroid of Ci

C ← (C1,C2, . . . ,Ck ; c ′1, c′2, . . . , c

′k)

S ← c ′1, c ′2, . . . , c ′kif Φkmeans(C) < Φ then Φ← Φkmeans(C)else stopping-condition ← true

return C

11

Page 12: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Example

12

Page 13: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Lloyd’s algorithm: analysis

Theorem

The Lloyd’s algorithm always terminates.

Proof.

Let Ct be the k-clusterings computed by the algorithm in Iteration t ≥ 1 of thewhile loop (line 3 of the loop). We have:

• By construction, the objective function values Φkmeans(Ct)’s, for every texcept for the last one, form a strictly decreasing sequence. Also, in thelast iteration (say Iteration t∗) Φkmeans(Ct∗) = Φkmeans(Ct∗−1), since theinvocation of Partition and the subsequent recomputation of the centroidsin this iteration cannot increase the value of the objective function.

• Each Ct is uniquely determined by the partition of P into k clusterscomputed at the beginning of Iteration t.

• By the above two points we conclude that all Ct ’s, except possibly thelast one, are determined by distinct partitions of P.

The theorem follows immediately since there are kN possible partitions of Pinto ≤ k clusters (counting also permutations of the same partition).

13

Page 14: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Observations on Lloyd’s algorithm

• Lloyd’s algorithm may be trapped into a local optimum whose valueof the objective function can be much larger than the optimal value.Hence, no guarantee can be proved on its accuracy. Consider thefollowing example and suppose that k = 3.

If initially one center falls among the points on the left side, andtwo centers fall among the points on the right side, it is impossiblethat one center moves from right to left, hence the two obviousclusters on the left will be considered as just one cluster.

14

Page 15: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Observations on Lloyd’s algorithm (cont’d)

• While the algorithm surely terminates, the number of iterations canbe exponential in the input size.

• Besides the trivial kN upper bound on the number of iterations,more sophisticated studies proved an O

(NkD

)upper bound which

improves upon the trivial one, in scenarios where k and D are small

• Some recent studies proved also a 2Ω(√N) lower bound on the

number of iterations in the worst case.

• Despite the not so promising theoretical results, emprical studiesshow that, in practice, the algorithm requires much less than Niterations, and, if properly seeded (see next slides) it featuresguaranteed accuracy.

• To improve performance, with a limited quality penalty, one couldstop the algorithm earlier, e.g., when the value of the objectivefunction decreases by a small additive factor.

15

Page 16: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Effective initialization

• The quality of the solution and the speed of convergence ofLloyd’s algorithm depend considerably from the choice of theinitial set of centers (e.g., consider the previous example)

• Typically, initial centers are chosen at random, but there aremore effective ways of selecting them.

• In [AV07], D. Arthur and S. Vassilvitskii developedk-means++, a careful center initialization strategy that yieldsclusterings not too far from the optimal ones.

16

Page 17: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

k-means++

Let P be a set of N points from <D , and let k > 1 be an integer.

Algorithm k-means++ computes a (initial) set S of k centers for P usingthe following procedure (note that S ⊆ P):

c1 ← random point chosen from P with uniform probability

S ← c1for i ← 2 to k do

ci ← random point chosen from P − S, where a point pis chosen with probability (d(p,S))2/

∑q∈P−S(d(q,S))2

S ← S ∪ cireturn S

Observation: Altough k-means++ already provides a good set ofcenters for the k-means problem (see next theorem) it can be used toprovide the initialization for Lloyd’s algorithm, whose iterations can onlyimprove the initial solution.

17

Page 18: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

k-means++ (cont’d)

Details on the selection of ci in k-means++

Consider Iteration i of the for loop of k-means++ (i ≥ 2)

Let S be the current set of i − 1 already discovered centers.

• Let P − S = pj : 1 ≤ j ≤ |P − S | (arbitrary order), and defineπj = (d(pj ,S))2/(

∑q∈P−S(d(q,S))2). Note that the πj ’s define a

probability distribution, that is,∑

j≥1 πj = 1.

• Draw a random number x ∈ [0, 1] and set ci = pr , where r is theunique index between 1 and |P − S | such that

r−1∑j=1

πj < x ≤r∑

j=1

πj .

18

Page 19: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

k-means++ (cont’d)

19

Page 20: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

k-means++ (cont’d)

Theorem (Arthur-Vassilvitskii’07)

Let Φoptkmeans(k) be the minimum value of Φkmeans(C) over all possible

k-clusterings C of P, and let Calg = Partition(P, S) be the k-clustering of Pinduced by the set S of centers returned by the k-means++. Note that Calg isa random variable which depends on the random choices made by thealgorithm. Then:

E [Φkmeans(Calg)] ≤ 8(ln k + 2) · Φoptkmeans(k),

where the expectation is over all possible sets S returned by k-means++,which depend on the random choices made by the algorithm.

N.B. We skip the proof which can be found in the original paper.

Remark: In recent years, many k-means algorithms have been devised, whichattain constant-approximation in polynomial time. (References in [C+19].)

20

Page 21: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

k-means clustering for big data

How can we compute a “good” k-means clustering of apointset P that is too large for a single machine?

Main alternatives:

• Distributed (e.g., MapReduce) implementation of Lloyd’salgorithm and k-means++.

(The next two exercises explore this alternative.)

• Coreset-based approach, similar in spirit to what was done fork-center. This approach lends itself to efficient distributed(e.g., MapReduce) as well as streaming implementations.

21

Page 22: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Lloyd’s and k-means++ in MapReduce

Do the following two exercises assuming that the algorithms are applied to aset P of N points from <D , with D constant, and that the target number ofclusters is k ∈ (1,

√N].

Exercise

Show how to implement one iteration of Lloyd’s algorithm in MapReduce using

a constant number of rounds, local space ML = O(√

N)

and aggregate space

MA = O (N). Assume that at the beginning of the iteration a set S of kcenters and a current estimate Φ of the objective function are available. Theiteration must compute the clusters induced by S , the new centers (i.e., thecentroids of the clusters), and the value of the objective function for theclusters with the new centers.

Exercise

Show how to implement algorithm k-means++ in MapReduce using O (k)

rounds, local space ML = O(√

N)

and aggregate space MA = O (N).

22

Page 23: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Lloyd’s and k-means++ in MapReduce (cont’d)

Observations:

• In order to achieve good quality solutions several MapReducerounds may be required (especially if Lloyd’s algorithm is run)

• A parallel variant of k-means++ (dubbed k-means||) has beenproposed in [B+12]. In this variant, a set of > k candidatecenters is selected in O (logN) parallel rounds, and then kcenters are extracted from this set using a weighted version ofk-means++.

• Lloyd’s algorithm, with centers initialized using k-means|| isimplemented in the Spark MLlib.

23

Page 24: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Coreset-based approach for k-means

We use a coreset-based approach similar to the one used fork-center.

Differences with respect to k-center case:

• The local coresets Ti ’s are computedusing a sequential algorithm fork-means.

• Each point of T is given a weightequal to the number of points in P itrepresents, so to account for theircumulative contribution to theobjective function

24

Page 25: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Weighted k-means clustering

One can define the following, more general, weighted variant of thek-means clustering problem.

Let P be a set of N points from a metric space (M, d), and let k < N bethe target granularity. Suppose that for each point p ∈ P, an integerweight w(p) > 0 is also given. We want to compute the k-clusteringC = (C1,C2, . . . ,Ck ; c1, c2, . . . , ck) of P which minimizes the followingobjective function

Φwkmeans(C) =

k∑i=1

∑p∈Ci

w(p) · (d(p, ci ))2,

that is, the weighted sum of the squared distances of the points fromtheir closest centers.

Observation: the unweighted k-means clustering problem is a specialcase of the weighted variant where all weights are equal to 1.

25

Page 26: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Weighted k-means clustering (cont’d)

Most of the algorithms known for the standard k-means clusteringproblem (unit weights) can be adapted to solve the case with generalweights.

It is sufficient to regard each point p as representing w(p) identicalcopies of itself. The same approximation guarantees are maintained.

For example:

Lloyd’s algorithm: remains virtually identical except that:

• The centroid of a cluster Ci becomes 1∑p∈Ci

w(p)

∑p∈Ci

w(p) · p

• The objective function Φwkmeans(C) is used.

k-means++: remains virtually identical except that in the selection ofthe i-the center ci , the probability for a point p ∈ P − S to be selectedbecomes

w(p) · (d(p,S))2∑q∈P−S(w(q) · (d(q,S))2)

26

Page 27: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Example of weighted centroid

27

Page 28: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Coreset-based MapReduce algorithm for k-means

The MapReduce algorithm for k-means, described in the next slide, uses,as a parameter, a sequential algorithm A for k-means. We call thisalgorithm: MR-kmeans(A).

In principle, one can instantiate A with any sequential algorithm, as longas the following characteristics are met:

• A solves the more general weighted variant maintaining the sameapproximation ratio.

• A requires space proportional to the input size (if larger space isneeded, the analysis of ML for the MapReduce algorithm must beamended).

28

Page 29: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

MR-kmeans(A)

Input Set P of N points from a metric space (M, d), integer k > 1

Output k-clustering C = (C1,C2, . . . ,Ck ; c1, c2, . . . , ck) of P which is agood approximation to the k-means problem for P.

Algorithm:

• Round 1: Partition P arbitrarily in ` subsets of equal sizeP1,P2, . . . ,P` and execute A independently on each Pi (with unitweights). For 1 ≤ i ≤ `, let Ti be the set of k centers computed byA on Pi . For each p ∈ Pi define its proxy π(p) as its closest centerin Ti , and for each q ∈ Ti , define its weight w(q) as the number ofpoints of Pi whose proxy is q.

• Round 2: Gather the coreset T = ∪`i=1Ti of ` · k points, togetherwith their weights. Using a single reducer, run A on T , using thegiven weights, to identify a set S = c1, c2, . . . , ck of k centers.

• Round 3: Run Partition(P,S) to return the final clustering.

29

Page 30: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Example for Round 1

30

Page 31: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Analysis of MR-kmeans(A)

Assume k ≤√N. By setting ` =

√N/k , it is easy to see (as for

MR-FFT) that the 3-round MR-kmeans(A) algorithm can beimplemented using

• Local space ML = O(

maxN/`, ` · k ,√N)

= O(√

N · k)

= o(N)

• Aggregate space MA = O (N)

The quality of the solution returned by the algorithm is stated in thefollowing theorem, which is proved in the next few slides.

Theorem

Suppose that A is an α-approximation algorithm for weighted k-meansclustering, for some α > 1, and let Calg be the k-clustering returned byMR-kmeans(A) with input P. Then:

Φkmeans(Calg) = O(α2 · Φopt

kmeans(P, k)).

That is, MR-kmeans(A) is an O(α2)-approximation algorithm.

31

Page 32: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Analysis of MR-kmeans(A) (cont’d)

Definition

Given a pointset P, a coreset T ⊆ P and a proxy function π : P → T ,we say that T is a γ-coreset for P, k and the k-means objective if∑

p∈P

(d(p, π(p)))2 ≤ γ · Φoptkmeans(P, k).

Remark: intuitively, the smaller γ, the better T (together with the proxyfunction) represents P with respect to the k-means clustering problem,and it is likely that a good solution for the problem can be extractedfrom T , by weighing each point ot T with the number of points forwhich it acts as proxy.

32

Page 33: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

33

Page 34: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Analysis of MR-kmeans(A) (cont’d)

The following lemma relates the quality of the solution returned byMR-kmeans(A) to the quality of the coreset T and to the approximationguarantee of A.

Lemma 1

Suppose that the coreset T computed in Round 1 of MR-kmeans(A) is aγ-coreset and that A is an α-approximation algorithm for weightedk-means clustering, for some α > 1. Let Calg be the k-clustering of Preturned by MR-kmeans(A). Then:

Φkmeans(Calg) = O(α · (1 + γ) · Φopt

kmeans(P, k)).

(We skip the proof. If interested, I can give you the details.)

Remark: The lemma implies that by chosing a γ-coreset with small γ,the penalty in accuracy incurred by the MapReduce algorithm withrespect to the sequential algorithm is small. This shows that accuratesolutions even in the big data scenario can be computed.

34

Page 35: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Analysis of MR-kmeans(A) (cont’d)

Lemma 2

If A is an α-approximation algorithm for k-means clustering, then thecoreset T computed in Round 1 of MR-kmeans(A) is a γ-coreset with

γ = α.

Proof.

Let Sopt be the set of optimal centers for P (i.e., those of the clusteringwith value of the objective function Φopt

kmeans(P, k)). Let also Sopti be the

set of optimal centers for Pi .

Recall that for every 1 ≤ i ≤ `, for each p ∈ Pi its proxy π(p) is thepoint of Ti closest to p, thus d(p, π(p)) = d(p,Ti ).

35

Page 36: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Analysis of MR-kmeans(A) (cont’d)

Proof.

Hence, we have:∑p∈P

d2(p, π(p)) =∑i=1

∑p∈Pi

(d(p,Ti ))2

≤∑i=1

α∑p∈Pi

(d(p,Sopti ))2 (*)

≤ α∑i=1

∑p∈Pi

(d(p,Sopt))2 (**)

= α∑p∈P

(d(p,Sopt))2 = αΦoptkmeans(P, k).

(*) follows since A is an α-approximation algorithm;

(**) follows since Sopt is not necessarily optimal for Pi .

36

Page 37: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Analysis of MR-kmeans(A) (cont’d)

Proof of theorem.

The theorem is an immediate consequence of Lemmas 1 and 2.

Observations:

• The constants hidden in the order of magnitude for theapproximation factor are not large.

• Two different k-means sequential algorithms can be used in R1 andR2. For example, one could use k-means++ in R1, and k-means++plus LLoyd’s in R2.

• In practice, the algorithm is fast and quite accurate.

• If k ′ > k centers are selected from each Pi in Round 1 (e.g., throughk-means++), the quality of T , hence the quality of the finalclustering, improves. In fact, for low dimensional spaces, a γ-coresetT with γ α can be built in this fashion using a not too large k ′.

37

Page 38: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

k-median clustering

38

Page 39: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Definition (review) and some observations

Given a pointset P from a metric space (M, d) and an integer k > 0, thek-median clustering problem requires to determine a k-clusteringC = (C1,C2, . . . ,Ck ; c1, c2, . . . , ck) which minimizes:

Φkmedian(C) =k∑

i=1

∑a∈Ci

d(a, ci ).

Our aim:

• Unlike the case of k-means clustering, for k-median we will focus onarbitrary metric spaces and require that cluster centers belong tothe input pointset.

• Traditionally, when centers of the clusters belong to the input theyare referred to as medoids.

• We will first study the popular PAM sequential algorithm devised byL. Kaufman and P.J. Rousseeuw in 1987 and based on the localsearch optimization strategy. Then, we will discuss how to copewith very large inputs.

39

Page 40: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Partitioning Around Medoids (PAM) algorithm

Input Set P of N points from (M, d), integer k > 1

Output k-clustering C = (C1,C2, . . . ,Ck ; c1, c2, . . . , ck) of P with smallΦkmedian(C). Centers must belong to P.

S ← c1, c2, . . . , ck (arbitrary set of k points of P)C ← Partition(P, S)stopping-condition ← false

while (!stopping-condition) do stopping-condition ← true

for each (c ∈ S and p ∈ P − S) do S ′ ← (S − c) ∪ pC′ ← Partition(P, S ′)if Φkmedian(C′) < Φkmedian(C) then stopping-condition ← false

C ← C′; S ← S ′

exit the for-each loop

return C

40

Page 41: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

PAM algorithm: example

41

Page 42: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

PAM algorithm: example

42

Page 43: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

PAM algorithm: example

43

Page 44: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

PAM algorithm: example

44

Page 45: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

PAM algorithm: example

45

Page 46: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

PAM algorithm: example

46

Page 47: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

PAM algorithm: example

47

Page 48: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

PAM algorithm: example

48

Page 49: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

PAM algorithm: analysis

Exercise

Show that the PAM algorithm always terminates.

The following theorem, which we do not prove, states theapproximation guarantee of the PAM algorithm.

Theorem (Arya+04)

Let Φoptkmedian(P, k) be the minimum value of Φkmedian(C) over all

possible k-clusterings C of P, and let Calg be the k-clustering of Preturned by the PAM algorithm. Then:

Φkmedian(Calg) ≤ 5 · Φoptkmedian(k).

49

Page 50: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Remarks on the PAM algorithm

• The PAM algorithm is very slow in practice especially for largeinputs since: (1) the local search may require a large number ofiterations to converge; and (2) in each iteration up to (N − k) · kswaps may need to be checked, and for each swap a new clusteringmust be computed.

• The first issue can be tackled by stopping the iterations when theobjective function does not decrease significantly. It can be shownthat an approximation guarantee similar to the one stated in thetheorem can be achieved with a reasonable number of iterations.

• A faster alternative to PAM is an adaptation of the LLoyd’salgorithm where the center of a cluster C is defined to be the pointof C that minimizes the sum of the distances to all other points ofC . This center is called medoid rather than centroid. Thisadaptation appears faster than PAM and still very accurate inpractice (see [PJ09]).

50

Page 51: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

k-median clustering for big dataThe solution of the k-median clustering problem on very large pointsetsP can be attained using the same coreset-based approach illustrated forthe case of k-means.

• Define the weighted variant of the problem where each point p ∈ Phas a weight w(p) and the objective function for a k-clusteringC = (C1,C2, . . . ,Ck ; c1, c2, . . . , ck) is

Φwkmedian(C) =

k∑i=1

∑p∈Ci

w(p) · d(p, ci ),

• Let B be a sequential algorithm which solves the weighted variant ofk-median using space linear in the input size. (EXERCISE: showhow to adapt PAM to solve the weighted variant )

• We can devise a 3-round MapReduce algorithm for k-median, whichuses B as a subroutine. Call this algorithm MR-kmedian(B).

• The description of MR-kmedian(B) is identical to the one ofMR-kmeans(A) (see Slide 29) except that A is now replaced by B.

51

Page 52: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Analysis of MR-kmedian(B) algorithm

Assume k ≤√N. Then, MR-kmedian(B) requires 3 rounds, local space

ML = O(√

N · k)

, and aggregate space MA = O (N), where N is the

input size.

The quality guarantee featured by MR-kmedian(B) is stated in thefollowing theorem (we skip the proof).

Theorem

Suppose that B is an α-approximation algorithm for weighted k-medianclustering, for some α > 1, and let Calg be the k-clustering of P returnedby MR-kmedian(B). Then:

Φkmedian(Calg) = O(α2 · Φopt

kmedian(P, k)).

That is, MR-kmedian(B) is a O(α2)-approximation algorithm.

52

Page 53: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Observations

• The same observations made for MR-kmeans (see Slide 37) apply toMR-kmedian.

• In general, the coreset-based strategy which we adopted for thethree clustering problems (k-center, k-means, k-median) featuresvarious strengths:

• Ability to handle huge inputs.

• Tunable performance-accuracy tradeoffs.

• Very good behaviour in practice.

• In the analysis, we assumed k ≤√N. By adapting the analysis of

Partition(P,S), one can show that the same bounds on the keyperformance indicators hold for larger values of k .

53

Page 54: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

k-center vs k-means vs k-median

Some considerations that may help in selecting the most appropriateclustering problem

• All 3 problems tend to return spherically-shaped clusters.

• They differ in the center selection strategy since, given the centers,the best clustering around them is the same for the 3 problems.

• k-center is useful when we want to optimize the distance of everypoint from its centers, hence minimizes the radii of the clusterspheres. It is also useful for coreset selection.

• k-median is useful for facility-location applications when targetingminimum average distance from the centers. However, it is themost challenging problem from a computational standpoint.

• k-means is useful when centers need not belong to the pointset andminimum average distance is targeted. Optimized implementationsare available in many software packages (unlike k-median)

54

Page 55: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

k-center vs k-means vs k-median

The following example shows that different impact that outliers can havefor the 3 problems.

For k ≥ 2, the relevance of the outlier x in the 3 problems is different

• k-center: x must be always selected as center!

• k-means: x must be selected as center if D ≥ 1000!

• k-median: x must be selected as center if D ≥ 1000000!

55

Page 56: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Exercises

Exercise

Let Calg = Partition(P,S) be the k-clustering of P induced by the set Sof centers returned by k-means++. Using Markov’s inequality determinea value α > 1 such that

Pr(Φkmeans(Calg) ≤ α · Φopt

kmeans(P, k))≥ 1/2.

Recall that for a real-valued nonnegative random variable X withexpectation E [X ] and a value a > 0, the Markov’s inequality states that

Pr(X ≥ a) ≤ E [X ]

a.

56

Page 57: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

Exercises

Exercise

(This exercise shows how to make the approximation bound of algorithmk-means++ to hold with high probability rather than in expectation.)

Let Calg = Partition(P,S) be the k-clustering of P induced by the set Sof centers returned by an execution of k-means++. Suppose that weknow that for some value α > 1 the following probability bound holds:

Pr(Φkmeans(Calg) ≤ α · Φopt

kmeans(P, k))≥ 1/2.

Show that the same approximation ratio, but with probability at least1− 1/N, can be ensured by running several independent instances ofk-means++, computing for each instance the clustering around thereturned centers, and returning at the end the best clustering found (callit Cbest), i.e., the one with smallest value of the objective function.

57

Page 58: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

References

LRU14 J. Leskovec, A. Rajaraman and J. Ullman. Mining Massive Datasets.Cambridge University Press, 2014. Sections 3.1.1 and 3.5, and Chapter 7

BHK18 A. Blum, J. Hopcroft, and R. Kannan. Foundations of Data Science.Manuscript, June 2018. Chapter 7

AV07 D. Arthur, S. Vassilvitskii. k-means++: the advantages of carefulseeding. Proc. of ACM-SIAM SODA 2007: 1027-1035.

B+12 B. Bahmani et al. Scalable K-Means++. PVLDB 5(7):622-633, 2012

C+19 V. Cohen-Addad et al. Tight FPT Approximations for k-Median andk-Means. Proc. of ICALP 2019: 42:1-42:14.

Arya+04 V. Arya et al. Local Search Heuristics for k-Median and Facility LocationProblems. SIAM J. Comput. 33(3):544-562, 2004.

PJ09 H.S. Park, C.H. Jun. A simple and fast algorithm for K-medoidsclustering. Expert Syst. Appl. 36(2):3336-3341, 2009

58

Page 59: Clustering (Part 3)capri/BDC/MATERIAL/Clustering-3-1920.pdf · 2020. 4. 22. · k-means clustering k-median clustering We will Review classical, well-esablished algorithms, andpoint

ErrataChanges w.r.t. first draft:

• Slide 21: ”Known alternatives” → ”Main alternatives”.

• Slide 23: modified the reference to Spark (bottom line).

• Slide 25: corrected the sentence after the formula for the objectivefunction.

• Slide 26: p → q in the last formula.

• Slide 27: new slide with an example of weighted coreset.

• Slide 30: new slide with an example for Round 1 of MR-kmeans.

• Slide 31: added some details to analysis of local and aggregate space.

• Slide 33: new slide with clarification of proof idea.

• Slide 34: γ → (1 + γ) in the centered formula.

• Slide 35: removed Aw .

• Slide 40: ”both for each loops” → ”the for each loop”

• Slide 51: added exercise.

• Slides 54,55: new slides.

59