the effectiveness of lloyd-type methods for the k-means problem chaitanya swamy university of...

The Effectiveness of Lloyd-type Methods for the k-

means ProblemChaitanya Swamy

University of WaterlooJoint work with

Rafi Ostrovsky, Yuval Rabani, Leonard Schulman

UCLA Technion Caltech

The k-means ProblemGiven: n points in d-dimensional space

•partition X into k clusters X1,…, Xk

•assign each point in Xi to a common center ciRd

Goal: Minimize ∑i ∑xXi d(x,ci)2

d: L2 distance

X Rd : point set with |X| = n

c1

c3c2

X1 X2 X3

k-means (contd.)•Given the ci’s, best clustering is to assign each

point to nearest center: Xi = {xX: ci is ctr. nearest to x}

•Given the Xi’s, best choice of centers is to setci = center of mass of Xi = ctr(Xi) = ∑xXi x / |Xi|

Optimal solution satisfies both propertiesProblem is NP-hard even for k=2 (n, d not

fixed)

Related Workk-means problem dates back to Steinhaus (1956).a) Approximation algorithms algorithms with provable guarantees•PTAS’s with varying runtime dependence on

n, d, k: poly/linear in n, could be exponential in d and/or k

– Matousek (poly(n), exp(d,k))– Kumar, Sabharwal & Sen (KSS04) (lin(n,d),

exp(k))•O(1)-approximation algorithms for k-

median: any point set with any metric, runtime poly(n,d,k); guarantees also translate to k-means

– Charikar, Guha, Tardos & Shmoys – Arya et al. + Kanungo et al.: (9+)-

approximation

b) Heuristics: Lloyd’s method invented in 1957 and remains an extremely popular heuristic even today1) Start with k initial / “seed” centers c1,…, ck.2) Iterate the following Lloyd step

a)Assign each point to nearest center ci to obtain clustering X1,…, Xk.

b)Update ci ctr(Xi) = ∑xXi x/|Xi| .

1) Start with k initial / “seed” centers c1,…, ck.2) Iterate the following Lloyd step

a)Assign each point to nearest center ci to obtain clustering X1,…, Xk.

b)Update ci ctr(Xi) = ∑xXi x/|Xi| .

b) Heuristics: Lloyd’s method invented in 1957 and remains an extremely popular heuristic even today

•Some bounds on number of iterations of Lloyd-type methods: Inaba-Katoh-Imai; Har-Peled-Sadri; Arthur-Vassilvitskii (’06)

•Performance very sensitive to choice of seed centers; lot of literature on finding “good” seeding methods for Lloyd

•But, almost no analysis that proves performance guarantees about quality of final solution for arbitrary k and dimension

Lloyd’s method: What’s known?

Our Goal: to analyze Lloyd and try to prove rigorous performance guarantees for Lloyd-type methods

Our Results

• Introduce a clusterability or separation condition.•Give a novel, efficient sampling process for seeding

Lloyd’s method with initial centers.•Show that if data satisfies our clusterabililty

condition:– seeding + 1 Lloyd step yields a constant-approximation

in time linear in n and d, poly(k): is potentially faster than Lloyd variants which require multiple reseedings

– seeding + KSS04-sampling gives a PTAS: algorithm is faster and simpler than PTAS in KSS04.

Main Theorem: If data has a “meaningful k-clustering”, then there is a simple, efficient seeding method s.t. Lloyd-type methods return a near-optimal solution.

“Meaningful k-Clustering”Settings where one would NOT consider data to possess a meaningful k-clustering:1) If near-optimum cost can be achieved by two

very distinct k-partitions of data, then identity of an optimal k-partition carries little meaning – provides ambiguous classification.

2) If cost of best k-clustering ≈ cost of best (k-1)-clustering, then a k-clustering yields only marginal benefit over the best (k-1)-clustering – should use smaller value of k here.Example: k=3

We formalize 2).Let k

2(X) = cost of best k-clustering of X.X is -seperated for k-means iff k

2(X) / k-

12(X) ≤ 2 .•Simple condition. Drop in k-clustering cost is already used by practitioners to choose the right k

•Can show that (roughly), X is -separated for k-means

two low-cost k-clusterings disagree on only a small fraction of data

Some basic factsFact: pRd, ∑xX d(x, p)2 = 1

2(X) + n.d(p,c)2

∑{x,y}X d(x,y)2 = n.12(X)

[Write d(x, p)2 = (x-c + c-p)T(x-c + c-p) and expand.]

Lemma: Let X = X1X2 be a partition of X with ci = ctr(Xi).Then 1

2(X) = 12(X1) + 1

2(X2) + (|X1| |X2| /

n).d(c1,c2)2.Proof: 1

2(X) = ∑xX1 d(x,c)2 + ∑xX2 d(x,c)2

= (12(X1) + |X1|.d(c1,c)2) + (1

2(X2) + |X2|.d(c2,c)2)

= 12(X1) + 1

2(X2) + (|X1| |X2| / n).d(c1,c2)2

n = |X|c = ctr(X)

X2

c2cX1

c1 c = (|X1|.c1+|X2|.c2) / n

r*2

r*1

The 2-means problem (k=2)

X*1, X*

2 : optimal clusters c*

i= ctr(X*i), D* = d(c*

1,c*2)

ni = |X*i|, (r*

i)2= ∑xX*i d(x, c*i)2 / ni = 1

2(X*i) /

ni = avg. squared distance in

cluster X*i

Lemma: For i=1, 2, (r*i)2 ≤ 2/(1-2).D*2 .

Proof: 22(X) / 2 ≤ 1

2(X) = 22(X) + (n1n2 /

n).D*2 .

X is -separated for 2-means.

c*1 c*

2D*

X*1

X*2

The 2-means algorithm

1) Sampling-based seeding procedure: – Pick two seed centers c1, c2 by randomly

picking the pair x, yX with probability d(x,y)2.

2) Lloyd step or simpler “ball k-means step”:– For each ci, let Bi = {xX: d(x,ci) ≤

d(c1,c2)/3}. – Update ci ctr(Bi); return these as final

centers.

Sampling can be implemented in O(nd) time, so entire algorithm runs in O(nd) time.

c*1 c*

2D*

X*1

X*2

c1

2-means: Analysis

c*1 c*

2D*

X*1

X*2

core(X*1

)core(X*

2)

c2

Let core(X*i) = {xX*

i : d(x,c)2 ≤ (r*

i)2 / }, where =

(2) < 1.Seeding lemma: With prob. 1–O(), c1,c2 lie in cores of X*

1, X*2.

Proof: |core(X*i)| ≥ (1-)ni for i=1,2.

Let A = ∑xcore(X*1), ycore(X*2) d(x,y)2 ≈ (1-)2n1n2D*2.

B= ∑{x,y}X d(x,y)2 = n.1

2(X) ≈ n1n2D*2.Probability = A / B ≈ (1-)2

= 1– O().

2-means analysis (contd.)c1

c*1 c*

2D*

X*1

X*2

core(X*1

)core(X*

2)

c2

Recall that Bi = {xX: d(x,ci) ≤ d(c1,c2)/3}Ball-k-means lemma: For i=1,2, core(X*

i) Bi X*

i .Therefore d(ctr(Bi), c*

i)2 ≤ (r*i)2 /(1–) .

Intuitively, since Bi X*i and Bi contains

almost all of the mass of X*i, ctr(Bi) must

be close to ctr(X*i) = c*

i.

B1 B2


c*1 c*

2D*

X*1

X*2

core(X*1

)core(X*

2)

c2


i) Bi X*


i)2 ≤ (r*i)2 /(1–) .

Proof: 2(X*

i) ≥ (|Bi| |X*i \ Bi| / ni).d(ctr(Bi),

ctr(X*i \ Bi))2

B1 B2

Some basic factsFact: pRd, ∑xX d(x, p)2 = 1

2(X) + n.d(p,c)2

∑{x,y}X d(x,y)2 = n.12(X)

[Write d(x, p)2 = (x-c + c-p)T(x-c + c-p) and expand.]

Lemma: Let X = X1X2 be a partition of X, ci = ctr(Xi).Then 1

2(X) = 12(X1) + 1

2(X2) + (|X1| |X2| /

n).d(c1,c2)2.

n = |X|c = ctr(X)

X2

c2cX1

c1 c = (|X1|.c1+|X2|.c2) / n

2-means analysis (contd.)

Theorem: With probability 1–O(), cost of final clustering is at most 2

2(X)/(1–), get a (1/(1–))-approximation algorithm.Since = O(2), we have

approximation ratio 1 as 0.probability of success 1 as

0.

Arbitrary kAlgorithm and analysis follow the same outline as in 2-means.If X is -separated for k-means, can again show that all clusters are well separated, that is, cluster radius << inter-cluster distance, r*

i = O().d(c*

i, c*j) i,j

1) Seeding stage: we choose k initial centers and ensure that they lie in the “cores” of the k optimal clusters.– exploits the fact that clusters are well separated– after seeding stage, each optimal center has a

distinct seed center very “near” it2) Now, can run either a Lloyd step or a ball-k-means step.Theorem: If X is -separated for k-means, then one can obtain an ()-approximation algorithm where () 1 as 0.

Schematic of entire algorithm

Simple sampling: Pick k centers as follows.– first pick 2 centers c1, c2 as in 2-means– to pick center ci+1, pick xX with probability

minj≤i d(x,cj)2

Simple sampling: success probability = exp(-k)

Oversampling + deletion: sample O(k) centers, then greedily delete till k remainO(1) success probability, O(nkd+k3d)

Greedy deletion: O(n3d)

Greedy deletion: Start with n centers and keep deleting the center that causes least cost increase till k centers remain

k well-placed seeds

Ball k-means or Lloyd step: gives O(1)-approx.

KSS04-sampling:gives PTAS

Simple sampling: analysis sketch

X*1,…, X*

k : optimal clustersc*

i = ctr(X*i), ni = |X*

i|,(r*i)2 = ∑xX*i d(x,c*

i)2 /

ni = 12(X*

i) / ni

core(X*i) = {xX*

i : d(x,c*i)2 ≤ (r*

i)2 / } where

= (2/3)Lemma: With probability (1–O())k, all sampled centers lie in the cores of distinct optimal clusters.Proof: Will show inductively that if c1,…, ci lie in distinct cores, then with probability 1–O(), so does center ci+1.Base case: X is -separated for k-means

X*i X*

j is -separated for 2-means for every i≠ j (because merging two clusters causes a huge increase in cost).So by 2-means analysis, first two centers c1,c2 lie in distinct cores.

X is -separated for k-means.

Simple sampling: analysis (contd.)

Inductive step: Assume c1,…, ci lie in cores of X*1,

…, X*i

Let C = {c1,…, ci}.A = ∑j ≥ i+1 ∑xcore(X*j) d(x,C)2 ≈ ∑j ≥ i+1 (1-)nj d(c*

j,C)2

B = ∑j ≤ k, xX*j d(x,C)2 ≈ ∑j ≤ i 1

2(X*j) + ∑j ≥ i+1(1

2(X*j) + nj d(c*

j,C)2)

X*2X*

1

core(X*1

)

c*1

c1

X*i

core(X*i)

c*i

ci

c*i+1

X*i+

1c*

k

X*k

Simple sampling: analysis (contd.)

Inductive step: Assume c1,…, ci lie in cores of X*1,

…, X*i

Let C = {c1,…, ci}.A = ∑j ≥ i+1 ∑xcore(X*j) d(x,C)2 ≈ ∑j ≥ i+1 (1-)nj d(c*

j,C)2

B = ∑j ≤ k, xX*j d(x,C)2 ≈ ∑j ≤ i 1

2(X*j) + ∑j ≥ i+1(1

2(X*j) + nj d(c*

j,C)2) ≈ ∑j ≥ i+1 nj d(c*

j,C)2

Probability = A/B = 1–O()

X*2

c*i+1 c*

k

X*1

core(X*1

)

c*1

c1

X*i

core(X*i)

c*i

ci

X*i+

1

X*k

Open Questions• Deeper analysis of Lloyd: are there weaker

conditions under which one can prove performance guarantes for Lloyd-type methods?

• PTAS for k-means with polytime dependence on n, k and d? Is it APX hard in geometric setting?

• PTAS for k-means under our separation condition?

• Other applications of separation condition?

Thank You.

the effectiveness of lloyd-type methods for the k-means problem chaitanya swamy university of...

Documents