the effectiveness of lloyd-type methods for the k-means problem chaitanya swamy university of...
DESCRIPTION
k-means (contd.) Given the c i ’s, best clustering is to assign each point to nearest center: X i = {x X: c i is ctr. nearest to x} Given the X i ’s, best choice of centers is to set c i = center of mass of X i = ctr(X i ) = ∑ x X i x / |X i | Optimal solution satisfies both properties Problem is NP-hard even for k=2 (n, d not fixed)TRANSCRIPT
The Effectiveness of Lloyd-type Methods for the k-
means ProblemChaitanya Swamy
University of WaterlooJoint work with
Rafi Ostrovsky, Yuval Rabani, Leonard Schulman
UCLA Technion Caltech
The k-means ProblemGiven: n points in d-dimensional space
•partition X into k clusters X1,…, Xk
•assign each point in Xi to a common center ciRd
Goal: Minimize ∑i ∑xXi d(x,ci)2
d: L2 distance
X Rd : point set with |X| = n
c1
c3c2
X1 X2 X3
k-means (contd.)•Given the ci’s, best clustering is to assign each
point to nearest center: Xi = {xX: ci is ctr. nearest to x}
•Given the Xi’s, best choice of centers is to setci = center of mass of Xi = ctr(Xi) = ∑xXi x / |Xi|
Optimal solution satisfies both propertiesProblem is NP-hard even for k=2 (n, d not
fixed)
Related Workk-means problem dates back to Steinhaus (1956).a) Approximation algorithms algorithms with provable guarantees•PTAS’s with varying runtime dependence on
n, d, k: poly/linear in n, could be exponential in d and/or k
– Matousek (poly(n), exp(d,k))– Kumar, Sabharwal & Sen (KSS04) (lin(n,d),
exp(k))•O(1)-approximation algorithms for k-
median: any point set with any metric, runtime poly(n,d,k); guarantees also translate to k-means
– Charikar, Guha, Tardos & Shmoys – Arya et al. + Kanungo et al.: (9+)-
approximation
b) Heuristics: Lloyd’s method invented in 1957 and remains an extremely popular heuristic even today1) Start with k initial / “seed” centers c1,…, ck.2) Iterate the following Lloyd step
a)Assign each point to nearest center ci to obtain clustering X1,…, Xk.
b)Update ci ctr(Xi) = ∑xXi x/|Xi| .
1) Start with k initial / “seed” centers c1,…, ck.2) Iterate the following Lloyd step
a)Assign each point to nearest center ci to obtain clustering X1,…, Xk.
b)Update ci ctr(Xi) = ∑xXi x/|Xi| .
b) Heuristics: Lloyd’s method invented in 1957 and remains an extremely popular heuristic even today
1) Start with k initial / “seed” centers c1,…, ck.2) Iterate the following Lloyd step
a)Assign each point to nearest center ci to obtain clustering X1,…, Xk.
b)Update ci ctr(Xi) = ∑xXi x/|Xi| .
b) Heuristics: Lloyd’s method invented in 1957 and remains an extremely popular heuristic even today
•Some bounds on number of iterations of Lloyd-type methods: Inaba-Katoh-Imai; Har-Peled-Sadri; Arthur-Vassilvitskii (’06)
•Performance very sensitive to choice of seed centers; lot of literature on finding “good” seeding methods for Lloyd
•But, almost no analysis that proves performance guarantees about quality of final solution for arbitrary k and dimension
Lloyd’s method: What’s known?
Our Goal: to analyze Lloyd and try to prove rigorous performance guarantees for Lloyd-type methods
Our Results
• Introduce a clusterability or separation condition.•Give a novel, efficient sampling process for seeding
Lloyd’s method with initial centers.•Show that if data satisfies our clusterabililty
condition:– seeding + 1 Lloyd step yields a constant-approximation
in time linear in n and d, poly(k): is potentially faster than Lloyd variants which require multiple reseedings
– seeding + KSS04-sampling gives a PTAS: algorithm is faster and simpler than PTAS in KSS04.
Main Theorem: If data has a “meaningful k-clustering”, then there is a simple, efficient seeding method s.t. Lloyd-type methods return a near-optimal solution.
“Meaningful k-Clustering”Settings where one would NOT consider data to possess a meaningful k-clustering:1) If near-optimum cost can be achieved by two
very distinct k-partitions of data, then identity of an optimal k-partition carries little meaning – provides ambiguous classification.
2) If cost of best k-clustering ≈ cost of best (k-1)-clustering, then a k-clustering yields only marginal benefit over the best (k-1)-clustering – should use smaller value of k here.Example: k=3
We formalize 2).Let k
2(X) = cost of best k-clustering of X.X is -seperated for k-means iff k
2(X) / k-
12(X) ≤ 2 .•Simple condition. Drop in k-clustering cost is already used by practitioners to choose the right k
•Can show that (roughly), X is -separated for k-means
two low-cost k-clusterings disagree on only a small fraction of data
Some basic factsFact: pRd, ∑xX d(x, p)2 = 1
2(X) + n.d(p,c)2
∑{x,y}X d(x,y)2 = n.12(X)
[Write d(x, p)2 = (x-c + c-p)T(x-c + c-p) and expand.]
Lemma: Let X = X1X2 be a partition of X with ci = ctr(Xi).Then 1
2(X) = 12(X1) + 1
2(X2) + (|X1| |X2| /
n).d(c1,c2)2.Proof: 1
2(X) = ∑xX1 d(x,c)2 + ∑xX2 d(x,c)2
= (12(X1) + |X1|.d(c1,c)2) + (1
2(X2) + |X2|.d(c2,c)2)
= 12(X1) + 1
2(X2) + (|X1| |X2| / n).d(c1,c2)2
n = |X|c = ctr(X)
X2
c2cX1
c1 c = (|X1|.c1+|X2|.c2) / n
r*2
r*1
The 2-means problem (k=2)
X*1, X*
2 : optimal clusters c*
i= ctr(X*i), D* = d(c*
1,c*2)
ni = |X*i|, (r*
i)2= ∑xX*i d(x, c*i)2 / ni = 1
2(X*i) /
ni = avg. squared distance in
cluster X*i
Lemma: For i=1, 2, (r*i)2 ≤ 2/(1-2).D*2 .
Proof: 22(X) / 2 ≤ 1
2(X) = 22(X) + (n1n2 /
n).D*2 .
X is -separated for 2-means.
c*1 c*
2D*
X*1
X*2
The 2-means algorithm
1) Sampling-based seeding procedure: – Pick two seed centers c1, c2 by randomly
picking the pair x, yX with probability d(x,y)2.
2) Lloyd step or simpler “ball k-means step”:– For each ci, let Bi = {xX: d(x,ci) ≤
d(c1,c2)/3}. – Update ci ctr(Bi); return these as final
centers.
Sampling can be implemented in O(nd) time, so entire algorithm runs in O(nd) time.
c*1 c*
2D*
X*1
X*2
c1
2-means: Analysis
c*1 c*
2D*
X*1
X*2
core(X*1
)core(X*
2)
c2
Let core(X*i) = {xX*
i : d(x,c)2 ≤ (r*
i)2 / }, where =
(2) < 1.Seeding lemma: With prob. 1–O(), c1,c2 lie in cores of X*
1, X*2.
Proof: |core(X*i)| ≥ (1-)ni for i=1,2.
Let A = ∑xcore(X*1), ycore(X*2) d(x,y)2 ≈ (1-)2n1n2D*2.
B= ∑{x,y}X d(x,y)2 = n.1
2(X) ≈ n1n2D*2.Probability = A / B ≈ (1-)2
= 1– O().
2-means analysis (contd.)c1
c*1 c*
2D*
X*1
X*2
core(X*1
)core(X*
2)
c2
Recall that Bi = {xX: d(x,ci) ≤ d(c1,c2)/3}Ball-k-means lemma: For i=1,2, core(X*
i) Bi X*
i .Therefore d(ctr(Bi), c*
i)2 ≤ (r*i)2 /(1–) .
Intuitively, since Bi X*i and Bi contains
almost all of the mass of X*i, ctr(Bi) must
be close to ctr(X*i) = c*
i.
B1 B2
2-means analysis (contd.)c1
c*1 c*
2D*
X*1
X*2
core(X*1
)core(X*
2)
c2
Recall that Bi = {xX: d(x,ci) ≤ d(c1,c2)/3}Ball-k-means lemma: For i=1,2, core(X*
i) Bi X*
i .Therefore d(ctr(Bi), c*
i)2 ≤ (r*i)2 /(1–) .
Proof: 2(X*
i) ≥ (|Bi| |X*i \ Bi| / ni).d(ctr(Bi),
ctr(X*i \ Bi))2
B1 B2
Some basic factsFact: pRd, ∑xX d(x, p)2 = 1
2(X) + n.d(p,c)2
∑{x,y}X d(x,y)2 = n.12(X)
[Write d(x, p)2 = (x-c + c-p)T(x-c + c-p) and expand.]
Lemma: Let X = X1X2 be a partition of X, ci = ctr(Xi).Then 1
2(X) = 12(X1) + 1
2(X2) + (|X1| |X2| /
n).d(c1,c2)2.
n = |X|c = ctr(X)
X2
c2cX1
c1 c = (|X1|.c1+|X2|.c2) / n
2-means analysis (contd.)c1
c*1 c*
2D*
X*1
X*2
core(X*1
)core(X*
2)
c2
Recall that Bi = {xX: d(x,ci) ≤ d(c1,c2)/3}Ball-k-means lemma: For i=1,2, core(X*
i) Bi X*
i .Therefore d(ctr(Bi), c*
i)2 ≤ (r*i)2 /(1–) .
Proof: 2(X*
i) ≥ (|Bi| |X*i \ Bi| / ni).d(ctr(Bi),
ctr(X*i \ Bi))2
Also d(ctr(Bi), c*i) = (|X*
i \ Bi| / ni).d(ctr(Bi), ctr(X*
i \ Bi)) ni(r*
i)2 ≥ (ni |Bi| / |X*
i \ Bi|).d(ctr(Bi), c*i)2
B1 B2
2-means analysis (contd.)
Theorem: With probability 1–O(), cost of final clustering is at most 2
2(X)/(1–), get a (1/(1–))-approximation algorithm.Since = O(2), we have
approximation ratio 1 as 0.probability of success 1 as
0.
Arbitrary kAlgorithm and analysis follow the same outline as in 2-means.If X is -separated for k-means, can again show that all clusters are well separated, that is, cluster radius << inter-cluster distance, r*
i = O().d(c*
i, c*j) i,j
1) Seeding stage: we choose k initial centers and ensure that they lie in the “cores” of the k optimal clusters.– exploits the fact that clusters are well separated– after seeding stage, each optimal center has a
distinct seed center very “near” it2) Now, can run either a Lloyd step or a ball-k-means step.Theorem: If X is -separated for k-means, then one can obtain an ()-approximation algorithm where () 1 as 0.
Schematic of entire algorithm
Simple sampling: Pick k centers as follows.– first pick 2 centers c1, c2 as in 2-means– to pick center ci+1, pick xX with probability
minj≤i d(x,cj)2
Simple sampling: success probability = exp(-k)
Oversampling + deletion: sample O(k) centers, then greedily delete till k remainO(1) success probability, O(nkd+k3d)
Greedy deletion: O(n3d)
Greedy deletion: Start with n centers and keep deleting the center that causes least cost increase till k centers remain
k well-placed seeds
Ball k-means or Lloyd step: gives O(1)-approx.
KSS04-sampling:gives PTAS
Simple sampling: analysis sketch
X*1,…, X*
k : optimal clustersc*
i = ctr(X*i), ni = |X*
i|,(r*i)2 = ∑xX*i d(x,c*
i)2 /
ni = 12(X*
i) / ni
core(X*i) = {xX*
i : d(x,c*i)2 ≤ (r*
i)2 / } where
= (2/3)Lemma: With probability (1–O())k, all sampled centers lie in the cores of distinct optimal clusters.Proof: Will show inductively that if c1,…, ci lie in distinct cores, then with probability 1–O(), so does center ci+1.Base case: X is -separated for k-means
X*i X*
j is -separated for 2-means for every i≠ j (because merging two clusters causes a huge increase in cost).So by 2-means analysis, first two centers c1,c2 lie in distinct cores.
X is -separated for k-means.
Simple sampling: analysis (contd.)
Inductive step: Assume c1,…, ci lie in cores of X*1,
…, X*i
Let C = {c1,…, ci}.A = ∑j ≥ i+1 ∑xcore(X*j) d(x,C)2 ≈ ∑j ≥ i+1 (1-)nj d(c*
j,C)2
B = ∑j ≤ k, xX*j d(x,C)2 ≈ ∑j ≤ i 1
2(X*j) + ∑j ≥ i+1(1
2(X*j) + nj d(c*
j,C)2)
X*2X*
1
core(X*1
)
c*1
c1
X*i
core(X*i)
c*i
ci
c*i+1
X*i+
1c*
k
X*k
Simple sampling: analysis (contd.)
Inductive step: Assume c1,…, ci lie in cores of X*1,
…, X*i
Let C = {c1,…, ci}.A = ∑j ≥ i+1 ∑xcore(X*j) d(x,C)2 ≈ ∑j ≥ i+1 (1-)nj d(c*
j,C)2
B = ∑j ≤ k, xX*j d(x,C)2 ≈ ∑j ≤ i 1
2(X*j) + ∑j ≥ i+1(1
2(X*j) + nj d(c*
j,C)2) ≈ ∑j ≥ i+1 nj d(c*
j,C)2
Probability = A/B = 1–O()
X*2
c*i+1 c*
k
X*1
core(X*1
)
c*1
c1
X*i
core(X*i)
c*i
ci
X*i+
1
X*k
Open Questions• Deeper analysis of Lloyd: are there weaker
conditions under which one can prove performance guarantes for Lloyd-type methods?
• PTAS for k-means with polytime dependence on n, k and d? Is it APX hard in geometric setting?
• PTAS for k-means under our separation condition?
• Other applications of separation condition?
Thank You.