more pluses in the k-means++ m - anu college of...
TRANSCRIPT
www.data61.csiro.au
k-variates++:moreplusesinthek-means++RichardNock,RaphaëlCanyasse,RoksanaBoreli,FrankNielsen
DATA61|ANU|TECHNION|ECOLEPOLYTECHNIQUE|UNSW|SONYCSLABS,INC.
(formerly NICTA)
Poster #29, Mon. 3-7pm
2
ICML 2016
k-variates
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
In this talk
❖ A generalization of the popular k-means++ seeding!
❖ Two theorems on k-variates++!
❖ guarantees on approximation of the global optimum!
❖ likelihood ratio bound between neighbouring instances!
❖ Applications: “reductions” between clustering algorithms + approximation bounds of new clustering algorithms, privacy
❖ A generalization of the popular k-means++ seeding!
❖ Two theorems on k-variates++!
❖ guarantees on approximation of the global optimum!
❖ likelihood ratio bound between neighbouring instances!
❖ Applications: “reductions” between clustering algorithms + approximation bounds of new clustering algorithms, privacy
3
ICML 2016
k-variates
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
In this talk
And more ! (see poster)
❖ A generalization of the popular k-means++ seeding!
❖ Two theorems on k-variates++!
❖ guarantees on approximation of the global optimum!
❖ likelihood ratio bound between neighbouring instances!
❖ Applications: “reductions” between clustering algorithms + approximation bounds of new clustering algorithms, privacy
4
ICML 2016
k-variates
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
In this talk
And more ! (see poster)!(see paper !)
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen5
❖ k-means++ seeding = a gold standard in clustering:!
❖ utterly simple to implement (iteratively pick centers squ. distance to previous centers) !
❖ assumption-free (expected) approximation guarantee wrt the k-means global optimum:
(Arthur & Vassilvitskii, SODA 2007)!
❖ Inspired many variants (tensor clustering, distributed, data stream, on-line, parallel clustering, clustering without centroids in closed form, etc.)
Motivation
ICML 2016
k-means++
distributed
on-line
streamed
no closed form centroid
tensors
more potentials
⇠
EC[potential] (2 + log k) · 8�opt
6
❖ Approaches are spawns of k-means++:!
❖ modify the algorithm (e.g. )!
❖ use it as building block!
❖ Our objective: !
❖ all in the same “bag”: a generalisation of k-means++ from which such approaches would be just “instanciations” reductions!
❖ Because general new applications
ICML 2016
distributed
on-line
streamed
no closed form centroid
more potentials
⇠k-variates
more applications
k-means++
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
Motivation
))
7
ICML 2016
k-means++
Input: data A ⇢ Rdwith |A| = m, k 2 N⇤;
Step 1: Initialise centers C ;;Step 2: for t = 1, 2, ..., k
2.1: randomly sample a ⇠qt A, with q1.= um and, for t > 1,
qt(a).= Dt(a)
X
a
02A
Dt(a0)
!�1
, where Dt(a).= min
x2Cka� xk22 ;
2.2: x a;
2.3: C C [ {x};Output: C;
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
Arthur&Vassilvitskii,SODA’07
8
ICML 2016
k-variates
Input: data A ⇢ Rdwith |A| = m, k 2 N⇤, random variables {X
a
,a 2 A},probe functions }t : A! Rd
(t � 1);
Step 1: Initialise centers C ;;Step 2: for t = 1, 2, ..., k
2.1: randomly sample a ⇠qt A, with q1.= um and, for t > 1,
qt(a).= Dt(a)
X
a
02A
Dt(a0)
!�1
, where Dt(a).= min
x2Ck}t(a)� xk22 ;
2.2: randomly sample x ⇠ Xa
;
2.3: C C [ {x};Output: C;
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
9
ICML 2016k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
Two theorems & applications
10
ICML 2016
Theorem 1❖ k-means potential for : , with !
❖ Suppose is -stretching: for any optimal cluster with size > 1 and any ,
❖ Then , with
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
c(a).= argmin
c2Cka� ck22�(A;C)
.=
Pa2A ka� c(a)k22C
}t ⌘ Aa0 2 A �(A;C)
�(A; {a0}) (1 + ⌘) · �(}t(A);C)
�(}t(A); {}t(a0)}), 8t
EC⇠k�variates++[�(A;C)] (2 + log k) · �
�var.=
X
a2A
tr (cov[Xa])
�bias
.=
X
a2A
kE[Xa]� copt
(a)k22
�opt
.=
X
a2A
ka� copt
(a)k22
�.= (6 + 4⌘)�
opt
+ 2�bias
+ 2�var
approximation of global
optimum
(� 0)
11
ICML 2016
Theorem 1❖ k-means potential for : , with !
❖ Suppose is -stretching: for any optimal cluster with size > 1 and any ,
❖ Then , with
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
c(a).= argmin
c2Cka� ck22�(A;C)
.=
Pa2A ka� c(a)k22C
}t ⌘ Aa0 2 A �(A;C)
�(A; {a0}) (1 + ⌘) · �(}t(A);C)
�(}t(A); {}t(a0)}), 8t
EC⇠k�variates++[�(A;C)] (2 + log k) · �
�var.=
X
a2A
tr (cov[Xa])
�bias
.=
X
a2A
kE[Xa]� copt
(a)k22
�opt
.=
X
a2A
ka� copt
(a)k22k-means++:!
• probe = Id!
• = Diracs
�.= (6 + 4⌘)�
opt
+ 2�bias
+ 2�var
approximation of global
optimum
X.
(� 0)
12
ICML 2016
Theorem 1❖ k-means potential for : , with !
❖ Suppose is -stretching: for any optimal cluster with size > 1 and any ,
❖ Then , with
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
c(a).= argmin
c2Cka� ck22�(A;C)
.=
Pa2A ka� c(a)k22C
}t ⌘ Aa0 2 A �(A;C)
�(A; {a0}) (1 + ⌘) · �(}t(A);C)
�(}t(A); {}t(a0)}), 8t
EC⇠k�variates++[�(A;C)] (2 + log k) · �
�var.=
X
a2A
tr (cov[Xa])
�bias
.=
X
a2A
kE[Xa]� copt
(a)k22
�opt
.=
X
a2A
ka� copt
(a)k22k-means++:
�b
i
a
s
= �o
p
t
�var = 0
) � = 8�o
p
t
⌘ = 0
�.= (6 + 4⌘)�
opt
+ 2�bias
+ 2�var
approximation of global
optimum
(� 0)
13
ICML 2016
Remarks❖ Guarantee approaches statistical lowerbound
(Fréchet-Cramér-Rao-Darmois)!
❖ Can be better than Arthur-Vassilvitskii bound, in particular if = knob from which background / domain knowledge may improve the general bound
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
�bias
< �opt
�bias
14
ICML 2016
Applications❖ Reductions from k-variates++ approximability ratios!
❖ pick clustering algorithm ,!
❖ show that expected output of = that of k-variates++ for particular choices of and (note: no computational constraint, just need existence)!
❖ Get approximability ratio for !
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
}t X.
L
L
L
)
15
ICML 2016
Summary (poster, paper)
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
}t X.LSetting Algorithm Probe functions DensitiesBatch k-means++ Identity Diracs
Distributed d-k-means++ Identity Uniform,support = subsets
Distributed p
+d-k-means++ Identity Non uniform,
compact supportStreaming s-k-means++ synopses DiracsOn-line ol-k-means++ point (batch not hit) Diracs
/ closest center (batch hit)
16
ICML 2016
Summary (poster, paper)
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
}t X.LSetting Algorithm Probe functions DensitiesBatch k-means++ Identity Diracs
Distributed d-k-means++ Identity Uniform,support = subsets
Distributed p
+d-k-means++ Identity Non uniform,
compact supportStreaming s-k-means++ synopses DiracsOn-line ol-k-means++ point (batch not hit) Diracs
/ closest center (batch hit)
17
ICML 2016
Distributed clustering❖ Setting: {data nodes = Forgy nodes} & special node
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
(F1,A1)
(F2,A2) (F3,A3)
(F4,A4) (F5,A5)
N⇤
k data points!
communicated
“Forgy nodes”“Sampling node”
no data!communicated
no data
non-uniformsampling
uniformsampling
data([iAi = A)
& &
e.g. hybrid, server-assisted
P2P networks!
(or Forgy node)
18
ICML 2016
Algorithm + Theorem❖ Algorithm: iterate for :!
❖ chooses (non-uniformly, ) Forgy node, say !
❖ samples (uniformly) point , sends to!
❖ computes & sends to , which updates !
❖ Theorem: , with and the spread of Forgy nodes!
❖ Remarks: global optimum on total data; bound gets all the better as Forgy nodes aggregate “local” data.
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
N⇤ ⇠ D⇤t Fi
t = 1, 2, ..., k
Fi at 2 Ai
Fj , 8j
at
dj 2 R+ N⇤ D⇤t
E[�(A,C)] (2 + log k) · � �.= 10�
opt
+ 6�Fs
�Fs
.=
Pi2[n]
Pa2Ai
kc(Ai)� ak22
�opt
Fj , 8j
d-k-means++
19
ICML 2016
Theorem 2❖ Assumption: , all satisfy (see e.g. differential privacy)
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
likelihood ratio bound
for neighbour samples
⌦.= Ball(L2, R)
(dpXa0 /dpXa)(x) %(R) , 8a,a0 2 A, 8x 2 ⌦
X.
20
ICML 2016
Theorem 2❖ Fix . For any neighbour (differ from 1),
❖ are spread and monotonicity parameters!
❖ They can be estimated / computed from data!
❖ In general, they with
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
likelihood ratio bound
for neighbour samples
A0 ⇡ A}t = Id(.)
PC⇠k�variates++[C|A0]PC⇠k�variates++[C|A] (1 + �w)k�1 + f(k) · �w · (1 + �s)
k�1 · %(R)
0 < �w, �s ⌧ 1
! 0 m (formal definition in
poster / paper)
21
ICML 2016
Theorem 2❖ Fix . For any neighbour (differ from 1),
❖ Conditions for & ?
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
likelihood ratio bound
for neighbour samples
A0 ⇡ A}t = Id(.)
PC⇠k�variates++[C|A0]PC⇠k�variates++[C|A] (1 + �w)k�1 + f(k) · �w · (1 + �s)
k�1 · %(R)
) 1 ) 0
22
ICML 2016
Theorem 2❖ Fix . For any neighbour (differ from 1),
❖ If densities of all are in , with prob. as long as !
❖ No , in bound (proof exhibits small values whp, experiments display such values). Application in differential privacy (sublinear noise !)
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
likelihood ratio bound
for neighbour samples
A0 ⇡ A}t = Id(.)
PC⇠k�variates++[C|A0]PC⇠k�variates++[C|A] (1 + �w)k�1 + f(k) · �w · (1 + �s)
k�1 · %(R)
[✏m, ✏M ] 63 0Xa
P[C|A0]
P[C|A] 1 +
✓✏M✏m
◆k
·
4
m14+
1d+1
+
✓64
k2d
◆k
· %(2R)
m
!� 1� �
k ✏m�2
4✏M·pm
| {z }o(1)
�w�s
23
ICML 2016
Experiments❖ k-variates++ ( ) vs k-means++ & k-means
(Bahmani & al. 2012), simulated data, d=50, sample peers with until . For each peer, (a) data uniformly sampled in an hyper rectangle + (b) p% of points given to a random peer (increases , problem more “difficult”)
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
kd-k-means++
0
5
10
15
20
25
0 10 20 30 40 50
Pi |Ai| ⇡ 20000E[|Ai|] = 500
�Fs
p
�Fs (p)
�Fs (0)
24
ICML 2016
Experiments❖ k-variates++ ( ) vs k-means++ & k-means
(Bahmani & al. 2012) (used with their best parameters)
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen
4 5 6 7 8 9 10k 0
10 20
30 40
50
p
-4-2 0 2 4 6 8
lq
4 5 6 7 8 9 10k 0
10 20
30 40
50
p
-4-2 0 2 4 6 8
k
⇢�(H).= �(d-k-means++)��(H)
�(H) · 100
⇢�(k-means++) ⇢�(k-meansk)
d-k-means++
k-variates++
beats k-means++
25
ICML 2016
Conclusions❖ We provide a generalisation of k-means++ with
guaranteed approximation of the global optimum!
❖ k-variates++ can be used as is (e.g. privacy, k-means++) or to prove approximation properties of other algorithms via “reductions” between clustering algorithms
❖ Come see the poster for more examples!
❖ Future: use Theorems to address stability, generalisation and smoothed analysis
k-variates++:moreplusesinthek-means++|RichardNock,RaphaelCanyasse,RoksanaBoreli&FrankNielsen