big-data clustering: k-means vs k-indicators - sribd.cn · big-data clustering: k-means vs...
TRANSCRIPT
Big-data Clustering: K-means vs K-indicators
Yin Zhang
Dept. of Computational & Applied Math.Rice University, Houston, Texas, U.S.A.
Joint work withFeiyu Chen & Taiping Zhang (CQU), Liwei Xu (UESTC)
2016 International Workshop onMathematical Issues in Information Sciences
Dec. 20, 2016. Shenzhen, China
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 1 / 39
Outline
1 Introduction to Data Clustering
2 K-means: Model & Algorithm
3 K-indicators: Model & Algorithm
4 Numerical experiments
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 2 / 39
Outline
1 Introduction to Data Clustering
2 K-means: Model & Algorithm
3 K-indicators: Model & Algorithm
4 Numerical experiments
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 3 / 39
Partition data points into clusters
Figure: Group n data points into k = 4 clusters (n� k )
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 3 / 39
Non-globular and globular clusters
Figure: Rows 1–2: non-globular. Row 3: Globular
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 4 / 39
Clustering Images
Learning Problems
Figure: Group images into 2 clusters (cats and dogs)
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 5 / 39
Clustering Images II
Figure: Group faces by individuals
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 6 / 39
Formal Description:
Input:
data objects mi ∈ Rd , i = 1,2, · · · ,n
(estimated) number of clusters k < n
a similarity measure (“distance”)
Output:
Assignments to k clusters:
(a) within clusters objects are more similar to each other
(b) between clusters objects are less similar to each other
A fundamental technique in unsupervised learning
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 7 / 39
Two-Stage Strategy in practice
Stage 1: Pre-processing data
Transform data to “latent features”
dependent on definition of “similarity”
various data-specific techniques existing
e.g.: PCA clustering, spectral clustering1
Stage 2: Clustering “latent features”
Method of choice: K-means (model2 + algorithm3)
Our focus is on Stage 2
1See von Luxburg, A tutorial on spectral clustering, 20072MacQueen, Some Methods for classification & Analysis of Multivariate Observations, 19673Lloyd, Least square quantization in PCM, 1957.
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 8 / 39
Pre-processing
Spectral Clustering4: Given a similarity measure, do:1 Construct a similarity graph (many options, e.g. KNN)2 Compute k leading eigenvectors of graph Laplacian3 Cluster the rows via K-means (i.e., Lloyd algorithm)
Dimension Reduction:Principle Component Analysis (PCA)Non-negative Matrix Factorization (NMF)Random Projection (PR)
Kernel Tricks: nonlinear change of spaces
Afterwards, K-means is applied to do clustering
4See von Luxburg, A Tutorial on Spectral Clustering, 2007 for references.Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 9 / 39
Outline
1 Introduction to Data Clustering
2 K-means: Model & Algorithm
3 K-indicators: Model & Algorithm
4 Numerical experiments
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 10 / 39
K-means Model
Given {mi}ni=1 ∈ Rd and k > 0, the default K-means model is
minxj
n∑i=1
min{‖mi − xj‖2 | j = 1, · · · , k
}where j-th centroid xj = the mean of the points in j-th cluster.
K-means modelsearches for k centroids in data spaceis essentially discrete and generally NP-hardis suitable for globular clusters (2-norm)
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 10 / 39
Lloyd Algorithm a.k.a. K-means Algorithm
Lloyd Algorithm5 (or Lloyd-Forgy):
Select k points as the initial centroids;while “not converged” do
(1) Assign each point to the closest centroid;(2) Recompute the centroid for each cluster.
end while
Algorithmconverges to a local minimum at O(dkn) per iter.is sensitive to initialization due to greedy naturerequires (costly) multiple restarts for consistency
5Lloyd, Least square quantization in PCM, 1957.Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 11 / 39
(Non-uniform) Random Initialization:
K-means++6:
1 Randomly select a data point in M as the first centroid.
2 Randomly select a new centroid in M so that it isprobabilistically far away from the existing centroids.
3 Repeat step 2 until k centroids are selected.
Advantages:
Achieving O(log k)∗(optimal value) is expectedmuch better than uniformly random in practice
K-means Algorithm = K-means++/Lloyd: method of choice
6Arthur and Vassilvitskii, K-means++: The advantages of careful seeding, 2007.Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 12 / 39
Convex Relaxation Models
LP Relaxation 7 =⇒ O(n2) variables
SDP Relaxation 8 =⇒ O(n2) variables
Sum of Norm Relaxation: 9 =⇒ O(n2) evaluation cost
minxi
n∑i=1
‖mi − xi‖2 + γ∑i<j
‖xi − xj‖ ← sparsity promoting
Due to high computational cost, fully convexified models/methodshave not shaken the dominance of K-means
Need to keep the same O(n) per-iteration cost as K-means
7Awasthi et al., Relax, no need to round: Integrality of clustering formulations, 20158Peng and Xia, Approximating k-means-type clustering via semidefinite programming, 20079Lindsten et al., Just Relax and Come Clustering! A Convexication of k-Means..., 2011
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 13 / 39
Why another new model/algorithm?
A Revealing Example
Synthetic Data:
Generate k centers with mutual distance = 2Create a cloud around each center within radius < 1
Globular, perfectly separated, ground truth known
Experiment:Pre-processing: k principal componentsClustering: apply K-means and KindAP (new)k increases from 10 to 150
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 14 / 39
Recovery Percentage
0 50 100 150Number of clusters k
86
88
90
92
94
96
98
100
Acc
urac
y (%
)
[n, d, k] = [40k, 300, k]
Lloyd: radius 0.33Lloyd: radius 0.66Lloyd: radius 0.99KindAP: radius 0.99
Such behavior hinders K-means in big-data applications
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 15 / 39
Speed
0 50 100 150Number of clusters k
10-3
10-2
10-1
100
101
102
Tim
e (s
)
size [n, d, k] = [40k, 300, k]
Lloyd: radius 0.33Lloyd: radius 0.66Lloyd: radius 0.99KindAP: radius 0.99
KindAP runs faster than Lloyd (1 run) with GPU acceleration
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 16 / 39
Outline
1 Introduction to Data Clustering
2 K-means: Model & Algorithm
3 K-indicators: Model & Algorithm
4 Numerical experiments
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 17 / 39
Ideal Data
Ideal K-rays data contains n points lying on k rays in Rd .
After permutation:
M =
h1pT
1h2pT
2...
hkpTk
︸ ︷︷ ︸
n×d
=
h1
h2. . .
hk
︸ ︷︷ ︸
n×k
pT
1pT
2...
pTk
︸ ︷︷ ︸
k×d
, HPT
p1, · · · ,pk ∈ Rd are ray vectors.hi ∈ Rni are positive vectors.
Each data vector (M row) is a positive multiple of a ray vector.
Ideal K-points data are special cases of ideal K-rays data.
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 17 / 39
k Indicators
Indicator matrix H ≥ 0 contains k indicators (orth. columns)[H]
ij> 0 ⇐⇒ Point i is on Ray j .
The j-th column of H is an indicator for Cluster j .
E.g., H ∈ R6×2 consists of two indicators
He1 =
321000
, He2 =
000222
Points 1-3 are in Cluster 1, points 4-6 (identical) in Cluster 2.
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 18 / 39
Range Relationship
For ideal K-rays data, range spaces of M and H are the same:
R(M) = R(H)
Consider data M = M + E with small perturbation E ,
R(U[k ]) ≈ R(M) ≈ R(M) = R(H),
where U[k ] ∈ Rn×k contains k leading left singular vectors of M.
The approximations become exact as E vanishes.
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 19 / 39
Two Choices of subspace distance
Minimizing the distance between 2 subspaces:
K-means’ choice:
dist(R(U[k ]),R(H)) = ‖(I − PR(H)) U[k ]‖F
where PR(H) is the orthogonal projection onto R(H).
Our Choice:
dist(R(U[k ]),R(H)) = minU∈Uo
‖U − H‖F
where Uo is the set of all the orthonormal bases for R(U[k ]).
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 20 / 39
Two Models
K-means model (a matrix version10):
minH‖(I − HHT ) U[k ]‖2F , s.t. H ∈ H ∩ ...
where the objective is highly non-convex in H.
K-indicators model:
minU,H‖U − H‖2F , s.t. U ∈ Uo, H ∈ H
where the objective is convex, and
Uo ={
U[k ]Z ∈ Rn×k : Z T Z = I}, H =
{H ∈ Rn×k
+ : HT H = I}
K-indicators model looks more tractable
10Boutsidis et al., Unsupervised Feature Selection for k-means Clustering Prob., NIPS, 2009Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 21 / 39
K-indicators Model is Geometric
minU,H‖U − H‖2F , s.t. U ∈ Uo, H ∈ H (1)
is to find the distance between 2 nonconvex sets Uo and H.
H is nasty, permitting no easy projection
Uo is milder, easily projectable
Projection onto Uo:
UT[k ]X = PDQT (k by k SVD)
PUo (X ) = U[k ](PQT ).
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 22 / 39
Partial Convexification
An intermediate problem:
min ‖U − N‖2F , s.t. U ∈ Uo, N ∈ N (2)
where N = {N ∈ Rn×k | N ≥ 0} is a closed convex set.
Reasons:N is convex, PN (X ) = max(0,X ).The boundary of N contains H.Local minimum of (2) can be solved by alternating projection.
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 23 / 39
KindAP algorithm
2-level alternating “projections” algorithmic scheme:
Figure: (1) alternating projection (2) “proj-like” operator (3) projection
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 24 / 39
Uncertainty Information
Magnitude of Nij reflects likelihood of point i in cluster j .
Let N hold the elements of N sorted row-wise in descending order.
Define soft indicator vector:
si = 1− Ni2/Ni1 ∈ [0,1], i = 1, ...,n,
Intuition:The closer is si to 0, the more uncertainty there is in assignment.The closer is si to 1, the less uncertainty there is in assignment.
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 25 / 39
Algorithm: KindAP
Input: M ∈ Rn×d and integer k ∈ (0,min(d ,n)]Compute k leading left singular vectors of M to form UkSet U = Uk ∈ Uowhile not converged do
Starting from U, find a minimizing pair(U,N)← argmin dist(Uo,N ).
Find an H ∈ H close to N.U ← PUo (H).
end whileOutput: U ∈ Uo, N ∈ N , H ∈ H.
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 26 / 39
Outline
1 Introduction to Data Clustering
2 K-means: Model & Algorithm
3 K-indicators: Model & Algorithm
4 Numerical experiments
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 27 / 39
A typical run on synthetic data
[d, n, k] = [500, 10000, 100]
***** radius 1.00 *****
Outer 1: 43 dUH: 5.08712480e+00 idxchg: 9968Outer 2: 13 dUH: 3.02146591e+00 idxchg: 866Outer 3: 2 dUH: 3.02144864e+00 idxchg: 0
KindAP 100.00% Elapsed time is 1.923610 secondsKmeans 88.66% Elapsed time is 4.920525 seconds
***** radius 2.00 *****
Outer 1: 42 dUH: 5.98082240e+00 idxchg: 9900Outer 2: 3 dUH: 5.55739995e+00 idxchg: 250Outer 3: 2 dUH: 5.55442009e+00 idxchg: 110Outer 4: 2 dUH: 5.55442000e+00 idxchg: 0
KindAP 100.00% Elapsed time is 1.797684 secondsKmeans 82.73% Elapsed time is 7.909552 seconds
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 27 / 39
Example 1: Three tangential circular clusters in 2-D
-2 -1 0 1 2-1
0
1
2
A: Synthetic data [n,d,k] = [7500,2,3]
0.2
0.4
0.6
0.8
1
Figure: Points colored according to soft indicator values
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 28 / 39
Example 2: Groups of 40 faces from 4 persons (ORL data)
0 10 20 30 400
0.2
0.4
0.6
0.8
1B: ORL data [n, d, k] = [40, 4096, 4]
Group 1: Mean(s) = 0.98; AC = 100.00%Group 2: Mean(s) = 0.88; AC = 90.00%Group 3: Mean(s) = 0.81; AC = 80.00%Group 4: Mean(s) = 0.77; AC = 70.00%Group 5: Mean(s) = 0.73; AC = 60.00%
Figure: Five soft indicators sorted in a descending order
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 29 / 39
K-rays data Example
-20 -10 0 10 20First Dimension
-15
-10
-5
0
5
10
15
Seco
nd D
imen
sion
Original Data
-20 -10 0 10 20First Dimension
-15
-10
-5
0
5
10
15
Seco
nd D
imen
sion
Result of Lloyd
-20 -10 0 10 20First Dimension
-15
-10
-5
0
5
10
15
Seco
nd D
imen
sion
Result of KindAP
Figure: Points are colored according to their clusters
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 30 / 39
Kernel data Example
-5 0 5
-10
-5
0
5
10T
wo
spira
ls
Mean(s) = 0.99961
-5 0 5
-5
0
5
Clu
ster
in c
lust
er
Mean(s) = 0.99586
-10 0 10
-10
-5
0
5
10
Cor
ners
Mean(s) = 0.99964
-20 -10 0 10
-20
-10
0
10
20
Hal
f-ke
rnel
Mean(s) = 0.99999
-10 0 10
-20
-10
0
10
Cre
scen
t & F
ull M
oon
Mean(s) = 1
-20 0 20
-20
0
20
Out
lier
Mean(s) = 1
Figure: Non-globular clusters in 2-D
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 31 / 39
Real Examples: two popular datasets with “large” k
COIL image datasetk = 100 objects, each having 72 different imagesn = 7200 imagesd = 1024 pixels (image size: 32× 32)
TDT2 document datasetk = 96 different usenet newsgroupsn = 10212 documentsd = 36771 words
Pre-processing:
KNN + Normalized Laplacian + Spectral Clustering
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 32 / 39
Algorithms Tested
3 Algorithms Compared
KindAPKindAP+ L: Run 1 Lloyd starting from KindAP centersLloyd10000: K-means with 10000 replications
K-Means Codekmeans in Matlab Statistics and Machine Learning Toolbox(R2016a with GPU acceleration)
ComputerA desktop with Intel Core i7-3770 CPU at 3.4GHz/8GB RAM
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 33 / 39
K-means Objective Value
30 40 50 60 70 80
Number of Clusters k
2
4
6
8
10
12
14
16
Squ
are
Err
or
A: TDT2 [n,d,k] = [10212,36771,96]
KindAP
KindAP+L
Lloyd(10,000 runs)
30 40 50 60 70 80 90 100
Number of Clusters k
4
6
8
10
12
14
16
18
Squ
are
Err
or
B: COIL100 [n,d,k] = [72k,1024,k]
KindAP
KindAP+L
Lloyd(10,000 runs)
Figure: A: KindAP+L best B: KindAP+L best
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 34 / 39
Clustering Accuracy
30 40 50 60 70 80
Number of Clusters k
76
78
80
82
84
86
88
90
92
Acc
urac
y (%
)
C: TDT2 [n,d,k] = [10212,36771,96]
KindAP
KindAP+L
Lloyd(10,000 runs)
30 40 50 60 70 80 90 100
Number of Clusters k
50
52
54
56
58
60
62
64
66
68
70
Acc
urac
y (%
)
D: COIL100 [n,d,k] = [72k,1024,k]
KindAP
KindAP+L
Lloyd(10,000 runs)
Figure: A: KindAP+L best B: KindAP best
K-means (K-indicators) model fits TDT2 (COIL100) better
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 35 / 39
Running Time (average time for Lloyd)
30 40 50 60 70 80
Number of Clusters k
0
0.2
0.4
0.6
0.8
1
1.2
Tim
e (s
)
E: TDT2 [n,d,k] = [10212,36771,96]
KindAP
KindAP+L
Lloyd(average time)
30 40 50 60 70 80 90 100
Number of Clusters k
0
0.2
0.4
0.6
0.8
1
1.2
Tim
e (s
)
F: COIL100 [n,d,k] = [72k,1024,k]
KindAP
KindAP+L
Lloyd(average time)
Figure: 1 KindAP run ≈ 1 Lloyd run
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 36 / 39
Visualization on Yale Face Dataset: n = 165, k = 16 (k∗ = 15)
M ≈ Mk = Uk ΣkV Tk Uk → KindAP→ (U,N,H) W = MT
k U
Figure: A: Vk . B: W . C-E: face ≈ Vk c1 = Wc2
c1: a row of ΣkUk c2: a row of U (→ face in cluster 16)
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 37 / 39
Summary
Property K-indicators K-means
parameter-free yes yesO(n) cost per iter. yes yes
non-greedy yes nonot need replications yes nosuitable for big-k data yes noposterior info available yes no
Contribution:enhanced infrastructure for unsupervised learning
Further Work:global optimality under favorable conditions(already proven for ideal data)
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 38 / 39
Thank you!
Chen, Xu, Zhang, Zhang Big-data Clustering: K-means vs K-indicators Dec. 20, 2016 39 / 39