[ieee 2013 ieee 13th international conference on data mining workshops (icdmw) - tx, usa...

7
Fast Spectral Clustering with Landmark-based Subspace Iteration Zejun Gan Software School Fudan University [email protected] Chaofeng Sha School of Computer Science Fudan University [email protected] Junyu Niu Software School Fudan University [email protected] Abstract—Spectral clustering has received a great deal of attention due to its flexibility on various type of geometry and its high-quality clustering results. However, with the rapid increase of data size, spectral clustering quickly becomes unfeasible because of its cubical complexity. Various sampling methods with at best quadratic complexity in time and space have been purposed. However, they are not able to jump out of the box for their dependency on eigen-decomposition. In this paper, we propose a novel framework called Landmark-based Subspace Iteration Spectral Clustering (LSISC), which scales linearly on both time and space with data size. Specifically, we use the similarity between points and landmarks as the new feature of points, and exploit subspace iteration to perform spectral clustering in near-linear time, while no pair-wise information is dropped. We show that the running time of LSISC is not sensitive to the number of landmarks, which allows more landmarks to be chosen for better accuracy. Experimental results show that our approach gives better clustering results in much less time than other state-of-the-art spectral methods. I. I NTRODUCTION Clustering, one of the fundamental learning tasks for data mining, aims at dividing unsupervised data into disjoint sets that maximize the similarity between objects in the same clus- ter and minimize that in the different clusters. In recent years, there has been increasing attention to the spectral clustering methods which observe the feature of the pairwise similarity matrix of data. Spectral clustering yields much more accurate results due to its flexibility on various types of geometry, than those results produced by methods such as k-means that relies only on Euclidean geometry. However, the initial version of spectral clustering and most of the successive variants are stranded in the time and space complexity of the core step in their algorithms. The general steps of spectral clustering include constructing an adjacency matrix and obtaining the eigenvectors corresponding to the largest eigenvalues of the related dense Laplacian matrix. For a dataset with n data points, the former step takes O(n 2 ) time and space complexity and the later step takes even O(n 3 ) time complexity. Neither operation is practicable under the present massive increasing data size even with the assist of parallel computing. Many improved methods have been purposed trying to improve either space or time complexity. [1] [2] [3] propose sampling representative data beforehand to lower the compu- tation cost. Another commonly used approach called Nystr¨ om [4] [5] [6] [7] focuses on accelerating the eigen-decomposition on a low-rank approximation matrix generated by decomposing an inner sub-matrix and projecting back. Graph compression [8] is also an approach to save the space usage, based on the premise that the affinity graph in the real life highly obeys the power-law degree distributions. With the increasing popularity of parallel computing, it is also natural to refine the algorithm to become adaptive to Map-Reduce [9]. Nevertheless, approaches above are still confined to the original algorithm without breaking through the bottleneck of the two key steps: the generation and storage of pairwise sim- ilarity matrix and eigen-decomposition. Sampling techniques only give solutions to the teething troubles. As the data size becomes considerably large, sampled matrix will cause either existing deficiencies or severe accuracy loss depending on the sampling rate. Parallel computing, which gains linear speeding up at the cost of more computational resource and storage, cannot keep pace with the cubically growing needs as the size of the data keeps increasing. In this paper, we propose a scalable, fast and accurate spec- tral clustering framework called Landmark-based Subspace Iteration Spectral Clustering (LSISC). We were inspired by the clustering method purposed by [10] using sparse coding and [11] using power iteration clustering, while we aim to find a better way to handle the massive data sets accurately in linear time and space to break through the bottleneck. Specifically, we select p landmarks from original data set to make the adjacency relation between data points and landmarks as the new feature of the data points. To avoid directly computing the eigenvectors of the Laplacian matrix or its sub-matrix, we adopt an improved version of power iteration called subspace iteration to find the top k eigenvectors of the Laplacian matrix in O(nk) time, which is almost linear since k is usually very small. Our contributions are as follows: Speed: Our framework for spectral clustering runs in linear time and space without specific limitations on input data. To reduce the space cost, we find a compressed form of adjacency matrix to avoid building a full one. We also introduce a universal linear time subspace iteration algorithm to obtain the top k eigenvectors of the compressed adjacency matrix, which is widely needed in many data mining methods. Accuracy: We show that our framework gives competitive or even better results over state-of-the-art methods, with only a few landmark points picked. Scalability: LSISC is well suited for running in distributed environment since most of the computations involved are 2013 IEEE 13th International Conference on Data Mining Workshops 978-0-7695-5109-8/13 $31.00 © 2013 IEEE DOI 10.1109/ICDMW.2013.137 773 2013 IEEE 13th International Conference on Data Mining Workshops 978-0-7695-5109-8/13 $31.00 © 2013 IEEE DOI 10.1109/ICDMW.2013.137 773

Upload: junyu

Post on 11-Mar-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Fast Spectral Clustering with Landmark-basedSubspace Iteration

Zejun GanSoftware School

Fudan University

[email protected]

Chaofeng ShaSchool of Computer Science

Fudan University

[email protected]

Junyu NiuSoftware School

Fudan University

[email protected]

Abstract—Spectral clustering has received a great deal ofattention due to its flexibility on various type of geometry and itshigh-quality clustering results. However, with the rapid increaseof data size, spectral clustering quickly becomes unfeasiblebecause of its cubical complexity. Various sampling methodswith at best quadratic complexity in time and space have beenpurposed. However, they are not able to jump out of the boxfor their dependency on eigen-decomposition. In this paper, wepropose a novel framework called Landmark-based SubspaceIteration Spectral Clustering (LSISC), which scales linearly onboth time and space with data size. Specifically, we use thesimilarity between points and landmarks as the new featureof points, and exploit subspace iteration to perform spectralclustering in near-linear time, while no pair-wise information isdropped. We show that the running time of LSISC is not sensitiveto the number of landmarks, which allows more landmarks to bechosen for better accuracy. Experimental results show that ourapproach gives better clustering results in much less time thanother state-of-the-art spectral methods.

I. INTRODUCTION

Clustering, one of the fundamental learning tasks for datamining, aims at dividing unsupervised data into disjoint setsthat maximize the similarity between objects in the same clus-ter and minimize that in the different clusters. In recent years,there has been increasing attention to the spectral clusteringmethods which observe the feature of the pairwise similaritymatrix of data. Spectral clustering yields much more accurateresults due to its flexibility on various types of geometry, thanthose results produced by methods such as k-means that reliesonly on Euclidean geometry.

However, the initial version of spectral clustering and mostof the successive variants are stranded in the time and spacecomplexity of the core step in their algorithms. The generalsteps of spectral clustering include constructing an adjacencymatrix and obtaining the eigenvectors corresponding to thelargest eigenvalues of the related dense Laplacian matrix. Fora dataset with n data points, the former step takes O(n2) timeand space complexity and the later step takes even O(n3) timecomplexity. Neither operation is practicable under the presentmassive increasing data size even with the assist of parallelcomputing.

Many improved methods have been purposed trying toimprove either space or time complexity. [1] [2] [3] proposesampling representative data beforehand to lower the compu-tation cost. Another commonly used approach called Nystrom[4] [5] [6] [7] focuses on accelerating the eigen-decomposition

on a low-rank approximation matrix generated by decomposingan inner sub-matrix and projecting back. Graph compression[8] is also an approach to save the space usage, based on thepremise that the affinity graph in the real life highly obeys thepower-law degree distributions. With the increasing popularityof parallel computing, it is also natural to refine the algorithmto become adaptive to Map-Reduce [9].

Nevertheless, approaches above are still confined to theoriginal algorithm without breaking through the bottleneck ofthe two key steps: the generation and storage of pairwise sim-ilarity matrix and eigen-decomposition. Sampling techniquesonly give solutions to the teething troubles. As the data sizebecomes considerably large, sampled matrix will cause eitherexisting deficiencies or severe accuracy loss depending on thesampling rate. Parallel computing, which gains linear speedingup at the cost of more computational resource and storage,cannot keep pace with the cubically growing needs as the sizeof the data keeps increasing.

In this paper, we propose a scalable, fast and accurate spec-tral clustering framework called Landmark-based SubspaceIteration Spectral Clustering (LSISC). We were inspired by theclustering method purposed by [10] using sparse coding and[11] using power iteration clustering, while we aim to find abetter way to handle the massive data sets accurately in lineartime and space to break through the bottleneck. Specifically,we select p landmarks from original data set to make theadjacency relation between data points and landmarks as thenew feature of the data points. To avoid directly computingthe eigenvectors of the Laplacian matrix or its sub-matrix, weadopt an improved version of power iteration called subspaceiteration to find the top k eigenvectors of the Laplacian matrixin O(nk) time, which is almost linear since k is usually verysmall. Our contributions are as follows:

Speed: Our framework for spectral clustering runs in lineartime and space without specific limitations on input data. Toreduce the space cost, we find a compressed form of adjacencymatrix to avoid building a full one. We also introduce auniversal linear time subspace iteration algorithm to obtain thetop k eigenvectors of the compressed adjacency matrix, whichis widely needed in many data mining methods.

Accuracy: We show that our framework gives competitiveor even better results over state-of-the-art methods, with onlya few landmark points picked.

Scalability: LSISC is well suited for running in distributedenvironment since most of the computations involved are

2013 IEEE 13th International Conference on Data Mining Workshops

978-0-7695-5109-8/13 $31.00 © 2013 IEEE

DOI 10.1109/ICDMW.2013.137

773

2013 IEEE 13th International Conference on Data Mining Workshops

978-0-7695-5109-8/13 $31.00 © 2013 IEEE

DOI 10.1109/ICDMW.2013.137

773

simply matrix multiplications.

The rest of the paper is organized as follows: in SectionII, we give a short review of spectral clustering and severalstate-of-the-art methods of reducing the time and space cost.We describe our LSISC framework in Section III. Section IVgives experiment results on parameter choosing and comparestime and space cost among algorithms. We conclude our workin Section V.

II. RELATED WORK

A. Spectral Clustering

Given a set of data points x1, . . . , xn with xi ∈ Rd

and some measure of similarity sij ≥ 0 between pairs ofdata points, spectral clustering constructs a pairwise similarityundirected graph with vertices representing the data points andweight of edges representing the similarity between vertices.Spectral clustering then converts the problem from clusteringto obtaining the minimum-cut of the graph. Graph Laplacianmatrices are the main tools for spectral clustering but differentauthors define Laplacian matrix differently. There are twonormalized graph Laplacian matrices in previous literature.[12] defined Lsym as:

Lsym = I −D−1/2SD−1/2 (1)

while [13] defines Lrw as:

Lrw = I −D−1S (2)

where S = {sij}i,j=1,...,n is an n×n weighed pairwise simi-larity matrix which is non-negative and symmetric, and D is adiagonal matrix containing row sum of S. Spectral clusteringcomputes the first k eigenvectors u1, . . . , uk correspondingto the k smallest eigenvalues of L and concatenates them toform a new U ∈ R

n×k matrix, where i-th row of U becomesthe new feature of point xi. Finally classic algorithms such ask-means are used to cluster the points using these new features.

The original spectral clustering suffers from two majordrawbacks. O

(n2

)time and space will be used to construct

pairwise similarity matrix, while O(n3

)time and O

(n2

)space will be used to compute the eigen-decomposition of theLaplacian matrix. These drawbacks forbid spectral clusteringfrom handling large-scale data sets.

Many approaches have been purposed to reduce the datasize before clustering. [1] developed a fast approximate spec-tral clustering in which a distortion-minimizing local transfor-mation such as mapping points to its nearest centroid pointgenerated by k-means is first applied to the data. [3] alsoadopted a strategy to repeatedly generate supernodes and applyDijkstra’s algorithm to partition nodes into disjoint subsets.

To speed up the eigen-decomposition, sampling techniquessuch as Nystrom method applied by [4] [5] [6] [7] are stillthe mainstream. Nystrom methods generally decompose arandomly sampled matrix and use the correlation betweensampled and unsampled data points to obtain a low-rankapproximation of the original matrix. [14] proposed a ran-domized Singular Value Decomposition(SVD) algorithm toachieve an accurate approximation by compacting the originalmatrix into a smaller matrix in the subspace and perform thedecomposition on it.

Sampling methods are widely adopted by methods above.However, since sampling will discard majority of the informa-tion provided by the original matrix, it is hard to take care ofboth complexity and accuracy at the same time.

B. Landmark-Based Representation

To avoid loss of information, [10] purposed Landmark-based Spectral Clustering (LSC) (Algorithm.1) which isclosely related to the recent progress of sparse coding [15].The target of the algorithm is to find a flat matrix Z ∈ R

p×n

such that S = ZTZ, where p stands for the dimension ofsparse feature.

To achieve efficient eigen-decomposition of D−1S , theyassume the SVD of Z is Z = AΣBT . Thus S = ZTZ =BΣATAΣBT = BΣΣBT , and ZZT = AΣΣAT since Aand B are the left/right singular vectors and are both unitarymatrix. By performing SVD on the p× p matrix ZZT , A andΣ can be computed in a short time. Notice that B is obviouslythe eigenvectors of S and can be computed by B = ZTAΣ−1.

To get the matrix Z, LSC first select p landmark pointsusing k-means or random selection, then for each zji in Z, aweight is given related to the similarity between the j-th land-mark point(uj) and i-th point(xi). To make the representationmatrix Z sparser, they set zji to zero if the j-th landmark pointis not among the r nearest landmark neighbor of i-th point. LetUi be the set of r nearest landmark points of xi, they computezji as

zji =

{Kσ(xi, uj)

∑j′∈Ui

Kσ(xi, uj′)j ∈ Ui

0 j /∈ Ui

(3)

where Kσ (·) is the kernel similarity function which is com-monly used in spectral clustering. The most well-known kernel

is the Gaussian kernel Kσ (xi, uj) = exp(−‖xi−uj‖2

2σ2

).

The time complexity of LSC is mainly depends on thebuilding of Z (O (nprd)), the SVD of the p × p matrix A(O

(p3

)) and the floating-point operation of matrix multiplica-

tion of B (O(np2

)). Since r is always a small constant, the

overall time complexity is O(npd+ p3 + np2

), which brings

hidden trouble when the data is scattered or with varyingdistribution density. In that case, more landmark points areneeded to provide representative features.

Algorithm 1 Landmark-based Spectral Clustering

1: Produce p landmark points using k-means or randomselection;

2: Construct a sparse affinity matrix Z ∈ Rp×n between data

points and landmark points, with the affinity calculatedaccording to Eq.(3)

3: Let Z = D−1/2Z where D is the row-sum of Z4: ZZT = AΣΣAT (eigen-decomposition)5: B = ZTAΣ−1.6: Treat each row of B as a point in R

k and apply k-meansto partition n points into k clusters

C. Power Iteration Clustering

It has been validated [1] that for a Laplacian matrix L,the eigenvector with the second-smallest eigenvalue represents

774774

a bipartite partition of the graph. More generally, the firstk eigenvectors with the k smallest eigenvalue represents asubspace in which the graph can be well separated into kcomponents. Notice that the smallest eigenvalue of L is always0, and the k smallest eigenvectors of L are the k largesteigenvectors of I − L. To obtain the largest eigenvector of amatrix, we can apply power method, which performs iterativeupdate on a random vector v0 and make it convergence to thelargest eigenvector by repeated matrix-vector multiplications:

vt+1 = c (I − L) vt (4)

where c = 1/ (I − L) vt1 is a normalizing constant. [11]observed that although the final result of the iteration is notvaluable, the intermediate vectors obtained during iterationcontains well segmented indicators of the clustering. Thepower iteration clustering (PIC) (Algorithm.2) thus is proposedto exploit the middle state of vt as the indicator vector beforeconvergence. However, PIC gives good results only on the datasets with few clusters. On those data sets with many clusters,the indicator vector quickly becomes identity-like vector evenafter the first round of iteration and contains not enoughpiecewise parts comparing to the real number of clusters.In spite of the limitation, PIC gives us good inspiration forobtaining the largest eigenvectors of a matrix in reduced timeusing iterative methods.

Algorithm 2 Power Iteration Clustering

1: Form a row-normalized affinity matrix W using Eq.(1) orEq.(2)

2: Pick an initial vector v03: repeat4: vt+1 = cWvt, where c = 1/Wvt15: δt+1 = |vt+1 − vt|6: increment t7: until |δt − δt−1| � 08: Treat each row of vt as a point in R

1 and apply k-meansto partition n points into k clusters

III. LANDMARK-BASED SUBSPACE ITERATION SPECTRAL

CLUSTERING

In this section we introduce our Landmark-based SubspaceIteration Spectral Clustering (LSISC). As we mentioned in thesection II, LSC provides us a sparse and compressed form ofsimilarity matrix S, while we still needs to perform eigen-decomposition on a p × p matrix, which takes O

(p3

)time

related to the size of landmark points. PIC gives us a very fastiterative way to get the top eigenvector of a matrix, while theclustering performance is hard to be promised. PIC also needsO

(n2

)time and space on matrix-vector multiplication during

the iteration.

LSISC takes the advantage of both the sparse representationof LSC and the fast eigen-decomposition of PIC to comple-ment each other. To build bridge between LSC and PIC, weneed to overcome the difficulties below:

Is there a way to exploit the sparse representation generatedfrom LSC to decrease computational and storage complexityof PIC?

Is there a way to get more top eigenvectors of the Laplacianmatrix to make it possible for PIC to handle data sets withmany clusters more accurately?

A. Combining Landmark-based Representation and Power It-eration

Recall that in LSC, the pairwise similarity matrix S isrepresented by

S = ZTZ (5)

where Z ∈ Rp×n is the landmark-based sparse representation

which is a flat and sparse matrix. Previous analysis showsthat the similarity matrix W is the major cause of hightime and space complexity. Thus we prefer performing matrixmultiplication of ZTZ only when necessary to storing andperforming matrix operations on n×n full matrix W each time.The matrix to be decomposed thus has another compressedform:

I − L = D−1S = D−1ZTZ (6)

Notice that D can also be computed without S by

D = diag(ZTZ�1) = diag(ZT(Z�1

)) (7)

Where �1 is the unit vector and D−1 (i, i) = 1/D (i, i) sinceD is a diagonal matrix.

In PIC, the main bottleneck is encountered when comput-ing the multiplication of n×n matrix I −L and n× 1 vectorvt. After the compressed form of I−L is derived, the iterativeprocess of PIC now can be converted to:

vt+1 = c (I − L) vt = cD−1ZTZvt = cD−1(ZT(Zvt

))(8)

It is easy to verify that with vt ∈ Rn×1, Z ∈ R

p×n andD an n × n diagonal matrix, the computational complexityof power iteration is greatly reduced from O

(n2

)to O (nr)

considering that Z is a very sparse matrix and contains onlyr < 10 elements per row. The space complexity is also reducedfrom O

(n2

)to O (nr) by getting rid of the n× n similarity

matrix S.

B. Subspace Iteration Clustering

The power method, despite good speed, cannot help clus-tering much since the eigenvector corresponding to the largesteigenvalue usually converges too fast to capture the usefulintermediate state for clustering. However, we would ratherlook for the k-largest eigenvectors than only the first onein order to make power method more compatible toe thedata sets with number of clusters more than 2. The k-largesteigenvectors form a k-dimensional subspace and the methodto obtain these k eigenvectors is called subspace iteration [16].

In subspace iteration, rather than iterate on an n×1 vectorv0, we iterate on an n × k random subspace V0. Noticethat the term ”subspace” indicates that each column of V0should be orthonormal and after each round of iteration, eachcolumn of Vt should also keep orthonormal. Thus the normal-ization of the vector is generalized to the orthogonalizationof the n × k matrix. QR factorization is used to fulfill thetask:

Randomly choose an n× k orthogonal matrix V0 such thatV0TV0 = Ik

775775

for t = 1, 2, . . . doT = WVt−1

Vt ×Rt = T (QR factorization)end for

When k = 1, it just degenerates to the powermethod. Thus in our problem, the iteration steps are asfollows:

Randomly choose an n× k orthogonal matrix V0 such thatV0TV0 = Ik

for t = 1, 2, . . . doT = D−1(ZT (ZVt−1)Vt ×Rt = T (QR factorization)

end for

The order of the matrix multiplication is crucial in ouralgorithm. Since k is usually so small that can be regarded as aconstant and p� n, the multiple matrix-matrix multiplicationstake near linear time O (nrk) and space O (nr + nk) withZ a sparse matrix containing r elements per row. Our majorconcern now becomes the time complexity of QR factorizationon the n× k matrix T during each iteration. Notice that onlythe first k columns of the orthogonal matrix Vt is neededin our algorithm, and therefore we will not perform a fullsize QR factorization to avoid O

(n3

)time complexity. [16]

shows that the economic (or thin, reduced) QR factorizationon a rectangular matrix T ∈ R

n×k requires 2nk2 floating-point operations using the modified Gram-Schmidt method,or 2nk2 − 2n3/3 floating-point operations using fast GivensQR method or Householder QR method. Hence the totaltime complexity of iteration is O

(n(rk + k2 − n2

)t)

and thespace complexity is O (nr + nk). In practice, k and t is usuallyvery small (t < 100) verified by the experiments in SectionIV. Thus we can say that the space and time cost by subspaceiteration scales linearly with the number of data points. Wesummarized our LSISC method in Algorithm.3.

Algorithm 3 Landmark-based Subspace Iteration SpectralClustering

1: Produce p landmark points using k-means or randomselection;

2: Construct a sparse affinity matrix Z ∈ Rp×n between data

points and landmark points, with the affinity calculatedaccording to Eq.(3)

3: Randomly choose an n×k orthogonal matrix V0 such thatV0TV0 = Ik

4: Compute D and D−1 using Eq.(7)5: while |δt − δt−1| > threshold do6: Increase t7: T = D−1(ZT (ZVt−1)8: Vt ×Rt = T (economy QR factorization)9: δt = |Vt − Vt−1|

10: end while11: Treat each row of Vt as a point in R

k and apply k-meansto partition n points into k clusters

C. Complexity Analysis

We now have n data points with dimensionality d, plandmark points and k-dimension subspace for iteration andclustering. We pick up r nearest landmark points per data

point to form the representation sparse matrix Z. From theperspective of time, LSISC takes O (npdr) to build the sparserepresentation matrix Z and O

(n(rk + k2

)t)

to get the top

k eigenvectors of ZTZ. LSC takes also O (npdr) to buildZ, while in contrast O

(p3 + np2

)is needed to compute

the eigenvectors. From the perspective of space, while bothLSC and LSISC takes O (nr) to store sparse matrix Z, LSCneeds to store an additional p × p matrix A to performeigen-decomposition. Thus the space complexity of LSC isO

(nr + nk + p2

), while LSISC takes only O (nr + nk).

Notice that k and t are not necessarily increased with the risein p and n, since the features embedded in low dimension havelittle relation with data size. Thus we achieve better optimiza-tion by eliminating the cubical and quadratic factors which isa must in all methods listed in [10] during decomposition. Ourmethod supports much larger p to better represent the featureof massive and scattered data points.

IV. EXPERIMENTS

In this section, we present several experiments to show theperformance of LSISC comparing to other algorithms.

A. Data Sets

To illustrate the performance given by LSISC, we testedseveral large data sets which challenged algorithms of spectralclustering before. All data sets are downloaded from LibSVMdata sets for multi-class classification.

MNIST is a widely tested handwrittern digits database.The 70000 digits have been size-normalized and centered in afixed-size image. We treat the grayscale value of all 784 pixelsas a one-column vector.

PenDigits is a database on 10992 pen-based digit recogni-tion with 16 features being the eight successive pen points ontwo-dimensional coordinate system.

News20 is a collection of 20000 messages, collected from20 different netnews newsgroups. One thousand messagesfrom each of the twenty newsgroups were chosen at randomand partitioned by newsgroup name. News20 contains 62061features denoting the occurrence of a word in one message.

Covtype is a collection of 581012 forest covertype data.

Seismic is a data set recording the seismic sensor to classifyvihicles.

Poker is a large data set with 1025010 records, each beingan example of a hand consisting of five playing cards drawnfrom a standard deck of 52.

We use pre-scaled data of MNIST, Covtype, Seismic andNews20. Among data sets tested, News20 are in sparse formdue to its high dimensional features. We merge training setand test set together. All data sets used in our experiments aredescribed in detail in Table.I

B. Algorithms

We compare LSISC with other spectral clustering meth-ods that use sampling to avoid obtaining and storing thefull similarity matrix. We also conduct experiment on thosemethods that rely on full similarity matrix such as the original

776776

TABLE I: Data sets used in experiments

Data Set #Instances #Features #Classes

MNIST 70000 784 10

PenDigits 10992 16 10

News20 15935 62061 20

Covtype 581012 54 7

Seismic 98528 50 3

Poker 1025010 10 10

version of spectral clustering method and one-dimensionalpower iteration clustering. We use k-means as our baseline.

Nystrom we used the original version implemented by[17].

KASP We implemented a MATLAB version of K-means-based approximate spectral clustering [1]. KASP also selectsp representative points by k-means. Being different from ourmethod, KASP builds a many-to-one correspondence table toassociate non-representative points with the nearest represen-tative points. Then a spectral clustering on those representativepoints is run and the membership of remaining points aredetermined by their corresponding representative points.

LSC-R and LSC-K Two methods of Landmark-basedSpectral Clustering [10] using different pre-processing steps.LSC-R picks up landmark points randomly and LSC-K adoptscentroids provided by an iteration-limited k-means method.

LSISC-R and LSISC-K Following the pre-processingsteps introduced in [10], we also implemented two ways topick up the landmark points, by randomly selecting (LSISC-R)or by using the centroids given by an iteration-limited k-meansmethod (LSISC-K).

We unite the code for all common steps during clus-tering, such as calculating the similarity matrices, buildingthe adjacency matrices between landmark points and othersand performing k-means clustering, by providing as-fast-as-possible implementations on MATLAB to avoid injustice.We adopt a divide-and-conquer technique to calculate theEuclidian distance matrix between two matrices when neededto prevent unfairness due to memory limitation and pageswapping. We use Gaussian kernel to compute the distancebetween points. All parameters are the same among algorithmsto be compared. Empirically, we select p=1200 and r=6 duringthe tests. To eliminate randomness, we conduct each test 20times and report the average performance.

C. Evaluation Metrics

We measure both clustering accuracy and running timeof different algorithms. For time evaluation, since differentmethods requires different pre-processing steps, we simplycount all the time elapsed of each algorithm from receivingthe data set to giving the clustering result. For accuracyevaluation, we calculate two indices called accuracy and NMI.The accuracy index finds the ratio of numbers of correctlabeled samples and total number of samples. A permutationon the labels of the result that best matches the ground truth isneeded and can be obtained by Hungarian algorithm [18]. NMIis short for Normalized Mutual Information [19]. The larger

value of NMI ∈ [0, 1] means better clustering result. NMI alsotakes the cluster size into consideration to avoid too scatteredclusters.

Our experiments are all implemented in MATLAB and runon a machine with 3.1 GHz CPU and 4GB main memory.

D. Experimental Results

We report the experimental results on six data sets in TableII, III and IV.

Speed We see that LSISC-R outperforms all its competitorson almost all the data sets due to the much less time spenton computing the eigenvectors through subspace iteration.KASP encounters efficiency problem when performing spectralclustering on sub data set containing p samples. Nystrom isalso encumbered with a full eigen-decomposition on the top-left sampled matrix. LSC-R is a little bit slower than LSISC-Rfor a partial SVD on the similarity matrix Z, while LSISC-Rdoes that by not much times of iterations.

Accuracy The landmark-based methods greatly outper-forms other sampling methods in all of the data sets. On News20 and Covtype, LSC.R and LSISC.R achieves best accuracyand nmi within the least time. On Pendigits, the advantage iseven more evident.

We can see that in most cases, LSISC gives competitiveor even more accurate results than LSC while spending theleast time. Besides, since the convergence threshold of theaccelerating factor |δt − δt−1| can be adjusted, we can balancethe running time and accuracy more flexibly over LSC underdifferent application scenarios. Furthermore, there is a cubicalfactor related to the landmark size p in the time complexityof LSC, which is hidden trouble when more landmark pointsare need to represent more features of the more sparse andscattered high-dimensional data.

However, we surprisingly find that the k-means pre-processing step, being the most time-expensive step, does nothelp accuracy so much on most of the data sets. Sometimesk-means pre-processing step may even introduce accuracydegradation on data sets such as News20. Randomly choosinglandmarks leads to equal or better clustering result overchoosing by k-means on most of the data sets except MNIST.Thus we consider random methods the best to put into practiceunder most of the scenarios.

Those methods such as PIC or original spectral clusteringthat require a full Laplacian matrix storing in the memory donot finish most of the experiments. Since the majority of ourdata sets contains more than 50 k instances, the process of thealgorithm brings in much page swap, which leads to intolerableendless waiting.

The memory issue is also important when constructing theLaplacian matrix. Naive methods such as PIC and originalspectral clustering fail to give result on most of the data setsdue to the rigid requirement on storing a full similarity matrixand a full Laplacian matrix. According to our observation, themost resource-costing step in LSC and LSISC is to find thedistance between normal data points and landmarks, whichrequires O(np) time and space. For the data set of size largerthan 500 k and landmarks of size larger than 1.2 k, we need

777777

TABLE II: Time usage of the algorithms (s)

Data set KASP Nystrom LSC-R LSC-K LSISC-R LSISC-K

MNIST 30.7491 19.2134 11.2248 41.9796 10.8725 44.1537

PenDigits 1.4992 3.9536 1.7989 3.1539 1.3255 2.5912

News20 23.0182 7.6861 4.5851 23.0316 3.6873 23.4915

Covtype 322.1038 587.9697 83.8835 390.3427 76.7334 385.7878

Seismic 16.6623 12.9805 6.5185 23.3764 6.4987 24.2735

Poker 1824.029 5590.8217 1282.7319 3158.153 1306.8474 3165.6241

TABLE III: Accuracy of the clustering results (%)

Data set KASP Nystrom LSC-R LSC-K LSISC-R LSISC-K

MNIST 56.5549 55.5526 63.8166 74.412 66.0404 74.4583PenDigits 72.4836 72.8134 78.3333 80.0541 78.7232 76.0453

News20 38.0653 23.5061 39.8902 35.7672 40.9738 34.7305

Covtype 26.0726 25.8778 30.7835 35.4955 21.054 21.2283

Seismic 67.5026 66.3131 67.1636 67.3484 66.8348 68.7967Poker 11.9585 11.8224 11.4027 13.0643 12.3768 12.2764

TABLE IV: NMI of the clustering results

Data set KASP Nystrom LSC-R LSC-K LSISC-R LSISC-K

MNIST 0.5317 0.4785 0.6316 0.7376 0.6358 0.7401PenDigits 0.6748 0.6614 0.7791 0.7919 0.7756 0.7755

News20 0.353 0.2171 0.37 0.3585 0.3794 0.3527

Covtype 0.1187 0.1397 0.1804 0.1751 0.0597 0.0586

Seismic 0.2818 0.2683 0.2954 0.2995 0.289 0.2918

Poker 0.0053 0.0061 0.0015 0.0075 0.0023 0.0076

4.47 GB memory to store a dense distance matrix and up to20 GB memory (which is observed from the commit size dataprovided by task manager) to meet the memory spikes duringcomputing this distance matrix. Page swap intensively occurswhen the main memory is not sufficient to afford. Nevertheless,since we need only a sparse matrix Z containing the adjacencyof points and its r nearest landmarks, we can easily develop adistributed version for k-nearest neighbour searching to greatlyreduce the pre-processing time. It is also quite straightforwardto put the iteration steps of LSISC on distributed environments,while the algorithms of distributed SVD exists but are noteasily to be implemented.

E. Parameter Selection

Besides the number of landmarks p and nearest landmarksr (which has been well analysed in [10]), we put more effortson the selection of the convergence threshold during subspaceiteration. A small threshold indicates that the variation ofthe largest k eigenvectors of the Laplacian matrix shouldbe little in order to be considered as stable state. Whenthe accelerating factor comes to near zero, our eigenvectorsgenerated by iteration are more close to the exact eigenvectors.Fig.1a and Fig.1b show that LSISC converges very quicklyon PenDigits data set, reaching its accuracy upper limit onlyafter 20 iterations, and the norm of the difference betweeniterated eigenvectors and exact eigenvectors is 0.000003. Thecorresponding acceleration factor at that time is 0.000029. Weprudently use threshold 0.00001 in all our experiments.

V. CONCLUSION

In this paper, we proposed a fast iterative method calledLandmark-based Subspace Iteration Clustering (LSISC) aim-ing at large scale spectral clustering. The advantage of bothLSC and PIC is taken to compensate each other. We pickup p � n landmark points and use the r-nearest landmarksaround the points as their new encoded features. By fullyusing the flat adjacency matrix between landmark and normalpoints to represent the full similarity matrix, the time andspace complexity is greatly optimized. We reduced the timecomplexity from O(n3) to O(n) and space complexity fromO(n2) to O(n) comparing to traditional eigen-decompositiondependent methods. Experiments on various data sets illustratethat LSISC achieve better clustering results with linear timeand space cost.

REFERENCES

[1] D. Yan, L. Huang, and M. I. Jordan, “Fast approximate spectralclustering,” in Proceedings of the 15th ACM SIGKDD internationalconference on Knowledge discovery and data mining. ACM, 2009,pp. 907–916.

[2] T. Sakai and A. Imiya, “Fast spectral clustering with random projectionand sampling,” in Machine Learning and Data Mining in PatternRecognition. Springer, 2009, pp. 372–384.

[3] J. L. C. W. M. Danilevsky and J. Han, “Large-scale spectral clusteringon graphs.”

[4] P. Drineas and M. W. Mahoney, “On the nystrom method for approxi-mating a gram matrix for improved kernel-based learning,” The Journalof Machine Learning Research, vol. 6, pp. 2153–2175, 2005.

778778

0 5 10 15 20 25

0

0.5

1

1.5

Iteration times

Acc

erat

ion

fact

or

(a) The decreasing speed of acceleration factor

0 10 20 30 40 50

0.2

0.4

0.6

0.8

Iteration times

NM

I

(b) The NMI of clustering result

Fig. 1: The acceleration factor and NMI with the iteration times on the PenDigits data set. The acceleration factor quicklyconverges to near-zero. The NMI reaches to the upper limit after only 20 iterations. It is shown that iteration times is not adominant factor in time and space complexity.

[5] C. Fowlkes, S. Belongie, F. Chung, and J. Malik, “Spectral groupingusing the nystrom method,” Pattern Analysis and Machine Intelligence,IEEE Transactions on, vol. 26, no. 2, pp. 214–225, 2004.

[6] C. Williams and M. Seeger, “Using the nystrom method to speedup kernel machines,” in Advances in Neural Information ProcessingSystems 13. Citeseer, 2001.

[7] M. Li and J. T.-Y. Kwok, “Making large-scale nystrom approximationpossible,” 2010.

[8] U. Kang and C. Faloutsos, “Beyond’caveman communities’: Hubs andspokes for graph compression and mining,” in Data Mining (ICDM),2011 IEEE 11th International Conference on. IEEE, 2011, pp. 300–309.

[9] W.-Y. Chen, Y. Song, H. Bai, C.-J. Lin, and E. Y. Chang, “Parallelspectral clustering in distributed systems,” Pattern Analysis and Ma-chine Intelligence, IEEE Transactions on, vol. 33, no. 3, pp. 568–586,2011.

[10] X. Chen and D. Cai, “Large scale spectral clustering with landmark-based representation,” in Proceedings of the Twenty-Fifth AAAI Confer-ence on Artificial Intelligence, 2011.

[11] F. Lin and W. W. Cohen, “Power iteration clustering,” in Proceedingsof International Conference on Machine Learning, ICML, vol. 10.Citeseer, 2010.

[12] J. Shi and J. Malik, “Normalized cuts and image segmentation,” PatternAnalysis and Machine Intelligence, IEEE Transactions on, vol. 22, no. 8,pp. 888–905, 2000.

[13] M. MeilPa and J. Shi, “Learning segmentation by random walks,” 2001.

[14] N. Halko, P. Martinsson, and J. Tropp, “Finding structure with ran-domness: Stochastic algorithms for constructing approximate matrixdecompositions. preprint, 2009,” arXiv preprint arXiv:0909.4061.

[15] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding al-gorithms,” Advances in neural information processing systems, vol. 19,p. 801, 2007.

[16] G. H. Golub and C. F. Van Loan, Matrix computations. JHUP, 2012,vol. 3.

[17] W.-Y. Chen, Y. Song, H. Bai, C.-J. Lin, and E. Y. Chang, “Parallelspectral clustering in distributed systems,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 33, no. 3, pp. 568–586, 2011.

[18] H. W. Kuhn, “The hungarian method for the assignment problem,”Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.

[19] C. D. Manning, P. Raghavan, and H. Schutze, Introduction to informa-tion retrieval. Cambridge University Press Cambridge, 2008, vol. 1.

779779