deep self-evolution clusteringstatic.tongtianta.site/paper_pdf/11668a42-7461-11e9-a533... ·...

14
0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2018.2889949, IEEE Transactions on Pattern Analysis and Machine Intelligence 1 Deep Self-Evolution Clustering Jianlong Chang, Gaofeng Meng, Senior, IEEE , Lingfeng Wang * , Member, IEEE , Shiming Xiang, Member, IEEE and Chunhong Pan, Member, IEEE Abstract—Clustering is a crucial but challenging task in pattern analysis and machine learning. Existing methods often ignore the combination between representation learning and clustering. To tackle this problem, we reconsider the clustering task from its definition to develop Deep Self-Evolution Clustering (DSEC) to jointly learn representations and cluster data. For this purpose, the clustering task is recast as a binary pairwise-classification problem to estimate whether pairwise patterns are similar. Specifically, similarities between pairwise patterns are defined by the dot product between indicator features which are generated by a deep neural network (DNN). To learn informative representations for clustering, clustering constraints are imposed on the indicator features to represent specific concepts with specific representations. Since the ground-truth similarities are unavailable in clustering, an alternating iterative algorithm called Self-Evolution Clustering Training (SECT) is presented to select similar and dissimilar pairwise patterns and to train the DNN alternately. Consequently, the indicator features tend to be one-hot vectors and the patterns can be clustered by locating the largest response of the learned indicator features. Extensive experiments strongly evidence that DSEC outperforms current models on twelve popular image, text and audio datasets consistently. Index Terms—Clustering, deep self-evolution clustering, self-evolution clustering training, deep unsupervised learning. 1 I NTRODUCTION C LUSTERING, an essential data analysis tool in pattern analysis and machine learning, is a data-driven task that attempts to explore knowledge from unlabeled data purely. In a multitude of science and engineering fields where the label information is unobservable or costly to obtain, many applications can be considered as typical instances of clustering, such as the astronomical data analysis [1], the medical analysis [2], the gene sequenc- ing [3], and the information retrieval [4], [5], [6], [7]. In the literature, much research has been devoted to clustering data [8], [9], [10], [11], [12], [13], [14], [15]. From the definition, clustering produces an effective organiza- tion of data based on similarities, namely patterns within the same clusters are similar to each other, while those in different clusters are dissimilar. Technically, there are two essential issues that require to be handled in clustering, i.e., how to choose a right distance metric to measure the similarities and how to define a justified clustering pattern to describe data? Such issues originate from that the objective of clustering is not clear enough. Generally, traditional clustering methods handle the issues by intro- ducing additional assumptions, i.e., measuring the simi- larities via predetermined distance metrics and clustering data based on predefined clustering patterns. To estimate similarities, plentiful attention has been paid to predetermine proper distance metrics. Distance, generally, depends on feature spaces of data. That is, a crucial problem of estimating similarities is to acquire Jianlong Chang, Gaofeng Meng, Shiming Xiang, Chunhong Pan and Lingfeng Wang are with the Department of National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China. E-mail: {jianlong.chang, gfmeng, smxiang, chpan}@nlpr.ia.ac.cn, [email protected]. Jianlong Chang and Shiming Xiang are also with the School of arti- ficial intelligence, University of Chinese Academy of Sciences, Beijing 100049, China. * Corresponding author. Bird Cat Dog DSEC Fig. 1. The motivation of DSEC. In the figure, the green and blue lines signify that pairwise patterns are similar or dissimilar, respec- tively. Intuitively, DSEC handles the clustering task by investigating similarities between pairwise patterns, i.e., grouping similar patterns into the same clusters and dissimilar patterns into different clusters. For this purpose, a group of neurons that represent a specific concept with a constant neuron are introduced in the DSEC model to indicate clustering labels of patterns directly. informative representations of data in an unsupervised way. Based on this insight, many attempts have dedi- cated to developing unsupervised representation extract- ing techniques to measure similarities in latent spaces. Early methods usually adopt manually designed feature descriptors, including HOG [16] and SIFT [17]. While these feature descriptors may loose impacts from some messy variables (e.g., rotation, translation, etc.), they often suffer from appearance variations of scenes and objects. To overcome the limitation, many deep unsupervised learning methods are proposed recently, such as the auto- encoder [18], the denoising autoencoder [19], and the auto-encoding variational bayes [20]. Technically, these methods aim to recode patterns by using representations that are capable of reconstructing inputs. Despite the observable advances, there is still much room for im- provement, e.g., learning high-level representations with high interpretability. To predefine clustering patterns, lots of efforts have been widely made. Traditional methods, in principle, are devoted to taking a comprehensive view of data

Upload: others

Post on 19-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Self-Evolution Clusteringstatic.tongtianta.site/paper_pdf/11668a42-7461-11e9-a533... · 2019-05-12 · To overcome the limitation, many deep unsupervised learning methods are

0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for moreinformation.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPAMI.2018.2889949, IEEE Transactions on Pattern Analysis and Machine Intelligence

1

Deep Self-Evolution ClusteringJianlong Chang, Gaofeng Meng, Senior, IEEE , Lingfeng Wang∗, Member, IEEE ,

Shiming Xiang, Member, IEEE and Chunhong Pan, Member, IEEE

Abstract—Clustering is a crucial but challenging task in pattern analysis and machine learning. Existing methods often ignorethe combination between representation learning and clustering. To tackle this problem, we reconsider the clustering task from itsdefinition to develop Deep Self-Evolution Clustering (DSEC) to jointly learn representations and cluster data. For this purpose, theclustering task is recast as a binary pairwise-classification problem to estimate whether pairwise patterns are similar. Specifically,similarities between pairwise patterns are defined by the dot product between indicator features which are generated by a deepneural network (DNN). To learn informative representations for clustering, clustering constraints are imposed on the indicatorfeatures to represent specific concepts with specific representations. Since the ground-truth similarities are unavailable inclustering, an alternating iterative algorithm called Self-Evolution Clustering Training (SECT) is presented to select similar anddissimilar pairwise patterns and to train the DNN alternately. Consequently, the indicator features tend to be one-hot vectors andthe patterns can be clustered by locating the largest response of the learned indicator features. Extensive experiments stronglyevidence that DSEC outperforms current models on twelve popular image, text and audio datasets consistently.

Index Terms—Clustering, deep self-evolution clustering, self-evolution clustering training, deep unsupervised learning.

F

1 INTRODUCTION

C LUSTERING, an essential data analysis tool in patternanalysis and machine learning, is a data-driven task

that attempts to explore knowledge from unlabeled datapurely. In a multitude of science and engineering fieldswhere the label information is unobservable or costly toobtain, many applications can be considered as typicalinstances of clustering, such as the astronomical dataanalysis [1], the medical analysis [2], the gene sequenc-ing [3], and the information retrieval [4], [5], [6], [7].

In the literature, much research has been devoted toclustering data [8], [9], [10], [11], [12], [13], [14], [15]. Fromthe definition, clustering produces an effective organiza-tion of data based on similarities, namely patterns withinthe same clusters are similar to each other, while those indifferent clusters are dissimilar. Technically, there are twoessential issues that require to be handled in clustering,i.e., how to choose a right distance metric to measurethe similarities and how to define a justified clusteringpattern to describe data? Such issues originate from thatthe objective of clustering is not clear enough. Generally,traditional clustering methods handle the issues by intro-ducing additional assumptions, i.e., measuring the simi-larities via predetermined distance metrics and clusteringdata based on predefined clustering patterns.

To estimate similarities, plentiful attention has beenpaid to predetermine proper distance metrics. Distance,generally, depends on feature spaces of data. That is, acrucial problem of estimating similarities is to acquire

• Jianlong Chang, Gaofeng Meng, Shiming Xiang, Chunhong Pan andLingfeng Wang are with the Department of National Laboratory ofPattern Recognition, Institute of Automation, Chinese Academy ofSciences, Beijing 100190, China. E-mail: {jianlong.chang, gfmeng,smxiang, chpan}@nlpr.ia.ac.cn, [email protected].

• Jianlong Chang and Shiming Xiang are also with the School of arti-ficial intelligence, University of Chinese Academy of Sciences, Beijing100049, China.

• ∗ Corresponding author.

Bird

Cat

DogDSEC

Fig. 1. The motivation of DSEC. In the figure, the green and bluelines signify that pairwise patterns are similar or dissimilar, respec-tively. Intuitively, DSEC handles the clustering task by investigatingsimilarities between pairwise patterns, i.e., grouping similar patternsinto the same clusters and dissimilar patterns into different clusters.For this purpose, a group of neurons that represent a specific conceptwith a constant neuron are introduced in the DSEC model to indicateclustering labels of patterns directly.

informative representations of data in an unsupervisedway. Based on this insight, many attempts have dedi-cated to developing unsupervised representation extract-ing techniques to measure similarities in latent spaces.Early methods usually adopt manually designed featuredescriptors, including HOG [16] and SIFT [17]. Whilethese feature descriptors may loose impacts from somemessy variables (e.g., rotation, translation, etc.), they oftensuffer from appearance variations of scenes and objects.To overcome the limitation, many deep unsupervisedlearning methods are proposed recently, such as the auto-encoder [18], the denoising autoencoder [19], and theauto-encoding variational bayes [20]. Technically, thesemethods aim to recode patterns by using representationsthat are capable of reconstructing inputs. Despite theobservable advances, there is still much room for im-provement, e.g., learning high-level representations withhigh interpretability.

To predefine clustering patterns, lots of efforts havebeen widely made. Traditional methods, in principle,are devoted to taking a comprehensive view of data

Page 2: Deep Self-Evolution Clusteringstatic.tongtianta.site/paper_pdf/11668a42-7461-11e9-a533... · 2019-05-12 · To overcome the limitation, many deep unsupervised learning methods are

0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for moreinformation.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPAMI.2018.2889949, IEEE Transactions on Pattern Analysis and Machine Intelligence

2

from diverse aspects [21]. Among these methods, theGaussian mixture distribution is a popular descriptor todescribe clustering patterns. Based on this perspective,a series of clustering methods are proposed, includingthe K-means [11] and the K-means++ [22]. Contrary topredefining a distribution, considerable methods prefer toestimate proper density functions to predefine clusteringpatterns, such as the density-based spatial clustering [23]and the density peaks based model [24]. In spite of theirprimitively impressive successes, a serious issue relatedto these methods is that the performance severely relieson the predefined clustering patterns, which are difficultto define before clustering.

Although the achievements in the literature are bril-liant, traditional methods still suffer from a multi-stageclustering paradigm, i.e., extracting representations ofdata with unsupervised methods and clustering datadepending on the representations. From the viewpoint ofapplications, the multi-stage clustering paradigm is cum-bersome in practice. From the viewpoint of performance,the generated representations are fixed after the unsu-pervised extraction. As a result, the representations cannot be further improved to achieve better performanceduring the clustering process. To eliminate this problem,the primal discriminative methods [25], [26] that alternatebetween clustering and feature learning are proposed.However, these models [25], [26] solely pertain to learnlow-level representations with shallow methods, such aslinear functions [25] and kernel functions [26].

To learn high-level representations and cluster datasimultaneously, in this paper, we develop Deep Self-Evolution Clustering (DSEC) that recasts the clusteringtask into a binary pairwise-classification problem to judgewhether pairwise patterns belong to the same clusters.Specifically, similarities between pairwise patterns arecalculated as dot product between indicator features gen-erated by a deep neural network (DNN). For clustering,clustering constraints are introduced into the DSEC mod-el to learn specific representations for specific concepts.Since the ground-truth similarities are unobservable inclustering, Self-Evolution Clustering Training (SECT) is p-resented to select similar and dissimilar pairwise patternsand to train the DNN alternately. As a result, the indicatorfeatures tend to be one-hot vectors and the patterns can beclustered by locating the largest response of the generatedindicator features. Visually, an intuitive description aboutthe motivation of DSEC is illustrated in Fig. 1.

To sum up, the main contributions of this work are:

• Starting from the definition, a binary pairwise-classification framework is developed for cluster-ing, which explores an effective scheme from adistinctive viewpoint. The successful attempt sig-nifies that considering relations between patternsis important for clustering, which offers an alter-native orientation for unsupervised learning.

• By introducing a constraint into DSEC, the learnedindicator features theoretically tend to be one-hot vectors where a specific one-hot vector corre-sponds to a specific cluster. Thus, DSEC can clusterdata by locating the largest response of the learned

indicator features purely, which dramatically sim-plifies the labeling process in clustering.

• A self-evolution training algorithm called SECT isproposed to optimize the DSEC model, which cangradually elevate the clustering performance byitself. Furthermore, the algorithm can effectivelyexplore the supervised information to deal withclustering, which builds a bridge between the su-pervised and unsupervised learning seamlessly.

It should be noted that a preliminary version of thiswork was published in ICCV 2017 as an oral presen-tation [27]. In this paper, we further extend our previ-ous work from both theoretical and empirical aspect-s. Theoretically, a more general clustering constraint isdeveloped. Under this general constraint, the learnedindicator features are one-hot vectors ideally, which istheoretically proved in this paper. Note that the previouswork can just be regarded as a special case when anadditional constraint is added to this general constraint.Furthermore, a clustering priori knowledge is introducedto guide DSEC to select similar and dissimilar pairwisepatterns at the initial stage. Empirically, extensive exper-iments are conducted to demonstrate the effectiveness ofDSEC as follows. First, we extend the preliminary versionfrom the image clustering task to the text and audioclustering tasks, which indicates that our approach iseffective for managing general clustering tasks in practice.Second, we comprehensively evaluate the performance ofDSEC by making comparisons with more recent methods,presenting more visualization results, analysing resultswith various clustering scenarios, combing it with semi-supervised tasks, and so on.

The remainder of this paper is organized as follows: inSection 2, we make a brief review on the related work ofclustering, deep learning and deep unsupervised learn-ing. Section 3 and 4 describe the details of the developedDSEC model and DSEC algorithm, respectively. Compre-hensive experimental results are reported and analyzedin Section 5. Finally, Section 6 concludes this paper.

2 RELATED WORK

In this section, we will make a brief review of the relatedwork on the clustering, deep learning and deep unsuper-vised learning methods.

2.1 Clustering

From the aspect related to algorithmic structure and op-eration, clustering methods can be roughly divided intotwo categories: agglomerative and divisive methods [21].

The agglomerative approaches start with each patternregarded as a diverse cluster and merge clusters togethergradually until stopping criterions are met. Proceed fromthis definition, some essential approaches are presented,e.g., the agglomerative methods [8], [28] and the hierar-chical clustering method [29]. By combining the agglom-erative approach [8] with deep learning, recently, jointunsupervised learning is presented for representationlearning and clustering simultaneously [30].

Page 3: Deep Self-Evolution Clusteringstatic.tongtianta.site/paper_pdf/11668a42-7461-11e9-a533... · 2019-05-12 · To overcome the limitation, many deep unsupervised learning methods are

0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for moreinformation.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPAMI.2018.2889949, IEEE Transactions on Pattern Analysis and Machine Intelligence

3

Contrary to agglomerative approaches, divisive meth-ods attempt to split all patterns concurrently, until stop-ping criterions are satisfied. The frequently used methodsare the K-means [8] and its variants, such as the con-strained K-means [11] and the K-means++ [22]. From theperspective of graph theory, spectral clustering modelsdivide patterns into diverse clusters if they are poorlyconnected [31]. Furthermore, these insights also form thebases of a number of methods, e.g., the density-basedclustering method [24].

Despite the competitive successes in clustering, thesetraditional methods are typically developed for simpleand monotonic application scenarios, and may fail inmore complex environment, such as large scale images,texts, and speeches clustering. The limitation inheritsfrom that the methods perform clustering based on thepredefined representations, which implies that the repre-sentations can not be further improved to achieve betterperformance dynamically.

2.2 Deep LearningDeep learning, also termed as deep neural network(DNN), is a significant technique in pattern analysis andmachine learning, which has been developed to managetasks like humans do [32]. Specifically, the pivotal ad-vance of DNNs is that they allow machines to adaptivelylearn informative representations of data to tackle specifictasks via a unified training procedure, without any prioriknowledge or interventions of humans.

Although the birth of deep learning can be tracedto 1960s, much interesting research has been proposedin the past decade. This is because there were somelimitations in the past, including computing resourceand data size. With the development of technology andthe accumulation of knowledge, by applying two effi-cient GPUs, Krizhevsky et al. won the championship ofthe ImageNet Large Scale Visual Recognition Competi-tion in 2012 (ILSVRC-2012) based on a DNN named asAlexNet [33], which triggered a wave of deep learning.Followed that, various deep DNN architectures are ex-ploited, e.g., VGG [34], GoogleNet [35] and ResNet [36].Thanks the extraordinary capability of DNNs in represen-tation learning, significant successes have been achievedin various pattern analysis and machine learning prob-lems, especially in computer version, such as image clas-sification [34], [35], [36], image segmentation [37], [38],[39], object detection [40], [41].

Although the achievements are remarkable, deeplearning methods suffer from a critical limitation thata large amount of labeled data is required to learnparameters. Such limitation comes from that numerousparameters in deep models require to be estimated witha large amount of labeled data. However, it is laboriousand expensive to collect sufficient labeled data. Especially,labeled data may not be previously acquired in manypractical tasks, e.g., information retrieval.

2.3 Deep Unsupervised LearningTo overcome the limitations in the supervised learning,deep unsupervised learning has drawn much attention

to train DNNs in an unsupervised manner. Based on theunlabeled data purely, as a result, DNNs can be trainedto some justified status, which can be used to extract con-victive representations for various unsupervised tasks,including clustering [42] and retrieval [43], [44].

Over the last decade, many deep unsupervised learn-ing methods have been explored for learning informativerepresentations or clustering. Primitively, most unsuper-vised methods aim to pre-train DNNs by reconstructinginputs, such as the restricted Boltzmann machine [45],the autoencoder [18] and the denoising autoencoder [19].These methods attempt to train DNNs to learn non-linearfunctions that are capable of reconstructing inputs. Inorder to learn more robust representations, several pop-ular regularization terms have been introduced, e.g., thesparsity constraint [46]. To simplify the learning process,consistent inference of latent representations based learn-ing method omits the decoding part to learn more inter-pretative representations directly [47]. Recently, deep gen-erative models, including the auto-encoding variationalbayes [20] and the generative adversarial network [48],have been proposed to encode data. Technically, thesegenerative models devote to encoding data distributionswith DNNs. By introducing some prior distribution tolearn hidden representations, several deep unsupervisedlearning methods have been developed, such as theadversarial autoencoder [49], the categorical generativeadversarial networks [50] and the Gaussian mixture vari-ational autoencoder [51].

While much endeavor has been devoted to deepunsupervised learning, several intrinsic challenges stillremain. First, how to provide proper driving force to trainDNNs in an unsupervised manner? Second, clusteringresults can not be obtained directly via the learned featurerepresentations by these methods, which implies that thegenerated representations are unsuitable for clustering.

3 DSEC MODEL

Clustering, technically, is a process of grouping data intoclusters such that patterns within the same clusters aresimilar to each other, while those in different clustersare dissimilar. Therefore, pairwise patterns only belongto either the same clusters or different clusters. Based onthis perspective, DSEC recasts the clustering task into abinary pairwise-classification framework. Specifically, theflowchart of DSEC is illustrated in Fig. 2 and more detailsare given in the following subsections.

3.1 Binary Pairwise-Classification for Clustering

Given an unlabeled dataset X = {xi}ni=1 and the pre-defined number of clusters k, where xi indicates theith pattern, DSEC manages the clustering task by inves-tigating similarities between pairwise patterns. To con-sider the similarities, we redefine the dataset as D ={(xi,xj , rij)}ni=1,j=1, where xi, xj ∈ X are the unlabeledpatterns (which refer to the inputs) and rij ∈ Y is anunobservable binary variable (which refers to the output).Specifically, rij = 1 implies that xi and xj belong to

Page 4: Deep Self-Evolution Clusteringstatic.tongtianta.site/paper_pdf/11668a42-7461-11e9-a533... · 2019-05-12 · To overcome the limitation, many deep unsupervised learning methods are

0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for moreinformation.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPAMI.2018.2889949, IEEE Transactions on Pattern Analysis and Machine Intelligence

4

Similarity Estimation

Similar

1

Dissimilar

0

Ambiguous

Dot Product

xi Ii

Learning

Clustering

Loss

Ii

Ij

Ii

Ij

Sharing Parametersxi Ii

xj IjDNN

Trained DNN

Ii Ij.

Fig. 2. The flowchart of DSEC. Specifically, DSEC consists of two essential parts, i.e., learning and clustering. In the learning part, DSECmeasures similarities between pairwise patterns via dot product between indicator features generated by a DNN. Since the ground-truthsimilarities are unknown in clustering, similar and dissimilar pairwise patterns with high likelihood are gradually selected to train the DNN.That is, as the learning procedure progresses, more pairwise patterns are appended for training until all pairwise patterns are included. In theclustering part, conclusively, DSEC clusters patterns by locating the largest activation of the indicator features generated by the trained DNN.

the same cluster, and rij = 0 otherwise. Formally, theobjective function of DSEC is formulated as follows:

minw

E(w) =n∑

i=1

n∑j=1

L(rij , g(xi,xj ;w)), (1)

where L(rij , g(xi,xj ;w)) measures the loss betweenthe binary variable rij and the estimated similarityg(xi,xj ;w), and w signifies the model parameters in thefunction g. Specifically, the squared error loss is employedas L(rij , g(xi,xj ;w)), i.e.,

L(rij , g(xi,xj ;w)) = ‖rij − g(xi,xj ;w)‖22. (2)

Generally, two essential issues need to be tackled inDSEC. First, assuming that Y is observable, the cluster-ing labels of xi and xj are still unacquirable by onlyaccessing to the estimated similarity g(xi,xj ;w). Second,the ground-truth values in Y can not be observed inclustering. In the following, Section 3.2 and 3.3 focus ontackling the aforementioned two issues, respectively.

3.2 Indicator Features under Clustering ConstraintsTo learn more informative representations for clustering,indicator features are introduced into the DSEC model toindicate clustering labels, namely representing a conceptwith a specific neuron. Denote the indicator features asI = {Ii}ni=1, where Ii is a k-dimensional representationof the pattern xi, a clustering constraint C is imposed onthe indicator features as follows:

C : ∀ i, Iih ≥ 0, h = 1, · · · , k, (3)

where Iih represents the hth element of the indicatorfeature Ii. For clarity, “C(Ii) = True” is utilized to de-scribe the case that Ii satisfies the clustering constraintC. Because of the brilliant capability of function fitting,DNNs are employed to learn the indicator features, i.e.,

Ii = f(xi;w), (4)

where f( · ;w) indicates a DNN model that is uniquelydetermined by the parameter w.

DSEC measures the similarities via dot product be-tween the indicator features. Formally, the similaritiesbetween patterns is written as follows:

g(xi,xj ;w) = f(xi;w) · f(xj ;w) = Ii · Ij , (5)

where the operator “·” represents dot product betweentwo vectors. By introducing the indicator features, theDSEC model is reformulated as:

minw

E(w) =n∑

i=1

n∑j=1

L(rij , Ii · Ij),

s.t. ∀ i, Ii = f(xi;w), C(Ii) = True.

(6)

The clustering constraint in Eq. (3) can bring a dra-matically attractive property for clustering. Let Ek denotethe standard basis of the k-dimensional Euclidean space,we have the following theorem (the proof of this theoremis reported in the supplementary material):

Theorem 1. If the optimal value of Eq. (6) is attained,for ∀ i,

Ii ∈ Ek,for ∀ i and ∀ j,

Ii 6= Ij ⇔ rij = 0 and Ii = Ij ⇔ rij = 1.

Theorem 1 signifies that the learned indicator featuresare k-dimensional one-hot vectors ideally. Specifically,the activated neurons are identical if pairwise patternsbelong to the same cluster and are different otherwise,which is similar to the behaviour of grandmother cells1 inbrain [52]. Benefited from this property, DSEC is capableof clustering based on the learned indicator features only.

By introducing several additional constraints to theclustering constraint C, a series of clustering constraints

1. The grandmother cell is a hypothetical neuron that represents acomplex but specific concept or object.

Page 5: Deep Self-Evolution Clusteringstatic.tongtianta.site/paper_pdf/11668a42-7461-11e9-a533... · 2019-05-12 · To overcome the limitation, many deep unsupervised learning methods are

0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for moreinformation.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPAMI.2018.2889949, IEEE Transactions on Pattern Analysis and Machine Intelligence

5

can be derived. In this paper, a norm constraint ‖I‖p =1 is appended to the constraint C to reduce the searchspaces of the indicator features, i.e.,

Cp : ∀ i, ‖Ii‖p = 1, Iih ≥ 0, h = 1, · · · , k, (7)

where p ∈ (0,+∞) is satisfied. For clarity, we refer to thisclustering constraint as Cp. Specifically, the search spacesof the 2-dimensional indictor features under diverse clus-tering constraints are depicted in Fig. 3. From C to Cp,visually, the search spaces of the indicator features willbe reduced from a plane (i.e., the first quadrant) to curvesthat still include the optimal indictor features (i.e., [1, 0]and [0, 1]). That is, for arbitrary p ∈ (0,+∞), Theorem 1is met under the arbitrary constraint Cp. Since the searchspaces are significantly reduced, further, higher efficiencywill be acquired by Cp, which will be empirically demon-strated in the experimental section.

3.3 Labeled Pairwise Patterns Selection

Since Y is unobservable in clustering, we develop aself-evolution mechanism that allows DSEC to graduallyselect similar and dissimilar pairwise patterns to trainthe DNNs in DSEC. For this purpose, DSEC graduallyselects labeled pairwise patterns with high likelihoodwith the progress of the learning, which is inspired fromthe curriculum learning [53] in the supervised learning.That is, “easy” pairwise patterns with high likelihoodare first selected as training data to find rough clusters.Then, as the learning progresses, the trained DNNs canbe utilized for measuring more accurate similarities andmore pairwise patterns will be gradually selected to findmore refined clusters.

In DSEC, labeled pairwise patterns are selected basedon similarities between indicator features, which is in-spired by the following two observations. First, if DNNsare already trained to some justified status with “easy”pairwise patterns, effective indicator features can belearned to estimate similarities between pairwise pat-terns. Second, for randomly initialized DNNs, they alsocan capture some available information of patterns [54],despite it is quite limited. According to the observations,similarities between patterns can be reflected by similar-ities between indicator features approximatively. There-fore, DSEC selects labeled pairwise patterns as follows:

rij :=

1, if S(Ii, Ij) > u,

0, if S(Ii, Ij) ≤ l,None, otherwise,

i, j = 1, · · · , n, (8)

where S( · , · ) indicates the cosine distance function tomeasure similarities between indicator features. Two pa-rameters u and l are the thresholds for selecting simi-lar and dissimilar labeled pairwise patterns with highlikelihood, respectively. And “None” implies that thesimilarity between xi and xj is ambiguous. Thus, thesample (xi,xj , rij) should be omitted during training.In addition, u = l is satisfied if and only if all pairwisepatterns are labeled and appended during training.

There is a priori knowledge in clustering, namely mostof pairwise patterns are dissimilar. In practice, it can be

I1

1

0

I2

I11

1

0

I2

1

(b)(a)

Fig. 3. The search spaces of the indicator features in the 2-dimensional space under different clustering constraints. (a) Cluster-ing constraint C. (b) Clustering constraints C0.5, C1.0 and C2.0.

naturally employed to select more accurate labeled data.Formally, given a dataset X = X1

⋃ · · ·⋃Xk, where Xh

indicates the hth cluster. Denote Nh as the number ofpatterns in the cluster Xh, for arbitrary xi, xj ∈ X , theprobability P that xi and xj belong to the same clusterscan be calculated as follows:

Theorem 2. Randomly sampling two patterns xi and xj fromX , we have P =

∑kh=1 N2

h

(∑k

h=1 Nh)2 .

From Theorem 2, we have the following corollary,

Corollary 1. For the balanced dataset X , P = 1k .

From the aforementioned theorem and corollary, we canacquire two important insights for clustering. Theorem 2signifies that the prior probability of pairwise patternsbelonging to different clusters is higher than to the sameclusters generally. On the balanced datasets, shown inCorollary 1, pairwise patterns that belong to the sameclusters account for only 1/k, where k is the number ofclusters. From these insights, if most of pairwise patternsare labeled as dissimilar data, most of pairwise patternscan acquire the ground-truth similarities. As a result,informative gradient can be yielded to train randomlyinitialized DNNs to considerable status, which guaranteethe availability of our method in practice.

So far the two essential issues described in Section 3.1have been tackled. The DSEC model is rewritten as:

minw,u,l

E(w) =n∑

i=1

n∑j=1

vijL(rij , Ii · Ij) + s(l, u),

s.t. s(l, u) = u− l, l ≤ u,∀ i and ∀ j, vij ∈ {0, 1},∀ i, Ii = f(xi;w), C(Ii) = True,

(9)

where v is an indicator coefficient matrix, i.e.,

vij :=

{1, if rij ∈ {0, 1},0, otherwise,

i, j = 1, · · · , n, (10)

where vij = 1 means that the sample (xi,xj , rij) isselected for training, and vij = 0 otherwise. The secondterm s(u, l) = u − l is a penalty term for the number ofselected pairwise patterns during training. By minimizingthe penalty term, more pairwise patterns will be selectedfor training until all pairwise patterns are included.

Page 6: Deep Self-Evolution Clusteringstatic.tongtianta.site/paper_pdf/11668a42-7461-11e9-a533... · 2019-05-12 · To overcome the limitation, many deep unsupervised learning methods are

0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for moreinformation.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPAMI.2018.2889949, IEEE Transactions on Pattern Analysis and Machine Intelligence

6

4 DSEC ALGORITHM

In this section, a self-evolution optimization algorithm isdeveloped to train the DSEC model in Eq. (9), and a labelinference tactic is presented to cluster each pattern basedon the corresponding indicator feature purely.

4.1 Self-Evolution Clustering TrainingAkin to [25], [26], an alternating iterative optimization al-gorithm called Self-Evolution Clustering Training (SECT),is developed to optimize the DSEC model. The algorithmfocuses on two issues, namely the implementation of theclustering constraints and the iterative optimization.

A constraint layer is devised to meet the requirementsof the clustering constraints in Eq. (7). Formally, thefunctions in this layer is formulated as follows:

Itemh : = exp

(Iinh −max

h

(Iinh)), h = 1, · · · , k, (11a)

Iouth : =Itemh

‖ Item ‖p, h = 1, · · · , k, (11b)

where Iin, Item and Iout ∈ Rk indicate the input, interme-diate variable and output of the constraint layer, orderly.Additionally, Iinh , Itemh and Iouth represent the hth elementin Iin, Item and Iout. Note that all elements of the outputIout are mapped into [0,+∞) by Eq. (11a) and the outputIout is simultaneously limited to met the norm constraintby Eq. (11b). In DSEC, the DNNs are always followed bythe constraint layer. That is, for arbitrary i, Ii invariablysatisfies the clustering constraints in Eq. (7).

The optimizations of w and {u, l} are alternatingiterative performed. Once {u, l} are fixed, r and v canbe obtained, and the DSEC model degenerates as follows:

minw

E(w) =n∑

i=1

n∑j=1

vijL(rij , f(xi;w) · f(xj ;w)). (12)

Since r and v are available, optimizing Eq. (12) is a su-pervised learning problem so that the back-propagationlearning algorithm can be utilized to update w. Besides,the storage complexity of r and v is O(n2) because thesimilarities between pairwise patterns need to be calcu-lated and stored. It is too high to handle large datasets.To deal with this problem, w is updated on each batchthat is gradually sampled from the original dataset, asillustrated in the line 3 to line 8 of Algorithm 1.

Similarly, the DSEC model can be simplified as followswhen w is fixed,

minl,u

E(l, u) = s(l, u). (13)

According to the gradient descent algorithm, in eachiteration, the update rules of l and u can be written as:

u := u− η · ∂s(l, u)∂u

,

l := l − η · ∂s(l, u)∂l

,

(14)

where η ≥ 0 is the learning rate. Since ∂s(l,u)∂l = −1 < 0

and ∂s(l,u)∂u = 1 > 0 are always satisfied, s(l, u) will

gradually decreasing until l = u is met. This corresponds

Algorithm 1 Deep Self-Evolution Clustering

Input: Dataset X = {xi}ni=1 , fw, k, u, l, η, m.Output: Clustering label ci of xi ∈ X .

1: Randomly initialize w;2: repeat3: for all q ∈

{1, 2, · · · ,

⌊nm

⌋}do

4: Sample batch Xq from X ; // m patterns per batch5: Select training data from Xq ; // Eq. (8)6: Calculate the indicator coefficient v; // Eq. (10)7: Update w by minimizing Eq. (12);8: end for9: Update u and l according to Eq. (14);

10: until l > u11: for all xi ∈ X do12: Ii := f(xi;w);13: ci := argmaxh(Iih);14: end for

to our target scenario that pairwise patterns are graduallyappended for training with the progress of the optimiza-tion, until all pairwise patterns are considered.

4.2 Label Inference for ClusteringClustering labels can be inferred via the learned indica-tor features purely, which benefits from the predictablemodes of the learned indicator features that are k-dimensional one-hot vectors ideally. In practice, howev-er, they may not be one-hot vectors strictly due to thefollowing two reasons, which are widely existed in thesupervised learning. First, it is hard to reach the globaloptima for training DNNs due to the strong non-convexproperty. Second, even a global optimum will be achievedon the training data, it is almost impossible to achieve aglobal optimum for all data (including the unobservabledata). To manage this issue, patterns are clustered bylocating the largest response of indicator features, i.e.,

ci := argmaxh

(Iih), h = 1, · · · , k, (15)

where ci is the clustering label of pattern xi. Note thatlabels with high likelihood are assigned to patterns.

In summary, the developed SECT algorithm is listedin Algorithm 1. During each iteration, the algorithm al-ternately selects similar and dissimilar pairwise patternsvia the fixed DNNs and trains the DNNs based on theselected pairwise patterns. When all pairwise patternsare considered for training and the objective function inEq. (12) can not be improved further, the algorithm con-verges. Conclusively, patterns are clustered by locatingthe largest response of the learned indicator features.

5 EXPERIMENTS

In this section, the clustering performance of the DSECmodel are evaluated on several popular datasets withthree frequently used metrics. Further, numerous ablationexperiments are also conducted to systematically andcomprehensively analyze the developed DSEC model.Specifically, the core code2 of the DSEC model is releasedat https://github.com/vector-1127/DSEC.

2. Relies on Keras [55] with the Theano [56] backend.

Page 7: Deep Self-Evolution Clusteringstatic.tongtianta.site/paper_pdf/11668a42-7461-11e9-a533... · 2019-05-12 · To overcome the limitation, many deep unsupervised learning methods are

0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for moreinformation.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPAMI.2018.2889949, IEEE Transactions on Pattern Analysis and Machine Intelligence

7

5.1 DatasetsWe perform experiments on twelve popular datasets,including images, texts, and speeches. The number of pat-terns, the number of clusters, and dimensions of patternsare orderly listed in Table 1. As described in [30], [42], thetraining and testing patterns of each dataset are jointlyutilized in our experiments.

Specifically, the image datasets includes MNIST [57],CIFAR-10 [58], CIFAR-100 [58], STL-10 [59] andILSVRC2012-1K [60]. For the CIFAR-100 dataset, addi-tionally, the 20 superclasses are considered in our ex-periments. We randomly choose 10 subjects from theILSVRC2012-1K dataset [60] and resize these images to96 × 96 × 3 to construct the ImageNet-10 dataset for ourexperiments. To compare the clustering methods on morecomplex dataset, we also randomly select 15 kinds ofdog images from ILSVRC2012-1K to establish the fine-grained ImageNet-Dog dataset. The text datasets includes20NEWS [61], REUTERS [62], and REUTERS-10k [62](which is a subset of 10000 texts sampled from REUTERSfollowing [42]). Three speech datasets, i.e., AudioSet-20,AudioSet-Music, and AudioSet-Human, are randomlychosen from AudioSet [63] for comparison in our experi-ments. In these datasets, AudioSet-Music and AudioSet-Human are two fine-grained datasets sampled from thecoarse classes “Music” and “Human sounds” in Au-dioSet [63], respectively. Specifically, the term frequency-inverse document frequency features3 (TF-IDF) and theMel-frequency cepstral coefficients4 are employed to re-code texts and speeches, respectively.

5.2 Evaluation MetricsThree popular metrics are utilized to evaluate the per-formance of clustering methods, including NormalizedMutual Information (NMI) [64], Adjusted Rand Index(ARI) [65], and clustering Accuracy (ACC) [66]. Specifical-ly, these metrics range in [0, 1], and higher scores signifymore accurate clustering results are achieved.

5.3 Compared MethodsSeveral existing clustering methods are employed tocompare with our approach. Specifically, the traditionalmethods, including K-means [11] and SC [31] are adoptedfor comparison. For the representation-based clusteringapproaches, as described in [42], we employ some unsu-pervised learning methods, including AE [18], DAE [19]and CILR [47], to learn feature representations and useK-means [11] to cluster data as a post processing. To acomprehensive comparison, recent single-stage methods,including CatGAN [50], GMVAE [51], DEC [42], JULE-SF [30], and JULE-RC [30] are used for comparison.

5.4 Experimental SettingsFor the traditional clustering methods, i.e., K-means [11]and SC [31], following the previous work [42], we con-catenate HOG feature [16] and a 8 × 8 color map as

3. https://scikit-learn.org/stable4. https://librosa.github.io/librosa/feature.html

TABLE 1The datasets utilized in the experiments.

Dataset Numbers Clusters Dimensions

MNIST [57] 70000 10 784CIFAR-10 [58] 60000 10 3072STL-10 [59] 13000 10 27648ImageNet-10 [60] 13000 10 27648ImageNet-Dog [60] 19500 15 27648CIFAR-100 [58] 60000 20 307220NEWS [61] 18846 20 10000REUTERS-10k [62] 10000 4 2000REUTERS [62] 685071 4 2000AudioSet-20 [63] 30000 20 1920AudioSet-Music [63] 22500 15 1920AudioSet-Human [63] 22500 15 1920

inputs when we experiment on STL-10, ImageNet-10 andImageNet-Dog. For the remaining datasets and methods,the pixel intensities on the image datasets and the extract-ed feature representations on the text and audio datasetsare employed as inputs, respectively.

In our experiments, DNNs with the constraint layerare devised to learn indicator features (the details ofthe devised DNNs can be found in the supplementarymaterial). Specifically, we set C2.0 (the cosine similari-ty) as a default selection of the clustering constraint inour experiments. Since the prior probability of pairwisepatterns belonging to different clusters is higher thanto the same clusters, we set u = 0.99 and l = 0.8for selecting similar and dissimilar pairwise patterns,respectively. The learning rate η = u−l

2·e = 0.0019, wheree signifies the number of iterations and is set to 50. InSECT, we gradually select m = 1000 patterns as a batchto select labeled pairwise patterns. The normalized Gaus-sian initialization strategy [67] is utilized to initialize thedevised DNNs. The RMSProp optimizer [68] where theinitial learning rate is set to 0.001 is utilized to train theDNNs. The batch size is 32 in the learning procedure. Fora reasonable evaluation, we perform 10 random restartsfor all experiments and the average results are employedto compare with the others methods. Following with theprevious works [19], [47], in addition, the unsuperviseddata augmentation technique in [47] is used to combatover-fitting and to guide clustering in our experiments.

5.5 Clustering5.5.1 Quantitative ResultsIn Table 2, we report the quantitative results of theseclustering methods on the experimental datasets. For eachdataset, note that DSEC dramatically outperforms theothers methods with significant margins on all the threeevaluation metrics. The results imply that DSEC achievesstate-of-the-art performance on the clustering task.

Further analysis, several tendencies can be observedfrom Table 2. First, the performance of the representation-based clustering methods (e.g., AE [18], CILR [47]) aresuperior to the traditional methods (e.g., K-means [11],SC [31]) currently. It indicates that the clustering tech-niques have only a minor impact on performance, whilethe representation extracting is more crucial. This sce-nario also exists in the supervised tasks, e.g., the repre-sentation learning plays crucial roles in the classification

Page 8: Deep Self-Evolution Clusteringstatic.tongtianta.site/paper_pdf/11668a42-7461-11e9-a533... · 2019-05-12 · To overcome the limitation, many deep unsupervised learning methods are

0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for moreinformation.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPAMI.2018.2889949, IEEE Transactions on Pattern Analysis and Machine Intelligence

8

TABLE 2The clustering results of the compared methods on the experimental datasets. The best results are highlighted in bold.

Dataset MNIST [57] CIFAR-10 [58] CIFAR-100 [58] STL-10 [59]

Metric NMI ARI ACC NMI ARI ACC NMI ARI ACC NMI ARI ACC

K-means [11] 0.4997 0.3652 0.5723 0.0871 0.0487 0.2289 0.0839 0.0280 0.1297 0.1245 0.0608 0.1920SC [31] 0.6626 0.5214 0.6958 0.1028 0.0853 0.2467 0.0901 0.0218 0.1360 0.0978 0.0479 0.1588AE [18] 0.7257 0.6139 0.8123 0.2393 0.1689 0.3135 0.1004 0.0476 0.1645 0.2496 0.1610 0.3030DAE [19] 0.7563 0.6467 0.8316 0.2506 0.1627 0.2971 0.1105 0.0460 0.1505 0.2242 0.1519 0.3022CILR [47] 0.7986 0.7756 0.8855 0.1105 0.0612 0.2583 0.0943 0.0453 0.1554 0.1398 0.0652 0.2452GMVAE [51] 0.7964 0.7670 0.8854 0.2847 0.1873 0.3364 0.1366 0.0492 0.1584 0.2365 0.1572 0.3177CatGAN [50] 0.7637 0.7360 0.8279 0.2646 0.1757 0.3152 0.1200 0.0453 0.1510 0.2100 0.1390 0.2984JULE-SF [30] 0.9063 0.9139 0.9592 0.1919 0.1357 0.2643 0.0944 0.0352 0.1434 0.1754 0.1622 0.2741JULE-RC [30] 0.9130 0.9270 0.9640 0.1923 0.1377 0.2715 0.1026 0.0327 0.1367 0.1815 0.1643 0.2769DEC [42] 0.7716 0.7414 0.8430 0.2568 0.1607 0.3010 0.1358 0.0495 0.1852 0.2760 0.1861 0.3590DSEC 0.9522 0.9631 0.9833 0.4379 0.3399 0.4778 0.2124 0.1100 0.2552 0.4033 0.2864 0.4818

Dataset ImageNet-10 [60] ImageNet-Dog [60] 20NEWS [61] REUTERS-10k [62]

Metric NMI ARI ACC NMI ARI ACC NMI ARI ACC NMI ARI ACC

K-means [11] 0.1186 0.0571 0.2409 0.0548 0.0204 0.1054 0.2154 0.0826 0.2328 0.3124 0.2505 0.5242SC [31] 0.1511 0.0757 0.2740 0.0383 0.0133 0.1111 0.2147 0.0934 0.2486 0.3066 0.2301 0.4384AE [18] 0.2099 0.1516 0.3170 0.1039 0.0728 0.1851 0.4439 0.3237 0.4907 0.3254 0.2878 0.5671DAE [19] 0.2064 0.1376 0.3044 0.1043 0.0779 0.1903 0.4558 0.3274 0.5027 0.3542 0.3042 0.5825CILR [47] 0.2135 0.1357 0.3072 0.1029 0.0703 0.1826 0.4835 0.3758 0.5275 0.3447 0.2955 0.5732GMVAE [51] 0.2365 0.1742 0.3532 0.1281 0.0792 0.1782 0.4266 0.3356 0.4792 0.3796 0.2966 0.6008CatGAN [50] 0.2250 0.1571 0.3459 0.1213 0.0776 0.1738 0.4064 0.3024 0.4754 0.3238 0.3214 0.5932JULE-SF [30] 0.1597 0.1205 0.2927 0.0492 0.0261 0.1115 0.4976 0.4506 0.5689 0.3713 0.3535 0.6144JULE-RC [30] 0.1752 0.1382 0.3004 0.0537 0.0284 0.1377 0.5126 0.4679 0.5834 0.4054 0.3686 0.6264DEC [42] 0.2819 0.2031 0.3809 0.1216 0.0788 0.1949 0.5286 0.4683 0.5832 0.6585 0.5318 0.7217DSEC 0.5827 0.5218 0.6737 0.2362 0.1241 0.2644 0.6133 0.5283 0.6513 0.7085 0.5936 0.7832

Dataset REUTERS [62] AudioSet-20 [63] AudioSet-Music [63] AudioSet-Human [63]

Metric NMI ARI ACC NMI ARI ACC NMI ARI ACC NMI ARI ACC

K-means [11] 0.3275 0.2593 0.5329 0.1901 0.0780 0.2074 0.1967 0.0742 0.2231 0.2023 0.0840 0.2375SC [31] N/A N/A N/A 0.1774 0.0741 0.2195 0.1897 0.0823 0.2264 0.1836 0.0752 0.2254AE [18] 0.3564 0.3164 0.7197 0.1933 0.1000 0.2300 0.2154 0.1146 0.2475 0.2024 0.1054 0.2363DAE [19] 0.3675 0.3386 0.7246 0.1964 0.1048 0.2366 0.2015 0.1144 0.2325 0.1924 0.1136 0.2475CILR [47] 0.3356 0.3064 0.7056 0.2007 0.1071 0.2387 0.2007 0.1231 0.2324 0.2246 0.1224 0.2642GMVAE [51] 0.3823 0.3475 0.6365 0.2224 0.1066 0.2441 0.2397 0.1376 0.2498 0.2365 0.1287 0.2653CatGAN [50] 0.3553 0.3246 0.6245 0.2102 0.0946 0.2174 0.2274 0.1036 0.2274 0.2297 0.1137 0.2434JULE-SF [30] 0.3986 0.3594 0.6111 0.1935 0.1128 0.2163 0.2103 0.1018 0.2362 0.2261 0.1095 0.2283JULE-RC [30] 0.4012 0.3825 0.6336 0.2106 0.1108 0.2336 0.2196 0.1196 0.2497 0.2386 0.1297 0.2457DEC [42] 0.6832 0.5635 0.7563 0.2280 0.1102 0.2461 0.2280 0.1297 0.2585 0.2308 0.1249 0.2546DSEC 0.7246 0.6044 0.8143 0.2881 0.1635 0.2981 0.3048 0.1722 0.3350 0.2989 0.1708 0.3284

Initial

Final

MNIST CIFAR-10 STL-10 ImageNet-10 ImageNet-Dog CIFAR-100 20NEWS REUTERS-10k REUTERS AudioSet-20 AudioSet-Music AudioSet-Human

22.79%

96.65%

98.78%

16.08%

35.91%

52.21%

18.66%

33.70%

51.58%

19.72%

41.19%

67.37%

12.79%

23.72%

31.45%

10.18%

18.05%

25.79%

13.85%

32.47%

65.13%

31.75%

57.28%

78.32%

32.82%

60.96%

81.43%

11.51%

17.35%

29.41%

13.53%

25.24%

33.50%

12.19%

24.64%

32.94%

Fig. 4. The clustering results of DSEC on the experimental datasets. For each dataset, the names of the datasets are written on the upwardside, the clustering results in the initial, intermediate and final stages are orderly illustrated in the top, intermediate and bottom lines. Inaddition, the clustering accuracies at the different stages are listed under the results. Large figures can be found in the supplement.

task [36]. Second, benefited from the effective represen-tations, impressive improvements are achieved learnedby these unsupervised methods, but they are limited,especially compared against the single-stage clusteringmethods (e.g., JULE [30], DSEC). This demonstrates thatthe single-stage clustering methods can observably im-prove the clustering performance, since more excellentrepresentations will be captured during clustering. Third-ly, dramatical superiorities are yielded by DSEC on theCIFAR-100, ImageNet, and AudioSet datasets, which sig-nifies that DSEC can be utilized on large-scale datasets,not merely limited to some simple datasets (e.g., MNIST).

Fourthly, competitive results are obtained by DSEC onthe image, text and audio datasets concurrently, whichdemonstrates that DSEC is efficient for managing generalclustering tasks in practice. Additionally, the performanceof JULE approximates to our method on MNIST. Howev-er, there is a conspicuous margin on the other datasets. Apossible reason is that the initialization strategy of JULEis invalidated on intricate datasets, which degenerates theperformance consequently. Compared with JULE, DSECalleviates the dependence on additional techniques byintroducing the Self-Evolution Clustering Training algo-rithm, which elevates the independence of DSEC.

Page 9: Deep Self-Evolution Clusteringstatic.tongtianta.site/paper_pdf/11668a42-7461-11e9-a533... · 2019-05-12 · To overcome the limitation, many deep unsupervised learning methods are

0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for moreinformation.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPAMI.2018.2889949, IEEE Transactions on Pattern Analysis and Machine Intelligence

9

7 3 2 0 5 1 6 9 8 4

sports caraptenodytes

patagonica airliner airship trailer

truck

container

ship

soccer

ball orange

maltese

dog

panthera

uncia

rec.

motorcyclessci.space

rec.sport.

baseball

rec.sport.

hockey

talk.politics.

misc

misc.

forsale

comp.

windows.xsci.crypt

talk.politics.

misc

soc.religion.

christian

drum rollambulance hi-hat meow bark snicker bleat stream harp fireworks

Fig. 5. The indicator features of the first ten clusters on MNIST, ImageNet-10, 20NEWS and AudioSet-20. For each dataset, ground-truthlabels are written on the upward side, indicator features are shown on the right side of examples, followed by typical failures at the bottom.

5.5.2 Results VisualizationThe clustering results of DSEC on our experimentaldatasets are intuitively illustrated in Fig. 4. For clari-ty, the learned indicator features are mapped in the 2-dimensional space with t-SNE [69]. In the figure, differentcolors correspond to different clusters. Visually, this illus-tration demonstrates that the developed DSEC model cangradually cluster patterns with the same indicator fea-tures if patterns belong to the same clusters, and clusterpatterns with different indicator features otherwise.

5.5.3 Indicator Features VisualizationIn Fig. 5, we qualitatively analysis the indicator featureslearned by DSEC on the experimental datasets. For clarity,word clouds5 and spectrograms6 are employed to visual-ize text and audio patterns, respectively. We observe that

5. https://pypi.python.org/pypi/wordcloud6. https://librosa.github.io/librosa/display.html

the same neurons will be distinctly activated in the indi-cator features if the patterns belong to the same clusters,and the failure cases always possess different fine-grainedinformation compared with the correct cases. That is,DSEC attempts to learn high-level representations, ratherthan a simple combination of lower-level representations.This is the reason why more complicated images, suchas the airliner and airship images in ImageNet-10, can bedistinguished by DSEC. Interestingly, most of the indica-tor features of the failure modes appear reasonable. Forexample, in terms of car, only the other types of vehicle(e.g., truck) might be considered as plausible labels, ratherthan the clusters beyond vehicle. It implies that moreinterpretable and high-level knowledge is captured byDSEC for clustering.

5.5.4 The Promotion of DSECTo observe the self-evolution capability of SECT, Fig. 6depicts the precisions of the selected similar and dis-

Page 10: Deep Self-Evolution Clusteringstatic.tongtianta.site/paper_pdf/11668a42-7461-11e9-a533... · 2019-05-12 · To overcome the limitation, many deep unsupervised learning methods are

0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for moreinformation.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPAMI.2018.2889949, IEEE Transactions on Pattern Analysis and Machine Intelligence

10

TABLE 3The comparison of the clustering results on the experimental datasets with various scenarios. The best results are highlighted in bold.

Dataset MNIST [57] CIFAR-10 [58] CIFAR-100 [58] STL-10 [59]

Metric NMI ARI ACC NMI ARI ACC NMI ARI ACC NMI ARI ACC

DSEC-SECT 0.9271 0.9310 0.9682 0.4034 0.3102 0.4458 0.1837 0.0946 0.2365 0.3602 0.2415 0.4472DSEC+K-means 0.9504 0.9618 0.9826 0.4313 0.3332 0.4743 0.2102 0.1091 0.2564 0.3981 0.2828 0.4715DSEC+SC 0.9538 0.9623 0.9819 0.4209 0.3245 0.4686 0.2080 0.1048 0.2543 0.4018 0.2827 0.4802DSEC 0.9522 0.9631 0.9833 0.4379 0.3399 0.4778 0.2124 0.1100 0.2552 0.4033 0.2864 0.4818

Dataset ImageNet-10 [60] ImageNet-Dog [60] 20NEWS [61] REUTERS-10k [62]

Metric NMI ARI ACC NMI ARI ACC NMI ARI ACC NMI ARI ACC

DSEC-SECT 0.5132 0.4421 0.6309 0.2183 0.1109 0.2127 0.5955 0.5074 0.6495 0.7088 0.5915 0.7827DSEC+K-means 0.5817 0.5198 0.6692 0.2429 0.1265 0.2635 0.6074 0.5032 0.6455 0.7047 0.5912 0.7833DSEC+SC 0.5820 0.5186 0.6703 0.2389 0.1230 0.2604 0.6246 0.5049 0.6483 0.7042 0.5933 0.7834DSEC 0.5827 0.5218 0.6737 0.2362 0.1241 0.2644 0.6133 0.5283 0.6513 0.7085 0.5936 0.7832

Dataset REUTERS [62] AudioSet-20 [63] AudioSet-Music [63] AudioSet-Human [63]

Metric NMI ARI ACC NMI ARI ACC NMI ARI ACC NMI ARI ACC

DSEC-SECT 0.7235 0.6037 0.8135 0.2856 0.1623 0.2945 0.3038 0.1179 0.3319 0.2946 0.1168 0.3280DSEC+K-means 0.7236 0.6015 0.8146 0.2838 0.1612 0.2955 0.3041 0.1217 0.3314 0.2968 0.1185 0.3271DSEC+SC N/A N/A N/A 0.2870 0.1623 0.2944 0.3009 0.1202 0.3353 0.2961 0.1192 0.3250DSEC 0.7246 0.6044 0.8143 0.2881 0.1635 0.2981 0.3048 0.1222 0.3350 0.2989 0.1208 0.3284

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

Epoch

Pre

cisi

on

Similar

MNIST CIFAR-10 STL-10 ImageNet-10ImageNet-Dog CIFAR-100 20NEWS REUTERS-10k

REUTERS AudioSet-20 AudioSet-Music AudioSet-Human

0 10 20 30 40 50.75

0.8

.85

0.9

.95

1

Epoch

Pre

cisi

on

Dissimilar

Fig. 6. The precisions of the selected similar (left) or dissimilar (right)pairwise patterns in different stages of clustering.

similar patterns during the learning stages on the ex-perimental datasets. It shows that SECT can graduallyselect more accurate pairwise patterns for training, whichverifies that SECT is steadily self-evolution as the learningprogresses. Furthermore, two tendencies can be observedfrom the illustration. First, the precisions of the dissimilarpatterns always high than the dissimilar patterns, whichis in agreement with with the analysis in Theorem 2.Second, the promotions are mainly occurred during initialstages, which is similar to the tendency in the supervisedlearning [70]. It also indicates that the initial status aresignificant for DSEC, which is further verified with anexperiment in Section 5.6.8.

5.6 Ablation StudyIn this section, extensive ablation studies are performedto synthetically analyze the DSEC model. Specifically, theexperimental settings always inherit from the statementsin Section 5.4, except some special settings in each study.

5.6.1 Effect of SECTIn order to empirically demonstrate the practicality ofthe SECT algorithm, DSEC-SECT that utilizes all pairwisepatterns to train the networks is conducted to comparewith DSEC. In DSEC-SECT, specifically, we set u = l =0.99 for the beginning, followed by an annealing phasewhich decreases linearly to u = l = 0.8. Table 3 showsthat DSEC achieves better performance than DSEC-SECT.

Further analysis, since DNNs are initialized randomly,more noisy data will be considered during training inDSEC-SECT. Contrary to DSEC-SECT, DSEC can selecthighly confident pairwise patterns based on the SECTalgorithm. By using these selected pairwise patterns, D-SEC can begin with more refined clusters and improvethe clustering performance finally.

5.6.2 Effect of Label Inference

To evaluate the effect of the label inference tactic inEq. (15), we employ the traditional methods, including K-means [11] (DSEC+K-means) and SC [31] (DSEC+SC) tocluster patterns via the generated indicator features. Fromthe results listed in Table 3, note that the label inferencetactic outperforms the traditional methods. Furthermore,compared with these traditional methods, the label in-ference tactic is straightforward and simple since DSECjust needs to locate the largest response of the generatedindicator features to yield clustering labels.

5.6.3 Impact of Number of Clusters

We conduct an experiment on 9 different datasets to studythe stabilities of these methods by varying the number ofclusters. For each dataset, we randomly select k subjectfrom ILSVRC2012-1K [60]. By varying k between 10 and50 with an interval 5, 9 different datasets are established.As the number of clusters increases, from Fig. 7 (a), allthe methods are generally degraded, which is becausemore uncertainty may be introduced. However, contraryto other methods, the superiority of DSEC is still kept,which implies that DSEC has outstanding capability todeal with clustering tasks with larger number of clusters.

5.6.4 Impact of Number of Patterns

To observe the effect of the number of patterns to theclustering methods, we vary it between 10000 and 60000with an interval 10000 on CIFAR-10. Fig. 7 (b) visually il-lustrates that the performance of most methods improveswith more patterns, which implies that more patterns arebeneficial for clustering. Furthermore, we observe that the

Page 11: Deep Self-Evolution Clusteringstatic.tongtianta.site/paper_pdf/11668a42-7461-11e9-a533... · 2019-05-12 · To overcome the limitation, many deep unsupervised learning methods are

0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for moreinformation.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPAMI.2018.2889949, IEEE Transactions on Pattern Analysis and Machine Intelligence

11

10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

(a) Number of clusters

Clu

ster

ing

Acc

urac

y

ImageNet

10k 20k 30k 40k 50k 60k

0.2

0.3

0.4

0.5

(b) Number of samples

Clu

ster

ing

Acc

urac

y

CIFAR-10

0.1 0.3 0.5 0.7 0.90.4

0.6

0.8

1

(c) Minimum retention rate r

Clu

ster

ing

Acc

urac

y

MNIST

BoW TF-IDF0

0.2

0.4

0.6

(d) Feature extraction

Clu

ster

ing

Acc

urac

y

20NEWSK-means

SCAE

DAECILR

GMVAECatGANJULE-RC

DECDSEC

Fig. 7. The clustering accuracies of the compared methods with various scenarios. The dashed lines represent the results of DSEC-SECT.The comparison of the clustering results (a) with the increasing number of clusters on the ILSVRC2012-1K dataset (1300 images per cluster),(b) with the increasing number of patterns on the CIFAR-10 dataset, and (c) on the imbalanced datasets sampled from MNIST.

0-.1 .1-.2 .2-.3 .3-.4 .4-.5 .5-.6 .6-.7 .7-.8 .8-.9 .9-1.0

0.2

0.4

0.6

0.8

1

(a) Contribution of clustering constraint

Freq

uenc

y

MNIST

Initial stage Final stage

0-.1

.1-.2

.2-.3

.3-.4

.4-.5

.5-.6

.6-.7

.7-.8

.8-.9

.9-1.

0

0.2

0.4

0.6

0.8

1

Freq

uenc

y

ImageNet-10

Initial stage Final stage

0 10 20 30 40 500.2

0.4

0.6

0.8

(b) Contribution of pre-training

Clu

ster

ing

Acc

urac

y

20NEWS

DSEC DSEC+AEDSEC+S(500) DSEC+S(1000)DSEC+S(2000) DSEC+S(5000)

0 10 20 30 40 50

0.2

0.4

0.6

(c) Influence of network configuration

Clu

ster

ing

Acc

urac

y

ImageNet-10

VGG-13 VGG-16 VGG-19

Fig. 8. (a) The distributions of the elements in the generated indicator features in the initial and final stages on MNIST and ImageNet-10. (b) The contributions of the pre-training techniques on DSEC. Triangles with different colors indicate various initial states. Specifically,DSEC+S(n) represents that n labeled patterns are utilized to pre-train the experimental networks in a supervised manner. (c) The influenceof network configurations on DSEC. Specifically, the three popular variants of VGG [34], i.e., VGG-13, VGG-16 and VGG-19, are compared.

performance of DSEC increases rapidly when more pat-terns are considered, which is similar to the influence ofthe number of labeled patterns in the supervised learningtasks. A significant reason is that sufficient patterns areessential to learn numerous parameters in deep modelsto map patterns from inputs to outputs.

5.6.5 Performance on Imbalanced DatasetsTo study the performance of DSEC on imbalanceddatasets, an additional experiment is executed on MNIST.In the experiment, we randomly sample five subsets fromMNIST with various minimum retention rates. For theminimum retention rate r, the patterns of the first classwill be kept with probability r and the last class withprobability 1, with the other classes linearly in between.From Fig. 7 (c), we observe that DSEC is more robustthan the others methods on the imbalanced datasets. Aprimary reason is that DSEC executes clustering based onsimilarities between patterns only, which can reduce theinfluence of the unbalancedness of datasets.

5.6.6 Impact of Feature ExtractionIn order to explore the impact of feature extraction, weemploy two different approaches on 20NEWS, i.e., thebag of word7 (BoW) and TF-IDF features, to representthe texts in 20NEWS. As shown in Fig. 7 (d), featureextraction can significantly influence all these clusteringmethods, which is akin to the impact in supervisedlearning, i.e., better features bring better performance.Furthermore, it is worth noting that the superior perfor-mance with significant margins is consistently achievedby DSEC, which verifies the stability of DSEC.

5.6.7 Contribution of Clustering ConstraintTo investigate the contribution of clustering constraintsin Eq. (9), Fig. 8 (a) exhibits the distributions of theelements in the generated indicator features on MNIST

7. https://pypi.python.org/pypi/bagofwords

and ImageNet-10 in the initial and final stages. In thisexperiment, we severally count the number of the ele-ments in the following ten disjoint intervals, i.e., [0, 0.1),[0.1, 0.2), · · · , [0.9, 1]. Initially, the major elements of theindicator features locate in [0, 0.1) and [0.1, 0.2). Finally,most elements move to [0, 0.1) and [0.9, 1]. It implies thatthe generated indicator features are sparse and the non-zero elements tend to be 1 simultaneously. This evolutionalso corresponds to our target that DSEC attempts to learnone-hot vectors to encode and cluster patterns.

5.6.8 Contribution of Pre-trainingTo explore the contribution of pre-training on the DSECmodel, we perform DSEC on the networks pre-trained bythe unsupervised and supervised pre-training techniques.As an unsupervised frequently used method, the AE [18](DSEC+AE) is employed as a basic method to pre-trainthe networks in a unsupervised way. For the supervisedpre-training, DSEC+S(n) represents that n labeled pat-terns are used to pre-train the networks in a supervisedmanner. From Fig. 8 (b), the pre-training techniques cansignificantly improve the performance of DSEC. It isexpected since the pre-training can initialize the networksto some justified status, which may assist DSEC to selectmore accurate labeled pairwise patterns for clusteringduring initial stages. Further, the supervised pre-trainingis always superior to the unsupervised pre-training. Aconsiderable reason is that labeled patterns can guide thenetworks to more informative initial status to achieve thebetter performance in DSEC.

5.6.9 Influence of Network ConfigurationThree popular variants of VGG [34], i.e., VGG-13, VGG-16 and VGG-19, are employed to study the influence ofnetwork configuration on DSEC. Fig. 8 (c) visually showsthe difference between these networks. A possible reasonis that the over-fitting causes the performance differencein Fig. 10 (Fig. 8(c) in the revised manuscript). Specifically,VGG-19 possesses more 6 million parameters and 10 mil-lion parameters than VGG-16 and VGG-13, respectively.

Page 12: Deep Self-Evolution Clusteringstatic.tongtianta.site/paper_pdf/11668a42-7461-11e9-a533... · 2019-05-12 · To overcome the limitation, many deep unsupervised learning methods are

0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for moreinformation.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPAMI.2018.2889949, IEEE Transactions on Pattern Analysis and Machine Intelligence

12

0 10 20 30 40 500.1

.15

0.2

.25

0.3

(a) Initialized thresholds

Clu

ster

ing

Acc

urac

y

ImageNet-Dog

l:0.60 u:1.00l:0.70 u:1.00l:0.80 u:1.00l:0.90 u:1.00l:0.80 u:0.80l:0.80 u:0.85l:0.80 u:0.90l:0.80 u:0.95

0 10 20 30 40 50

0.2

0.3

0.4

0.5

(b) Clustering constraints

Clu

ster

ing

Acc

urac

y

CIFAR-10

CC0.5C1.0C2.0C3.0C4.0Null1Null2

0 60 120 180 240 3000

0.2

0.4

0.6

0.8

1

(c) Semi-supervised learning tasks

Cla

ssifi

catio

nA

ccur

acy

ImageNet-10

L(32)L(128)L(512)L(2048)L(32)L(128)L(512)L(2048)

Fig. 9. (a) DSEC with diverse initial values of l and u. (b) DSEC with diverse clustering constraints. Specifically, Null1 means that the clusteringconstraint is omitted and our label inference is used for clustering, and Null2 indicates that the clustering constraint is omitted and K-means isused for clustering. (c) The contribution of pre-training for DSEC. Specifically, L(n) signifies that n labeled patterns are used to train networksfrom randomly initialized status, and L(n) means that n labeled patterns are utilized to fine-tune the networks pre-trained by DSEC.

Since there are only 13000 images in ImageNet-10, VGG-19 is easy to over-fitting and degenerates the performanceconsequently. Although the difference is yielded, howev-er, the clustering accuracies of all three networks (VGG-13, VGG-16 and VGG-19) are approximately equal to 0.60,which outperforms the compared methods with signif-icant margins (+0.20). These results indicate that DSECcan achieve superior performances on diverse networks.

5.6.10 Sensitivity of Initialized Thresholds

To investigate the stability of the SECT algorithm, wecompare the performance of DSEC under various initialvalues of l and u. From Fig. 9 (a), we observe thatthe initial values have a significantly influence on theperformance of DSEC. It is because DNNs are initializedrandomly, more noisy labeled pairwise patterns may beemployed for training if l is too large or u is too small.Further analysis, more appropriate values can assist D-SEC to select highly confident and enough labeled pair-wise patterns. By learning with these selected pairwisepatterns, DSEC can begin with more refined clusters andachieve the better clustering results finally.

5.6.11 Sensitivity of Clustering Constraint

A series of experiences are conducted on ImageNet-Dogto study the sensitivity of clustering constraints on DSEC.As depicted in Fig. 9 (b), the clustering constraint C isincapable of yielding one-hot vectors to recode patterns,even if it is possible in theory. A conceivable reason isthat the search space of the clustering constraint C is toolarge and the back-propagation learning algorithm cannot attain the theoretical values. Compared with C, thesearch space of Cp is significantly reduced to efficientlyguide DSEC to learn one-hot vectors. Furthermore, theimpact of Cp varies with the hyper-parameter p. Specif-ically, the peak performance of DSEC is obtained whenp is approximately with in the range [1, 2] and the toolarge (e.g., 4) or too small (e.g., 0.5) parameter p maydegrade the learning stability and the convergence speed,which are predictable. Intrinsically, the optimal indicatorfeatures purely consist of two essential elements, i.e., 0and 1. For too small parameters p, Cp attempts to learnsmall number to compose the optimal indicator features,which is hard to learn the number 1. Contrary to too smallp, too large p is inefficient to learn the number 0 to formthe optimal indicator features as well.

To verify the necessity of the clustering constraints,we compare our model with two baselines, i.e., Null1and Null2. Specifically, Null1 means that the clustering

Fig. 10. Filters learned in the first convolutional layers based on (a)the developed DSEC model and (b) the supervised learning.

constraint is omitted and our label inference is usedfor clustering, and Null2 indicates that the clusteringconstraint is omitted and K-means is used for clustering.The results in Fig. 9 (b) show that our clustering con-straints are necessary for our model. If such constraintsare omitted, Theorem 1 is not satisfied. As a result, DSECcan learn feature representations only, and an additionalclustering method is required to cluster data.

5.6.12 Benefit for Semi-supervised Learning TasksA semi-supervised task, i.e., the classification task withnumerous unlabeled patterns and a small number oflabeled patterns, is executed on the ImageNet-10 datasetto study the availability of the networks pre-trained byDSEC. In Fig. 9 (c), L(n) signifies that only n labeledpatterns are employed to train networks from randomlyinitialized status, and L(n) indicates that n labeled pat-terns are utilized to fine-tune the networks pre-trained byDSEC. From the figure, intuitively, overall the experimen-tal results empirically confirm that the superior classifica-tion results can be achieved by the pre-trained networkswith stable training processes. It demonstrates that thenetworks can be trained to some justified status and canbe modified based on very limited labeled patterns.

5.6.13 Filters VisualizationTo study whether the proposed DSEC model can learnmeaningful parameters, a DNN (the configurations canbe found in the supplementary material) is devised topresent the learned filters in the first convolutional layer.As shown in Fig. 10, DSEC attempts to learn a varietyof frequency and orientation selective filters, as well asvarious colored blobs, which are similar to the filterslearned with label information. This verifies that DSECis efficient of learning informative filters to capture high-level representations in a purely unsupervised manner.

6 CONCLUSION

In this paper, we have developed a DNN-based clusteringmethod, i.e., deep self-evolution clustering, which recastthe clustering task into a DNN to judge whether pairwisepatterns belong to the same clusters. To generate moreinformative representations for clustering, constrained

Page 13: Deep Self-Evolution Clusteringstatic.tongtianta.site/paper_pdf/11668a42-7461-11e9-a533... · 2019-05-12 · To overcome the limitation, many deep unsupervised learning methods are

0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for moreinformation.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPAMI.2018.2889949, IEEE Transactions on Pattern Analysis and Machine Intelligence

13

indicator features have been introduced into the DNN torecode patterns with one-hot vectors. Since the ground-truth similarities are unavailable in clustering, the self-evolution clustering training algorithm has been pro-posed to select similar and dissimilar pairwise patternsand to train the DNN via the selected pairwise patternsalternately. Conclusively, DSEC cluster data by locatingthe largest response of the generated indicator featurespurely. In comparison with existing approaches, our ap-proach has achieved superior performance on image, text,and audio datasets concurrently.

Future work may include exploring more justifiedclustering constraints to learn the indicator features andgeneralizing DSEC to select the number of clusters auto-matically. For the first problem, a series of clustering con-straints have been introduced as some pioneering work.Nevertheless, which one is the best is an open problem. Itwill be interesting in the future to adaptively learn clus-tering constraints to improve the clustering performance.For the second problem, we have predefined the numberof clusters. However, it is hard to be defined beforeclustering. Inspired by the DCD method [14], a possibledirection is to introduce a residual similar to NormalizedOutput Similarity Approximation Clusterability [14] intoour model. Consequently, the number of clusters may beoptimized and selected automatically.

ACKNOWLEDGMENTS

This work is supported in part by the National NaturalScience Foundation of China (NSFC Nos. 91646207).

REFERENCES

[1] A. Ross, L. Samushia, C. Howlett, W. Percival, A. Burden, andM. Manera, “The clustering of the sdss dr7 main galaxy samplec i. a 4 per cent distance measure at z = 0.15,” Eprint Arxiv, vol.449, no. 1, pp. 835–847, 2015.

[2] N. Thanh, H. Le, and M. Ali, “Neutrosophic recommendersystem for medical diagnosis based on algebraic similaritymeasure and clustering,” in FUZZ-IEEE, 2017, pp. 1–6.

[3] Kumar, S. Selva, Inbarani, and H. Hannah, “Analysis of mixedc-means clustering approach for brain tumour gene expressiondata,” International Journal of Data Analysis Techniques and Strate-gies, vol. 5, no. 5, pp. 214–228, 2013.

[4] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unifiedembedding for face recognition and clustering,” in CVPR, 2015,pp. 815–823.

[5] K. Wang, R. He, L. Wang, W. Wang, and T. Tan, “Joint featureselection and subspace learning for cross-modal retrieval,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 10, pp. 2010–2023, 2016.

[6] L. Feng and B. Bhanu, “Semantic concept co-occurrence pat-terns for image annotation and retrieval,” IEEE Trans. PatternAnal. Mach. Intell., vol. 38, no. 4, pp. 785–799, 2016.

[7] S. Husain and M. Bober, “Improving large-scale image retrievalthrough robust aggregation of local descriptors,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 39, no. 9, pp. 1783–1796, 2017.

[8] K. Gowda and G. Krishna, “Agglomerative clustering using theconcept of mutual nearest neighbourhood,” Pattern Recognition,vol. 10, no. 2, pp. 105–112, 1978.

[9] A. Topchy, A. Jain, and W. Punch, “Clustering ensembles:Models of consensus and weak partitions,” IEEE Trans. PatternAnal. Mach. Intell., vol. 27, no. 12, pp. 1866–1881, 2005.

[10] R. Vidal, “Subspace clustering,” IEEE Signal Process. Mag.,vol. 28, no. 2, pp. 52–68, 2011.

[11] J. Wang, J. Wang, J. Song, X. Xu, H. T. Shen, and S. Li,“Optimized cartesian k-means,” IEEE Trans. Knowl. Data Eng.,vol. 27, no. 1, pp. 180–192, 2015.

[12] S. Hong, J. Choi, J. Feyereisl, B. Han, and L. Davis, “Joint imageclustering and labeling by matrix factorization,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 38, no. 7, pp. 1411–1424, 2016.

[13] P. Purkait, T. Chin, A. Sadri, and D. Suter, “Clustering with hy-pergraphs: The case for large hyperedges,” IEEE Trans. PatternAnal. Mach. Intell., vol. 39, no. 9, pp. 1697–1711, 2017.

[14] Z. Yang, J. Corander, and E. Oja, “Low-rank doubly stochasticmatrix decomposition for cluster analysis,” Journal of MachineLearning Research, vol. 17, pp. 187:1–187:25, 2016.

[15] Z. Yang, T. Hao, O. Dikmen, X. Chen, and E. Oja, “Clustering bynonnegative matrix factorization using graph random walk,” inNIPS, 2012, pp. 1088–1096.

[16] N. Dalal and B. Triggs, “Histograms of oriented gradients forhuman detection,” in CVPR, 2005, pp. 886–893.

[17] D. Lowe, “Distinctive image features from scale-invariant key-points,” International Journal of Computer Vision, vol. 60, no. 2,pp. 91–110, 2004.

[18] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedylayer-wise training of deep networks,” in NIPS, 2006, pp. 153–160.

[19] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol,“Stacked denoising autoencoders: Learning useful represen-tations in a deep network with a local denoising criterion,”Journal of Machine Learning Research, vol. 11, pp. 3371–3408,2010.

[20] D. Kingma and M. Welling, “Auto-encoding variational bayes,”CoRR, vol. abs/1312.6114, 2013.

[21] A. Jain, M. Murty, and P. Flynn, “Data clustering: A review,”ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, 1999.

[22] B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvit-skii, “Scalable k-means++,” PVLDB, pp. 622–633, 2012.

[23] W. Wang, Y. Wu, C. Tang, and M. Hor, “Adaptive density-based spatial clustering of applications with noise (DBSCAN)according to data,” in ICMLC, 2015, pp. 445–451.

[24] A. Rodriguez and A. Laio, “Clustering by fast search and findof density peaks,” Science, vol. 344, no. 6191, pp. 1492–1496,2014.

[25] F. Torre and T. Kanade, “Discriminative cluster analysis,” inICML, 2006, pp. 241–248.

[26] J. Ye, Z. Zhao, and M. Wu, “Discriminative k-means for cluster-ing,” in NIPS, 2007, pp. 1649–1656.

[27] J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan, “Deepadaptive image clustering,” in ICCV, 2017, pp. 5879–5887.

[28] P. Franti, O. Virmajoki, and V. Hautamaki, “Fast agglomera-tive clustering using a k-nearest neighbor graph,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 28, no. 11, pp. 1875–1881, 2006.

[29] F. Zhou, F. Torre, and J. Hodgins, “Hierarchical aligned clusteranalysis for temporal clustering of human motion,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 35, no. 3, pp. 582–596, 2013.

[30] J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learningof deep representations and image clusters,” in CVPR, 2016, pp.5147–5156.

[31] L. Zelnik-Manor and P. Perona, “Self-tuning spectral cluster-ing,” in NIPS, 2004, pp. 1601–1608.

[32] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,vol. 521, no. 7553, pp. 436–444, 2015.

[33] A. Krizhevsky, I. Sutskever, and E. Hinton, “Imagenet classifica-tion with deep convolutional neural networks,” in NIPS, 2012,pp. 1097–1105.

[34] K. Simonyan and A. Zisserman, “Very deep convolutionalnetworks for large-scale image recognition,” arXiv preprint arX-iv:1409.1556, 2014.

[35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeperwith convolutions,” in CVPR, 2015, pp. 1–9.

[36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in CVPR, 2016, pp. 770–778.

[37] H. Noh, S. Hong, and B. Han, “Learning deconvolution net-work for semantic segmentation,” in CVPR, 2015, pp. 1520–1528.

[38] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection andsegmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38,no. 1, pp. 142–158, 2016.

[39] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutionalnetworks for semantic segmentation,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 39, no. 4, pp. 640–651, 2017.

Page 14: Deep Self-Evolution Clusteringstatic.tongtianta.site/paper_pdf/11668a42-7461-11e9-a533... · 2019-05-12 · To overcome the limitation, many deep unsupervised learning methods are

0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for moreinformation.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPAMI.2018.2889949, IEEE Transactions on Pattern Analysis and Machine Intelligence

14

[40] W. Ouyang, X. Zeng, X. Wang, S. Qiu, P. Luo, Y. Tian, H. Li,S. Yang, Z. Wang, H. Li, K. Wang, J. Yan, C. Loy, and X. Tang,“Deepid-net: Object detection with deformable part based con-volutional neural networks,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 39, no. 7, pp. 1320–1334, 2017.

[41] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: towardsreal-time object detection with region proposal networks,” IEEETrans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149,2017.

[42] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embed-ding for clustering analysis,” in ICML, 2016, pp. 478–487.

[43] M. Paulin, J. Mairal, M. Douze, Z. Harchaoui, F. Perronnin,and C. Schmid, “Convolutional patch representations for im-age retrieval: an unsupervised approach,” arXiv preprint arX-iv:1603.00438, 2016.

[44] J. Xie, G. Dai, F. Zhu, E. Wong, and Y. Fang, “Deepshape: Deep-learned shape descriptor for 3d shape retrieval,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 39, no. 7, pp. 1335–1345, 2017.

[45] G. Hinton and R. Salakhutdinov, “Reducing the dimensionalityof data with neural networks,” Science, vol. 313, no. 5786, pp.504–507, 2006.

[46] A. Ng, “Sparse autoencoder,” CS294A Lecture notes, vol. 72, pp.1–19, 2011.

[47] J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan, “Deepunsupervised learning with consistent inference of latent rep-resentations,” Pattern Recognition, vol. 77, pp. 438–453, 2017.

[48] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad-versarial nets,” in NIPS, 2014, pp. 2672–2680.

[49] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow, “Adversar-ial autoencoders,” CoRR, vol. abs/1511.05644, 2015.

[50] J. Springenberg, “Unsupervised and semi-supervised learningwith categorical generative adversarial networks,” CoRR, vol.abs/1511.06390, 2015.

[51] N. Dilokthanakul, P. Mediano, M. Garnelo, M. Lee, H. Salim-beni, K. Arulkumaran, and M. Shanahan, “Deep unsupervisedclustering with gaussian mixture variational autoencoders,”CoRR, vol. abs/1611.02648, 2016.

[52] C. Gross, “Genealogy of the ”grandmother cell”,” NeuroscientistA Review Journal Bringing Neurobiology Neurology & Psychiatry,vol. 8, no. 5, pp. 512–518, 2002.

[53] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curricu-lum learning,” in ICML, 2009, pp. 41–48.

[54] Theano Development Team, “Theano: Deep learning tutorials– convolutional neural networks,” http://www.deeplearning.net/tutorial/lenet.html\#lenet.

[55] F. Chollet, “Keras,” https://github.com/fchollet/keras, 2015.[56] Theano Development Team, “Theano,” http://deeplearning.

net/software/theano/.[57] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based

learning applied to document recognition,” Proceedings of theIEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[58] A. Krizhevsky and G. Hinton, “Learning multiple layers of fea-tures from tiny images,” Master’s Thesis, Department of ComputerScience, University of Torono, 2009.

[59] A. Coates, A. Ng, and H. Lee, “An analysis of single-layernetworks in unsupervised feature learning,” in AISTATS, 2011,pp. 215–223.

[60] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “Imagenet:A large-scale hierarchical image database,” in CVPR, 2009, pp.248–255.

[61] K. Lang, “Newsweeder: Learning to filter netnews,” in ICML,1995, pp. 331–339.

[62] D. Lewis, Y. Yang, T. Rose, and F. Li, “RCV1: A new benchmarkcollection for text categorization research,” Journal of MachineLearning Research, vol. 5, pp. 361–397, 2004.

[63] J. Gemmeke, D. Ellis, D. Freedman, A. Jansen, W. Lawrence,R. Moore, M. Plakal, and M. Ritter, “Audio set: An ontologyand human-labeled dataset for audio events,” in ICASSP, 2017,pp. 776–780.

[64] A. Strehl and J. Ghosh, “Cluster ensembles — A knowledgereuse framework for combining multiple partitions,” Journal ofMachine Learning Research, vol. 3, pp. 583–617, 2002.

[65] L. Hubert and P. Arabie, “Comparing partitions,” Journal ofClassification, vol. 2, no. 1, pp. 193–218, 1985.

[66] T. Li and C. Ding, “The relationships among various nonneg-ative matrix factorization methods for clustering,” in ICDM,2006, pp. 362–371.

[67] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep intorectifiers: Surpassing human-level performance on imagenetclassification,” in ICCV, 2015, pp. 1026–1034.

[68] T. Tieleman and G. Hinton, “Lecture 6.5—RmsProp: Dividethe gradient by a running average of its recent magnitude,”COURSERA: Neural Networks for Machine Learning, 2012.

[69] L. Maaten and G. Hinton, “Visualizing data using t-sne,” Jour-nal of machine learning research, vol. 9, no. Nov, pp. 2579–2605,2008.

[70] P. Mehta and D. Schwab, “An exact mapping between thevariational renormalization group and deep learning,” CoRR,vol. abs/1410.3831, 2014.

Jianlong Chang received the B.S. degree inSchool of Mathematical Sciences from Uni-versity of Electronic Science and Technologyof China, Chengdu, China, in 2015. He iscurrently pursuing the Ph.D. degree at theNational Laboratory of Pattern Recognition,Institute of Automation, Chinese Academy ofSciences, Beijing, China. His research in-terests include pattern recognition, machinelearning and deep reinforcement learning.

Gaofeng MENG received the B.S. degreein Applied Mathematics from NorthwesternPolytechnical University in 2002, and the M.S.degree in Applied Mathematics from TianjingUniversity in 2005, and the Ph.D degree inControl Science and Engineering from Xi’anJiaotong University in 2009. He is currentlyan associate professor of the National Labo-ratory of Pattern Recognition, Institute of Au-tomation, Chinese Academy of Sciences. Healso serves as an Associate Editor of Neuro-

computing. His current research interests include document imageprocessing, computer vision and pattern recognition.

Lingfeng Wang received his B.S. degreein computer science from Wuhan University,Wuhan, China, in 2007. He received his Ph.D.degree from Institute of Automation, ChineseAcademy of Sciences in 2013. He is currentlyan associate professor with the National Lab-oratory of Pattern Recognition of Institute ofAutomation, Chinese Academy of Sciences.His research interests include computer vi-sion and image processing.Shiming Xiang received the B.S. degree inmathematics from Chongqing Normal Univer-sity, Chongqing, China, in 1993, the M.S. de-gree from Chongqing University, Chongqing,China, in 1996, and the Ph.D. degree fromthe Institute of Computing Technology, Chi-nese Academy of Sciences, Beijing, China, in2004. From 1996 to 2001, he was a Lecturerwith the Huazhong University of Science andTechnology, Wuhan, China. He was a Post-doctorate Candidate with the Department of

Automation, Tsinghua University, Beijing, China, until 2006. He iscurrently a Professor with the National Lab of Pattern Recognition, In-stitute of Automation, Chinese Academy of Sciences, Beijing, China.His research interests include pattern recognition, machine learning,and computer vision.

Chunhong Pan received his B.S. Degreein automatic control from Tsinghua Universi-ty, Beijing, China, in 1987, his M.S. Degreefrom Shanghai Institute of Optics and FineMechanics, Chinese Academy of Sciences,China, in 1990, and his Ph.D. degree in pat-tern recognition and intelligent system fromInstitute of Automation, Chinese Academy ofSciences, Beijing, in 2000. He is currently aprofessor with the National Laboratory of Pat-tern Recognition, Institute of Automation, Chi-

nese Academy of Sciences. His research interests include computervision, image processing, computer graphics, and remote sensing.