[ieee 2012 international joint conference on neural networks (ijcnn 2012 - brisbane) - brisbane,...

8
Semi-Supervised Clustering with Multi-Viewpoint based Similarity Measure Yang Yan, Lihui Chen, Duc Thang Nguyen School of Electric and Electronic Engineering Nanyang Technological University Singapore, 639798 [email protected] , [email protected] , [email protected] AbstractThe traditional (dis)similarity measure between a pair of data objects in a clustering method uses only a single viewpoint, which is usually the origin as the only reference point. Recently a novel multi-viewpoint based similarity (MVS) measure [1] has been proposed, which utilizes many different viewpoints in similarity measure and it has been successfully applied in data clustering. In this paper, we study how a semi- supervised MVS-based clustering can be developed by incorporating some prior knowledge in the form of class labels, when they are available to the user. A novel search-based semi- supervised clustering method called CMVS is proposed in the MVS manner with the help of a small percentage of objects being labeled. Two new criterion functions for clustering have been formulated accordingly, when only these labeled objects are considered as the viewpoints in the multi-viewpoints based similarity measure. Theoretical discussion has been conducted to ensure the newly proposed criterion functions make good use of the prior knowledge in terms of similarity measure, besides seeding. Empirical study is performed on various benchmark datasets to demonstrate the effectiveness and verify the merit of our proposed semi-supervised MVS clustering. Keywords-semi-supervised clustering; class label; similarity measure; multi-viewpoints I. INTRODUCTION Clustering is one of the most important and interesting topics in data mining field. The purpose of clustering is to find intrinsic structure in data, and group them into meaningful subgroups based on certain either explicit or implicit similarity measure. There have been plenty of clustering algorithms proposed for very distinct research fields, developed using different techniques. Besides the well-known algorithm kmeans which still remains as one of the top 10 data mining algorithm nowadays [2], other state-of-the-art techniques like fuzzy clustering [3], non-negative matrix factorization (NMF) [4], spectral clustering [5] , co-clustering [6], model-based clustering discussed in the survey paper [7] and relational clustering [8] do have their unique merits on various aspects and domains. For example, co-clustering is generally effective to handle high dimensional data by simultaneously grouping documents and words based on the highly co-occurrence among them, it performs equivalent to a dimensional reduction process. Fuzzy clustering is used for categorization- applications which require a realistic overlapping clusters representation. Model-based clustering is good at outlier detection but usually request a higher complexity. However, despite the advanced underlying various approaches, sometimes it is still difficult to categorize some complicated datasets well rely on a completely unsupervised method, due to the complexity of dataset, noise problem etc. It is noticed that, in many real application, various prior knowledge, which may be available to the users in the form of class labels or pairwise constraints, can be incorporated into clustering to guide the search process and thereby improve its performance. This strategy of technique is called semi- supervised clustering [9]. In recent years, a number of semi- supervised clustering methods are broadly developed based on the clustering framework mentioned above. The strategy of semi-supervised clustering can be divided into search-based methods and similarity adapting methods. The former cases [9-11] make use of the prior knowledge to guide the clustering process, while the latter case [12, 13] focus on improving the effectiveness of the similarity measure by distance metric learning so that the prior knowledge can be satisfied. It is noticed that if the clustering problem is described as an optimization problem, an optimal partition is found by optimizing a particular criterion function of similarity among data. In other words, the true intrinsic structure of data could only be correctly discovered with a suitable defined similarity. Therefore, the similarity measure also plays a very important role for the effectiveness of clustering methods. While the similarity between two objects is measured by using only one reference point in traditional ways, Nguyen et al [1] recently proposed a novel multi-viewpoint based similarity (MVS) measure, which utilizes many different viewpoints at the same time to assess the similarity between data objects in sparse and high-dimensional space, particularly text documents. In MVS, each object assumed not being in the same cluster as the two objects being measured is treated as a single viewpoint. It has been proved by theoretical analysis and empirical study that two clustering criterion functions based on MVS are capable of providing a better performance than a series of single viewpoint based similarity (SVS) clustering approaches, but also fast and scalable like kmeans. In this paper, we focus on how a semi-supervised MVS- based clustering can be developed by incorporating some prior U.S. Government work not protected by U.S. copyright WCCI 2012 IEEE World Congress on Computational Intelligence June, 10-15, 2012 - Brisbane, Australia IJCNN

Upload: duc-thang

Post on 17-Feb-2017

214 views

Category:

Documents


2 download

TRANSCRIPT

Semi-Supervised Clustering with Multi-Viewpoint based Similarity Measure

Yang Yan, Lihui Chen, Duc Thang Nguyen School of Electric and Electronic Engineering

Nanyang Technological University Singapore, 639798

[email protected], [email protected], [email protected]

Abstract— The traditional (dis)similarity measure between a pair of data objects in a clustering method uses only a single viewpoint, which is usually the origin as the only reference point. Recently a novel multi-viewpoint based similarity (MVS) measure [1] has been proposed, which utilizes many different viewpoints in similarity measure and it has been successfully applied in data clustering. In this paper, we study how a semi-supervised MVS-based clustering can be developed by incorporating some prior knowledge in the form of class labels, when they are available to the user. A novel search-based semi-supervised clustering method called CMVS is proposed in the MVS manner with the help of a small percentage of objects being labeled. Two new criterion functions for clustering have been formulated accordingly, when only these labeled objects are considered as the viewpoints in the multi-viewpoints based similarity measure. Theoretical discussion has been conducted to ensure the newly proposed criterion functions make good use of the prior knowledge in terms of similarity measure, besides seeding. Empirical study is performed on various benchmark datasets to demonstrate the effectiveness and verify the merit of our proposed semi-supervised MVS clustering.

Keywords-semi-supervised clustering; class label; similarity measure; multi-viewpoints

I. INTRODUCTION Clustering is one of the most important and interesting

topics in data mining field. The purpose of clustering is to find intrinsic structure in data, and group them into meaningful subgroups based on certain either explicit or implicit similarity measure. There have been plenty of clustering algorithms proposed for very distinct research fields, developed using different techniques. Besides the well-known algorithm kmeans which still remains as one of the top 10 data mining algorithm nowadays [2], other state-of-the-art techniques like fuzzy clustering [3], non-negative matrix factorization (NMF) [4], spectral clustering [5] , co-clustering [6], model-based clustering discussed in the survey paper [7] and relational clustering [8] do have their unique merits on various aspects and domains. For example, co-clustering is generally effective to handle high dimensional data by simultaneously grouping documents and words based on the highly co-occurrence among them, it performs equivalent to a dimensional reduction process. Fuzzy clustering is used for categorization-applications which require a realistic overlapping clusters

representation. Model-based clustering is good at outlier detection but usually request a higher complexity.

However, despite the advanced underlying various approaches, sometimes it is still difficult to categorize some complicated datasets well rely on a completely unsupervised method, due to the complexity of dataset, noise problem etc. It is noticed that, in many real application, various prior knowledge, which may be available to the users in the form of class labels or pairwise constraints, can be incorporated into clustering to guide the search process and thereby improve its performance. This strategy of technique is called semi-supervised clustering [9]. In recent years, a number of semi-supervised clustering methods are broadly developed based on the clustering framework mentioned above.

The strategy of semi-supervised clustering can be divided into search-based methods and similarity adapting methods. The former cases [9-11] make use of the prior knowledge to guide the clustering process, while the latter case [12, 13] focus on improving the effectiveness of the similarity measure by distance metric learning so that the prior knowledge can be satisfied. It is noticed that if the clustering problem is described as an optimization problem, an optimal partition is found by optimizing a particular criterion function of similarity among data. In other words, the true intrinsic structure of data could only be correctly discovered with a suitable defined similarity. Therefore, the similarity measure also plays a very important role for the effectiveness of clustering methods.

While the similarity between two objects is measured by using only one reference point in traditional ways, Nguyen et al [1] recently proposed a novel multi-viewpoint based similarity (MVS) measure, which utilizes many different viewpoints at the same time to assess the similarity between data objects in sparse and high-dimensional space, particularly text documents. In MVS, each object assumed not being in the same cluster as the two objects being measured is treated as a single viewpoint. It has been proved by theoretical analysis and empirical study that two clustering criterion functions based on MVS are capable of providing a better performance than a series of single viewpoint based similarity (SVS) clustering approaches, but also fast and scalable like kmeans.

In this paper, we focus on how a semi-supervised MVS-based clustering can be developed by incorporating some prior

U.S. Government work not protected by U.S. copyright

WCCI 2012 IEEE World Congress on Computational Intelligence June, 10-15, 2012 - Brisbane, Australia IJCNN

knowledge in the form of class labels, when they are available to the user. With the help of a small percentage of objects being labeled, a novel search-based semi-supervised clustering method is proposed. Two new criterion functions have been formulated accordingly, when only the labeled objects in the dataset are considered as the viewpoints in this new similarity measure. Discussions focused on the specific terms in the objective functions to be investigated are provided to show our proposed functions are able to make good use of the prior knowledge in term of similarity measure, so that the misleading effect caused by improper viewpoints during the clustering process can be significant reduced. Finally, we perform extensive experimental study on a number of benchmark datasets to verify the effectiveness of our proposed methods by comparing with some popular semi-supervised clustering algorithms with single viewpoint based or implicit similarity measure.

The rest of this paper is organized as follows. In Section II, the idea of MVS is explained in details, followed by the formulation of two existing MVS-based criterion functions. We propose our new clustering method in Section III. Two new criterion functions are then formulated accordingly, and an incremental optimization algorithm to perform the clustering is also presented. Experimental results on real-world benchmark datasets are reported and discussed in Section IV. Finally, the conclusion and potential future work are given in Section V.

II. RELATED WORKS First of all, Table I summarizes the basic notations that will

be used extensively throughout this paper. Each object ( refer to a single document) in a text document corpus corresponds to a m-dimensional normalized vector d, where m is total number of terms (words) that the document corpus has.

A. Single Viewpoint Similairty Before the Multi-Viewpoints based Similarity is explained,

a few most popular traditional distance measures using single viewpoint are reviewed briefly first. In the literature, Euclidean distance given in (1) and cosine similarity (CS) given in (2) are two of the most popular measures. The former is used in traditional kmeans algorithm, while the latter one is used in spherical kmeans [14] to handle data in a sparse and high dimensional space, such as text documents . The Euclidean distance between objects to its cluster center should be minimized, while the cosine similarity between them should be maximized. Cosine similarity is widely applied in many other document clustering methods as a core similarity measurement, such as the Min-Max cut [15], Normalized Cut [5] graph-based clustering. Other popular measures including the extended Jaccard coefficient [16] combines the feature of cosine similarity and Euclidean distance. The magnitude and the direction of the document vectors are both taken into accounted in Jaccard. In general, cosine similarity is the most popular one because of its simple interpretation. It is also the base of the MVS measure.

( ),i j i jDist d d d d= − (1)

( ) ( ), cos , t

i j i j i jSim d d d d d d= = (2)

TABLE I. NOTATIONS

Notation Description n number of documents of the collection nr number of documents in cluster r g number of labeled documents of the collection gr number of labeled documents in cluster r m number of terms c number of classes k number of clusters d object ( refer to document vector) , ||d||=1

S = {d1,…, dn} set of all the documents Sr set of the documents in cluster r

LS = {l1,…,lt} set of all the labeled documents LSr set of the labeled documents in cluster r

D = i

id Sd

∈∑ composite vector of all the documents

Dr =i r

id Sd

∈∑ composite vector of cluster r

L = i

il LSl

∈∑ composite vector of all labeled documents

Lr = iLSi rl l∈∑ composite vector of labeled documents in cluster r

Cr=Dr/nr centroid vector of cluster r

B. Multi-viewpoints based Similarity As pointed out in [1], the cosine similarity shown in (2) can

be understood by calculating the cosine angle between two document vectors as

measuring them at the origin i.e vector 0.

Hence, it is a single viewpoint-based measure. The motivation of MVS stands it is possible to obtain a more accurate assessment of how close or distant a pair of document points (di and dj) is, if we could measure them by standing at more than just one viewpoint as references. For example, from a third point dh, the direction and distances to di and dj are indicated by two new vectors (di – dh) and (di – dh) respectively. Therefore, working on different vectors with a number of different viewpoints, the similarity between a pair of document is defined as:

( ) ( )

( ) ( )

( )

( )

\

\,

1

1

1, ,

,

1

cos

h r

h ri j r

d S Sr

hr

h

i j i h j hd S Sd d S r

t

i h j h

i h j h i h j hd

t t t t

i j i h j h h h

dr

n n

n n

d d Sim d d d dn n

d d d d

d d d d d d d d

d d d d d d d dn n

Sim

∈∈

=−

= − −−

− −

− − − −

= − − +−

=

(3)

As described by (3), similarity of two documents di and dj which is given in cluster r, is defined as the average of similarities measured relatively by taking all the other documents outside cluster r as viewpoints. Each individual relative similarity is defined by dot-product between two different vectors (di – dh) and (di – dh), which is equivalent to the product of cosine angle between them and the Euclidean distance of them. The interesting point is, MVS measure does not only reflect the intra-similarity between di and dj based on dh, but also provided a measure of inter-similarity between di / dj and dh by the Euclidean distances since dh must not belong to

cluster r. It implies if dh in fact has higher chance clustered together with di and dj, the similarity weightage based on viewpoint dh also becomes smaller by multiplying a smaller || di – dh|| or || di – dh|| value.

It has been argued in [1], while most of viewpoints are useful, there may be some of them giving misleading information. Therefore, it suggests a large enough number of viewpoints are usually required to balance and overcome the effect of misleading viewpoints. In this case, if the majority of them will be useful, a more informative similarity could be offered than the single origin point based similarity measure.

C. Clustering criterion funtions based on MVS As proposed in [1], two clustering criterion functions were

formulated based on the MVS measure. The first one, called IR, is the cluster size-weighted sum of average pairwise similarities of documents in the same cluster. The sum can be expressed as:

( )21 ,

1,

i j r

k

r i jr d d Sr

F n Sim d dn= ∈

=⎡ ⎤⎢ ⎥⎣ ⎦

∑ ∑ (4)

According to the last expression of , | , in (3), we have:

( )

2

1 , \

2

1

2

1

1 2

21

11

r

i j r i r r

kt tri j i h r

r d d S d S d S Sr r

kt trr r r r r

r r r

ktr r

r rr r r r

Fn

nd d d d n

n n

nD D D D D n

n n n

n n n nD D D n

n n n n n

= ∈ ∈ ∈

=

=

= − +−

= − − +−

+ += − − +

− −

⎡ ⎤∑ ⎢ ⎥

⎣ ⎦⎡ ⎤⎢ ⎥⎣ ⎦⎡ ⎛ ⎞ ⎤

⎜ ⎟⎢ ⎥⎣ ⎝ ⎠ ⎦

∑ ∑ ∑

As reported in [1], this formulation is expected to be quite sensitive to cluster size nr without the help of a regulating factor α. In addition, n is a constant, therefore, the final form of IR can be expressed below:

2

11

11

ktr r

R r rr r r r

n n n nI D D D

n n n n nα−=

+ += − −

− −

⎡ ⎛ ⎞ ⎤⎜ ⎟⎢ ⎥

⎣ ⎝ ⎠ ⎦∑ (5)

In this formulation with α, a better cluster quality is able to obtain by a higher IR value, although it may still lead to sensitiveness to the cluster size. The second criterion function IV, which considers similarity between each document vector and its cluster’s centroid instead may prevent this problem. It is expressed in criterion function G below:

1 \

1 ,i r h r

kr

i h hr d S d S Sr r

CG Sim d d dn n C= ∈ ∈

⎛ ⎞= − −⎜ ⎟⎜ ⎟− ⎝ ⎠∑∑ ∑ (6)

Similar to IR, the final formulation of IV can be derived by exploring the vector dot product in a few steps.

1

1tk

r r rv r

r r r r

n D n D D DI D

n n n n D=

+ += − −

− −

⎡ ⎤⎛ ⎞⎜ ⎟⎢ ⎥⎝ ⎠⎣ ⎦

∑ (7)

As derived based on (3), an intra-cluster similarity term and an inter-cluster similarity term are contained in both IR and IV. The former is presented b y in IR or in IV, respectively; while the latter is presented by .

In next Section, we aim to develop some suitable MVS-based clustering criterion functions with the help of some prior knowledge.

III. CONSTRAINED-MVS CLUSTERING

A. New Viewpoint set based on labelling As reviewed in Section II, it is clear that the effectiveness

of the MVS clustering mainly depends on the overall quality of the viewpoints. The distribution of the viewpoints is formed immediately after the first cluster partitioning is done. Therefore, MVSC suffers from the same sensitivity problem like kmeans due to random initialization.

We believe that, it is possible to reduce negative effect on the misleading viewpoints, if we can create a less number but higher qualified viewpoint set to the each object, rather than need a very large number of viewpoints in order to get a good balance between the good and the bad ones.

Now suppose some prior knowledge in the form of the exact class label of a few objects in a dataset is available to the user, these labeled objects in the label set called LS can be used as seeds of the initial clusters instead of random choosing some in the dataset as the seeds to improve the performance, as the traditional way applied in most search-based methods. More than that, LS is also a good choice to fulfill the requirement of high qualified viewpoint set. In this proposal, to measure the similarity between two objects in the same cluster, or between an object and its centroid, only these labeled objects not within that cluster can be served as viewpoints. In principle, the members in a particular cluster should stay “far away” from those which belong to other clusters. Therefore, a labeled object will have much higher chance to serve as a good viewpoint than an unlabeled object. Once we have a few available class labels, the concept of MVS can then be successfully applied.

However, as the definition (scope) of the viewpoints is changed due to the use of the prior knowledge, the expression of , | , is then modified as below,

( ) ( )

( )\,

\

1, ,

1

h ri j r

h r

i j i h j hd LS LSd d S r

t t t t

i j i h j h h hd LS LSr

Sim d d Sim d d d dg g

d d d d d d d dg g

∈∈

= − −−

− − +−

=

∑ (8)

B. Two new criterion function CMVS-IR and CMVS- IV Having the new similarity measure in (8), two new criterion

functions can be formulated by again summing the average pairwise similarities of documents in the same cluster or the similarity between each document and its cluster’s centroid. Since in this manner, the choice of the viewpoints are restricted by the labeled document set, we denote our new clustering framework by CMVS, meaning Constrained Clustering with Multi-Viewpoint based Similarity. Subsequently, we named

these two new criterion functions as CMVS-IR and CMVS-IV, respectively.

Eq. (4) can be used again for the case of pairwise similarities. Applying (8) to all the object pairs in cluster r, we have:

( )

( )

( )

,

, \

2

, \

2 2

,

1

2

2

i j r

i j r h r

i j r i r h r

i jd d S

t t t t

i j i h j h h hd d S d LS LSr

t tri j i h r

d d S d S d SL SLr

rr r r r

r

Sim d d

d d d d d d d dg g

nd d d d n

g g

nD D L L n

g g

∈ ∈

∈ ∈ ∈

= − − +−

= − +−

= − − +−

∑ ∑

∑ ∑ ∑

Substituting into (4), adding the regulating factor and remove the constant n, finally we get the function to be maximized:

( )2

11

21

g g

kr

RL r r rr r r

nI D D L L

n α−=

= − −−

⎡ ⎤⎢ ⎥⎣ ⎦

∑ (9)

According to (8), (6) can be modified as below to achieve the optimization between each document with its cluster centroid.

1 \

1 ,i r h r

kr

i h hr d S d LS LSr r

CG Sim d d dg g C= ∈ ∈

⎛ ⎞= − −⎜ ⎟⎜ ⎟− ⎝ ⎠∑∑ ∑ (10)

Since r r

r r

C DC D

= , exploring the vector dot product, we have:

( )

\

\

,

( ) 1 ( )

i r h r

i r h r

ri h h

d S d LS LS r

t t t tr ri i h h h h

d S d LS LS r r

tr rr r r r r r

r r

CSim d d d

C

C Cd d d d d d

C C

D ng g D D L L n g g

D D

∈ ∈

∈ ∈

= − − +

− − + − + −

⎛ ⎞− −⎜ ⎟⎝ ⎠

⎛ ⎞⎜ ⎟⎝ ⎠

⎛ ⎞= ⎜ ⎟⎝ ⎠

∑ ∑

∑ ∑

Substituting the above into (10), and again eliminate the constant n, we found maximizing G is equivalent to maximizing IVL below:

( )

1

11

ktr

VL r r rr r r

nI D D L L

g g D=

= − + −−

⎡ ⎛ ⎞ ⎤⎜ ⎟⎢ ⎥

⎣ ⎝ ⎠ ⎦∑ (11)

Comparing (10) and (11) with (5) and (6) respectively, we can see clearly how the prior knowledge has now been successfully incorporated into this new MVS framework via the new inter-similarity term . During the clustering process, the term is minimized to ensure all the objects in a particular cluster r, represent by Dr , to be “far away” from a set of labeled non-cluster r objects,

represented by In principle, this is what we desire to have if the similarity measure is able to appropriately show the intrinsic structure of the dataset.

C. Optimization algorithm and complexity This section shows how to perform clustering by using a

greedy algorithm to optimize these functions. An incremental k-way algorithm [17] is employed to perform document clustering by optimizing some given criterion functions. The expression of IR/IV in CMVS framework depends only on nr and Dr. r = 1,…,k. L and Lr is fixed after the clustering process begins, until the converge is achieved all the way. Hence, IRL/IVL can be written in the general form:

( )1

,k

r r rr

I I n D=

=∑ (12)

Where Ir( nr, Dr) corresponds to the criterion function value of cluster r. The same is applied to IR/IV. With this general form, the algorithm contains two major steps, namely Initialization with labeled object set and Refinement on the unlabeled object set, as described in Fig.1. At Initialization, a few labeled objects are randomly selected from the whole dataset and their labels are kept in the cluster each of them truly belongs to, and then the mean of these objects in each cluster is computed and treated as the centers. The initial partitions of whole dataset are formed based on them. Refinement is a procedure that consists of a number of iterations. The objects in the unlabeled set are visited one by one in a totally random order without repeat at each iteration. Each object is checked to see if its move to another cluster would result in any improvement of the criterion function. If yes, the object is moved to the cluster

Procedure INITIALIZATION Select a set of labeled objects LS = {l1,…lg} randomly Initialize the centers c1,…ck by computing the mean vector of the labeled objects within each cluster Select the centroids for the clusters have no label assigned arg , \ Dr ∑ , | |, 1, … , end procedure procedure REFINEMENT repeat {v[1: (n-g)]} random permutation of | \ for j 1: (n-g) do i v[ j] p Δ 1, , max , 1, , Δ 1, , if Δ Δ 0 then Move di to cluster q:

Update Dp, np, Dq, nq

end if end for Until No move for all (n-g) unlabeled documents end procedure

Figure 1. Detialed Setps of CMVS Algorithm

1http://glaros.dtc.umn.edu/gkhome/views/cluto

which leads to the highest improvement. Otherwise, the object remains with no move. The clustering process terminates when an iteration completes without any object being moved to a new cluster.

During the optimization procedure, in each iteration, the main sources of computational cost are:

• Searching for optimum cluster to move each individual un- labeled document to: · .

• Updating composite vectors as a result of such move: · .

Where nz is the total number of non-zero entries in all unlabeled document vectors. If τ denotes the number of iterations the algorithm takes, since nz is always much larger than m for document domain, the computational complexity required is · · . Therefore, we could see that the complexity of the semi-supervised MVS clustering is obviously less than the un-supervised version.

IV. EXPERIMENTS & DISCUSSION

A. Document Collections To verify the advantage of our proposed methods, we

evaluate their performance in experiments on 9 benchmark text datasets. These datasets are provided together with CLUTO1 by the toolkit’s authors. The corpora present a diversity of size, number of classes and class balance. Table II summarizes their characteristics. Balance means the ratio of the smallest class size to the largest class size in a particular dataset. All datasets are extremely unbalanced except for classic. They were all preprocessed by standard procedures, including stop-word removal, stemming, removal of too rare as well as too frequent words, TF-IDF weighting and normalization.

B. Experimental Setup

To demonstrate how well the two constrained MVS clustering approaches can perform, we compare them with other label constrained clustering approaches based on spherical kmeans and NMF. These two approaches are both well-known and widely used in various applications [14, 18]. The former one uses single viewpoint cosine similarity measure, while the latter one uses implicit similarity measure. Through the comparison with these two, we aim to not only confirm the design of a newly proposed semi-supervised clustering approach, but also verify whether the strength of MVS remains steady in its semi-supervised manner. In summary, the four algorithms are:

• CMVS-IR: CMVS using criterion function IR

• CMVS-IV: CMVS using criterion function IV

• C-spkmeans [9]: constrained spherical kmeans

• SS-NMF [19] : label-based semi-supervised NMF

The regulating factor α in IR is fixed to 0.3 during the experiments, as reported in [1]. It is one of the most appropriate values. For each dataset, cluster number is predefined equal to the number of true class, i.e. k=c. The amount of prior knowledge is measured by the proportion of n in particular dataset. In order to show our proposed CMVS-IR and CMVS-IV

TABLE II. DOCUMENT DATASETS

Data Source c n m Balance

classic CACM/CISI/ CRAN/MED 4 7089 12009 0.323

re0 Reuters 13 1504 2886 0.018 re1 Reuters 25 1657 3758 0.027 reviews TREC 5 4069 23220 0.099 tr11 TREC 9 414 6424 0.045 tr23 TREC 6 204 5831 0.066 tr31 TREC 7 927 10127 0.006 sports TREC 7 8580 18324 0.036 k1b WebACE 6 2340 13859 0.043

still work well with a very small number of labels, we first show the performance on 4 datasets with only 1% to 3% labels. Then, we gradually increase the prior knowledge from 5% to 20%. The stopping threshold for C-Spkmeans and SS-NMF is set to 1E-8 and τmax is equal to 200. Similar to [1], ten independent trials with a random initialization and random label selection at each prior knowledge level is performed. We chose the best trial in terms of the corresponding objective function value. After all, the results reported in this paper on each dataset by each clustering method is the average of the 10 best trials been tested.

C. Performance Evaluation The clustering result is evaluated by comparing the index of

documents’ assigned cluster with their true class labels. Two types of external evaluation metric namely Accuracy & NMI are used to assess the performance. Accuracy measures the fraction of documents that are correctly assigned, assuming a one-to-one correspondence between true classes and assigned clusters. Let q denote any possible permutation of index set {1,…,k}, Accuracy is calculated by:

( ),

1

1max

k

r q rq

r

Accuracy nn =

= ∑

From another aspect, Normalized Mutual Information (NMI) measures the information the true class partition and the cluster assignment share. It measures how much knowing about the clusters helps us know about the classes:

,,1 1

1 1

log

log log

k k i ji ji j

i j

k k jii ji j

n nn

n nNMI

nnn nn n

= =

= =

⎛ ⎞⋅⎜ ⎟⎜ ⎟⎝ ⎠=

⎛ ⎞⎛ ⎞⎜ ⎟⎜ ⎟

⎝ ⎠⎝ ⎠

∑ ∑

∑ ∑ .

D. Results & Discussion Due to the space limitation, Table III shows the results in

Accuracy and NMI of the four semi-supervised clustering algorithms on only four datasets with 1% to 3% labels. More results on nine datasets with more available prior knowledge are then presented in Table IV. For each dataset at a specific prior knowledge level, the value in bold and underlined is the best result, while the value in bold only is the second best. The result of original MVSC-IR, MVSC-IV, spherical kmeans and NMF is provided when prior knowledge is equal to 0.

First, we see from Table III, with such a low rate of prior knowledge, in other words, with a limited number of high

TABLE III. CLUSETERING RESULTS ON 4 DATASETS WITH 1-3% LABELS

Data % of label

Accuracy NMI CMVS-IR CMVS-IV C-spkmeans SS-NMF CMVS-IR CMVS-IV C-spkmeans SS-NMF

classic 0 0.648 0.724 0.618 0.593 0.574 0.644 0.577 0.560 1% 0.719 0.828 0.717 0.627 0.586 0.678 0.643 0.581 2% 0.719 0.830 0.721 0.672 0.586 0.684 0.646 0.587 3% 0.729 0.841 0.725 0.684 0.578 0.690 0.649 0.601

re1 0 0.442 0.420 0.443 0.399 0.591 0.583 0.593 0.523 1% 0.487 0.443 0.440 0.402 0.598 0.587 0.561 0.548 2% 0.519 0.489 0.476 0.429 0.619 0.607 0.581 0.561 3% 0.527 0.529 0.531 0.448 0.623 0.624 0.618 0.568

sports 0 0.717 0.752 0.697 0.651 0.669 0.719 0.633 0.619 1% 0.814 0.866 0.722 0.694 0.723 0.760 0.690 0.654 2% 0.848 0.868 0.726 0.703 0.751 0.762 0.698 0.660 3% 0.860 0.872 0.744 0.724 0.756 0.744 0.686 0.656

k1b 0 0.825 0.761 0.779 0.824 0.701 0.671 0.649 0.680 1% 0.863 0.814 0.801 0.825 0.736 0.723 0.715 0.711 2% 0.889 0.866 0.835 0.859 0.744 0.743 0.742 0.749 3% 0.902 0.896 0.870 0.863 0.773 0.758 0.755 0.752

TABLE IV. CLUSTERING RESULTS ON 9 DATASETS WITH VARIOUS PERCENTAGE OF LABELS

Data % of label Accuracy NMI

CMVS-IR CMVS-IV C-spkmeans SS-NMF CMVS-IR CMVS-IV C-spkmeans SS-NMF

classic 5% 0.741 0.856 0.734 0.702 0.599 0.705 0.653 0.600 10% 0.797 0.868 0.755 0.762 0.642 0.720 0.663 0.627 15% 0.841 0.875 0.774 0.796 0.680 0.728 0.674 0.671 20% 0.859 0.883 0.795 0.821 0.701 0.737 0.688 0.694

re0 5% 0.521 0.487 0.437 0.443 0.485 0.458 0.448 0.421 10% 0.566 0.542 0.478 0.521 0.529 0.504 0.468 0.479 15% 0.607 0.590 0.545 0.539 0.568 0.545 0.496 0.482 20% 0.654 0.617 0.577 0.588 0.579 0.573 0.525 0.538

re1 5% 0.585 0.548 0.568 0.462 0.643 0.630 0.640 0.569 10% 0.641 0.625 0.646 0.514 0.675 0.644 0.673 0.591 15% 0.707 0.663 0.715 0.533 0.706 0.677 0.724 0.606 20% 0.744 0.689 0.748 0.548 0.728 0.697 0.738 0.629

reviews 5% 0.800 0.852 0.811 0.743 0.680 0.712 0.684 0.658 10% 0.811 0.868 0.826 0.796 0.688 0.731 0.703 0.669 15% 0.832 0.885 0.836 0.824 0.710 0.753 0.711 0.704 20% 0.840 0.897 0.850 0.831 0.740 0.770 0.727 0.706

tr11 5% 0.753 0.736 0.722 0.666 0.734 0.714 0.703 0.691 10% 0.817 0.796 0.773 0.653 0.769 0.758 0.741 0.671 15% 0.824 0.833 0.819 0.694 0.779 0.776 0.765 0.717 20% 0.849 0.848 0.836 0.715 0.799 0.790 0.785 0.739

tr23 5% 0.524 0.524 0.461 0.450 0.468 0.470 0.413 0.384 10% 0.557 0.582 0.520 0.453 0.479 0.500 0.430 0.399 15% 0.595 0.614 0.543 0.519 0.535 0.527 0.452 0.438 20% 0.664 0.635 0.611 0.548 0.563 0.562 0.506 0.462

tr31 5% 0.817 0.796 0.684 0.592 0.737 0.748 0.629 0.564 10% 0.821 0.823 0.741 0.667 0.755 0.752 0.698 0.596 15% 0.825 0.841 0.742 0.730 0.757 0.775 0.671 0.619 20% 0.845 0.857 0.792 0.716 0.779 0.786 0.707 0.642

sports 5% 0.870 0.875 0.758 0.763 0.746 0.768 0.680 0.691 10% 0.879 0.882 0.786 0.809 0.785 0.779 0.709 0.711 15% 0.893 0.891 0.803 0.828 0.790 0.791 0.722 0.743 20% 0.908 0.900 0.831 0.845 0.800 0.802 0.744 0.762

k1b 5% 0.914 0.888 0.881 0.877 0.789 0.756 0.760 0.736 10% 0.929 0.903 0.909 0.895 0.810 0.770 0.780 0.760 15% 0.934 0.908 0.917 0.906 0.813 0.784 0.808 0.770 20% 0.947 0.915 0.934 0.925 0.828 0.790 0.817 0.782

TABLE V. STATISTICAL SIGNIFICANCE OF COMPARISIONS WITH C-SPKMEANS

5% 10% 15% 20% Accuracy CMVS-IR >> >> >> >> 0.013 0.011 0.011 0.012 CMVS-IV >> (>>) >> (>>) >> (>>) > (>>) 0.012 (5.0e-3) 0.010(3.5e-3) 0.027(5.0e-3) 0.091 (0.012) NMI CMVS-IR > (>>) >> >> >> 0.095(0.018) 0.045 0.041 0.027 CMVS-IV >> (>>) >> (>>) > (>>) > (>>) 0.028 (0.019) 0.023 (4.7e-3) 0.051 (0.011) 0.061(0.014)

TABLE VI. STATISTICAL SIGNIFICANCE OF COMPARISIONS WITH SS-NMF

5% 10% 15% 20% Accuracy CMVS-IR >> >> >> >> 1.4e-3 2.1e-3 2.0e-3 3.1e-3 CMVS-IV >> >> >> >> 1.1e-3 7.5e-4 4.6e-4 2.1e-3 NMI CMVS-IR >> >> >> >> 5.4e-3 1.8e-3 4.0e-3 3.6e-3 CMVS-IV >> >> >> >> 2.8e-3 1.0e-3 8.3e-4 1.7e-3

qualified viewpoints, the improvement is shown on both IR and IV. Significant improvement is achieved by only 1% of labeled objects on 3 out of 4 datasets except re1. This result not only shows the potential of our proposed method, but also indicates that the quantity of the viewpoints is not the most critical issues on the MVS manner, while the quality is. On the other hand, the improvement made on these 3 datasets slows down in after 1%. However, the improvement rate on re1 shows relatively consistently through the increment of prior knowledge. Moreover, from Table III we found that there is only one exception where IR shows comparable values in Accuracy and lower values in NMI compared with C-spkmeans on classic; while IV gives comparable results in Accuracy and NMI value to SS-NMF on k1b. For the others, IR and IV are always the best two among all.

From Table IV, first we could see the top two performances are presented by either both IR or IV or one of them. When more prior knowledge is given, IR and IV show consistently improvement through an increase of the prior knowledge. IR significantly outperforms SS-NMF without any exceptions and outperforms C-spkmeans on 7 out of 9 datasets, except re1and reveiws. The Accuracy values and NMI values on these two datasets are comparable to IR. Meanwhile, the performance on tr11 and k1b by C-spkmeans and the performance on k1b by SS-NMF are comparable to IV. In addition, the improvement of IV on re1 becomes much slower than C-spkmeans when a high level prior knowledge is provided. For the rest datasets, IV then significantly outperforms the other two.

We have also carried out statistical significance test to justify the clustering performance comparisons. Each of CMVS-IR and CMVS-IV was paired up with C-spkmeans and SS-NMF for a paired t-test. Given two paired sets X and Y of N measured values, the null hypothesis of the test is that the differences between X and Y come from a population with mean 0. The alternative hypothesis is that the paired sets differ from each other in a significant way. In our experiment, these tests were done based on the evaluation values obtained on the nine datasets. The typical 5% significance level was used. If the t-test returns a p-value smaller than 0.05, we reject the null hypothesis and say that the difference of the performance

between two algorithms is significant. Otherwise, the null hypothesis is true and the comparison is considered insignificant.

The outcome of the t-test with C-spkmeans and SS-NMF is presented in Table V and VI, respectively. “>>” symbol indicates the algorithm in the row performs the algorithm in the column significantly better at the prior knowledge level, while “>” indicates an insignificant comparison. The values right below the symbols is the p-value of the t-test. As the t-tests show, the advantage of CMVS-IR and CMVS-IV over SS-NMF is statistically significant from various prior knowledge levels. A few special cases happened on the test with C-spkmeans. CMVS-IR is not significantly better than C-pkmeans if based on NMI measure when only 5% labels are given. The reason is, as observed in Table III before, the spkmeans results in NMI on classic is much higher than CMVS-IR, although the results in Accuracy are quite comparable. This phenomenon still shows on 5% level, but it gradually catches up when more class labels can be provided. Therefore, 0.653 in NMI measure may be considered as an outlier to the majority results. We also report the p-values in the bracket where classic was removed and only results on the other 8 datasets was used in NMI measure at 5% prior knowledge level. Under this circumstance, a much smaller p-value is obtained and CMVS-IR is confirmed to outperform C-spkmeans significantly by the t-test. Similar condition can be found in CMVS-IV. The insignificant conclusion is made when the prior knowledge is high. Here, re1 is the corresponding “outlier”, as C-spkmeans yield outstanding results on re1 comparing with IV, but does not perform very well on most of the rest datasets. By excluding re1, CMVS-IV was confirmed to outperform C-spkmeans significantly with good p-values.

V. CONCLUSION In this paper, we study on how to develop a proper new

search based semi-supervised clustering method based on a multi-viewpoints similarity measure, when some prior knowledge in the form of class labels is available to users. We explore the best use of a few available class labels in the dataset in this novel semi-supervised approach, by making good use of the prior knowledge in terms of similarity measure

during the clustering processing, other than seeding in the initialization. Two new criterion functions, namely CMVS-IR and CMVS-IV have been formulated by applying the MVS measure which only the labeled objects can be served as viewpoints rather than any objects in the corpus. Theoretical discussion has also been conducted to ensure the new proposed criterion functions are able to make good use of the prior knowledge in term of similarity measure. Extensive empirical study is conduct on a number of benchmark datasets at various amount of the prior knowledge under different evaluation metrics, to show the advantages of the proposed method.

REFERENCES [1] D. T. Nguyen, L. Chen, and C. K. Chan, "Clustering with Multi-

Viewpoint Based Similarity Measure," IEEE Transactions on Knowledge and Data Engineering, vol. PP, 2011.

[2] X. Wu, V. Kumar, Q. J. Ross, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z. H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg, "Top 10 algorithms in data mining," Knowledge and Information Systems, vol. 14, pp. 1-37, 2008.

[3] M. E. S. Mendes and L. Sacks, "Evaluating fuzzy clustering for relevance-based information access," in IEEE International Conference on Fuzzy Systems, 2003, pp. 648-653.

[4] W. Xu, X. Liu, and Y. Gong, "Document Clustering Based On Non-negative Matrix Factorization," SIGIR Forum (ACM Special Interest Group on Information Retrieval), pp. 267-273, 2003.

[5] I. S. Dhillon, Y. Guan, and B. Kulis, "Kernel k-means, spectral clustering and normalized cuts," in KDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 551-556.

[6] I. S. Dhillon, S. Mallela, and D. S. Modha, "Information-theoretic co-clustering," Proc. 9th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD 03), pp. 89-98, 2003.

[7] S. Zhong and J. Ghosh, "Generative model-based document clustering: A comparative study," Knowledge and Information Systems, vol. 8, pp. 374-384, 2005.

[8] B. Long, Z. M. Zhang, and P. S. Yu, "A probabilistic framework for relational clustering," in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007, pp. 470-479.

[9] S. Basu, A. Banerjee, and R. Mooney, "Semi-supervised clustering by seeding," Proceedings of the 19th International Conference on Machine Learning, pp. 19-26, 2002.

[10] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl, "Constrained k-means clustering with background knowledge," Proceedings of the Eighteenth International Conference on Machine Learning, pp. 577-584, 2001.

[11] N. Grira, M. Crucianu, and N. Boujemaa, "Active semi-supervised fuzzy clustering," Pattern Recognition, vol. 41, pp. 1851-1861, 2008.

[12] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, "Distance metric learning, with application to clustering with side-information," Advances in Neural Information Processing Systems, vol. 15, pp. 505-512, 2003.

[13] X. Yin, S. Chen, E. Hu, and D. Zhang, "Semi-supervised clustering with metric learning: An adaptive kernel method," Pattern Recognition, vol. 43, pp. 1320-1333, 2010.

[14] I. S. Dhillon and D. S. Modha, "Concept decompositions for large sparse text data using clustering," Machine Learning, vol. 42, pp. 143-175, 2001.

[15] C. H. Q. Ding, X. He, H. Zha, M. Gu, and H. D. Simon, "A min-max cult algorithm for graph partitioning and data clustering," 2001, pp. 107-114.

[16] G. Karypis, "CLUTO a clustering tollkit," Department of Computer Science vol. Uni. of Minnesota, Tech. Rep., , 2003.

[17] Y. Zhao and G. Karypis, "Empirical and theoretical comparisons of selected criterion functions for document clustering," Machine Learning, vol. 55, pp. 311-331, 2004.

[18] T. Li and C. Ding, "The relationships among various nonnegative matrix factorization methods for clustering," 2007, pp. 362-371.

[19] H. Lee, J. Yoo, and S. Choi, "Semi-supervised nonnegative matrix factorization," IEEE Signal Processing Letters, vol. 17, pp. 4-7, 2010.