clustering with multiviewpoint-based

Upload: rakesh-ponnoju

Post on 02-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Clustering With Multiviewpoint-Based

    1/14

    Clustering with Multiviewpoint-BasedSimilarity Measure

    Duc Thang Nguyen, Lihui Chen, Senior Member, IEEE, and Chee Keong Chan

    AbstractAll clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity

    between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multiviewpoint-based similarity

    measure and two related clustering methods. The major difference between a traditional dissimilarity/similarity measure and ours is

    that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects

    assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment

    of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions for

    document clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms that

    use other popular similarity measures on various document collections to verify the advantages of our proposal.

    Index TermsDocument clustering, text mining, similarity measure.

    1 INTRODUCTION

    CLUSTERING is one of the most interesting and importanttopics in data mining. The aim of clustering is to findintrinsic structures in data, and organize them into mean-ingful subgroups for further study and analysis. There havebeen many clustering algorithms published every year.They can be proposed for very distinct research fields, anddeveloped using totally different techniques and ap-proaches. Nevertheless, according to a recent study [1],more than half a century after it was introduced, the simplealgorithm k-means still remains as one of the top 10 data

    mining algorithms nowadays. It is the most frequently usedpartitional clustering algorithm in practice. Another recentscientific discussion [2] states that k-means is the favoritealgorithm that practitioners in the related fields choose touse. Needless to mention, k-means has more than a few basicdrawbacks, such as sensitiveness to initialization and tocluster size, and its performance can be worse than otherstate-of-the-art algorithms in many domains. In spite of that,its simplicity, understandability, and scalability are thereasons for its tremendous popularity. An algorithm withadequate performance and usability in most of applicationscenarios could be preferable to one with better performancein some cases but limited usage due to high complexity.While offering reasonable results, k-means is fast and easy tocombine with other methods in larger systems.

    A common approach to the clustering problem is to treatit as an optimization process. An optimal partition is foundby optimizing a particular function of similarity (ordistance) among data. Basically, there is an implicit

    assumption that the true intrinsic structure of data couldbe correctly described by the similarity formula defined andembedded in the clustering criterion function. Hence,effectiveness of clustering algorithms under this approachdepends on the appropriateness of the similarity measure tothe data at hand. For instance, the original k-means hassum-of-squared-error objective function that uses euclideandistance. In a very sparse and high-dimensional domainlike text documents, spherical k-means, which uses cosinesimilarity (CS) instead of euclidean distance as the measure,

    is deemed to be more suitable [3], [4].In [5], Banerjee et al. showed that euclidean distance wasindeed one particular form of a class of distance measurescalled Bregman divergences. They proposed Bregman hard-clustering algorithm, in which any kind of the Bregmandivergences could be applied. Kullback-Leibler divergencewas a special case of Bregman divergences that was said togive good clustering results on document data sets. Kullback-Leibler divergence is a good example of nonsymmetricmeasure. Also on the topic of capturing dissimilarity in data,Pakalska et al. [6] found that the discriminative power ofsome distance measures could increase when their none-uclidean and nonmetric attributes were increased. They

    concluded that noneuclidean and nonmetric measures couldbe informative for statistical learning of data. In [7], Pelilloeven argued that the symmetry and nonnegativity assump-tion ofsimilarity measures wasactually a limitation ofcurrentstate-of-the-art clustering approaches. Simultaneously, clus-tering still requires more robust dissimilarity or similaritymeasures; recent works such as [8] illustrate this need.

    The work in this paper is motivated by investigationsfrom the above and similar research findings. It appears tous that the nature of similarity measure plays a veryimportant role in the success or failure of a clusteringmethod. Our first objective is to derive a novel method for

    measuring similarity between data objects in sparse andhigh-dimensional domain, particularly text documents.From the proposed similarity measure, we then formulatenew clustering criterion functions and introduce their

    988 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

    . The authors are with the Division of Information Engineering, School ofElectrical and Electronic Engineering, Nanyang Technological University,Block S1, Nanyang Avenue, Republic of Singapore 639798.E-mail: [email protected], {elhchen, eckchan}@ntu.edu.sg.

    Manuscript received 16 July 2010; revised 25 Feb. 2011; accepted 10 Mar.

    2011; published online 22 Mar. 2011.Recommended for acceptance by M. Ester.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2010-07-0398.Digital Object Identifier no. 10.1109/TKDE.2011.86.

    1041-4347/12/$31.00 2012 IEEE Published by the IEEE Computer Society

  • 7/27/2019 Clustering With Multiviewpoint-Based

    2/14

    respective clustering algorithms, which are fast and scalablelike k-means, but are also capable of providing high-qualityand consistent performance.

    The remaining of this paper is organized as follows: In

    Section 2, we review related literature on similarity andclustering of documents. We then present our proposal fordocument similarity measure in Section 3.1. It is followedby two criterion functions for document clustering and theiroptimization algorithms in Section 4. Extensive experimentson real-world benchmark data sets are presented anddiscussed in Sections 5 and 6. Finally, conclusions andpotential future work are given in Section 7.

    2 RELATED WORK

    First of all, Table 1 summarizes the basic notations that willbe used extensively throughout this paper to representdocuments and related concepts. Each document in acorpus corresponds to an m-dimensional vector d, wherem is the total number of terms that the document corpushas. Document vectors are often subjected to some weight-ing schemes, such as the standard Term Frequency-InverseDocument Frequency (TF-IDF), and normalized to haveunit length.

    The principle definition of clustering is to arrange dataobjects into separate clusters such that the intraclustersimilarity as well as the intercluster dissimilarity ismaximized. The problem formulation itself implies thatsome forms of measurement are needed to determine suchsimilarity or dissimilarity. There are many state-of-the-artclustering approaches that do not employ any specific formof measurement, for instance, probabilistic model-basedmethod [9], nonnegative matrix factorization [10], informa-tion theoretic coclustering [11] and so on. In this paper,though, we primarily focus on methods that indeed doutilize a specific measure. In the literature, euclideandistance is one of the most popular measures

    Distdi; dj kdi djk: 1

    It is used in the traditional k-means algorithm. The objectiveof k-means is to minimize the euclidean distance betweenobjects of a cluster and that clusters centroid

    minXkr1

    Xdi2Sr

    kdi Crk2: 2

    However, for data in a sparse and high-dimensional space,such as that in document clustering, cosine similarity ismore widely used. It is also a popular similarity score intext mining and information retrieval [12]. Particularly,similarity of two document vectors di and dj, Simdi; dj, isdefined as the cosine of the angle between them. For unitvectors, this equals to their inner product

    Simdi; dj cosdi; dj dtidj: 3

    Cosine measure is used in a variant of k-means calledspherical k-means [3]. While k-means aims to minimizeeuclidean distance, spherical k-means intends to maximizethe cosine similarity between documents in a cluster andthat clusters centroid

    maxXkr1

    Xdi2Sr

    dtiCrkCrk

    : 4

    The major difference between euclidean distance and cosinesimilarity, and therefore between k-means and spherical k-

    means, is that the former focuses on vector magnitudes,while the latter emphasizes on vector directions. Besidesdirect application in spherical k-means, cosine of documentvectors is also widely used in many other documentclustering methods as a core similarity measurement. Themin-max cut graph-based spectral method is an example[13]. In graph partitioning approach, document corpus isconsider as a graph G V ; E, where each document is avertex in V and each edge in E has a weight equal to thesimilarity between a pair of vertices. Min-max cut algorithmtries to minimize the criterion function

    minXk

    r1

    SimSr; Sn Sr

    SimSr; Sr

    where SimSq; Sr1q;rk

    X

    di2Sq;dj2Sr

    Simdi; dj;

    and when the cosine as in (3) is used, minimizing thecriterion in (5) is equivalent to

    minXkr1

    DtrD

    kDrk2: 6

    There are many other graph partitioning methods withdifferent cutting strategies and criterion functions, such asAverage Weight [14] and Normalized Cut [15], all of which

    have been successfully applied for document clusteringusing cosine as the pairwise similarity score [16], [17]. In[18], an empirical study was conducted to compare avariety of criterion functions for document clustering.

    Another popular graph-based clustering technique isimplemented in a software package called CLUTO [19]. Thismethod first models the documents with a nearest neighborgraph, and then splits the graph into clusters using a min-cutalgorithm. Besides cosine measure, the extended Jaccardcoefficient can also be used in this method to representsimilarity between nearest documents. Given nonunit docu-ment vectors ui, uj di ui=kuik; dj uj=kujk, their ex-tended Jaccard coefficient is

    SimeJaccui; uj utiuj

    kuik2 kujk

    2 utiuj: 7

    NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 989

    TABLE 1Notations

  • 7/27/2019 Clustering With Multiviewpoint-Based

    3/14

    Compared with euclidean distance and cosine similarity, theextended Jaccard coefficient takes into account both themagnitude and the direction of the document vectors. If thedocuments are instead represented by their correspondingunit vectors, this measure has the same effect as cosinesimilarity. In [20], Strehl et al. compared four measures:euclidean, cosine, Pearson correlation, and extended Jaccard,and concluded that cosine and extended Jaccard are the bestones on web documents.

    In nearest neighbor graph clustering methods, such as theCLUTOs graph method above, the concept of similarity issomewhat different from the previously discussed methods.Twodocuments mayhave a certain value of cosinesimilarity,but if neither of them is in the other ones neighborhood, theyhave no connection between them. In such a case, somecontext-based knowledge or relativeness property is alreadytaken into account when considering similarity. Recently,Ahmad andDey [21] proposed a methodto compute distancebetween two categorical values of an attribute based on theirrelationship with all other attributes. Subsequently, Ienco

    et al. [22] introduced a similar context-based distancelearning method for categorical data. However, for a givenattribute, they only selected a relevant subset of attributesfrom the whole attribute set to use as the context forcalculating distance between its two values.

    More related to text data, there are phrase-based andconcept-based document similarities. Lakkaraju et al. [23]employed a conceptual tree-similarity measure to identifysimilar documents. This method requires representingdocuments as concept trees with the help of a classifier.For clustering, Chim and Deng [24] proposed a phrase-based document similarity by combining suffix tree model

    and vector space model. They then used HierarchicalAgglomerative Clustering algorithm to perform the cluster-ing task. However, a drawback of this approach is the highcomputational complexity due to the needs of building thesuffix tree and calculating pairwise similarities explicitlybefore clustering. There are also measures designed speci-fically for capturing structural similarity among XMLdocuments [25]. They are essentially different from thedocument-content measures that are discussed in this paper.

    In general, cosine similarity still remains as the mostpopular measure because of its simple interpretation andeasy computation, though its effectiveness is yet fairly

    limited. In the following sections, we propose a novel way toevaluate similarity between documents, and consequentlyformulate new criterion functions for document clustering.

    3 MULTIVIEWPOINT-BASED SIMILARITY

    3.1 Our Novel Similarity Measure

    The cosine similarity in (3) can be expressed in thefollowing form without changing its meaning:

    Simdi; dj cosdi0; dj0 di0tdj0; 8

    where 0 is vector 0 that represents the origin point.

    According to this formula, the measure takes 0 as one andonly reference point. The similarity between two documentsdi and dj is determined w.r.t. the angle between the twopoints when looking from the origin.

    To construct a new concept of similarity, it is possible touse more than just one point of reference. We may have amore accurate assessment of how close or distant a pair ofpoints are, if we look at them from many differentviewpoints. From a third point dh, the directions anddistances to di and dj are indicated, respectively, by thedifference vectors di dh and dj dh. By standing atvarious reference points dh to view di, dj and working on

    their difference vectors, we define similarity between thetwo documents as

    Simdi; djdi;dj2Sr

    1

    nnr

    Xdh2SnSr

    Simdidh; djdh: 9

    As described by the above equation, similarity of twodocuments di and djgiven that they are in the sameclusteris defined as the average of similarities measuredrelatively from the views of all other documents outside thatcluster. What is interesting is that the similarity here isdefined in a close relation to the clustering problem. Apresumption of cluster memberships has been made prior tothe measure. The two objects to be measured must be in thesame cluster, while the points from where to establish thismeasurement must be outside of the cluster. We call thisproposal the Multiviewpoint-basedSimilarity, or MVS. Fromthis point onwards, we will denote the proposed similaritymeasure between two document vectors di and dj byMVSdi; djjdi; dj2Sr, or occasionally MVSdi; dj for short.

    The final form of MVS in (9) depends on particularformulation of the individual similarities within the sum. Ifthe relative similarity is defined by dot-product of thedifference vectors, we have

    MVSdi; djjdi; dj 2 Sr

    1

    nnr

    Xdh2SnSr

    didhtdjdh

    1

    nnr

    Xdh

    cosdidh; djdhkdidhkkdjdhk:

    10

    The similarity between two points di and dj inside clusterSr, viewed from a point dh outside this cluster, is equal tothe product of the cosine of the angle between di and djlooking from dh and the euclidean distances from dh tothese two points. This definition is based on the assumptionthat dh is not in the same cluster with di and dj. The smaller

    the distances kdi dhk and kdj dhk are, the higher thechance that dh is in fact in the same cluster with di and dj,and the similarity based on dh should also be small to reflectthis potential. Therefore, through these distances, (10) alsoprovides a measure of intercluster dissimilarity, given thatpoints di and dj belong to cluster Sr, whereas dh belongs toanother cluster. The overall similarity between di and dj isdetermined by taking average over all the viewpoints notbelonging to cluster Sr. It is possible to argue that whilemost of these viewpoints are useful, there may be some ofthem giving misleading information just like it may happenwith the origin point. However, given a large enough

    number of viewpoints and their variety, it is reasonable toassume that the majority of them will be useful. Hence, theeffect of misleading viewpoints is constrained and reducedby the averaging step. It can be seen that this method offers

    990 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

  • 7/27/2019 Clustering With Multiviewpoint-Based

    4/14

    more informative assessment of similarity than the singleorigin point-based similarity measure.

    3.2 Analysis and Practical Examples of MVS

    In this section, we present analytical study to show that theproposed MVS could be a very effective similarity measurefor data clustering. In order to demonstrate its advantages,

    MVS is compared with cosine similarity on how well theyreflect the true group structure in document collections.First, exploring (10), we have

    MVS di; djjdi; dj 2 Sr

    1

    n nr

    Xdh2SnSr

    dtidj d

    tidh d

    tjdh d

    thdh

    dtidj1

    nnrdti

    Xdh

    dh1

    nnrdtj

    Xdh

    dh1; kdhk1

    dtidj 1

    n nrdtiDSnSr

    1

    n nrdtjDSnSr 1

    dtidj dtiCSnSr d

    tjCSnSr 1;

    11

    where DSnSr Pdh2SnSrdh is the composite vector of all the

    documents outside cluster r, called the outer compositew.r.t. cluster r, and CSnSr DSnSr =nnr the outer centroidw.r.t. cluster r, 8r 1; . . . ; k. From (11), when comparingtwo pairwise similarities MVSdi; dj and MVSdi; dl,

    document dj is more similar to document di than the otherdocument dl is, if and only if

    dtidj dt

    jCSnSr > dtidl d

    tl CSnSr

    , cosdi; dj cosdj; CSnSr kCSnSr k >

    cosdi; dl cosdl; CSnSr kCSnSr k:

    12

    From this condition, it is seen that even when dl is considered

    closer to di in terms of CS, i.e., cosdi; dj cosdi; dl, dl canstill possibly be regarded as less similar to di based on MVS

    if, on the contrary, it is closer enough to the outer centroidCSnSr than dj is. This is intuitively reasonable, since the

    closer dl is to CSnSr , the greater the chance it actuallybelongs to another cluster rather than Sr and is, therefore,less similar to di. For this reason, MVS brings to the table anadditional useful measure compared with CS.

    To further justify the above proposal and analysis, wecarried out a validity test for MVS and CS. The purpose ofthis test is to check how much a similarity measurecoincides with the true class labels. It is based on oneprinciple: if a similarity measure is appropriate for theclustering problem, for any of a document in the corpus, thedocuments that are closest to it based on this measure

    should be in the same cluster with it.The validity test is designed as following: For each type

    of similarity measure, a similarity matrix A faijgnn iscreated. For CS, this is simple, as aij dtidj. The procedurefor building MVS matrix is described in Fig. 1.

    First, the outer composite w.r.t. each class is determined.Then, for each row ai of A, i 1; . . . ; n, if the pair ofdocuments di and dj; j 1; . . . ; n are in the same class, aij iscalculated as in line 10, Fig. 1. Otherwise, dj is assumed tobe in dis class, and aij is calculated as in line 12, Fig. 1. Aftermatrix A is formed, the procedure in Fig. 2 is used to get itsvalidity score.

    For each document di corresponding to row ai of A, weselect qr documents closest to di. The value of qr is chosenrelatively as percentage of the size of the class r that containsdi, where percentage 2 0; 1. Then, validity w.r.t. di iscalculated by the fraction of these qr documents having thesameclasslabelwithdi,asinline12,Fig.2.Thefinalvalidityisdetermined by averaging over all the rows ofA, as in line 14,Fig. 2. It is clear that validity score is bounded within 0 and 1.The higher validity score a similarity measure has, the moresuitable it should be for the clustering task.

    Two real-world document data sets are used as examplesin this validity test. The first is reuters7, a subset of thefamous collection, Reuters-21578 Distribution 1.0, of Reutersnewswire articles.1 Reuters-21578 is one of the most widely

    NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 991

    Fig. 1. Procedure: Build MVS similarity matrix.

    Fig. 2. Procedure: Get validity score.

    1. http://www.daviddlewis.com/resources/testcollections/reuters21578/.

  • 7/27/2019 Clustering With Multiviewpoint-Based

    5/14

    used test collection for text categorization. In our validitytest, we selected 2,500 documents from the largest sevencategories: acq, crude, interest, earn, money-fx,ship, and trade to form reuters7. Some of the documentsmay appear in more than one category. The second data setis k1b, a collection of 2,340 webpages from the Yahoo! subject

    hierarchy, including six topics: health, entertainment,sport, politics, tech, and business. It was createdfrom a past study in information retrieval called WebAce[26], and is now available with the CLUTO toolkit [19].

    The two data sets were preprocessed by stop-wordremoval and stemming. Moreover, we removed words thatappear in less than two documents or more than 99.5 percentof the total number of documents. Finally, the documentswere weighted by TF-IDF and normalized to unit vectors.The full characteristics of reuters7 and k1b are presented inFig. 3.

    Fig. 4 shows the validity scores of CS and MVS on the

    two data sets relative to the parameter percentage. Thevalue ofpercentage is set at 0.001, 0.01, 0.05, 0.1, 0:2; . . . ; 1:0.According to Fig. 4, MVS is clearly better than CS for bothdata sets in this validity test. For example, with k1b data setat percentage 1:0, MVS validity score is 0.80, while that ofCS is only 0.67. This indicates that, on average, when wepick up any document and consider its neighborhood ofsize equal to its true class size, only 67 percent of thatdocuments neighbors based on CS actually belong to itsclass. If based on MVS, the number of valid neighborsincreases to 80 percent. The validity test has illustrated thepotential advantage of the new multiviewpoint-based

    similarity measure compared to the cosine measure.

    4 MULTIVIEWPOINT-BASED CLUSTERING

    4.1 Two Clustering Criterion Functions IRIR and IVIVHaving defined our similarity measure, we now formulateour clustering criterion functions. The first function, calledIR, is the cluster size-weighted sum of average pairwisesimilarities of documents in the same cluster. First, let usexpress this sum in a general form by function F

    F Xk

    r1

    nr1

    n2r

    Xdi;dj2Sr

    Simdi; dj

    24

    35: 13

    We would like to transform this objective function intosome suitable form such that it could facilitate the

    optimization procedure to be performed in a simple, fast

    and effective way. According to (10)

    Xdi;dj2Sr

    Simdi; dj

    X

    di;dj2Sr

    1

    n nr

    Xdh2SnSr

    di dh t dj dh

    1

    n nr

    Xdi;dj

    Xdh

    dtidj d

    tidh d

    tjdh d

    thdh

    :

    Since

    Xdi2Sr

    di Xdj2Sr

    dj Dr;

    Xdh2SnSr

    dh D Dr and kdhk 1;

    we have

    Xdi;dj2Sr

    Simdi; dj

    X

    di;dj2Sr

    dtidj 2nr

    n nr

    Xdi2Sr

    dtiX

    dh2SnSr

    dh n2r

    DtrDr 2nr

    n nrDtrD Dr n

    2r

    n nrn nr

    kDrk2

    2nrn nr

    DtrD n2r :

    Substituting into (13) to get

    F Xkr1

    1

    nr

    n nrn nr

    kDrk2

    n nrn nr

    1

    DtrD

    ! n:

    Because n is constant, maximizing F is equivalent to

    maximizing F

    F Xk

    r1

    1

    nr

    nnr

    nnrkDrk

    2 nnr

    nnr 1 Dt

    rD !: 14

    If comparing F with the min-max cut in (5), both functions

    contain the two terms kDrk2 (an intracluster similarity

    992 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

    Fig. 3. Characteristics of reuters7 and k1b data sets.

    Fig. 4. CS and MVS validity test.

  • 7/27/2019 Clustering With Multiviewpoint-Based

    6/14

    measure) and DtrD (an intercluster similarity measure).

    Nonetheless, while the objective of min-max cut is to

    minimize the inverse ratio between these two terms, our

    aim here is to maximize their weighted difference. In F, this

    difference term is determined for each cluster. They are

    weighted by the inverse of the clusters size, before summed

    up over all the clusters. One problem is that this formula-

    tion is expected to be quite sensitive to cluster size. From theformulation of COSA [27]a widely known subspace

    clustering algorithmwe have learned that it is desirable

    to have a set of weight factors frgk1 to regulate the

    distribution of these cluster sizes in clustering solutions.

    Hence, we integrate into the expression of F to have it

    become

    F Xkr1

    rnr

    nnrnnr

    kDrk2

    nnrnnr

    1

    DtrD

    !: 15

    In common practice, frgk1 are often taken to be simple

    functions of the respective cluster sizes fnrgk1 [28]. Let us

    use a parameter called the regulating factor, which hassome constant value ( 2 0; 1), and let r n

    r in (15), the

    final form of our criterion function IR is

    IRXkr1

    1

    n1r

    nnrnnr

    kDrk2

    nnrnnr

    1

    DtrD

    !: 16

    In the empirical study of Section 5.4, it appears that IRsperformance dependency on the value of is not very

    critical. The criterion function yields relatively good

    clustering results for 2 0; 1.In the formulation ofIR, a cluster quality is measured by

    the average pairwise similarity between documents withinthat cluster. However, such an approach can lead tosensitiveness to the size and tightness of the clusters. WithCS, for example, pairwise similarity of documents in asparse cluster is usually smaller than those in a densecluster. Though not as clear as with CS, it is still possiblethat the same effect may hinder MVS-based clustering if

    using pairwise similarity. To prevent this, an alternativeapproach is to consider similarity between each documentvector and its clusters centroid instead. This is expressed inobjective function G

    G Xkr1

    Xdi2Sr

    1

    nnr

    Xdh2SnSr

    Sim didh;Cr

    kCrkdh

    G Xkr1

    1

    nnr

    Xdi2Sr

    Xdh2SnSr

    didh t Cr

    kCrkdh

    :

    17

    Similar to the formulation of IR, we would like to express

    this objective in a simple form that we could optimize more

    easily. Exploring the vector dot product, we get

    Xdi2Sr

    Xdh2SnSr

    di dh t Cr

    kCrk dh

    X

    di

    Xdh

    dtiCr

    kCrk dtidh d

    th

    CrkCrk

    1

    nnrDtr

    DrkDrk

    DtrDDr nrDDrt Dr

    kDrk

    nrn nr; sinceCr

    kCrk Dr

    kDrk

    n kDrk kDrk nr kDrk DtrD

    kDrk nrn nr:

    Substituting the above into (17) to have

    G Xkr1

    nkDrk

    nnrkDrk

    nkDrk

    nnr 1

    DtrD

    kDrk

    ! n:

    Again, we could eliminate n because it is a constant.Maximizing G is equivalent to maximizing IV below

    IV Xkr1

    nkDrknnr

    kDrk nkDrknnr

    1

    DtrD

    kDrk

    !: 18

    IV calculates the weighted difference between the twoterms: kDrk and D

    trD=kDrk, which again represent an

    intracluster similarity measure and an intercluster similar-ity measure, respectively. The first term is actuallyequivalent to an element of the sum in spherical k-meansobjective function in (4); the second one is similar to anelement of the sum in min-max cut criterion in (6), but withkDrk as scaling factor instead of kDrk

    2. We have presentedour clustering criterion functions IR and IV in the simple

    forms. Next, we show how to perform clustering by using agreedy algorithm to optimize these functions.

    4.2 Optimization Algorithm and Complexity

    We denote our clustering framework by MVSC, meaningClustering with Multiviewpoint-based Similarity. Subse-quently, we have MVSC-IR and MVSC-IV, which are MVSCwith criterion function IR and IV, respectively. The maingoal is to perform document clustering by optimizing IR in(16) and IV in (18). For this purpose, the incremental k-wayalgorithm [18], [29]a sequential version of k-meansisemployed. Considering that the expression of IV in (18)depends only on nr and Dr, r 1; . . . ; k, IV can be written in

    a general form

    IV Xkr1

    Ir nr; Dr ; 19

    where Irnr; Dr corresponds to the objective value of clusterr. The same is applied to IR. With this general form, theincremental optimization algorithm, which has two majorsteps Initialization and Refinement, is described in Fig. 5.

    At Initialization, k arbitrary documents are selected to bethe seeds from which initial partitions are formed. Refine-ment is a procedure that consists of a number of iterations.

    During each iteration, the n documents are visited one byone in a totally random order. Each document is checked ifits move to another cluster results in improvement of theobjective function. If yes, the document is moved to the

    NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 993

  • 7/27/2019 Clustering With Multiviewpoint-Based

    7/14

    cluster that leads to the highest improvement. If no clustersare better than the current cluster, the document is notmoved. The clustering process terminates when an iterationcompletes without any documents being moved to newclusters. Unlike the traditional k-means, this algorithm is astepwise optimal procedure. While k-means only updatesafter all n documents have been reassigned, the incremental

    clustering algorithm updates immediately whenever eachdocument is moved to new cluster. Since every move whenhappens increases the objective function value, convergenceto a local optimum is guaranteed.

    During the optimization procedure, in each iteration, themain sources of computational cost are

    . Searching for optimum clusters to move individualdocuments to: Onz k.

    . Updating composite vectors as a result of suchmoves: Om k.

    where nz is the total number of nonzero entries in alldocument vectors. Our clustering approach is partitional andincremental; therefore, computing similarity matrix isabsolutely not needed. If denotes the number of iterationsthe algorithm takes, since nzis often several tens times largerthan m for document domain, the computational complexityrequired for clustering with IR and IV is Onz k .

    5 PERFORMANCE EVALUATION OF MVSC

    To verify the advantages of our proposed methods, weevaluate their performance in experiments on documentdata. The objective of this section is to compare MVSC-IRand MVSC-IV with the existing algorithms that also use

    specific similarity measures and criterion functions fordocument clustering. The similarity measures to be com-pared includes euclidean distance, cosine similarity, andextended Jaccard coefficient.

    5.1 Document Collections

    The data corpora that we used for experiments consist of

    20 benchmark document data sets. Besides reuters7 and k1b,

    which have been described in details earlier, we included

    another 18 text collections so that the examination of the

    clustering methods is more thorough and exhaustive.

    Similar to k1b, these data sets are provided together with

    CLUTO by the toolkits authors [19]. They had been used

    for experimental testing in previous papers, and theirsource and origin had also been described in details [30],

    [31]. Table 2 summarizes their characteristics. The corpora

    present a diversity of size, number of classes and class

    balance. They were all preprocessed by standard proce-

    dures, including stop-word removal, stemming, removal of

    too rare as well as too frequent words, TF-IDF weighting

    and normalization.

    5.2 Experimental Setup and Evaluation

    To demonstrate how well MVSCs can perform, we compare

    them with five other clustering methods on the 20 data sets

    in Table 2. In summary, the seven clustering algorithms are

    . MVSC-IR: MVSC using criterion function IR

    . MVSC-IV: MVSC using criterion function IV

    . k-means: standard k-means with euclidean distance

    . Spkmeans: spherical k-means with CS

    . graphCS: CLUTOs graph method with CS

    . graphEJ: CLUTOs graph with extended Jaccard

    . MMC: Spectral Min-Max Cut algorithm [13].

    Our MVSC-IR and MVSC-IV programs are implemented in

    Java. The regulating factor in IR is always set at 0.3 during

    the experiments. We observed that this is one of the most

    appropriate values. A study on MVSC-IRs performancerelative to different values is presented in a later section.

    The other algorithms are provided by the C library interface

    which is available freely with the CLUTO toolkit [19]. For

    994 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

    Fig. 5. Algorithm: Incremental clustering.

    TABLE 2Document Data Sets

  • 7/27/2019 Clustering With Multiviewpoint-Based

    8/14

    each data set, cluster number is predefined equal to thenumber of true class, i.e., k c.

    None of the above algorithms are guaranteed to findglobal optimum, and all of them are initialization-dependent. Hence, for each method, we performedclustering a few times with randomly initialized values,and chose the best trial in terms of the correspondingobjective function value. In all the experiments, each testrun consisted of 10 trials. Moreover, the result reportedhere on each data set by a particular clustering method isthe average of 10 test runs.

    After a test run, clustering solution is evaluated bycomparing the documents assigned labels with their truelabels provided by the corpus. Three types of externalevaluation metric are used to assess clustering performance.They are the FScore, Normalized Mutual Information (NMI),and Accuracy. FScore is an equally weighted combination ofthe precision (P) and recall (R) values used ininformation retrieval. Given a clustering solution, FScore isdetermined as

    FScore Xki1

    nin

    maxj

    Fi;j

    where Fi;j 2 Pi;j Ri;j

    Pi;j Ri;j; Pi;j

    ni;jnj

    ; Ri;j ni;jni

    ;

    where ni denotes the number of documents in class i, nj thenumber of documents assigned to cluster j, and ni;j thenumber of documents shared by class i and cluster j. Fromanother aspect, NMImeasures the information the true classpartition and the cluster assignment share. It measures howmuch knowing about the clusters helps us know about the

    classes

    NMI

    Pki1

    Pkj1 ni;j log

    nni;jninj

    ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPk

    i1 ni lognin

    Pkj1 nj log

    njn

    r :

    Finally, Accuracy measures the fraction of documents thatare correctly labels, assuming a one-to-one correspondencebetween true classes and assigned clusters. Let qdenote anypossible permutation of index set f1; . . . ; kg, Accuracy iscalculated by

    Accuracy 1n

    maxq

    Xki1

    ni;qi:

    The best mapping q to determine Accuracy could be foundby the Hungarian algorithm.2 For all three metrics, theirrange is from 0 to 1, and a greater value indicates a betterclustering solution.

    5.3 Results

    Fig. 6 shows the Accuracy of the seven clustering algorithmson the 20 text collections. Presented in a different way,clustering results based on FScore and NMIare reported inTables 3 and 4, respectively. For each data set in a row, the

    value in bold and underlined is the best result, while thevalue in bold only is the second to best.

    It can be observed that MVSC-IR and MVSC-IV performconsistently well. In Fig. 6, 19 out of 20 data sets, exceptreviews, either both or one of MVSC approaches are in thetop two algorithms. The next consistent performer isSpkmeans. The other algorithms might work well on certaindata set. For example, graphEJ yields outstanding result onclassic; graphCS and MMC are good on reviews. But they donot fare very well on the rest of the collections.

    To have a statistical justification of the clusteringperformance comparisons, we also carried out statisticalsignificance tests. Each of MVSC-IR and MVSC-IV waspaired up with one of the remaining algorithms for a pairedt-test [32]. Given two paired sets X and Y of N measuredvalues, the null hypothesis of the test is that the differencesbetween X and Y come from a population with mean 0. Thealternative hypothesis is that the paired sets differ fromeach other in a significant way. In our experiment, thesetests were done based on the evaluation values obtained onthe 20 data sets. The typical 5 percent significance level wasused. For example, considering the pair (MVSC-IR, k-means), from Table 3, it is seen that MVSC-IR dominates k-means w.r.t. FScore. If the paired t-test returns a p-valuesmaller than 0.05, we reject the null hypothesis and say thatthe dominance is significant. Otherwise, the null hypothesisis true and the comparison is considered insignificant.

    The outcomes of the paired t-tests are presented inTable 5. As the paired t-tests show, the advantage of MVSC-IR and MVSC-IV over the other methods is statisticallysignificant. A special case is the graphEJ algorithm. On theone hand, MVSC-IR is not significantly better than graphEJif based on FScore or NMI. On the other hand, when MVSC-IR and MVSC-IV are tested obviously better than graphEJ,

    the p-values can still be considered relatively large,although they are smaller than 0.05. The reason is that, asobserved before, graphEJs results on classic data set arevery different from those of the other algorithms. Whileinteresting, these values can be considered as outliers, andincluding them in the statistical tests would affect theoutcomes greatly. Hence, we also report in Table 5 the testswhere classic was excluded and only results on the other19 data sets were used. Under this circumstance, bothMVSC-IR and MVSC-IV outperform graphEJ significantlywith good p-values.

    5.4 Effect of on MVSC-IRIRs PerformanceIt has been known that criterion function-based partitionalclustering methods can be sensitive to cluster size andbalance. In the formulation of IR in (16), there existsparameter which is called the regulating factor, 2 0; 1.To examine how the determination of could affect MVSC-IRs performance, we evaluated MVSC-IR with differentvalues of from 0 to 1, with 0.1 incremental interval. Theassessment was done based on the clustering results inNMI, FScore, and Accuracy, each averaged over all the20 given data sets. Since the evaluation metrics for differentdata sets could be very different from each other, simply

    taking the average over all the data sets would not be verymeaningful. Hence, we employed the method used in [18]to transform the metrics into relative metrics before aver-aging. On a particular document collection S, the relative

    NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 995

    2. http://en.wikipedia.org/wiki/Hungarian_algorithm.

  • 7/27/2019 Clustering With Multiviewpoint-Based

    9/14

    FScore measure of MVSC-IR with i is determined asfollowing:

    relative FScoreIR; S; i maxj FScoreIR; S; j

    FScoreIR; S; i

    ;

    where i; j 2 f0:0; 0:1; . . . ; 1:0g, FScoreIR; S; i is theFScore result on data set S obtained by MVSC-IR with i. The same transformation was applied to NMI andAccuracy to yield relative_NMIand relative_Accuracy, respec-tively. MVSC-IR performs the best with an i if its relativemeasure has a value of 1. Otherwise its relative measure isgreater than 1; the larger this value is, the worse MVSC-IRwith i performs in comparison with other settings of .Finally, the average relative measures were calculated over

    all the data sets to present the overall performance.Fig. 7 shows the plot of average relative FScore, NMI, and

    Accuracy w.r.t. different values of . In a broad view,MVSC-IR performs the worst at the extreme values of (0

    and 1), and tends to get better when is set at some softvalues in between 0 and 1. Based on our experimentalstudy, MVSC-IR always produces results within 5 percentof the best case, regarding any types of evaluation metric,with from 0.2 to 0.8.

    6 MVSC AS REFINEMENT FOR kk-MEANS

    From the analysis of (12) in Section 3.2, MVS provides anadditional criterion for measuring the similarity amongdocuments compared with CS. Alternatively, MVS can beconsidered as a refinement for CS, and consequently MVSCalgorithms as refinements for spherical k-means, whichuses CS. To further investigate the appropriateness andeffectiveness of MVS and its clustering algorithms, we

    carried out another set of experiments in which solutionsobtained by Spkmeans were further optimized by MVSC-IRand MVSC-IV. The rationale for doing so is that if the finalsolutions by MVSC-IR and MVSC-IV are better than the

    996 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

    Fig. 6. Clustering results in Accuracy. Left-to-right in legend corresponds to left-to-right in the plot.

  • 7/27/2019 Clustering With Multiviewpoint-Based

    10/14

    intermediate ones obtained by Spkmeans, MVS is indeedgood for the clustering problem. These experiments wouldreveal more clearly if MVS actually improves the clusteringperformance compared with CS.

    In the previous section, MVSC algorithms have beencompared against the existing algorithms that are closelyrelated to them, i.e., ones that also employ similaritymeasures and criterion functions. In this section, we make

    use of the extended experiments to further compare theMVSC with a different type of clustering approach, theNMF methods [10], which do not use any form of explicitlydefined similarity measure for documents.

    6.1 TDT2 and Reuters-21578Collections

    For variety and thoroughness, in this empirical study, weused two new document copora described in Table 6: TDT2and Reuters-21578. The original TDT2 corpus,3 whichconsists of 11,201 documents in 96 topics (i.e., classes), hasbeen one of the most standard sets for document clusteringpurpose. We used a subcollection of this corpus whichcontains 10,021 documents in the largest 56 topics. The

    Reuters-21578 Distribution 1.0 has been mentioned earlier inthis paper. The original corpus consists of 21,578 documents

    NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 997

    TABLE 4Clustering Results in NMI

    TABLE 3Clustering Results in FScore

    3. http://www.nist.gov/speech/tests/tdt/tdt98/index.html.

  • 7/27/2019 Clustering With Multiviewpoint-Based

    11/14

    in 135 topics. We used a subcollection having 8,213 docu-ments from the largest 41 topics. The same two document

    collections had been used in the paper of the NMF methods

    [10]. Documents that appear in two or more topics were

    removed, and the remaining documents were preprocessed

    in the same way as in Section 5.1.

    6.2 Experiments and Results

    The following clustering methods:

    . Spkmeans: spherical k-means

    . rMVSC-IR: refinement of Spkmeans by MVSC-IR

    . rMVSC-IV: refinement of Spkmeans by MVSC-IV

    . MVSC-IR: normal MVSC using criterion IR

    . MVSC-IV: normal MVSC using criterion IV,

    and two new document clustering approaches that do not

    use any particular form of similarity measure:

    . NMF: Nonnegative Matrix Factorization method

    . NMF-NCW: Normalized Cut Weighted NMF

    were involved in the performance comparison. When used

    as a refinement for Spkmeans, the algorithms rMVSC-IRand rMVSC-IV worked directly on the output solution of

    Spkmeans. The cluster assignment produced by Spkmeans

    was used as initialization for both rMVSC-IR and rMVSC-IV. We also investigated the performance of the original

    MVSC-IR and MVSC-IV further on the new data sets.Besides, it would be interesting to see how they and theirSpkmeans-initialized versions fare against each other. Whatis more, two well-known document clustering approachesbased on nonnegative matrix factorization, NMF and NMF-NCW [10], are also included for a comparison with ouralgorithms, which use explicit MVS measure.

    During the experiments, each of the two corpora in Table 6were used to create six different test cases, each of whichcorresponded to a distinct number of topics used(c 5; . . . ; 10). For each test case, c topics were randomlyselected from the corpus and their documents were mixed

    together to form a test set. This selection was repeated50 times so that each test case had 50 different test sets. Theaverage performance of the clustering algorithms with k cwere calculated over these 50 test sets. This experimentalsetup is inspired by the similar experiments conducted in theNMF paper [10]. Furthermore, similar to previous experi-mental setup in Section 5.2, each algorithm (including NMFand NMF-NCW) actually considered 10 trials on any test setbefore using the solution of the best obtainable objectivefunction value as its final output.

    The clustering results on TDT2 and Reuters-21578 areshown in Table 7 and 8, respectively. For each test case in a

    column, the value in bold and underlined is the bestamong the results returned by the algorithms, while thevalue in bold only is the second to best. From the tables,several observations can be made. First, MVSC-IR andMVSC-IV continue to show they are good clusteringalgorithms by outperforming other methods frequently.They are always the best in every test case of TDT2.

    998 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

    Fig. 7. MVSC-IRs performance with respect to .

    TABLE 6TDT2 and Reuters-21578Document Corpora

    TABLE 5Statistical Significance of Comparisons Based on Paired t-Tests with 5 Percent Significance Level

  • 7/27/2019 Clustering With Multiviewpoint-Based

    12/14

    Compared with NMF-NCW, they are better in almost allthe cases, except only the case of Reuters-21578, k 5,where NMF-NCW is the best based on Accuracy.

    The second observation, which is also the main objective

    of this empirical study, is that by applying MVSC to refinethe output of spherical k-means, clustering solutions areimproved significantly. Both rMVSC-IR and rMVSC-IV leadto higher NMIs and Accuracies than Spkmeans in all the

    cases. Interestingly, there are many circumstances whereSpkmeans result is worse than that of NMF clusteringmethods, but after refined by MVSCs, it becomes better. Tohave a more descriptive picture of the improvements, we

    could refer to the radar charts in Fig. 8. The figure showsdetails of a particular test case where k 5. Remember thata test case consists of 50 different test sets. The chartsdisplay result on each test set, including the accuracy result

    NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 999

    TABLE 7Clustering Results on TDT2

    TABLE 8Clustering Results on Reuters-21578

    Fig. 8. Accuracies on the 50 test sets (in sorted order of Spkmeans) in the test case k 5.

  • 7/27/2019 Clustering With Multiviewpoint-Based

    13/14

    obtained by Spkmeans, and the results after refinement byMVSC, namely rMVSC-IR and rMVSC-IV. For effectivevisualization, they are sorted in ascending order of theaccuracies by Spkmeans (clockwise). As the patterns in bothFigs. 8a and 8b reveal, improvement in accuracy is mostlikely attainable by rMVSC-IR and rMVSC-IV. Many of theimprovements are with a considerably large margin,especially when the original accuracy obtained bySpkmeans is low. There are only few exceptions whereafter refinement, accuracy becomes worst. Nevertheless, thedecreases in such cases are small.

    Finally, it is also interesting to notice from Tables 7 and 8that MVSC preceded by spherical k-means does notnecessarily yields better clustering results than MVSC withrandom initialization. There are only a small number ofcases in the two tables that rMVSC can be found better thanMVSC. This phenomenon, however, is understandable.Given a local optimal solution returned by spherical k-means, rMVSC algorithms as a refinement method wouldbe constrained by this local optimum itself and, hence, their

    search space might be restricted. The original MVSCalgorithms, on the other hand, are not subjected to thisconstraint, and are able to follow the search trajectory oftheir objective function from the beginning. Hence, whileperformance improvement after refining spherical k-meansresult by MVSC proves the appropriateness of MVS and itscriterion functions for document clustering, this observationin fact only reaffirms its potential.

    7 CONCLUSIONS AND FUTURE WORK

    In this paper, we propose a Multiviewpoint-based Similar-ity measuring method, named MVS. Theoretical analysis

    and empirical examples show that MVS is potentially moresuitable for text documents than the popular cosinesimilarity. Based on MVS, two criterion functions, IR andIV, and their respective clustering algorithms, MVSC-IRand MVSC-IV, have been introduced. Compared with otherstate-of-the-art clustering methods that use different typesof similarity measure, on a large number of document datasets and under different evaluation metrics, the proposedalgorithms show that they could provide significantlyimproved clustering performance.

    The key contribution of this paper is the fundamentalconcept of similarity measure from multiple viewpoints.

    Future methods could make use of the same principle, butdefine alternative forms for the relative similarity in (10), ordo not use average but have other methods to combine therelative similarities according to the different viewpoints.Besides, this paper focuses on partitional clustering ofdocuments. In the future, it would also be possible to applythe proposed criterion functions for hierarchical clusteringalgorithms. Finally, we have shown the application of MVSand its clustering algorithms for text data. It would beinteresting to explore how they work on other types ofsparse and high-dimensional data.

    REFERENCES[1] X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J.

    McLachlan, A. Ng, B. Liu, P.S. Yu, Z.-H. Zhou, M. Steinbach, D.J.Hand, and D. Steinberg, Top 10 Algorithms in Data Mining,Knowledge Information Systems, vol. 14, no. 1, pp. 1-37, 2007.

    [2] I. Guyon, U.V. Luxburg, and R.C. Williamson, Clustering:Science or Art?, Proc. NIPS Workshop Clustering Theory, 2009.

    [3] I. Dhillon and D. Modha, Concept Decompositions for LargeSparse Text Data Using Clustering, Machine Learning, vol. 42,nos. 1/2, pp. 143-175, Jan. 2001.

    [4] S. Zhong, Efficient Online Spherical K-means Clustering, Proc.IEEE Intl Joint Conf. Neural Networks (IJCNN), pp. 3180-3185, 2005.

    [5] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, Clustering withBregman Divergences, J. Machine Learning Research, vol. 6,pp. 1705-1749, Oct. 2005.

    [6] E. Pekalska, A. Harol, R.P.W. Duin, B. Spillmann, and H. Bunke,Non-Euclidean or Non-Metric Measures Can Be Informative,Structural, Syntactic, and Statistical Pattern Recognition, vol. 4109,pp. 871-880, 2006.

    [7] M. Pelillo, What Is a Cluster? Perspectives from Game Theory,Proc. NIPS Workshop Clustering Theory, 2009.

    [8] D. Lee and J. Lee, Dynamic Dissimilarity Measure for SupportBased Clustering, IEEE Trans. Knowledge and Data Eng., vol. 22,no. 6, pp. 900-905, June 2010.

    [9] A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, Clustering on theUnit Hypersphere Using Von Mises-Fisher Distributions,

    J. Machine Learning Research, vol. 6, pp. 1345-1382, Sept. 2005.[10] W. Xu, X. Liu, and Y. Gong, Document Clustering Based on Non-

    Negative Matrix Factorization, Proc. 26th Ann. Intl ACM SIGIRConf. Research and Development in Informaion Retrieval, pp. 267-273,

    2003.[11] I.S. Dhillon, S. Mallela, and D.S. Modha, Information-Theoretic

    Co-Clustering, Proc. Ninth ACM SIGKDD Intl Conf. KnowledgeDiscovery and Data Mining (KDD), pp. 89-98, 2003.

    [12] C.D. Manning, P. Raghavan, and H. Schutze, An Introduction toInformation Retrieval. Cambridge Univ. Press, 2009.

    [13] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, A Min-Max CutAlgorithm for Graph Partitioning and Data Clustering, Proc.IEEE Intl Conf. Data Mining (ICDM), pp. 107-114, 2001.

    [14] H. Zha, X. He, C. Ding, H. Simon, and M. Gu, Spectral Relaxationfor K-Means Clustering, Proc. Neural Info. Processing Systems(NIPS), pp. 1057-1064, 2001.

    [15] J. Shi and J. Malik, Normalized Cuts and Image Segmentation,IEEE Trans. Pattern Analysis Machine Intelligence, vol. 22, no. 8,pp. 888-905, Aug. 2000.

    [16]I.S. Dhillon, Co-Clustering Documents and Words UsingBipartite Spectral Graph Partitioning, Proc. Seventh ACMSIGKDD Intl Conf. Knowledge Discovery and Data Mining (KDD),pp. 269-274, 2001.

    [17] Y. Gong and W. Xu, Machine Learning for Multimedia ContentAnalysis. Springer-Verlag, 2007.

    [18] Y. Zhao and G. Karypis, Empirical and Theoretical Comparisonsof Selected Criterion Functions for Document Clustering,

    Machine Learning, vol. 55, no. 3, pp. 311-331, June 2004.[19] G. Karypis, CLUTO a Clustering Toolkit, technical report, Dept.

    of Computer Science, Univ. of Minnesota, http://glaros.dtc.umn.edu/~gkhome/views/cluto, 2003.

    [20] A. Strehl, J. Ghosh, and R. Mooney, Impact of SimilarityMeasures on Web-Page Clustering, Proc. 17th Natl Conf. ArtificialIntelligence: Workshop of Artificial Intelligence for Web Search (AAAI),pp. 58-64, July 2000.

    [21] A. Ahmad and L. Dey, A Method to Compute Distance BetweenTwo Categorical Values of Same Attribute in UnsupervisedLearning for Categorical Data Set, Pattern Recognition Letters,vol. 28, no. 1, pp. 110-118, 2007.

    [22] D. Ienco, R.G. Pensa, and R. Meo, Context-Based DistanceLearning for Categorical Data Clustering, Proc. Eighth Intl Symp.Intelligent Data Analysis (IDA), pp. 83-94, 2009.

    [23] P. Lakkaraju, S. Gauch, and M. Speretta, Document SimilarityBased on Concept Tree Distance, Proc. 19th ACM Conf. Hypertextand Hypermedia, pp. 127-132, 2008.

    [24] H. Chim and X. Deng, Efficient Phrase-Based DocumentSimilarity for Clustering, IEEE Trans. Knowledge and Data Eng.,vol. 20, no. 9, pp. 1217-1229, Sept. 2008.

    [25] S. Flesca, G. Manco, E. Masciari, L. Pontieri, and A. Pugliese, FastDetection of xml Structural Similarity, IEEE Trans. Knowledge and

    Data Eng., vol. 17, no. 2, pp. 160-175, Feb. 2005.[26] E.-H. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V.Kumar, B. Mobasher, and J. Moore, Webace: A Web Agent forDocument Categorization and Exploration, Proc. Second IntlConf. Autonomous Agents (AGENTS 98), pp. 408-415, 1998.

    1000 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012

  • 7/27/2019 Clustering With Multiviewpoint-Based

    14/14

    [27] J. Friedman and J. Meulman, Clustering Objects on Subsets ofAttributes, J. Royal Statistical Soc. Series B Statistical Methodology,vol. 66, no. 4, pp. 815-839, 2004.

    [28] L. Hubert, P. Arabie, and J. Meulman, Combinatorial Data Analysis:Optimization by Dynamic Programming. SIAM, 2001.

    [29] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, seconded. John Wiley & Sons, 2001.

    [30] S. Zhong and J. Ghosh, A Comparative Study of GenerativeModels for Document Clustering, Proc. SIAM Intl Conf. Data

    Mining Workshop Clustering High Dimensional Data and Its Applica-

    tions, 2003.[31] Y. Zhao and G. Karypis, Criterion Functions for Document

    Clustering: Experiments and Analysis, technical report, Dept. ofComputer Science, Univ. of Minnesota, 2002.

    [32] T.M. Mitchell, Machine Learning. McGraw-Hill, 1997.

    Duc Thang Nguyen received the BEng degreein electrical and electronic engineering fromNanyang Technological University, Singapore,where he is also working toward the PhDdegree in the Division of Information Engineer-ing. Currently, he is an operations planninganalyst at PSA Corporation Ltd., Singapore. Hisresearch interests include algorithms, informa-tion retrieval, data mining, optimizations, andoperations research.

    Lihui Chen received the BEng degree incomputer science and engineering at ZhejiangUniversity, China and the PhD degree incomputational science at the University of St.Andrews, United Kingdom. Currently, she is anassociate professor in the Division of InformationEngineering at Nanyang Technological Univer-sity in Singapore. Her research interests includemachine learning algorithms and applications,data mining and web intelligence. She has

    published more than 70 referred papers in international journals andconferences in these areas. She is a senior member of the IEEE, and amember of the IEEE Computational Intelligence Society.

    Chee Keong Chan received the BEng degreefrom National University of Singapore in elec-trical and electronic engineering, the MSCdegree and DIC in computing from ImperialCollege, University of London, and the PhDdegree from Nanyang Technological University.Upon graduation from NUS, he worked as anR&D engineer in Philips Singapore for severalyears. Currently, he is an associate professor inthe Information Engineering Division, lecturing in

    subjects related to computer systems, artificial intelligence, softwareengineering, and cyber security. Through his years in NTU, he haspublished about numerous research papers in conferences, journals,books, and book chapter. He has also provided numerous consultationsto the industries. His current research interest areas include data mining(text and solar radiation data mining), evolutionary algorithms (schedul-ing and games), and renewable energy.

    . For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 1001