kdd_bipartite

Upload: srjonas

Post on 08-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 kdd_bipartite

    1/6

    Co-clustering documents and words using BipartiteSpectral Graph Partitioning

    Inderjit S. DhillonDepartment of Computer Sciences

    University of Texas, Austin, TX 78712

    [email protected]

    ABSTRACT

    Both document clustering and word clustering are well stud-ied problems. Most existing algorithms cluster documentsand words separately but not simultaneously. In this paperwe present the novel idea of modeling the document collec-tion as a bipartite graph between documents and words, us-ing which the simultaneous clustering problem can be posedas a bipartite graph partitioning problem. To solve the par-

    titioning problem, we use a new spectral co-clustering algo-rithm that uses the second left and right singular vectors ofan appropriately scaled word-document matrix to yield goodbipartitionings. The spectral algorithm enjoys some opti-mality properties; it can be shown that the singular vectorssolve a real relaxation to the NP-complete graph bipartition-ing problem. We present experimental results to verify thatthe resulting co-clustering algorithm works well in practice.

    1. INTRODUCTIONClustering is the grouping together of similar objects. Given

    a collection of unlabeled documents, document clusteringcan help in organizing the collection thereby facilitating fu-ture navigation and search. A starting point for applying

    clustering algorithms to document collections is to createa vector space model[20]. The basic idea is (a) to extractunique content-bearing words from the set of documentstreating these words as features and (b) to then representeach document as a vector in this feature space. Thus theentire document collection may be represented by a word-by-document matrixA whose rows correspond to words andcolumns to documents. A non-zero entry in A, say Aij, in-dicates the presence of word i in document j, while a zeroentry indicates an absence. Typically, a large number ofwords exist in even a moderately sized set of documents, forexample, in one test case we use 4303 words in 3893 docu-ments. However, each document generally contains only asmall number of words and hence, A is typically very sparsewith almost 99% of the matrix entries being zero.

    Existing document clustering methods include agglomer-

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.KDD 2001 San Francisco, California, USACopyright 2001 ACM X-XXXXX-XX-X/XX/XX ...$5.00.

    ative clustering[25], the partitional k-means algorithm[7],projection based methods including LSA[21], self-organizingmaps[18] and multidimensional scaling[16]. For computa-tional efficiency required in on-line clustering, hybrid ap-proaches have been considered such as in[5]. Graph-theoretictechniques have also been considered for clustering; manyearlier hierarchical agglomerative clustering algorithms[9] andsome recent work[3, 23] model the similarity between docu-

    ments by a graph whose vertices correspond to documentsand weighted edges or hyperedges give the similarity be-tween vertices. However these methods are computationallyprohibitive for large collections since the amount of workrequired just to form the graph is quadratic in the numberof documents.

    Words may be clustered on the basis of the documents inwhich they co-occur; such clustering has been used in theautomatic construction of a statistical thesaurus and in theenhancement of queries[4]. The underlying assumption isthat words that typically appear together should be associ-ated with similar concepts. Word clustering has also b eenprofitably used in the automatic classification of documents,see[1]. More on word clustering may be found in [24].

    In this paper, we consider the problem of simultaneous or

    co-clustering of documents and words. Most of the existingwork is on one-way clustering, i.e., either document or wordclustering. A common theme among existing algorithms isto cluster documents based upon their word distributionswhile word clustering is determined by co-occurrence in doc-uments. This points to a duality between document andterm clustering. We pose this dual clustering problem interms of finding minimum cut vertex partitions in a bipar-tite graph between documents and words. Finding a globallyoptimal solution to such a graph partitioning problem is NP-complete; however, we show that the second left and rightsingular vectors of a suitably normalized word-documentmatrix give an optimal solution to the real relaxation ofthis discrete optimization problem. Based upon this obser-

    vation, we present a spectral algorithm that simultaneouslypartitions documents and words, and demonstrate that thealgorithm gives good global solutions in practice.

    A word about notation: small-bold letters such as x, u,p will denote column vectors, capital-bold letters such asA, M, B will denote matrices, and script letters such asV, D, Wwill usually denote vertex sets.

    2. BIPARTITE GRAPH MODELFirst we introduce some relevant terminology about graphs.

    A graph G = (V, E) is a set of vertices V= {1, 2, . . . , |V|}

  • 8/7/2019 kdd_bipartite

    2/6

    and a set of edges {i, j} each with edge weight Eij. Theadjacency matrix M of a graph is defined by

    Mij =

    Eij , if there is an edge {i, j},0, otherwise.

    Given a partitioning of the vertex set V into two subsetsV1 and V2, the cut between them will play an importantrole in this paper. Formally,

    cut(V1, V2) =

    iV1,jV2Mij. (1)

    The definition of cut is easily extended to k vertex subsets,

    cut(V1, V2, . . . , Vk) =i

  • 8/7/2019 kdd_bipartite

    3/6

    5. If the graphG has c connected components thenL hasc eigenvalues that equal 0.

    6. For any vectorx, xTLx =

    {i,j}E Eij(xi xj)2.7. For any vectorx, and scalars and

    (x + e)TL(x + e) = 2xTLx. (4)Proof.

    1. Part 1 follows from the definition ofL.

    2. This is easily seen by multiplying IG and IGT.

    3. By part 2, xTLx = xTIGIT

    Gx = yTy 0, for allx. This implies that L is symmetric positive semi-definite. All such matrices have non-negative real eigen-values and a full set of n orthogonal eigenvectors[13].

    4. Given any vector x, Lx = IG(IGTx). Let k be the

    row ofIGTx that corresponds to the edge {i, j}, then

    it is easy to see that

    (IGTx)k =

    Eij(xi xj), (5)

    and so when x = e, Le = 0.

    5. See [11].

    6. This follows from equation (5).

    7. This follows from part 4 above. For the rest of the paper, we will assume that the graph G

    consists of exactly one connected component. We now seehow the eigenvalues and eigenvectors ofL give us informa-tion about partitioning the graph. Given a bipartitioningof V into V1 and V2 (V1 V2 = V), let us define the parti-tion vector p that captures this division,

    pi =

    +1, i V1,1, i V2. (6)

    Theorem 2. Given the Laplacian matrix L of G and apartition vectorp, the Rayleigh Quotient

    pTLp

    pTp=

    1

    n 4 cut(V1, V2).

    Proof. Clearly pTp = n. By part 6 of Theorem 1, pTLp =

    {i,j}E Eij(pi pj)2. Thus edges within V1 or V2 do not

    contribute to the above sum, while each edge between V1and V2 contributes a value of 4 times the edge-weight. 3.2 Eigenvectors as optimal partition vectors

    Clearly, by Theorem 2, the cut is minimized by the trivialsolution when all pi are either -1 or +1. Informally, the cutcaptures the association between different partitions. Weneed an objective function that in addition to small cut val-ues also captures the need for more balanced clusters.

    We now present such an objective function. Let eachvertex i be associated with a positive weight, denoted byweight(i), and let W be the diagonal matrix of such weights.For a subset of vertices Vl define its weight to be weight(Vl) =

    iVl weight(i) =

    iVl Wii. We consider subsets V1 and

    V2 to be balanced if their respective weights are equal.

    The following ob jective function favors balanced clusters,

    Q(V1, V2) = cut(V1, V2)weight(V1) +

    cut(V1, V2)weight(V2) . (7)

    Given two different partitionings with the same cut value,the above objective function value is smaller for the morebalanced partitioning. Thus minimizing Q(V1, V2) favorspartitions that have a small cut value and are balanced.

    We now show that the Rayleigh Quotient of the follow-ing generalized partition vector q equals the above objectivefunction value.

    Lemma 1. Given graph G, letL andW be its Laplacianand vertex weight matrices respectively. Let1 = weight(V1)and 2 = weight(V2). Then the generalized partition vec-torq with elements

    qi =

    +

    21

    , i V1,

    12

    , i V2,

    satisfies q

    T

    We = 0, andqT

    Wq = weight(V).Proof. Let y = We, then yi = weight(i) = Wii. Thus

    qTWe =

    2

    1

    iV1

    weight(i)

    1

    2

    iV2

    weight(i) = 0.

    Similarly qTWq =n

    i=1 Wiiq2i = 1 + 2 = weight(V).

    Theorem 3. Using the notation of Lemma 1,

    qTLq

    qTWq=

    cut(V1, V2)weight(V1) +

    cut(V1, V2)weight(V2) .

    Proof. It is easy to show that the generalized partitionvector q may be written as

    q = 1 + 22

    12p + 2 1

    2

    12e,

    where p is the partition vector of (6). Using part 7 of The-orem 1, we see that

    qTLq =

    (1 + 2)2

    412pTLp.

    Substituting the values of pTLp and qTWq, from Theo-rem 2 and Lemma 1 respectively, proves the result.

    Thus to find the global minimum of (7), we can restrictour attention to generalized partition vectors of the form inLemma 1. Even though this problem is still NP-complete,the following theorem shows that it is possible to find a realrelaxation to the optimal generalized partition vector.

    Theorem 4. The problem

    minq=0

    qTLq

    qTWq, subject to q

    TWe = 0,

    is solved whenq is the eigenvector corresponding to the 2ndsmallest eigenvalue 2 of the generalized eigenvalue problem,

    Lz = Wz. (8)

    Proof. This is a standard result from linear algebra[13]. 3.3 Ratio-cut and Normalized-cut objectives

    Thus far we have not specified the particular choice ofvertex weights. A simple choice is to have weight(i) = 1 for

    all vertices i. This leads to the ratio-cut objective which hasbeen considered in [14] (for circuit partitioning),

    Ratio-cut(V1, V2) = cut(V1, V2)|V1| +cut(V1, V2)

    |V2| .

    An interesting choice is to make the weight of each ver-tex equal to the sum of the weights of edges incident on it,i.e., weight(i) =

    k Eik. This leads to the normalized-

    cut criterion that was used in [22] for image segmentation.Note that for this choice of vertex weights, the vertex weightmatrix W equals the degree matrix D, and weight(Vi) =

  • 8/7/2019 kdd_bipartite

    4/6

    cut(V1, V2)+within(Vi) for i = 1, 2, where within(Vi) is thesum of the weights of edges with both end-points in Vi. Thenthe normalized-cut objective function may be expressed as

    N(V1, V2) = cut(V1, V2)iV1

    k Eik

    +cut(V1, V2)iV2

    k Eik

    ,

    = 2 S(V1, V2),where

    S(

    V1,

    V2) =

    within(V1)weight(V1)

    +within(V2)weight(V2)

    .

    Note that S(V1, V2) measures the strengths of associationswithin each partition. Thus minimizing the normalized-cutis equivalent to maximizing the proportion of edge weightsthat lie within each partition.

    4. THE SVD CONNECTIONIn the previous section, we saw that the second eigenvector

    of the generalized eigenvalue problem Lz = Dz providesa real relaxation to the discrete optimization problem offinding the minimum normalized cut. In this section, wepresent algorithms to find document and word clusteringsusing our bipartite graph model. In the bipartite case,

    L = D1

    A

    AT D2 , and D = D1 0

    0 D2

    where D1 and D2 are diagonal matrices such that D1(i, i) =j Aij , D2(j, j) =

    i Aij . Thus Lz = Dz may be

    written asD1 A

    AT D2

    x

    y

    =

    D1 0

    0 D2

    x

    y

    (9)

    Assuming that both D1 and D2 are nonsingular, we canrewrite the above equations as

    D11/2x D11/2Ay = D11/2x,

    D21/2ATx + D21/2y = D21/2y.Letting u = D1

    1/2x and v = D21/2y, and after a little

    algebraic manipulation, we get

    D11/2

    AD21/2

    v = (1 )u,D2

    1/2A

    TD1

    1/2u = (1 )v.

    These are precisely the equations that define the singularvalue decomposition (SVD) of the normalized matrixAn =

    D11/2AD2

    1/2. In particular, u and v are the left andright singular vectors respectively, while (1 ) is the cor-responding singular value. Thus instead of computing theeigenvector of the second (smallest) eigenvalue of (9), we cancompute the left and right singular vectors corresponding tothe second (largest) singular value ofAn ,

    Anv2 = 2u2, AnTu2 = 2v2, (10)

    where 2 = 1 2. Computationally, working on An ismuch better since An is of size w d while the matrix Lis of the larger size (w + d) (w + d).

    The right singular vector v2 will give us a bipartitioningof documents while the left singular vector u2 will give us abipartitioning of the words. By examining the relations (10)it is clear that this solution agrees with our intuition thata partitioning of documents should induce a partitioning ofwords, while a partitioning of words should imply a parti-tioning of documents.

    4.1 The Bipartitioning AlgorithmThe singular vectors u2 and v2 ofAn give a real approx-

    imation to the discrete optimization problem of minimizingthe normalized cut. Given u2 and v2 the key task is toextract the optimal partition from these vectors.

    The optimal generalized partition vector of Lemma 1 istwo-valued. Thus our strategy is to look for a bi-modaldistribution in the values of u2 and v2. Let m1 and m2denote the bi-modal values that we are looking for. Fromthe previous section, the second eigenvector ofL is given by

    z2 =

    D1

    1/2u2D2

    1/2v2

    . (11)

    One way to approximate the optimal bipartitioning is by theassignment ofz2(i) to the bi-modal values mj (j = 1, 2) suchthat the following sum-of-squares criterion is minimized,

    2j=1

    z2(i)mj

    (z2(i) mj)2.

    The above is exactly the objective function that the classicalk-means algorithm tries to minimize[9]. Thus we use thefollowing algorithm to co-cluster words and documents:

    Algorithm Bipartition

    1. Given A, form An = D11/2AD2

    1/2.2. Compute the second singular vectors ofAn , u2 and v2

    and form the vector z2 as in (11).3. Run the k-means algorithm on the 1-dimensional data z2

    to obtain the desired bipartitioning.

    The surprising aspect of the above algorithm is that werun k-means simultaneously on the reduced representationsof both words and documents to get the co-clustering.

    4.2 The Multipartitioning AlgorithmWe can adapt our bipartitioning algorithm for the more

    general problem of finding k word and document clusters.One possibility is to use Algorithm Bipartition in a recursive

    manner. However, we favor a more direct approach. Justas the second singular vectors contain bi-modal informa-tion, the = log2 k singular vectors u2,u3, . . . ,u+1, andv2,v3, . . . ,v+1 often contain k-modal information aboutthe data set. Thus we can form the -dimensional data set

    Z =

    D1

    1/2U

    D21/2V

    , (12)

    where U = [u2, . . . ,u+1], and V = [v2, . . . , v+1]. Fromthis reduced-dimensional data set, we look for the best k-modal fit to the -dimensional points m1, . . . ,mk by as-signing each -dimensional row, Z(i), to mj such that thesum-of-squares

    kj=1

    z2(i)mj

    Z(i) mj2

    is minimized. This can again be done by the classical k-means algorithm. Thus we obtain the following algorithm.

    Algorithm Multipartition(k)

    1. Given A, form An = D11/2AD2

    1/2.2. Compute = log2 k singular vectors ofAn , u2, . . .u+1

    and v2, . . . v+1, and form the matrix Z as in (12).3. Run the k-means algorithm on the -dimensional data Z

    to obtain the desired k-way multipartitioning.

  • 8/7/2019 kdd_bipartite

    5/6

    Name # Docs # Words # Nonzeros(A)MedCran 2433 5042 117987MedCran All 2433 17162 224325MedCisi 2493 5447 109119MedCisi All 2493 19194 213453Classic3 3893 4303 176347Classic3 30docs 30 1073 1585Classic3 150docs 150 3652 7960Yahoo K5 2340 1458 237969

    Yahoo K1 2340 21839 349792

    Table 1: Details of the data sets

    5. EXPERIMENTAL RESULTSFor some of our experiments, we used the popular Med-

    line (1033 medical abstracts), Cranfield (1400 aeronauticalsystems abstracts) and Cisi (1460 information retrieval ab-stracts) collections. These document sets can be down-loaded from ftp://ftp.cs.cornell.edu/pub/smart. For testingAlgorithm Bipartition, we created mixtures consisting of 2of these 3 collections. For example, MedCran contains docu-ments from the Medline and Cranfield collections. Typically,we removed stop words, and words occurring in < 0.2% and> 15% of the documents. However, our algorithm has anin-built scaling scheme and is robust in the presence of largenumber of noise words, so we also formed word-documentmatrices by including all words, even stop words.

    For testing Algorithm Multipartition, we created the Clas-sic3 data set by mixing together Medline, Cranfield and Cisiwhich gives a total of 3893 documents. To show that ouralgorithm works well on small data sets, we also createdsubsets ofClassic3 with 30 and 150 documents respectively.

    Our final data set is a collection of 2340 Reuters newsarticles downloaded from Yahoo in October 1997[2]. Thearticles are from 6 categories: 142 from Business, 1384 fromEntertainment, 494 from Health, 114 from Politics, 141 fromSports and 60 news articles from Technology. In the prepro-cessing, HTML tags were removed and words were stemmed

    using Porters algorithm. We used 2 matrices from this col-lection: Yahoo K5 contains 1458 words while Yahoo K1 in-cludes all 21839 words obtained after removing stop words.Details on all our test collections are given in Table 1.

    5.1 Bipartitioning ResultsIn this section, we present bipartitioning results on the

    MedCran and MedCisi collections. Since we know the trueclass label for each document, the confusion matrix cap-tures the goodness of document clustering. In addition, themeasures of purity and entropy are easily derived from theconfusion matrix[6].

    Table 2 summarizes the results of applying Algorithm Bi-partition to the MedCran data set. The confusion matrix atthe top of the table shows that the document cluster

    D0

    consists entirely of the Medline collection, while 1400 of the1407 documents in D1 are from Cranfield. The bottom ofTable 2 displays the top 7 words in each of the word clus-ters W0 and W1. The top words are those whose internaledge weights are the greatest. By the co-clustering, the wordcluster Wi is associated with document cluster Di. It shouldbe observed that the top 7 words clearly convey the con-cept of the associated document cluster.

    Similarly, Table 3 shows that good bipartitions are alsoobtained on the MedCisi data set. Algorithm Bipartition usesthe global spectral heuristic of using singular vectors which

    Medline CranfieldD0: 1026 0D1: 7 1400

    W0: patients cells blood children hormone cancer renalW1: shock heat supersonic wing transfer buckling laminar

    Table 2: Bipartitioning results for MedCran

    Medline Cisi

    D0: 970 0D1: 63 1460

    W0: cells patients blood hormone renal rats cancerW1: libraries retrieval scientific research science system book

    Table 3: Bipartitioning results for MedCisi

    makes it robust in the presence of noise words. To demon-strate this, we ran the algorithm on the data sets obtainedwithout removing even the stop words. The confusion ma-trices of Table 4 show that the algorithm is able to recoverthe original classes despite the presence of stop words.

    5.2 Multipartitioning Results

    In this section, we show that Algorithm Multipartitiongives us good results. Table 5 gives the confusion matrixfor the document clusters and the top 7 words of the associ-ated word clusters found in Classic3. Note that since k = 3in this case, the algorithm uses = log2 k = 2 singularvectors for co-clustering.

    As mentioned earlier, the Yahoo K1 and Yahoo K5 datasets contain 6 classes of news articles. Entertainment is thedominant class containing 1384 documents while Technol-ogy contains only 60 articles. Hence the classes are of variedsizes. Table 6 gives the multipartitioning result obtained byusing = log2 k = 3 singular vectors. It is clearly diffi-cult to recover the original classes. However, the presenceof many zeroes in the confusion matrix is encouraging. Ta-ble 6 shows that clusters

    D1 and

    D2 consist mainly of the

    Entertainment class, while D4 and D5 are purely fromHealth and Sports respectively. The word clusters show theunderlying concepts in the associated document clusters (re-call that the words are stemmed in this example). Table 7shows that similar document clustering is obtained whenfewer words are used.

    Finally, Algorithm Multipartition does well on small collec-tions also. Table 8 shows that even when mixing small (andrandom) subsets ofMedline, Cisi and Cranfield our algorithmis able to recover these classes. This is in stark contrast tothe spherical k-means algorithm that gives poor results onsmall document collections[7].

    6. CONCLUSIONSIn this paper, we have introduced the novel idea of mod-

    eling a document collection as a bipartite graph using whichwe proposed a spectral algorithm for co-clustering words anddocuments. This algorithm has some nice theoretical prop-erties as it provides the optimal solution to a real relaxationof the NP-complete co-clustering objective. In addition, our

    Medline CranfieldD0: 1014 0D1: 19 1400

    Medline CisiD0: 925 0D1: 108 1460

    Table 4: Results for MedCran All and MedCisi All

  • 8/7/2019 kdd_bipartite

    6/6

    Med Cisi CranD0: 965 0 0D1: 65 1458 10D2: 3 2 1390

    W0: patients cells blood hormone renal cancer ratsW1: library libraries retrieval scientific science book systemW2: boundary layer heat shock mach supersonic wing

    Table 5: Multipartitioning results for Classic3

    Bus Entertain Health Politics Sports TechD0: 120 82 0 52 0 57D1: 0 833 0 1 100 0D2: 0 259 0 0 0 0D3: 22 215 102 61 1 3D4: 0 0 392 0 0 0D5: 0 0 0 0 40 0

    W0: clinton campaign senat house court financ whiteW1: septemb tv am week music set topW2: film emmi star hollywood award comedi fienneW3: world health new polit entertain tech sportW4: surgeri injuri undergo hospit england accord recommendW5: republ advanc wildcard match abdelatif ac adolph

    Table 6: Multipartitioning results for Yahoo K1

    algorithm works well on real examples as illustrated by ourexperimental results.

    7. REFERENCES[1] L. D. Baker and A. McCallum. Distributional

    clustering of words for text classification. In ACMSIGIR, pages 96103, 1998.

    [2] D. Boley. Hierarchical taxonomies using divisivepartitioning. Technical Report TR-98-012, Universityof Minnesota, 1998.

    [3] D. Boley, M. Gini, R. Gross, E.-H. Han, K. Hastings,G. Karypis, V. Kumar, B. Mobasher, and J. Moore.

    Document categorization and query generation on theWorld Wide Web using WebACE. AI Review, 1998.

    [4] C. J. Crouch. A cluster-based approach to thesaurusconstruction. In ACM SIGIR, pages 309320, 1988.

    [5] D. R. Cutting, D. R. Karger, J. O. Pedersen, andJ. W. Tukey. Scatter/gather: A cluster-basedapproach to browsing large document collections. InACM SIGIR, 1992.

    [6] I. S. Dhillon, J. Fan, and Y. Guan. Efficient clusteringof very large document collections. In V. K.R. Grossman, C. Kamath and R. Namburu, editors,Data Mining for Scientific and EngineeringApplications. Kluwer Academic Publishers, 2001.

    [7] I. S. Dhillon and D. S. Modha. Concept

    decompositions for large sparse text data usingclustering. Machine Learning, 42(1):143175, January2001. Also appears as IBM Research Report RJ10147, July 1999.

    [8] W. E. Donath and A. J. Hoffman. Lower bounds forthe partitioning of graphs. IBM Journal of Researchand Development, 17:420425, 1973.

    [9] R. O. Duda, P. E. Hart, and D. G. Stork. PatternClassification. John Wiley & Sons, 2000. 2nd Edition.

    [10] C. M. Fiduccia and R. M. Mattheyses. A linear timeheuristic for improving network partitions. Technical

    Bus Entertain Health Politics Sports TechD0: 120 113 0 1 0 59D1: 0 1175 0 0 136 0D2: 19 95 4 73 5 1D3: 1 6 217 0 0 0D4: 0 0 273 0 0 0D5: 2 0 0 40 0 0

    W0: compani stock financi pr busi wire quote

    W1: film tv emmi comedi hollywood previou entertain

    W2: presid washington bill court militari octob violatW3: health help pm death famili rate leadW4: surgeri injuri undergo hospit england recommend discovW5: senat clinton campaign house white financ republicn

    Table 7: Multipartitioning results for Yahoo K5

    Med Cisi CranD0: 9 0 0D1: 0 10 0D2: 1 0 10

    Med Cisi CranD0: 49 0 0D1: 0 50 0D2: 1 0 50

    Table 8: Results for Classic3 30docs and Classic3 150docs

    Report 82CRD130, GE Corporate Research, 1982.

    [11] M. Fiedler. Algebraic connectivity of graphs.Czecheslovak Mathematical Journal, 23:298305, 1973.

    [12] M. R. Garey and D. S. Johnson. Computers andIntractability: A Guide to the Theory ofNP-Completeness. W. H. Freeman & Company, 1979.

    [13] G. H. Golub and C. F. V. Loan. Matrix computations.Johns Hopkins University Press, 3rd edition, 1996.

    [14] L. Hagen and A. B. Kahng. New spectral methods forratio cut partitioning and clustering. IEEETransactions on CAD, 11:10741085, 1992.

    [15] K. M. Hall. An r-dimensional quadratic placementalgorithm. Management Science, 11(3):219229, 1970.

    [16] R. V. Katter. Study of document representations:Multidimensional scaling of indexing terms. System

    Development Corporation, Santa Monica, CA, 1967.[17] B. Kernighan and S. Lin. An efficient heuristic

    procedure for partitioning graphs. The Bell SystemTechnical Journal, 29(2):291307, 1970.

    [18] T. Kohonen. Self-organizing Maps. Springer, 1995.

    [19] A. Pothen, H. Simon, and K.-P. Liou. Partitioningsparse matrices with eigenvectors of graphs. SIAMJournal on Matrix Analysis and Applications,11(3):430452, July 1990.

    [20] G. Salton and M. J. McGill. Introduction to ModernRetrieval. McGraw-Hill Book Company, 1983.

    [21] H. Schutze and C. Silverstein. Projections for efficientdocument clustering. In ACM SIGIR, 1997.

    [22] J. Shi and J. Malik. Normalized cuts and image

    segmentation. IEEE Trans. Pattern Analysis andMachine Intelligence, 22(8):888905, August 2000.

    [23] A. Strehl, J. Ghosh, and R. Mooney. Impact ofsimilarity measures on web-page clustering. In AAAI2000 Workshop on AI for Web Search, 2000.

    [24] C. J. van Rijsbergen. Information Retrieval.Butterworths, London, second edition, 1979.

    [25] E. M. Voorhees. The effectiveness and efficiency ofagglomerative hierarchic clustering in document

    retrieval. PhD thesis, Cornell University, 1986.