a generic framework for efficient subspace clustering of high-dimensional data

Upload: paul-ringseth

Post on 06-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data

    1/8

    A Generic Framework for Efficient Subspace Clustering of

    High-Dimensional Data

    Hans-Peter Kriegel, Peer Kroger, Matthias Renz, Sebastian Wurst

    Institute for Computer Science, University of Munich

    {kriegel,kroegerp,renz,wurst}@dbs.ifi.lmu.de

    Abstract

    Subspace clustering has been investigated exten-sively since traditional clustering algorithms often failto detect meaningful clusters in high-dimensional dataspaces. Many recently proposed subspace clusteringmethods suffer from two severe problems: First, the

    algorithms typically scale exponentially with the datadimensionality and/or the subspace dimensionality ofthe clusters. Second, for performance reasons, manyalgorithms use a global density threshold for clustering,which is quite questionable since clusters in subspacesof significantly different dimensionality will most likelyexhibt significantly varying densities. In this paper, wepropose a generic framework to overcome these limi-tations. Our framework is based on an efficient filter-refinement architecture that scales at most quadraticw.r.t. the data dimensionality and the dimensional-ity of the subspace clusters. It can be applied to anyclustering notions including notions that are based ona local density threshold. A broad experimental evalua-

    tion on synthetic and real-world data empirically showsthat our method achieves a significant gain of runtimeand quality in comparison to state-of-the-art subspaceclustering algorithms.

    1 Introduction

    One of the primary data mining tasks is cluster-ing which aims at partitioning the data objects intogroups (clusters) of similar objects. A lot of workhas been done in the area of clustering (see e.g. [10]for an overview). Many real-world data sets con-

    sist of very high dimensional feature spaces. In suchhigh-dimensional feature spaces features may be irrel-evant for clustering. In addition, different subgroupsof features may be irrelevant w.r.t. to varying clus-ters and different clusters in varying subspaces mayoverlap. Thus, clusters can no longer be found in the

    full-dimensional space. As a consequence, traditionalclustering algorithms usually have scalability and/orefficiency problems. Usually, global dimensionality re-duction techniques such as PCA cannot be applied tothese data sets because they cannot account for localtrends of overlapping clusters in the data.

    A prominent application for these problems is the

    analysis of gene expression data. Gene expression datacontains the expression level of thousands of genes, in-dicating how active it is, according to a set of samples.A common task is to find clusters of functionally re-lated genes, having a similar expression level. Genesmay exhibit several different functionalities that maybe needed in different sets of samples. Thus, the genesmay be clustered differently in varying subsets of sam-ples. If the samples are patients, another task is clus-tering the patients into homogeneous groups accordingto specific phenotypes such as gender, age or diseases.Since different genes are responsible for different phe-notypes, we face the problem that the patients can beclustered differently in varying subsets of genes.

    To cope with the problem of local feature irrele-vance, the procedure of feature selection deciding aboutthe relevance of different features has to be linked withthe clustering process more closely. In recent years,several methods have been proposed that couple theanalysis of feature relevance, i.e. variance along an at-tribute, with the process of clustering. These existingmethods can be distinguished into: (1) Algorithms thatcompute a discrete partitioning of the data objects, i.e.each object is uniquely assigned to one cluster; an ar-bitrary subset of attributes may be relevant for thecluster. (2) Algorithms that automatically detect allclusters in all subspaces of the original feature space.

    Clusters in different subspaces may overlap.Obviously, algorithms that allow overlapping clus-

    ters generate more information than algorithms thatcompute a unique assignment because they uncoverthe information that objects may be clustered differ-ently in varying subspaces. As discussed above, our

    Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM05)

    1550-4786/05 $20.00 2005 IEEE

  • 8/2/2019 A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data

    2/8

    motivating applications, call for solutions that explic-itly allow overlapping clusters. Thus, in this paper, wefocus on overlapping clusters. Throughout the rest ofthe paper, when we speak of subspace clustering, wemean the task of finding overlapping clusters.

    In this paper, we present FIRES (FIlter REfinementSubspace clustering), a general framework for efficient

    subspace clustering. It is generic in such a way that itworks with all kinds of clustering notions. It starts with1D clusters that can be constructed with a clusteringmethod of choice and merges these 1D clusters to gen-erate approximations of subspace clusters. An optionalrefinement step can compute the true subspace clus-ters, again using any clustering notion of choice anda variable cluster criterion. Thus, FIRES overcomestwo severe limitations of existing subspace clusteringmethods as we will discuss later (cf. Section 2).

    The rest of this paper is organized as follows. Wediscuss related work and point out our contributions inSection 2. The generic FIRES framework is describedin Section 3. Section 4 presents a broad experimentalevaluation of FIRES. Section 5 concludes the paper.

    2 Related Work and Contributions

    The complexity of the exhaustive search for all sub-space clusters is O(2d), where d is the data dimen-sionality. Most subspace clustering algorithms workbottom-up similar to the Apriori algorithm for findingfrequent itemsets, starting with 1D clusters and merg-ing these clusters to compute clusters of higher dimen-sionality. To reduce the search space, the algorithmsusually rely on a downward closure property of densityenabling the Apriori-like search strategy. [13] providesa review of subspace clustering algorithms.

    CLIQUE [2], the pioneering approach to subspaceclustering uses a grid-based clustering notion. Thedata space is partitioned by an axis-parallel grid intoequi-sized units of width . Only units which containat least points are considered as dense. Dense unitssatisfy the downward closure property. A cluster is de-fined as a maximal set of adjacent dense units. Obvi-ously, the accuracy and the efficiency of CLIQUE heav-ily depends on the granularity and the positioning ofthe grid. On the other hand, a higher grid granularityresults in higher runtimes. Modifications of CLIQUEinclude ENCLUS [5] and MAFIA [12].

    SUBCLU [11] uses the DBSCAN cluster model ofdensity-connected sets [8]. It is shown that density-connected sets satisfy the downward closure prop-erty. Compared to the grid-based approaches SUB-CLU achieves a better clustering quality but requiresa higher runtime.

    Another recent approach called DOC [14] uses a ran-dom seed of points to guide a greedy search for sub-space clusters. DOC measures the density of subspaceclusters using hypercubes of fixed width w and thushas similar problems like CLIQUE.

    Recently, some feature selection techniques havebeen proposed, that elaborate the clustering quality

    of subspaces [6, 3]. Most of these methods cannot usean Apriori style approach but use random or greedysearch methods instead, producing incomplete resultsor suffering from high runtimes.

    Pattern-based clustering [16] aims at groupingpoints in clusters that exhibit a similar trend in a sub-set of attributes rather than points that are dense insome subspace. This approach is too general for ourfocus and usually suffers from high runtimes.

    As discussed above, pro jected clustering algorithmsthat do not allow overlapping clusters, e.g. PROCLUS[1] and PreDeCon [4], are also not considered here.

    Our Contributions. The existing subspace cluster-

    ing algorithms usually have the following two draw-backs. First, all methods usually scale exponentiallywith the number of features and/or the dimensionalityof the subspace clusters [13]. Second, most subspaceclustering approaches use a global density thresholdfor performance reasons. Usually, the global densitythreshold provides the downward closure property en-suring the application of the Apriori style algorithm.However, it is quite questionable that a global densitythreshold is applicable on clusters in subspaces of con-siderably different dimensionality, since density natu-rally decreases with increasing dimensionality. In ad-dition, the density of clusters in a common subspacemay significantly vary.

    Here, we propose a framework for efficient sub-space clustering that (i) scales at most quadratic withthe data dimensionality and the subspace dimensional-ity, and (ii) allows to use any sophisticated clusteringmodel, e.g. one that accounts for the subspace dimen-sionality and for local point density.

    3 Efficient Subspace Clustering

    In the following, we assume that D is a database ofn points in a d-dimensional feature space, i.e. D d.

    3.1 General Idea

    Subspace clustering is a complex problem due to thehigh complexity of the search and the output space.The time consuming step is to identify the relevantattributes of the subspace cluster. We claim that it

    2

    Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM05)

    1550-4786/05 $20.00 2005 IEEE

  • 8/2/2019 A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data

    3/8

    is desirable to generate subspace clusters of maximumdimensionality. Since each cluster in a subspace S isalso hidden in each projection of S, this redundancycan be encapsulated in the subspace cluster of maximaldimensionality. This reduces the search space and es-pecially the output space significantly. Our key idea isto use a filter-refinement architecture to speed up the

    subspace finding process, i.e. we efficiently computeapproximations of the subspace clusters which can berefined to generate the true subspace clusters and areof maximum dimensionality.

    Our framework FIRES (FIlter REfinement Subspaceclustering) consists of the following three steps:

    1. Preclustering: First, all 1D clusters called base-clusters are computed. This is similar to existing sub-space clustering approaches and can be done using anyclustering algorithm of choice.

    2. Generation of Subspace Cluster Approxima-tions: In a second step, the base-clusters are mergedto find maximal-dimensional subspace cluster approx-

    imations. However, we do not want to merge them inan Apriori style but use an algorithm that scales atmost quadratic w.r.t. the number of dimensions.

    3. Postprocessing of Subspace Clusters: A thirdstep can be applied to refine the cluster approximationsretrieved after the second step.

    The key benefits of our method are (1) the dramat-ically reduced runtime and (2) the possibility to applya clustering model that can detect clusters of differentdensities to Step 1 and 3.

    3.2 Preclustering

    Similar to existing subspace clustering approaches,FIRES first generates clusterings in each dimension ofthe data space. Thereby we assume that the cluster-ings of each dimension suffice to identify all higher-dimensional subspace clusters. For this preprocessingstep, we can choose any clustering method, e.g. DB-SCAN [8], k-means [10], SNN [7] or grid-based cluster-ing. In the following we call the 1D clusters resultingfrom the pre-clustering base-clusters, denoted by C1.

    For the preclustering we propose a filter step whichdrops irrelevant base-clusters, that do not contain anyvital subspace cluster information. Small base-clustersdo not likely include significant subspace clusters, be-

    cause they usually indicate a sparse area in the higher-dimensional space. For this reason we compute theaverage size savg of all base-clusters and skip thosebase-clusters whose size is smaller than 25% of savg .The threshold of 25% is empirically determined andhas shown best results in our experiments.

    subspace

    cluster

    cC

    cB

    cA

    base-cluster cAB

    cAC

    Figure 1. Overlapping subspace clusters

    3.3 Generation of Subspace Cluster Approxima-tions

    Approximations of maximal-dimensional subspaceclusters are determined by suitably merging the base-clusters derived from our preprocessing step. Out ofall merge possibilities, whose number is exponential inthe number of base-clusters, we have to find the most

    promising merge candidates by searching for all base-clusters which contain nearly the same objects. In thefollowing, we call capp subspace cluster approximationof a subspace cluster csub, iff capp is built by merg-ing all base-clusters sharing objects with csub. Thoseobjects which are shared by a set SB of base-clustersmust be close in all dimensions of the base-clusters inSB, and thus, must be close in the subspace spannedby SB. Consequently, a good indication for the exis-tence of a subspace cluster is if the intersection betweenbase-clusters is large. In the following, we denote base-clusters to be similar if they share a sufficiently highnumber of objects. The similarity between base-clustersc1, c2

    C1 is defined as sim(c1, c2) =

    |c1

    c2

    |.

    In contrast to functions commonly used to mea-sure the similarity between sets, such as the Jaccard-Distance, our similarity function is insensitive to noise.This is necessary because the base-clusters usually con-tain, besides true subspace-cluster objects, a lot ofnoise in higher-dimensional space.

    Before starting with the merge phase, we have todetect promising merge candidates, possibly hidden inthe set of base-clusters. Different subspace clustersmay overlap in one or more dimensions, thus one base-cluster can include objects of multiple subspace clus-ters. In our approach, we try to merge similar base-clusters. The example depicted in Figure 1 shows two

    subspace clusters cAB and cAC, indicated by the base-clusters cA, cB and cC. The base-cluster cA is similarto the base-cluster cB and to cC. However cB andcC are obviously not similar. In order to avoid thatwe merge all three base-clusters cB, cC, and cA, wehave to split cA into two separate instances, such that

    3

    Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM05)

    1550-4786/05 $20.00 2005 IEEE

  • 8/2/2019 A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data

    4/8

    one instance mainly contains the objects of cAB andthe other contains the objects of cAC. More generally,each base-cluster which includes more than one over-lapping subspace cluster, should be split into multiplebase-clusters in such a way, that each of them containsat most one of the subspace clusters. In the exampleof Figure 1 the base-cluster cA is similar to two base-

    clusters which are from the same dimension. In thisspecial case, we can easily detect that cA indicates twosubspace clusters, since two base-clusters of the samedimension must be disjunctive. However, it may hap-pen that one base-cluster of dimension D includes sep-arate subspace clusters which may differ in all dimen-sions except for D. In our approach, we apply simpleheuristics for the decision when to split base-clusters.First, we search for the most-similar-cluster of a splitcandidate, which is defined as follows:

    Definition 1 (Most Similar Cluster) Let c C1;the most similar base-cluster M SC(c) C1 of c (c =M SC(c)) fulfills the following constraint:cp C1 : sim(c,MSC(c)) sim(c, cp).

    Let the base-cluster cS be the split candidate andlet cT be the most similar cluster of cS, i.e. cT =M SC(cS). If the number of objects shared by bothclusters cS and cT is very high, then the shared ob- jects probably indicate a subspace cluster. If there isanother subspace cluster hidden in cS, the number ofthe remaining objects o (cS cT) must also be large.We simply split cS, iff both, the number of objectsshared by cS and cT and the number of the remainingobjects in cS cT exceed a specific threshold. Let savgbe again the average size of all base-clusters, then we

    split cS into cS1 = (cS cT) and cS2 = (cS cT), iff|cS cT| 23savg |cS cT| 23savg.After detecting all base-clusters, we have to identify

    the most promising merge candidates. Our similar-ity function does not suffice to identify them, becauseit penalizes small clusters and prefers large clusters.In addition, our goal is not to merge only two base-clusters, but to consider a set of base clusters to bemerged simultaneously. Thus, we need a relative simi-larity measure to identify suitable merge candidates.

    Definition 2 (k-Most-Similar-Clusters) Let c C1. We cal l M SCk(c) C1 the k-most-similar-clusters, iff it is the smallest subset of C1 that contains

    at least k base-clusters, where the following conditionholds: cp M SCk(c), cq C1 M SCk(c) :sim(c, cp) > sim(c, cq).

    The k-most-similar-clusters of cA contain those kbase-clusters, that share the most objects with cA.

    In our approach, we try to merge those base-clusterswhich are all mostly similar to each other, i.e. thatshare many of their k-most-similar-clusters. Thisavoids that a set of well fitting base-clusters will bedestroyed if we add a merge candidate which is similarto one of the base-clusters but does not fit well to theentire set. Thus, in order to obtain base-cluster sets

    with homogeneous similarity between their elements,we prefer to merge base-clusters where many of theirk-most-similar-clusters are equal.

    Definition 3 (Best-Merge-Candidates) Let c C1 andk, + ( k). The best-merge-candidatesBM C(c) of c are defined by the set:BM C(c) := {x C1| |M SCk(c) M SCk(x)| }.

    Merging all best-merge-candidates should suffice tofilter out unfitting merge candidates and seems verypromising to find high quality subspace clusters. How-ever, due to the limiting parameter k this method may

    be too restrictive, and thus, important cluster informa-tion may be lost. We by-pass this problem by allowingto merge best-merge-candidate sets. Two base-clustersare merged, if they share at least one base-cluster whichfulfills the properties of a best-merge-cluster, which isdefined in the following:

    Definition 4 (Best-Merge-Cluster) Let c C1and minClu +, thenc is called best-merge-cluster,iff |BM C(c)| minClu.

    The generation of subspace cluster approximationsproceeds as follows: We first generate all best-merge-

    clusters. Then we group all pairs of best-merge-clusterscA and cB, iff cA BM C(cB) and vice versa. Finally,we add all best-merge-candidates to these groups. Theresulting set of base-clusters form our subspace clusterapproximations.

    3.4 Postprocessing

    Significant subspace clusters should achieve a goodtrade-off between dimensionality and cluster size. How-ever, the information of the base-clusters does not suf-fice to generate correct and complete subspace clus-ters. Therefore, we require postprocessing steps in

    order to achieve good results. We propose two stepswhich are subsequently applied to all subspace-clusterapproximations: (1) Pruning improves the qualityof the mergeable-cluster-set by identifying and remov-ing meaningless base-clusters. (2) Refinement re-moves noise and completes the subspace clusters.

    4

    Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM05)

    1550-4786/05 $20.00 2005 IEEE

  • 8/2/2019 A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data

    5/8

    Pruning. There may be clusters in the subspacecluster approximations which do not contain relevantinformation of the corresponding subspace cluster. Inorder to eliminate these irrelevant clusters, we need anotion of the quality of a cluster C w.r.t. its dimen-sionality dim(C) and the number of objects shared byall its base clusters size(C) =

    |

    i=1..n ci

    C

    |. We de-

    fine the quality of a subspace cluster C as score(C) =f(size(C)) dim(C), where f : +0 denotes aweighting function for the cluster size. Obviously thesize of the corresponding subspace cluster should haveless weight than its dimensionality. In our experiments,we achieved the best results with f(x) =

    x.

    The pruning removes a merge candidate c C1 of asubspace-cluster approximation C if the following con-dition holds:

    score(C {c}) > score(C)

    cp C : cp = c score(C {cp}) score(C {c}).

    After removing the merge-candidate c, we continuethe top-down pruning with the remaining merge can-didates, until we achieve the best score. The score-function avoids that large high-dimensional subspaceclusters degenerate to low-dimensional clusters.

    Refinement. The subspace cluster csub which is ap-proximated by capp can be built from the base-clustersc1, . . ,cn merged in capp. Two building methods can bedistinguished, the union csub = c1 .. cn and the in-tersection csub = c1 .. cn. Obviously, both variantsdo not yield the true subspace cluster csub. Due to thefact that the base-clusters contain a lot of noise in ahigher dimensional subspace, the intersection variantseems to produce more accurate approximations thanthe union variant. However the intersection merge hassignificant draw-backs because the detected subspacecluster would be too strict, thus many promising can-didates would be lost. Furthermore, parts of the truesubspace cluster can be outside of the approximationwhich would lead to uncomplete results.

    We achieve better subspace cluster results when ap-plying an additional refinement step. First, we use theunion of all base-clusters in order to avoid that po-tential cluster candidates are lost or become uncom-plete. Then, we cluster again the merged set of ob-

    jects in the corresponding subspace. Thereby, we can

    apply any clustering algorithm, e.g. DBSCAN, SNN-clustering, k-means, grid-based clustering or even hi-erarchical clustering. We propose the use of SNN todetect clusters of different density, or the use of DB-SCAN. Thereby the density threshold is adjusted tothe subspace dimensionality, i.e. the minPts-value is

    0%

    20%

    40%

    60%

    80%

    100%

    FIRES SUBCLU CLIQUE

    algorithm

    precision

    Cluster 1(320 points)

    Cluster 2(313 points)

    Cluster 3(354 points)

    Noise (1653points)

    Overall

    Figure 2. Comparison of accuracy.

    kept constant and we adjust the -range as follows: = 1n/ d

    n, where 1 denotes the -value in the 1D

    subspace.

    4 Evaluation

    We evaluated FIRES in comparison with CLIQUEand SUBCLU. All tests were run on a LINUX work-station featuring a 2.4 GHz CPU and 3.6 GB RAM.

    4.1 Accuracy

    To measure the effectiveness of FIRES, SUBCLUand CLIQUE, we measure the ratio of points found ineach cluster/noise and the points that should be in theaccording cluster/noise. We do not consider the redun-dant projections of all subspace clusters generated bythe Apriori style of SUBCLU and CLIQUE but onlyconcentrate on the true clusters hidden by the datagenerator. We apply DBSCAN to generate the base-clusters using a parameter setting as suggested in [8]and as refinement method with paramter settings for and minpts as proposed in Section 3.4. Figure 2 il-lustrates results of FIRES in comparison to SUBCLU,and CLIQUE applied on a synthetic dataset contain-ing three clusters of significantly varaying dimension-ality and density. For each method, the result withthe best parameter setting is presented. It can be ob-served that FIRES clearly outperfoms SUBCLU andCLIQUE w.r.t. accuracy. Only FIRES finds the thirdcluster having the lowest density in a 10D subspacewith a satisfying precision. Both, CLIQUE and SUB-CLU, miss out major parts of this cluster.

    In Figure 3 we evaluate the impact of different algo-rithms to generate the base-clusters on the final accu-racy. We compare DBSCAN, k-means, and a CLIQUE-like approach, which partitions each dimension into in-tervals of equal size. An interval is considered as denseif it contains more than minPts points. Neighboring

    5

    Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM05)

    1550-4786/05 $20.00 2005 IEEE

  • 8/2/2019 A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data

    6/8

    0%

    20%

    40%

    60%

    80%

    100%

    120%

    DBSCAN Grid K-Means

    preclustering method

    precision

    Cluster 1

    Cluster 2

    Cluster 3

    Noise

    Overall

    Figure 3. Comparison of preclustering.

    dense cells are linked to form base-clusters. The param-eters for each method are optimized to achieve a faircomparison. Again, the used data set contains threeclusters in subspaces of different dimensionality andexhibits significantly different densities. Obviously, us-ing DBSCAN as preclustering method yields the bestresults. Using k-means, cluster 2 cannot be detected.

    The grid-based approach has problems with cluster 3,which has the highest subspace dimensionality.

    We evaluated the impact of parameterization bychanging one of the three parameters k, and minCluat a time, while the other two are fixed at an optimizedvalue. The results are illustrated in Figure 4. The pa-rameter is not critical as far as it is not set too low.If it is set too low, some clusters are still found withhigh precision but some of the clusters are not foundat all. Roughly the same can be said about the pa-rameter k. If k is reduced, the precision drops slightly,whereas precision drops significantly if k is increased.In our experiments, we observed that we achieve goodresults if not differs significantly from k. The thirdparameter minClu turned out to be stable through-out the range of tested settings, i.e. the precision ofFIRES is very high at any tested minClu value. Accu-racy only decreases very slightly when we increase thevalue for minClu. However, we made the observation,that reducing minClu increases the number of gener-ated subspace clusters because we get a larger numberof best-merge-clusters.

    We found out that DBSCAN with self-adapting pa-rameters is slightly more accurate than SNN (detailedresults are not shown due to space limitations). Never-theless, if the densities of clusters in subspaces of equaldimensionality differ significantly, SNN achieves bet-

    ter results than DBSCAN. However, SNN clusteringrequires considerably a higher runtime than DBSCAN,and thus, we decided to use DBSCAN in our remainingtests. The next experiment compared the effectivenessof the results using and leaving out certain aspects ofour postprocessing step. We compared the results of

    using both steps (pruning and refinement), using onlyone of the two steps, and using none of them. The re-sults (not shown due to space limitations) suggest thatapplying no postprocessing decreases the accuracy ofFIRES considerably. Clearly the best results can beachieved when using both steps of postprocessing. Thereason for this is that the pruning step removes base-

    clusters i.e. dimensions from the cluster approxi-mation that reduce the quality of the subspace cluster.Without any dimensions that do not belong to the clus-ter, the refinement finds only those points within theapproximation that really belong to the cluster.

    4.2 Scalability

    We compared the scalabilty of FIRES, CLIQUE andSUBCLU on synthetic data sets.

    For evaluating the scalability w.r.t. the number ofpoints in the dataset (cf. Figure 5(a)) we use 25Ddatasets each contains one 5D subspace cluster. We

    analyse the time FIRES needed for base-clusters gen-eration and the time needed for the merge-step. Obvi-ously, FIRES scales slightly superlinear w.r.t. the num-ber of points in the dataset, more than CLIQUE butless than SUBCLU. This is due to the use of DBSCANas method for preclustering and refinement. The run-time of CLIQUE scales linear because of the rather sim-ple but efficient grid-based clustering model. However,the computation of subspaces is obviously the bottle-neck for the overall runtime. Since FIRES scales betterin comparison to CLIQUE w.r.t. data dimensionality,on higher dimensional datasets, the number of pointsmay become a less critical parameter. Due to the linearscalability of the merge step of FIRES, one can use a

    simpler, e.g. grid-based, clustering model rather thanDBSCAN to speed-up FIRES.

    We investigate the scalability w.r.t. the data di-mensionality using datasets of 1,000 points each con-taining one 5D subspace cluster. As it can be seen fromFigure 5(b), the runtimes of both, CLIQUE and SUB-CLU, increase clearly superlinear, whereas the runtimeof FIRES increases only linear. Our experiments showthat FIRES is the only method that can efficiently beapplied to a dataset with more than 1,000 dimensions.

    The impact of the highest dimensionality of asubspace cluster on the runtime is evaluated usingdatasets of 1,000 50D points. From Figure 5(c) we

    observe FIRES again clearly outperfoms SUBCLU andCLIQUE.

    Last, we evaluate the runtime of FIRES using dif-ferent preclustering. We compare DBSCAN, k-meansand the grid-based approach on a 10D dataset, one 5Dsubspace cluster and a varying number of points in the

    6

    Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM05)

    1550-4786/05 $20.00 2005 IEEE

  • 8/2/2019 A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data

    7/8

    0%

    20%

    40%

    60%

    80%

    100%

    1 2 3 4

    average

    precision

    0%

    20%

    40%

    60%

    80%

    100%

    3 4 5 6

    k

    average

    precision

    0%

    20%

    40%

    60%

    80%

    100%

    1 2 3 4

    minClu

    average

    precision

    Figure 4. Impact of parameterization on the accuracy of FIRES.

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1000

    1000 2000 3000 4000 5000

    number of points

    timeins

    CLIQUE

    SUBCLU

    FIRES (all)

    FIRES (DBSCAN)

    FIRES (merge)

    (a) Number of points

    0

    100

    200

    300

    400

    500

    600

    700

    10 20 30 40 50

    dimension

    timeins

    CLIQUE

    SUBCLU

    FIRES

    (b) Data dimensionality

    0

    200

    400

    600

    800

    1000

    1200

    1400

    5 10 15 20 25

    dimension of cluster-subspace

    timeins

    CLIQUE

    SUBCLU

    FIRES

    (c) Subspace dimensionality

    Figure 5. Scalabilty of FIRES, SUBCLU, and CLIQUE.

    datasets. The results (not shown due to space limita-tions) illustrate that the runtime of FIRES using any ofthese algorithms is still approximately quadratic w.r.t.to the number of points in the dataset, but FIRESwith k-means as well as with the grid-based approachis clearly faster than with DBSCAN. This experimentshows that the runtime can be decreased significantly ifapplying simpler but more efficient clustering methodsfor preclustering and/or postprocessing.

    4.3 Real-World Data

    We applied FIRES to two real-world gene expres-sion datasets. The first data set called Bio1 [15]contains the expression level of about 4,000 genes mea-sured at 24 different time slots during yeast mitoticcell cycle. The task is to find clusters of functionallyrelated genes. The second dataset called Bio2 [9]contains the expression levels of 72 patients sufferingacute myeloid leukemia (AML) and acute lymphoblas-

    tic leukemia (ALL). For each patient, approximately7100 genes are measured. The task is to find groups ofpatients with homogeneous phenotypes. The patientsare labeled according to their leukemia type. For someof them information on additional phenotypes such asgender is provided.

    Some sample clusters FIRES found on the Bio1dataset are visualized in Figure 6. Biological criteriafor interesting and meaningful clusters are genes thathave similar function, participate in a common path-way, and/or build larger complexes. Based on these cri-teria, we evaluated the clusters FIRES generates usingthe public yeast genome database SGD1. As it can beseen, in both datasets FIRES uncovers subspace clus-ters that fulfill at least one of the biological criteria andthus are biologically meaningful and relevant. On thefirst dataset, one cluster contains the genes CKB1 andCKA2, both subunits of the protein kinase CK2. Inthe same cluster, DFR1 and ARO1 are two genes par-ticipating in the chorismate pathway. A second clustercontains four genes with the same biological function(YIP1, SED5, BFR2, and SEC21). A third cluster con-tains three genes that are part of the large ribosomalcomplex (RPL12B, RPL14A, and RPL16A). CLIQUEand SUBCLU could not reproduce these clusters.

    The results of FIRES on the Bio2 dataset were also

    very interesting. The data set is problematic since ithas only a very few number of points (72) but a veryhigh dimensionality (7070). In general, the 72 patientsmay be clustered rather differently, depending on dif-

    1http://www.yeastgenome.org/

    7

    Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM05)

    1550-4786/05 $20.00 2005 IEEE

  • 8/2/2019 A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data

    8/8

    ORF Gene Annotation

    Cluster 1YGL019W CKB1 subunit of CK2YOR061W CKA2 subunit of CK2YOR236W DFR1 chorismate pathwayYDR127W ARO1 chorismate pathway

    Cluster 2YGR172C YIP1 ER to Golgi trans port

    YLR026C SED5 ER to Golgi trans portYDR299W BFR2 ER to Golgi transportYNL287W SEC21 ER to Golgi transport

    Cluster 3YDR418W RPL12B part of ribosomeYIL133C RPL16A part of ribosomeYKL006W RPL14A part of ribosome

    Figure 6. Results on Bio1.

    ferent phenotypes. FIRES generated a lot of subspaceclusters of varying dimensionality that may representpartitionings according such certain phenotypes. Un-fortunately, most clusters could not be evaluated since

    more information on the patients is not known. How-ever, three clusters found by FIRES provide very inter-esting insights. One cluster consists of 16 patients 15 ofwhich suffer from ALL. In fact, all these ALL patientsare marked as B-cell ALL patients. No T-cell ALL per-son is in that cluster. A second cluster contained 25patients, 18 male and only three female. The gender ofthe remaining four patients is not annotated. A thirdcluster contained 27 of 32 ALL (of both, B-cell and T-cell type) patients. Let us point out that the responsetime of FIRES on this very high dimensional datasetwas quite low with around 15 seconds. Both SUBCLUand CLIQUE produced memory overflows due to a highnumber of candidate subspaces.

    5 Conclusions

    In this paper, we proposed the new filter-refinementsubspace clustering algorithm FIRES, which overcomesboth problems of existing subspace clustering methodsincluding inadequate scalabilty w.r.t. data dimension-ality and/or subspace dimensionality and the use ofa global density threshold. FIRES efficiently computesmaximum dimensional cluster approximations from 1Dclusters that can be refined to obtain the true clus-ters. Any clustering notion can be incooperated intoFIRES in order to account for clusters of varying den-

    sity. A thorough experimental evaluation has shownthat the effectiveness of FIRES is significantly betterthan that of well-known algorithms such as SUBCLUand CLIQUE. In addition, FIRES clearly outperformsSUBCLU and CLIQUE in terms of scalability and run-time w.r.t. data dimensionality and subspace dimen-

    sionality. We demonstrated the usability of FIRESfinding interesting and biologically meaningfull clustersin several different gene expression datasets.

    References

    [1] C. C. Aggarwal and C. Procopiuc. Fast Algorithmsfor Projected Clustering. In Proc. ACM SIGMOD,1999.

    [2] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Ragha-van. Automatic Subspace Clustering of High Dimen-sional Data for Data Mining Applications. In Proc.ACM SIGMOD, 1998.

    [3] C. Baumgartner, K. Kailing, H.-P. Kriegel, P. Kroger,and C. Plant. Subspace Selection for Clustering High-Dimensional Data. In Proc. ICDM, 2004.

    [4] C. Bohm, K. Kailing, H.-P. Kriegel, and P. Kroger.Density Connected Clustering with Local SubspacePreferences. In Proc. ICDM, 2004.

    [5] C.-H. Cheng, A.-C. Fu, and Y. Zhang. Entropy-Based Subspace Clustering for Mining NumericalData. In Proc. ACM SIGKDD, 1999.

    [6] M. Dash, K. Choi, P. Scheuermann, and H. Liu. Fea-ture Selection for Clustering A Filter Solution. InProc. ICDM, 2002.

    [7] L. Ertoz, M. Steinbach, and V. Kumar. Finding Clus-ters of Different Sizes, Shapes, and Densities in Noisy,High Dimensional Data. In Proc. SIAM Data Mining,2003.

    [8] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. ADensity-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise. In Proc. KDD,1996.

    [9] T. R. Golub et al. Molecular Classification of Can-cer: Class Discovery and Class Prediction by GeneExpression Monitoring. Sience, 286:531537, 1999.

    [10] J. Han and M. Kamber. Data Mining: Concepts andTechniques. Academic Press, 2001.

    [11] K. Kailing, H.-P. Kriegel, and P. Kroger. Density-Connected Subspace Clustering for High-DimensionalData. In Proc. SIAM Data Mining, 2004.

    [12] H. Nagesh, S. Goil, and A. Choudhary. AdaptiveGrids for Clustering Massive Data Sets. In Proc.SIAM Data Mining, 2001.

    [13] L. Parsons, E. Haque, and H. Liu. Subspace cluster-ing for high dimensional data: A review. SIGKDDExplorations, 6(1):90105, 2004.

    [14] C. M. Procopiuc, M. Jones, P. K. Agarwal, and M. T.M. A Monte Carlo Algorithm for Fast ProjectiveClustering. In Proc. ACM SIGMOD, 2002.

    [15] P. Spellman et al. Comprehensive Identificationof Cell Cycle-Regulated Genes of the Yeast Saccha-romyces Cerevisiae by Microarray Hybridization..Molecular Biolology of the Cell, 9:32733297, 1998.

    [16] J. Yang, W. Wang, H. Wang, and P. S. Yu. Delta-Clusters: Capturing Subspace Correlation in a LargeData Set. In Proc. ICDE, 2002.

    8

    Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM05)

    1550-4786/05 $20.00 2005 IEEE