white paper clustering approaches and techniques

Data Mining: Clustering Approaches & Techniques based on Real-life Data

24/01/2014

White Paper

BSNL CDR Project

Hrishav Bakul Barua

&

Anupam Roy

Telecom

[email protected], [email protected]

2

Confidentiality Statement

Confidentiality and Non-Disclosure Notice The information contained in this document is confidential and proprietary to TATA Consultancy Services. This information may not be disclosed, duplicated or used for any other purposes. The information contained in this document may not be released in whole or in part outside TCS for any purpose without the express written permission of TATA Consultancy Services.

Tata Code of Conduct We, in our dealings, are self-regulated by a Code of Conduct as enshrined in the Tata Code of Conduct. We request your support in helping us adhere to the Code in letter and spirit. We request that any violation or potential violation of the Code by any person be promptly brought to the notice of the Local Ethics Counselor or the Principal Ethics Counselor or the CEO of TCS. All communication received in this regard will be treated and kept as confidential.

3

TableofContent

Abstract............................................................................................................................................................................. 4

AbouttheAuthors ............................................................................................................................................................ 4

1. DataMining............................................................................................................................................................... 5

1.1 ClusterAnalysis ................................................................................................................................................. 5

1.1.1 Whatagoodclusteringtechnique/algorithmdemands? ......................................................................... 6

1.1.2 ACategorizationofMajorClusteringApproaches.................................................................................... 7

1.1.3 HierarchicalMethod ................................................................................................................................. 7

1.1.4 PartitioningMethod.................................................................................................................................. 8

1.1.5 DensityBasedMethod.............................................................................................................................. 9

1.1.6 GridBasedMethods ................................................................................................................................. 9

1.1.7 ConstraintBasedClustering....................................................................................................................10

1.1.8 ClusteringOverMultiDensityDataSpace..............................................................................................11

1.1.9 ClusteringOverVariableDensitySpace..................................................................................................11

1.1.10 ClusteringHigherDimensionalData .......................................................................................................11

1.1.11 MassiveDataClusteringUsingDistributedandParallelApproach ........................................................12

1.1.12 HowClusteringAlgorithmsareCompared?............................................................................................12

1.1.13 ClusterValidation....................................................................................................................................12

2. Conclusion...............................................................................................................................................................19

3. Acknowledgements.................................................................................................................................................19

4. References ..............................................................................................................................................................20

4

Abstract

Findingmeaningfulpatternsandusefultrendsinlargedatasetshasattractedconsiderableinterestrecently.Oneofthemostwidelystudiedproblemsinthisareaistheidentificationandformationofclustersordenselypopulatedregionsinadataset.Clusteranalysisdividesdataintomeaningfulorusefulgroupscalledclusters.Theobjectiveofthispaperistopresentaclearanalysisandsurveyofthevariousexistingclusteringapproachesandtechniquesandsomeofthefamousandpioneeringalgorithmsappliedundertheseapproaches.Hence,thispaperwillbringtolightthebestofthetechniquesandwillshowwhytheyarethebestamongallthetechniques.

Inthispaper,thetechniqueofdataclusteringhasbeenexamined,whichisaparticularkindofdataminingproblem.Theprocessofgroupingasetofphysicalorabstractobjectsintoclassesofsimilarobjectsiscalledclustering.Aclusterisacollectionofdataobjectsthataresimilartooneanotherwithinthesameclusterandaredissimilartotheobjectsinotherclusters[1].Givenalargesetofdatapointsthatis,dataobjects;thedataspaceisusuallynotuniformlyoccupied.Dataclusteringidentifiesthesparseandthecrowdedplacesandhence,discoverstheoveralldistributionpatternsofthedataset.Besides,thederivedclusterscanbevisualisedmoreefficientlyandeffectivelythantheoriginaldataset.Miningknowledgefromlargeamountsofspatialdataisknownasspatialdatamining.Itbecomesahighlydemandingfieldbecausehugeamountsofspatialdatahavebeencollectedinvariousapplicationsrangingfromgeospatialdata,industrialdatatobiomedicalknowledge.Theamountofspatialdatabeingcollectedisincreasingexponentiallyandhasfarexceededhumansabilitytoanalysethem.Recently,clusteringhasbeenrecognisedasaprimarydataminingmethodforknowledgediscoveryinspatialdatabase.Thedevelopmentofclusteringalgorithmshasreceivedalotofattentioninthelastfewyearsandnewclusteringalgorithmsareproposed.Avarietyofalgorithmshaverecentlyemergedthatmeettherequirementsofdataminingusingclusteranalysisandweresuccessfullyappliedtoreallifedataminingproblems.

About the Authors Hrishav Bakul Barua has joined TCS on September 10, 2012. A student of Sikkim Manipal University (SMU), he has published his research works on Data Mining: Clustering Techniques in International Journal of Computer Applications (FCS), New York, USA. http://www.ijcaonline.org/archives/volume58/number2/9252-3418

Anupam Roy has total four years of project experience in TCS. Currently working in BSNL CDR Project and pursuing ME in Software Engineering from Jadavpur University, Kolkata. He has worked on Attacks on Distributed Database and Intrusion Detection/Prevention Systems.

5

1. Data Mining Data mining refers to extracting or mining knowledge from large volume of data. Many other terms carry a similar or slightly different meaning to data mining, such as knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology and data dredging. Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery from Data or KDD. 1.1 Cluster Analysis The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters [1]. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with many attributes of different types. This imposes unique computational requirements on relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and were successfully applied to real-life data mining problems. They are subject of the survey. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, business management, archaeology, insurance, libraries and many others. In recentyears, due to the rapid increase of online documents, text clustering becomes important.

Distance (similarity, or dissimilarity) function for clustering quality Inter-clusters distance maximised Intra-clusters distance minimised

Figure1:Formation of Clusters

6

1.1.1 What a good clustering technique/algorithm demands?

Agoodclusteringtechnique/algorithmdemandsthefollowing:

Scalability: Many clustering algorithms work well on small data sets containing fewer than several hundred data objects; however, a large database may contain millions of objects. Clustering on a sample of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed.

Ability to deal with different types of attributes: Many algorithms are designed to cluster interval-based (numerical) data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixture of these data types.

Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find spherical clusters with similar size and density. However, a cluster could be of any shape. It is important to develop algorithms that can detect clusters of arbitrary shape.

Minimal requirements for domain knowledge to determine input parameters: Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of desired clusters). The clustering results can be quite sensitive to input parameters. Parameters are often difficult to determine, especially for data sets containing high-dimensional objects. This not only burdens users, but it also makes the quality of clustering difficult to control.

Ability to deal with noisy data: Most real-world databases contain outliers or missing, unknown, or erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality.

Incremental clustering and insensitivity to the order of input records: Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates) into existing clustering structures and instead, must determine a new clustering from scratch. Some clustering algorithms are sensitive to the order of input data. That is, given a set of data objects, such an algorithm may return dramatically different clustering depending on the order of presentation of the input objects. It is important to develop incremental clustering algorithms and algorithms that are insensitive to the order of input.

High dimensionality: A database or a data warehouse can contain several dimensions or attributes. Many clustering algorithms are good at handling low-dimensional data, involving only two to three dimensions. Human eyes are good at judging the quality of clustering for up to three dimensions. Finding clusters of data objects in high dimensional space is challenging, especially considering that such data can be sparse and highly skewed.

Constraint-based clustering: Real-world applications may need to perform clustering under various kinds of constraints. Suppose that your job is to choose the locations for a given number of new automatic banking machines (ATMs) in a city. To decide upon this, you may cluster households while considering constraints such as the citys rivers and highway networks, and the type and number of customers per cluster. A challenging task is to find groups of data with good clustering behavior that satisfy specified constraints.

Interpretability and usability: Users expect clustering results to be interpretable, comprehensible and usable. That is, clustering may need to be tied to specific semantic interpretations and applications. It is important to study how an application goal may influence the selection of clustering features and methods.

7

Time Complexity: The time required for a particular clustering algorithm to run/execute and produce the output.

Labeling or assignment: Hard or strict (each data object is in one and only one cluster vs. soft or fuzzy (each data object has a probability of being in each cluster).

1.1.2 A Categorization of Major Clustering Approaches

Hierarchical Method Partitioning Method Density-Based Methods Grid-Based Methods Methods Based on Co-Occurrence of Categorical Data Constraint-Based Clustering Clustering Algorithms Used in Machine Learning Scalable Clustering Algorithms Model-based Methods Algorithms For High Dimensional Data

1.1.3 Hierarchical Method

Hierarchical clustering builds a cluster hierarchy or, in other words, a tree of clusters, also known as a dendrogram as represented in the following figure:

Figure2:Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}.

Every cluster node contains child clusters; sibling clusters partition the points covered by their common parent. Such an approach allows exploring data on different levels of granularity. Hierarchical clustering methods are categorized into agglomerative (bottom-up) and divisive (top-down). An agglomerative clustering starts with one-point (singleton) clusters and recursively merges two or more most appropriate clusters. A divisive clustering starts with one cluster of all data points and recursively splits the most appropriate cluster. The process continues until a stopping criterion (frequently, the requested number k of clusters) is achieved.

8

Figure3:Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}.

Advantages of hierarchical clustering include:

Embedded flexibility regarding the level of granularity Ease of handling any forms of similarity or distance Consequently, applicability to any attribute types

Disadvantages of hierarchical clustering are related to:

Vagueness of termination criteria The fact that most hierarchical algorithms do not revisit once constructed(intermediate) clusters with the

purpose of their improvement Hierarchical clustering based on linkage metrics results in clusters of proper (convex) shapes. Active contemporary efforts to build cluster systems that incorporate our intuitive concept of clusters as connected components of arbitrary shape, including the algorithms CURE and CHAMELEON [13], are surveyed in the sub-section Hierarchical Clusters of Arbitrary Shapes. Divisive techniques based on binary taxonomies are presented in the sub-section Binary Divisive Partitioning. The sub-section Other Developments contains information related to incremental learning, model-based clustering and cluster refinement. One of the most striking developments in hierarchical clustering is the algorithm BIRCH [8]. Data squashing used by BIRCH to achieve scalability has independent importance. Hierarchical clustering of large datasets can be very sub-optimal, even if data fits in memory. Compressing data may improve performance of hierarchical algorithms.

1.1.4 Partitioning Method

In this section we survey data partitioning algorithms, which divide data into several subsets. Because checking all possible subset systems is computationally infeasible, certain greedy heuristics are used in the form of iterative optimization. Specifically, this means different relocation schemes that iteratively reassign points between the k clusters. Unlike traditional hierarchical methods, in which clusters are not revisited after being constructed, relocation algorithms gradually improve clusters. With appropriate data, this results in high quality clusters. One approach to data partitioning is to take a conceptual point of view that identifies the cluster with a certain model whose unknown parameters have to be found. More specifically, probabilistic models assume that the data comes from a mixture of several populations whose distributions and priors we want to find. Corresponding algorithms are described in the sub-section Probabilistic Clustering. One clear advantage of probabilistic methods is the interpretability of the constructed clusters. Having concise cluster representation also allows inexpensive computation of intra-clusters measures of fit that give rise to a global objective function.

9

Given a database of n objects or data tuples, a partitioning method constructs k partitions of the data, where each partition represents a cluster and k n. That is, it classifies the data into k groups, which together satisfy the following requirements: (1) each group must contain at least one object, and (2) each object must belong to exactly one group. Most applications adopt one of a few popular heuristic methods, such as (1) the k-means algorithm, where each cluster is represented by the mean value of the objects in the cluster, and(2) the k-medoids algorithm, where each cluster is represented by one of the objects located near the center of the cluster. 1.1.5 Density-Based Method Most partitioning methods cluster objects based on the distance between objects. Such methods can find only spherical-shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes. Other clustering methods have been developed based on the notion of density. Their general idea is to continue growing the given cluster as long as the density (number of objects or data points) in the neighborhood exceeds some threshold; that is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points. Such a method can be used to filter out noise (outliers) and discover clusters of arbitrary shape. The density-based approach is famous for its capability of discovering arbitrary shaped clusters of good quality even in noisy datasets [2].Figure 4: Irregular shapes difficult for k-meansillustrates some cluster shapes that present a problem for partitioning relocation clustering (e.g., k-means), but are handled properly by density-based algorithms.

hey also have good scalability. T

Figure4:Irregular shapes difficult for k-means

int in the attribute space and is xplained in the sub-section Density Functions. It includes the algorithm DENCLUE.

NCLUE is a method that clusters objects based on the analysis of the value distribu s of density functions.

.1.6 Grid-Based Methods

nt of the number of data objects, yet ependent on only the number of cells in each dimension in the quantized space.

There are two major approaches for density-based methods. The first approach pins density to a training data point and is reviewed in the sub-section Density-Based Connectivity. Representative algorithms include DBSCAN, GDBSCAN, OPTICS, and DBCLASD. The second approach pins density to a poe DBSCAN [2] and its extension, OPTICS, are typical density-based methods that grow clusters according to a density-based connectivity analysis. DE

tion

1 The grid-based clustering approach uses a multi-resolution grid data structure. It quantizes the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed [3]. The main advantage of the approach is its fast processing time, which is typically independed

10

There is high probability that all data points that fall into the same grid cell belong to the same cluster. Therefore, all data points belonging to the same cell can be aggregated and treated as one object. It is due to this nature that grid-based clustering algorithms are computationally efficient which depends on the number of cells in each dimension in the quantized space. It has many advantages such as the total number of the grid cells is independent of the number of data points and is insensitive of the order of input data points.

Some of the popular grid-based clustering techniques are STING [4], Wave Cluster [5], CLIQUE [6], pMAFIA [7]and so on. CLIQUE [6] is a hybrid clustering method that combines the idea of both density-based and grid-based approaches. pMAFIA [7] is an optimized and improved version of CLIQUE. It uses the concept of adaptive grids for detecting the clusters. It scales exponentially to the dimension of the cluster of the highest dimension in the data set. The algorithm STING (STatistical INformation Grid-based method) [4] works with numerical attributes (spatial data) and is designed to facilitate region oriented queries. In doing so, STING constructs data summaries in a way similar toBIRCH [8]. It, however, assembles statistics in a hierarchical tree of nodes that are grid-cells.Figure 5: Cell generation and tree construction in STINGpresents the proliferation of cells in 2-dimensional space and the construction of the corresponding tree. Each cell has four (default) children and stores a point count, and attribute-dependent measures: mean standard deviation, minimum, maximum, and distribution type. Measures are accumulated starting from bottom level cells, and further propagate to higher-level cells (e.g., minimum is equal to a minimum among the children-minimums). Only distribution type presents a problem- X2-test is used after bottom cell distribution types are handpicked. When the cell-tree is constructed (in O(N)time), certain cells are identified and connected in clusters similar to DBSCAN. If the number of leaves is K, the cluster construction phase depends on K and not on N. This algorithm has a simple structure suitable for parallelization and allows for multi resolution; though defining appropriate granularity is not straightforward. STING has been further enhanced to algorithm STING+ [9] that targets dynamically evolving spatial databases, and uses similar hierarchical cell organization as its predecessor. In

ddition, STING+ enables active data mining.

a

Figure5:Cell generation and tree construction in STING

lute and relative conditions on regions (a et of adjacent cells), absolute and relative conditions on certain attributes.

.1.7 Constraint-Based Clustering

of such onditioned cluster partitions is the subject of active research; for example, we can look into the survey [10].

To do so, it supports user defined trigger conditions (e.g., there is a region where at least10 cellular phones are in use per square mile with total area of at least 10 square miles, or usage drops by 20% in a described region). The related measures, sub-triggers, are stored and updated over the hierarchical cell tree. They are suspended until the trigger fires with user-defined action. Four types of conditions are supported: absos

1 In real-world applications customers are rarely interested in unconstrained solutions. Clusters are frequently subjected to some problem-specific limitations that make them suitable for particular business actions. Buildingc

11

The framework for the constrained-based clustering is introduced in [11]. The taxonomy of clustering constraints includes constraints on individual objects (example, customer who recently purchased) and parameter constraints (like the number of clusters) that can be addressed through preprocessing or external cluster parameters. The taxonomy also includes constraints on individual clusters that can be described in terms of bounds on aggregate functions (min, avg, and so on) over each cluster. Another approach to building balanced clusters is to convert a task into a graph partitioning problem [12]. Important constraint-based clustering application is to cluster 2D spatial data in the presence of obstacles. Instead of regular Euclidean distance, a length of the shortest path between two points can be used as an obstacle distance. The Clustering with Obstructed Distance (COD) algorithm [11] deals with this problem. It is best illustrated by the Figure6:Obstacle (river with the bridge) makes a difference, showing the difference in constructing three clusters in absence of obstacle (left) and in presence of a river with a bridge (right).

Figure6:Obstacle (river with the bridge) makes a difference

1.1.8 Clustering Over Multi-Density Data Space One of the main applications of clustering spatial databases is to find clusters of spatial objects which are close to each other. Most traditional clustering algorithms try to discover clusters of arbitrary densities, shapes and sizes. Very few clustering algorithms show preferable efficiency when clustering multi-density datasets. This is also because small clusters with small number of points in a local area are possible to be missed by a global density threshold. TDCT [16] is a triangle- density clustering technique for large multi-density as well as embedded clusters. 1.1.9 Clustering Over Variable-Density Space Most of the real life datasets have a skewed distribution and may also contain nested cluster structures the discovery of which is very difficult. Therefore, we discuss two density based approaches, OPTICS [14] and EnDBSCAN [15], which attempt to handle the datasets with variable density successfully. OPTICS can identify embedded clusters over varying density space. However, its execution time performance degrades in case of large datasets with variable density space and it cannot detect nested cluster structures successfully over massive datasets. In EnDBSCAN [15], an attempt is made to detect embedded or nested clusters using an integrated approach. Based on our experimental analysis in light of very large synthetic datasets, it has been observed that EnDBSCAN can detect embedded clusters; however, with the increase in the volume of data, the performance of it also degrades. EnDBSCAN is highly sensitive to the parameters MinPts and . In addition to the above mentioned parameters, OPTICS requires an additional parameter that is, ' 1.1.10 Clustering Higher Dimensional Data Most of the clustering methods stated in section 1.1 are implemented in 2D spatial datasets. The need for clustering in 3D spatial datasets is highly demanded. In case of space research and Geo-Spatial data or 3D object detection, an efficient clustering algorithm is required. CLIQUE is a dimension-growth subspace clustering method [12]. Here, process starts at single dimensional subspace and extends to higher dimensional ones. CLIQUE is a combination of

12

density and grid based clustering method. In this, the data space is portioned into non overlapping rectangular units, identifying the dense units out of them. 3D-CATD [17] is a clustering technique for massive numeric three- dimensional (3D) datasets. The clustering algorithm is based on density approach and can detect global as well as embedded clusters. Experimental results are reported to establish the superiority of the algorithm in light of several synthetic data sets. We have only considered three-dimensional objects. But, some or more of the real life problems deals with higher dimensionalities rather than 2D /3D datasets. 1.1.11 Massive Data Clustering Using Distributed and Parallel Approach Parallel and distributed computing is expected to relieve current clustering methods from the sequential bottleneck, providing the ability to scale massive datasets and improving the response time. Such algorithms divide the data into partitions, which are processed in parallel. The results from the partitions are then merged. In [18], a Density Based Distributed Clustering (DBDC)[21] algorithm was presented where the data are first clustered locally at different sites independent of each other. The aggregated information about locally created clusters are extracted and transmitted to a central site. On the central site, a global clustering is performed based on the local representatives and the result is sent back to the local sites. The local sites update their clustering based on the global model, that is, merge two local clusters to one or assign local noise to global clusters. For both the local and global clustering, density-based algorithms are used. This approach is scalable to large datasets and gives clusters of good quality. GDCT [19],[20] is a distributed algorithm for intrinsic cluster detection over large spatial data. 1.1.12 How Clustering Algorithms are Compared ? There are many factors on the basis of which clustering algorithms are compared. A few of them are listed as follows:

The size of datasets Number of clusters Type of datasets Type of software used for implementation Complexity of time taken for execution Number of users parameters Noise handling accuracy

1.1.13 Cluster Validation A large number of clustering algorithms have been developed to deal with specific applications. Several questions arise like:

Which clustering algorithm is best suitable for the application at hand? How many clusters are there in the studied data? Is there a better cluster scheme?

These questions are related with evaluating the quality of clustering results, that is, cluster validation. Cluster validation is a procedure of assessing the quality of clustering results and finding a fit cluster strategy for a specific application. It aims at finding the optimal cluster scheme and interpreting the cluster patterns. Cluster validation is an indispensable process of cluster analysis, because no clustering algorithm can guarantee the discovery of genuine clusters from real datasets and that different clustering algorithms often impose different cluster structures on a data set even if there is no cluster structure present in it. Cluster validation is needed in data mining to solve the following problems:

13

To measure a partition of a real data set generated by a clustering algorithm To identify the genuine clusters from the partition To interpret the clusters

Generally speaking, cluster validation approaches are classified into the following three categories:

Internal approaches Relative approaches External approaches

The cluster validation methods are discussed as follows: 1.1.13.1 Internal Approaches Internal cluster validation is a method of evaluating the quality of clusters when statistics are devised to capture the quality of the induced clusters using the available data objects only. In other words, internal cluster validation excludes any information beyond the clustering data, and only focuses on assessing clusters quality based on the clustering data themselves. The statistical methods of quality assessment are employed in internal criteria, for example, root-mean-square standard deviation (RMSSTD) is used for compactness of clusters. R-squared (RS) for dissimilarity between clusters; and S_Dbw for compound evaluation of compactness and dissimilarity [1]. The formulas of RMSSTD, RSand S_Dbw are shown below.

Formula 1

Where, xj is the expected value in the jthdimension; nij is the number of elements in the ithcluster ith dimension; njis the number of elements in the jthdimension in the whole data set; nc is the number of clusters.

Formula 2

Where,

14

The formula of S_Dbw is given as: S_Dbw = Scat(c) + Dens_bw(c)..Formula 3 where Scat(c) is the average scattering within c clusters. The Scat(c)is defined as:

Formula 4

The value of Scat(c) is the degree of the data points scattered within clusters. It reflects the compactness of clusters. The term is the variance of a data set; and the term is the variance of cluster ci.Dens_bw(c) indicates the average number of points between the c clusters (that is, an indication of inter-cluster density) in relation with density within clusters. The formula of Dens_bw is given as:

Formula 5

Where uij is the middle point of the distance between the centers of the clusters vi and vj. The density function of a point is defined as the number of points around a specific point within the given radius. 1.1.13.2 Relative Approaches Relative assessment compares two structures and measures the irrelative merit. The idea is to run the clustering algorithm for a possible number of parameters (for example, for each possible number of clusters) and identify the clustering scheme that best fits the dataset, that is, they assess the clustering results by applying an algorithm with different parameters on a data set and finding the optimal solution. In practice, relative criteria methods also use RMSSTD, RSand S_Dbw to find the best cluster scheme in terms of compactness and dissimilarity from all the clustering results. Relative cluster validity is also called cluster stability, and the recent works on research of relative cluster validity are presented in.

15

1.1.13.3External Approaches The results of a clustering algorithm are evaluated based on a pre-specified structure, which reflects the users intuition about the clustering structure of the data set. As a necessary post processing step, external cluster validation is a procedure of hypothesis test, that is, given a set of class labels produced by a cluster scheme, and compare it with the clustering results by applying the same cluster scheme to the other partitions of a database, as shown in the Figure7

Figure7:Externalcriteriabasedvalidation

External cluster validation is based on the assumption that an understanding of the output of the clustering algorithm can be achieved by finding a resemblance of the clusters with existing classes. The statistical methods for quality assessment areemployed in external cluster validation such as Rand statistic, Jaccard Coefficient, Folkes and Mallows index, Huberts statistic and Normalized statistic, and Monte Carlo method, to measure the similarity between the priori modeled partitions and clustering results of a dataset. Based on our selected survey and experimental analysis, it has been observed that:

Density based approach is most suitable for quality cluster detection over massive datasets in 2D, 3D or higher dimensions.

Grid based approach is suitable for fast processing of large datasets in 2D, 3D or higher dimensions. Almost all clustering algorithms require input parameters, determinations of which are very difficult, especially

for real world data sets containing high dimensional objects. Moreover, the algorithms are highly sensitive to those parameters.

Distribution of most of the real-life datasets are skewed in nature, so, handling of such datasets for all types for qualitative cluster detection based on a global input parameter seems to be impractical.

Only some of the techniques falling under density/density-grid hybrid approaches (TDCT, GDCT, DGCL etc) are capable of handling multi-density datasets as well as multiple intrinsic or nested clusters over massive datasets qualitatively.

Only few of the techniques (falling especially under Grid based approach) can handle higher dimensional datasets.

Algorithms under Density based as well as Grid based approaches employ lesser number of user defined parameters.

The Density and Grid based approaches can handle the single-linkage problem well and can detect Multi-density as well as embedded clusters.

16

A tabular comparison of various pioneering clustering algorithms under various approaches is represented as follows:

Table1:ClusteringAlgorithms

Approach

Sl. No.

Algorithms No. of Parameters

Optimized for

Structure

Multi- Density Cluster

Embedded Clusters

Complexity Noise Handling

1 K-means No. of Clusters

Separated Clusters

Spherical

No No O(ltkN)

No

2

K-medoids No. of Clusters

Separated Clusters, Large valued objects

Spherical

No No O(k(N-k)2)

No

3 K-modes No. of Clusters

Separated Clusters, Large Datasets

Spherical

No No O(ltk(N-k)2)

No

4 FCM (Fuzzy C-means Clustering)

No. of Clusters

Separated Clusters

Non-convex shapes

No No O(N)

No

5 PAM (Partition Around Medoids)

No. of Clusters

Separated Clusters, Large Datasets

Spherical

No No O(ltk(N-k)2)

No

6 CLARA (Clustering LARge Applications)

No. of Clusters

Relatively Large Datasets

Spherical

No No O(ksz2+k( N-k))

No

Partitioning Approach

7 CLARANS (A CLustering Algorithm based on RANdomized Search)

No. of Clusters, Maximum no. of neighbors

Better that PAM & CLARA

Spherical

No No O(kN2)

No

1 BIRCH (Balanced Iterative Reducing & Clustering using hierarchies)

Branching factor, Diameter, Threshold

Large Data

Spherical

No No O(N) Yes Hierarchical

2 CURE (Clustering Using REpresentatives)

No. of Clusters, No of representatives

Any Shaped Large data

Arbitrary

No No O(N2logN)

Yes

17

Approach

Sl. No.


Optimized for

Structure


Embedded Clusters


3 ROCK (RObust Clustering using links)

No. of Clusters

Small noisy data

Arbitrary

No No O(N2 +Nmmma+N2logN)

Yes Approach

4 CHAMELEON

3(k-nearest neighbors, MIN-SIZE, c)

Small datasets

Arbitrary

Yes No O(N2)

Yes

1 DBSCAN (Density Based Spatial Clustering of Applications with Noise)

2(MinPts, ) Large datasets

Arbitrary

No No O(N log N) using R*tree

Yes

2 OPTICS( Ordering Points To Identify the Clustering Structure)

3(MinPts, ,')

Large datasets

Arbitrary

Yes Yes O(N log N) using R* tree

Yes

3 DENCLUE

2(MinPts, ) Large datasets

Arbitrary

No No O(N log N) using R*tree

Yes

4 TDCT (Triangle- Density

Clustering Technique)

2(, )

Large Spatial

datasets

Arbitrary

Yes Yes O(nc 2*m*N) Yes

Density Based Approach

5 3D-CATD (3- Dimensional Clustering Algorithm

using Tetrahedron

Density)

2(, )

Large datasets,

3D datasets

Arbitrary

Yes Yes O(nc m*N) Yes

1 Wave Cluster

No. of cells for each

dimension, No. of

applications of transform

Any Shape, Large Data

Any Yes No O(N) Yes Grid-

2 STING No. of cells in lowest

level, No. of objects in

cell

Large spatial

datasets

Vertical and

horizontal

boundary

No No O(N) Yes

18

Approach

Sl. No.


Optimized for

Structure


Embedded Clusters


3 CLIQUE Size of the grid,

minimum no. of

points in each grid

cell

High dimensional, Large datasets

Arbitrary

No No O(N) Yes Based Approach

4 MAFIA Size of the grid,

minimum no. of

points in each grid

cell

High dimensional, Large datasets

Arbitrary

No No O(ckl) Yes

1 GDCT (Grid- Density

Clustering Technique)

2 (n, ) Large datasets,

2D datasets

Arbitrary

Yes Yes O(N/k+t) Yes

2 GDCT Using Distributed Computing

2 (n, ) Large datasets,

2D datasets

Arbitrary

Yes Yes O(N) Yes

Grid- Density Hybrid Approach 3 DisClus

(Distributed Clustering)

2 (n,) High resolution

multi-spectral Satellite Datasets

Arbitrary

Yes Yes O(N) Yes

Graph Based Clustering

1 AUTOCLUST NIL Massive Data

Arbitrary

No No O(NlogN) Yes

19

2. Conclusion Clustering lies at the heart of data analysis and data mining applications. The ability to discover highly correlated regions of objects when their number becomes very large is highly desirable, as data sets grow and their properties and data interrelationships change. Every research paper that presents a new clustering technique shows its superiority to other techniques and it is hard to judge how well the technique will work. In this paper we described the process of clustering from the data mining point of view. We gave the properties of a good clustering technique and the methods used to find meaningful partitioning. We have also done a selected survey on various clustering approaches and the pioneering algorithms of these approaches. From the survey we can conclude that Density Based and Grid Based clustering approaches can produce optimum solutions in clustering. Density based clustering techniques can find clusters of any shape and size in large datasets with good noise handling activity and less parameters. Grid based technique can find clusters in least time complexity as it can perform very fast processing of datasets. So, the Density-Grid Hybrid clustering approach can be one of the best solutions for any kind of clustering problems. GDCT, DGCL and DisClus which fall in this category are some of the best algorithms in the arena. The clusters obtained from the techniques discussed can further be refined by smoothing the cluster boundaries. This can be performed by employing Membership functions and fuzzy logic on the boundary data points to find the probability and membership of these points with respect to the clusters and hence predicting the exact cluster where the point belong to. 3. Acknowledgements We would sincerely like to thank Mr. Sarbeswar Das, Project Manager BSNL CDR Project for his encouragement to write this paper and also for sharing his experience and expertise. Hrishav &Anupam

20

4. References [1] J. Han and M. Kamber, (2004), Data Mining: Concepts and Techniques. India: Morgan Kaufmann Publishers. [2] M. Ester, H. P. Kriegel, J. Sander and X. Xu,( 1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases withNoise, in International Conference on Knowledge Discovery in Databases and Data Mining (KDD-96), Portland, Oregon, pp.226-231. [3] C. Hsu and M. Chen,(2004) Subspace Clustering of High Dimensional Spatial Data with Noises, PAKDD, pp. 31-40. [4] W. Wang, J. Yang, and R. R. Muntz,(1997) STING: A Statistical Information Grid Approach to Spatial data Mining, in Proc. 23rdInternational Conference on Very Large Databases, (VLDB), Athens, Greece, Morgan Kaufmann Publishers, pp. 186 - 195. [5] G. Sheikholeslami, S. Chatterjee and A. Zhang,(1998) Wavecluster: A Multiresolution Clustering approach for very large spatialdatabase, in SIGMOD'98, Seattle. [6] R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan,(1998) Automatic subspace clustering of high dimensional data for data miningapplications, in SIGMOD Record ACM Special Interest Group on Management of Data, pp. 94105. [7] H. S. Nagesh, S. Goil and A. N. Choudhary,(2000) A scalable parallel subspace clustering algorithm for massive data sets, in Proc.International Conference on Parallel Processing, pp. 477. [8]Tian Zhang, Raghu Ramakrishnan ,MironLivny, (1996), BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proceeding SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data Pages 103-114 ,ACM New York, NY, USA [9] WANG, W., YANG, J., and MUNTZ, R.R. (1999). STING+: An approach to active spatialdata mining. In Proceedings 15th ICDE, 116-125, Sydney, Australia. [10] HAN, J., KAMBER, M., and TUNG, A. K. H. (2001), Spatial clustering methods in data mining: A survey. In Miller, H. and Han, J. (Eds.) Geographic Data Mining andKnowledge Discovery, Taylor and Francis. [11] TUNG, A.K.H., NG, R.T., LAKSHMANAN, L.V.S., and HAN, J. (2001), Constraint-Based Clustering in Large Databases, In Proceedings of the 8th ICDT, London, UK. [12] STREHL, A. and GHOSH, J. 2000. A scalable approach to balanced, high-dimensional clustering of market baskets, In Proceedings of 17th International Conference on HighPerformance Computing, Springer LNCS, 525-536, Bangalore, India.

21

[13]L. Ertoz, M. Steinbach and V. Kumar,(2003) Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data,in SIAM International Conference on Data Mining (SDM '03). [14]M. Ankerst, M. M. Breuing, H. P. Kriegel and J. Sander,(1999) OPTICS: Ordering Points To Identify the Clustering Structure, inACMSIGMOD, pp. 49-60. [15] S. Roy and D. K. Bhattacharyya,(2005) An Approach to Find Embedded Clusters Using Density Based Techniques, in Proc. ICDCIT,LNCS 3816, pp. 523-535. [16]HrishavBakulBarua, Dhiraj Kumar Das and SauravjyotiSarmah,(2012), A Density Based Clustering Technique For Large Spatial Data Using Polygon Approach, TDCT, IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661 Volume 3, Issue 6 (July-Aug. 2012), PP 01-10. [17]HrishavBakulBarua and SauravjyotiSarmah.(2012), Article: An Extended Density based Clustering Algorithm for Large Spatial 3D Data using Polyhedron Approach (3D-CATD). International Journal of Computer Applications 58(2):4-15, November 2012. Published by Foundation of Computer Science, New York, USA(ISBN: 973-93-80871-32-3),(ISSN:0975 8887) [18] E. Januzaj, H. P. Kriegel and M. Pfeifle, Towards Effective andEfficient Distributed Clustering.Workshop on Clustering Large DataSets, ICDM'03.Melbourne, Florida, 2003. [19] S. Sarmah, R. Das and D. K. Bhattacharyya, Intrinsic Cluster Detection Using Adaptive Grids, in Proc. ADCOM'07, Guwahati,2007. [20] S. Sarmah, R. Das and D.K. Bhattacharyya, A Distributed Algorithm for Intrinsic Cluster Detection over Large Spatial Data Agrid-density based clustering Technique (GDCT), World Academy of Science, Engineering and Technology 45, pp. 856-866, 2008. [21]Januzaj, E., et al. (2003): Towards effective and efficient distributed clustering. In: Proceedingsof the ICDM 2003 .

Thank You Contact

For more information, contact

[email protected], [email protected]

AboutTataConsultancyServices(TCS)

TataConsultancyServicesisanITservices,consultingandbusinesssolutionsorganizationthatdeliversrealresultstoglobalbusiness,ensuringalevelofcertaintynootherfirmcanmatch.TCSoffersaconsultingled,integratedportfolioofITandITenabledinfrastructure,engineeringandassuranceservices.ThisisdeliveredthroughitsuniqueGlobalNetworkDeliveryModelTM,recognizedasthebenchmarkofexcellenceinsoftwaredevelopment.ApartoftheTataGroup,Indiaslargestindustrialconglomerate,TCShasaglobalfootprintandislistedontheNationalStockExchangeandBombayStockExchangeinIndia.

Formoreinformation,visitusatwww.tcs.com.

ITServicesBusinessSolutionsConsultingAll content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS). The content / information contained here is correct at the time of publishing. No material from here may be copied, modified, reproduced, republished, uploaded, transmitted, posted or distributed in any form without prior written permission from TCS. Unauthorized use of the content / information appearing here may violate copyright, trademark and other applicable laws, and could result in criminal or civil penalties. Copyright 2011 Tata Consultancy Services Limited

AbstractAbout the Authors1. Data Mining1.1 Cluster Analysis1.1.1 What a good clustering technique/algorithm demands?1.1.2 A Categorization of Major Clustering Approaches1.1.3 Hierarchical Method1.1.4 Partitioning Method1.1.6 Grid-Based Methods1.1.7 Constraint-Based Clustering1.1.8 Clustering Over Multi-Density Data Space1.1.9 Clustering Over Variable-Density Space1.1.10 Clustering Higher Dimensional Data1.1.11 Massive Data Clustering Using Distributed and Parallel Approach1.1.12 How Clustering Algorithms are Compared ?1.1.13 Cluster Validation

2. Conclusion3. Acknowledgements4. References

white paper clustering approaches and techniques

Documents

tata code of conduct

principal ethics counselor

local ethics counselor

reallife data

data mining

potential violation

ceo of tcs