accurate distributed cluster analysis for big data competitive k-means

15
50 Int. J. Big Data Intelligence, Vol. 1, Nos. 1/2, 2014 Copyright © 2014 Inderscience Enterprises Ltd. A new approach for accurate distributed cluster analysis for Big Data: competitive K-Means Rui Máximo Esteves* Department of Electrical and Computer Engineering, University of Stavanger, Norway E-mail: [email protected] *Corresponding author Thomas Hacker Computer and Information Technology, Purdue University, West Lafayette, Indiana E-mail: [email protected] Chunming Rong Department of Electrical and Computer Engineering, University of Stavanger, Norway E-mail: [email protected] Abstract: The tremendous growth in data volumes has created a need for new tools and algorithms to quickly analyse large datasets. Cluster analysis techniques, such as K-Means can be distributed across several machines. The accuracy of K-Means depends on the selection of seed centroids during initialisation. K-Means++ improves on the K-Means seeder, but suffers from problems when it is applied to large datasets. In this paper, we describe a new algorithm and a MapReduce implementation we developed that addresses these problems. We compared the performance with three existing algorithms and found that our algorithm improves cluster analysis accuracy and decreases variance. Our results show that our new algorithm produced a speedup of 76 ± 9 times compared with the serial K-Means++ and is as fast as the streaming K-Means. Our work provides a method to select a good initial seeding in less time, facilitating fast accurate cluster analysis over large datasets. Keywords: K-Means; K-Means++; streaming K-Means; SK-Means; MapReduce. Reference to this paper should be made as follows: Esteves, R.M., Hacker, T. and Rong, C. (2014) ‘A new approach for accurate distributed cluster analysis for Big Data: competitive K-Means’, Int. J. Big Data Intelligence, Vol. 1, Nos. 1/2, pp.50–64. Biographical notes: Rui Máximo Esteves is a researcher at the University of Stavanger (UiS) in Norway, where his works focuses on data-intensive (Big Data) machine learning, optimisation and cloud computing. He was a Guest Editor for the special issue ‘Cloud computing and Big Data’ in the Journal of Internet Technology and Chair of Cloud Computing Contest at International Conferences on Cloud Computing Technology and Science (CloudCom). He was a Professor Assistant in Pattern Recognition and in Web Semantic Technologies at UiS. He lectured in the University of Trás-os-Montes in Portugal in Forestry Statistics and in Forestry Remote Detection. He worked for the National Institute of Statistics in Portugal He participated in research projects related to optimisation of energy consumption, statistics and remote detection applied to forestry. Thomas Hacker is an Associate Professor of Computer and Information Technology at Purdue University and Visiting Professor in the Department of Electrical Engineering and Computer Science at the University of Stavanger in Norway. His research interests centre around high performance computing and networking on the operating system and middleware layers. Recently, his research has focused on cloud computing, cyber infrastructure, scientific workflows, and data-oriented infrastructure. He is also a co-Leader for Information Technology

Upload: shubham97

Post on 18-Jan-2016

37 views

Category:

Documents


4 download

DESCRIPTION

Research paper for distributed cluster analysis

TRANSCRIPT

Page 1: Accurate Distributed Cluster Analysis for Big Data Competitive K-Means

50 Int. J. Big Data Intelligence, Vol. 1, Nos. 1/2, 2014

Copyright © 2014 Inderscience Enterprises Ltd.

A new approach for accurate distributed cluster analysis for Big Data: competitive K-Means

Rui Máximo Esteves* Department of Electrical and Computer Engineering, University of Stavanger, Norway E-mail: [email protected] *Corresponding author

Thomas Hacker Computer and Information Technology, Purdue University, West Lafayette, Indiana E-mail: [email protected]

Chunming Rong Department of Electrical and Computer Engineering, University of Stavanger, Norway E-mail: [email protected]

Abstract: The tremendous growth in data volumes has created a need for new tools and algorithms to quickly analyse large datasets. Cluster analysis techniques, such as K-Means can be distributed across several machines. The accuracy of K-Means depends on the selection of seed centroids during initialisation. K-Means++ improves on the K-Means seeder, but suffers from problems when it is applied to large datasets. In this paper, we describe a new algorithm and a MapReduce implementation we developed that addresses these problems. We compared the performance with three existing algorithms and found that our algorithm improves cluster analysis accuracy and decreases variance. Our results show that our new algorithm produced a speedup of 76 ± 9 times compared with the serial K-Means++ and is as fast as the streaming K-Means. Our work provides a method to select a good initial seeding in less time, facilitating fast accurate cluster analysis over large datasets.

Keywords: K-Means; K-Means++; streaming K-Means; SK-Means; MapReduce.

Reference to this paper should be made as follows: Esteves, R.M., Hacker, T. and Rong, C. (2014) ‘A new approach for accurate distributed cluster analysis for Big Data: competitive K-Means’, Int. J. Big Data Intelligence, Vol. 1, Nos. 1/2, pp.50–64.

Biographical notes: Rui Máximo Esteves is a researcher at the University of Stavanger (UiS) in Norway, where his works focuses on data-intensive (Big Data) machine learning, optimisation and cloud computing. He was a Guest Editor for the special issue ‘Cloud computing and Big Data’ in the Journal of Internet Technology and Chair of Cloud Computing Contest at International Conferences on Cloud Computing Technology and Science (CloudCom). He was a Professor Assistant in Pattern Recognition and in Web Semantic Technologies at UiS. He lectured in the University of Trás-os-Montes in Portugal in Forestry Statistics and in Forestry Remote Detection. He worked for the National Institute of Statistics in Portugal He participated in research projects related to optimisation of energy consumption, statistics and remote detection applied to forestry.

Thomas Hacker is an Associate Professor of Computer and Information Technology at Purdue University and Visiting Professor in the Department of Electrical Engineering and Computer Science at the University of Stavanger in Norway. His research interests centre around high performance computing and networking on the operating system and middleware layers. Recently, his research has focused on cloud computing, cyber infrastructure, scientific workflows, and data-oriented infrastructure. He is also a co-Leader for Information Technology

Page 2: Accurate Distributed Cluster Analysis for Big Data Competitive K-Means

A new approach for accurate distributed cluster analysis for Big Data 51

for the Network for Earthquake Engineering Simulation (NEES), which brings together researchers from 14 universities across the country to share innovations in earthquake research and engineering. He received his BS in Physics and BS in Computer Science from Oakland University in Rochester, Michigan, USA. He received his MS and PhD in Computer Science and Engineering from the University of Michigan, Ann Arbor, Michigan.

Chunming Rong is the Head of the Center for IP-based Service Innovation (CIPSI) at the University of Stavanger in Norway, where his work focuses on big data analytics, cloud computing, security and privacy. He has been an IEEE senior member and is honoured as a member of the Norwegian Academy of Technological Sciences since 2011. He is a Visiting Chair Professor at Tsinghua University (2011–2014) and served also as an Adjunct Professor at the University of Oslo (2005–2009). He is the co-founder and Chairman of the Cloud Computing Association (CloudCom.org) and its associated IEEE conference and workshop series.

This paper is a revised and expanded version of a paper entitled ‘Competitive K-Means, a new accurate and distributed K-Means algorithm for large datasets’ presented at the 5th IEEE Cloudcom Conference, Bristol, UK, 2–5 December 2013.

1 Introduction

The volume of data generated daily is growing at an exponential rate. This growth is due in part to the proliferation of sensors and the increase in resolution of those sensors. To distil meaningful information from this growing mountain of data, there is a growing need for advanced data analysis techniques, such as cluster analysis. Clustering is a key factor in the Big Data problem. In a Big Data context it is not feasible to ‘label’ large collection of objects. It is common to not have a prior knowledge of the underlying data structure and the number of groups and the group nature. Moreover, in the Big Data context the data tend to change in time and therefore using clustering methods can produce clusters that are dynamic. Clustering provides efficient browsing, search, recommendation and assist in the document classification, which are relevant tasks necessary for Big Data. As a consequence, cluster analysis faces new challenges of processing tremendously large and complex datasets stored and analysed across many computers. Since moving a large amount of data between machines is more costly than moving the computation to the data, a recent trend is to move algorithms, which typically represent a few KB, to process chunks of the dataset independently. The MapReduce approach is a seamless solution to distributed computation that can be used to solve this problem; however, it requires new algorithms that can benefit from MapReduce technology.

K-Means (Zhao et al., 2009; Ekanayake et al., 2008), is a cluster analysis algorithm that can be implemented using an embarrassingly parallel approach for clustering large datasets distributed across several machines.

A proper initialisation of K-Means is crucial to obtaining a good final solution. There are no efficient lightweight techniques for improving the choice of initial centroids when the dataset has the following characteristics:

1 a large number of clusters

2 a high-feature dimensionality

3 a large number of data points

4 storage across several systems. K-Means++ (developed by Arthur and Vassilvitskii, 2007) employs an improved seeding method to improve K-Means quality by choosing initial cluster centroids largely distant from each other.

However, we observed two major problems when we applied K-Means++ seeding method to large datasets. First, K-Means++ is a stochastic algorithm, which means that the results produced by K-Means++ were considerably different across several analysis runs using the same initial conditions. We observed that the difference in the results grows as the dataset contains more points and has higher feature dimensionality. Still, we found that the quality of the K-Means++ was much better than K-Means when using a random selection of initial centroids. Second, K-Means++ is an inherently serial algorithm that is time-consuming for large datasets.

In related work, several authors proposed improvements to the second problem without considering the first problem (Esteves et al., 2012; Bahmani et al., 2012; Ailon et al., 2009). Pavan et al. (2010) addresses the first problem, but at the expense of worsening the second problem.

In this paper, an extended version of a paper we presented at CloudCom 2013 (Esteves, 2013), we propose a new parallel seeding algorithm named competitive K-Means (CK-Means) that addresses both problems affecting serial K-Means++. We also propose an efficient MapReduce implementation of our new CK-Means that we found scales well with large datasets.

Page 3: Accurate Distributed Cluster Analysis for Big Data Competitive K-Means

52 R.M. Esteves et al.

2 Theory and background

2.1 K-Means

K-Means is a partition type of cluster analysis algorithm that tries to solve the following clustering problem. Given an integer k, a distance measure dm and a set of n data points in a d-dimensional space, the goal is to choose k centroids so as to minimise a cost function usually defined as the total distance between each point in the dataset and the closest centroid to that point. The exact solution to this problem is NP-hard and K-Means provides an approximate solution that has O(nkd) running time (Ailon et al., 2009).

The K-Means algorithm is simple and straightforward: first, it randomly selects k points from the whole dataset. These points represent the initial centroids (or seeds). Each remaining point in the dataset is assigned to a cluster whose centroid is closest to that point. The coordinates of the centroid are then recalculated. The new coordinates of a specific centroid corresponds to the average of all points assigned to the respective cluster. This process iterates until a cost function converges to an optimum without a guarantee that it is the global one. Therefore, the initial selection of centroids during the initialisation process to select the best possible set of centroids is essential (Ostrovsky and Rabani, 2006). The accuracy of large dataset cluster analyses using K-Means depends on the accuracy of centroid initialisation methods that are adapted to datasets distributed across several machines.

2.2 The problem of selecting initial centroids for large Big Data datasets

A considerable amount of prior work has been undertaken in cluster analysis and in developing new approaches to select an excellent set of initial centroids for cluster analysis techniques such as K-Means. In the era of Big Data, however, the techniques that have been developed for initial centroid selection may not scale to the hundreds of thousands to millions of data points present in large datasets, or to the high degree of dimensionality in these large datasets. The inherent characteristics of Big Data, known as the 4-V’s, present considerable challenges to existing methods. The volume of data is the first critical characteristic. Big Data involves the management and analysis of tremendous volumes of data reaching in the petabytes. One example is from the Compact Muon Solenoid project, which produces petabytes of data annual from the Large Hadron Collider at CERN. Another example of volume is the production of log records for large scale supercomputer systems, which can be comprised of thousands of individual computational nodes, each producing thousands of log entries each day. Another distinguishing characteristic is variety. The inherent dimensional complexity of Big Data, along with the much larger potential degrees of freedom, require faster and more accurate techniques for cluster analysis and centroid selection. Moreover, the growing pervasive availability of multi-core processors and commodity-based cluster

computing systems coupled with high performance parallel storage systems, such as Hadoop and Lustre, provide a fertile new ground for the development of new parallel algorithms and data analytic techniques take can best take full advantage of these new architectures and platforms. This is the motivation for our work described in this paper.

According to Meilă and Heckerman (1998), there is no formal delimitation between initial search and search. The task of finding the initial position of the centroids means finding the global optimum to a certain accuracy. This task represents a search per se, and in some circumstances it can be a more challenging one than that performed by the main clustering algorithm. The concept of ‘best’ generally implies a tradeoff between accuracy and computation cost. To Meilă and Heckerman ‘the best initialisation method’ is therefore an ill-defined notion, which depends on the data, the accuracy/cost tradeoff and the clustering algorithm. Meilă and Heckerman studied the effect of three different initialisation methods on the expectation-maximisation clustering algorithm: the Random approach; the noisy-marginal method of Thiesson et al. (1997) and the hierarchical agglomerative clustering. The last two initialisation methods are data dependent while the first one is data independent. On a synthetic dataset created by Meilă and Heckerman, the Random method performed worse than did the data-dependent initialisation method, but this difference was not seen for the real-world data.

Khan and Ahmad (2004) present their cluster centre initialisation method (CCIA) for K-Means. Their algorithm first computes K cluster centres for individual attributes. After the first step, the method choses seeds by examining each of the attributes individually to extract a list of K ′ possible seed locations. It might happen that K K′ > (more clusters centres were computed than needed). To reduce the number of centres to K, their method uses density-based multi scale data condensation (DBMSDC). DBMSDC computes the density of the data at a point, and then sorts the points according to their density. The point in the top of the sorted list is chosen and all the points within a radius inversely proportional to the density of that point are pruned. The DBMSDC algorithm moves to the next point, which has not been pruned from the list and repeats. This is repeated until the desired number of points K remains. Their method assumes that each attribute is normally distributed. It also assumes that individual attributes can provide some hints to initial cluster centres. The authors test their method in a scenario of a small dataset. They tested with a real dataset of 87 samples and 6 dimensions. In the scenario of Big Data, where we have dimensions in the number of hundreds or thousands, the presented assumptions are difficult to guarantee as a priori conditions.

Al-Daoud (2007) introduced a new algorithm for initialisation. The idea of the algorithm is to find the dimension with maximum variance, sorting it by the dimension selected, dividing it into a set of groups of data points then finding the median for each group, using the corresponding data points (vectors) to initialise the K-Means. The usage of the median instead of the average

Page 4: Accurate Distributed Cluster Analysis for Big Data Competitive K-Means

A new approach for accurate distributed cluster analysis for Big Data 53

for selection of the centres increases the robustness of their approach to outliers. The authors tested their algorithm with an artificial dataset and with an image dataset (described as the ‘well known baboon image’ in their paper). Both datasets tested by the authors have eight dimensions at maximum. Relying on the distribution of a single dimension to determine the initial position of centroids assumes that the dimension chosen is representative of the distribution of the dataset. To be representative of the dataset the variance of the selected dimension has to be significantly higher that all the others. Suppose that we have several dimensions with similar high variance leading to several possible choices. If each choice corresponds to a different sorting of the data, the choice of the dimension will affect the selection of the initial centroids. Their approach also assumes that the value of the maximum variance is observed in only one dimension. These assumptions can be reasonable in datasets with a limited number of dimensions. However, in the Big Data scenario, we face the challenge of high dimensionality. For example, considering the use case of document clustering applied to Wikipedia as it was presented in our previous work (Esteves et al., 2011). The analysed Wikipedia dataset has 30 GB and 11,500 dimensions after pre-processing. A dataset with 30 GB is a modest use case of Big Data, however it is sufficient to understand that in 11,500 dimensions chances are that several dimensions have similar variance with distinct sorting. Thus, the method presented by Al-Daoud is not suitable for high dimensionality and consecutively for Big Data.

Redmond and Heneghan (2007) propose a method for initialisation of K-Means using kd trees. The kd-tree is a binary tree in which every node is a k-dimensional point. Redmond and Heneghan use the kd-tree as a top-down hierarchical scheme for partitioning data. Every non-leaf node represents a splitting of the data along the longest dimension of the parent node. The value of the median computed for the longest dimension of the parent node is used as the criteria for splitting. After creating the kd-tree, the Redmond and Heneghan method computes the density and the mean value for each leaf-node. It also computes the distances between the mean values of the leaf-nodes. The methods combines the information obtained from the computed densities and distances to select centres of leaf-nodes that have high density and are far apart from each other. The selected centres are the initial centroids for K-Means. The authors tested their algorithm with artificial and real world datasets that have less than 20 dimensions. According to Redmond and Heneghan, kd-trees have poor ability to scale to high dimensions. Therefore this method is not appropriated for Big Data.

El Agha and Ashour (2012) presented an initialisation method for K-Means. Taking as example a two-dimension dataset, ElAgha initialisation first finds the boundaries of data points, and then it divides the area covered by the points into K rows and K columns forming a 2D grid. ElAgha initialisation then uses the upper left corner of the cells lied on the diagonal as a base points. Then the base

points are randomly biased to generate the actual initial centroids for K-Means. ElAgha initialisation generates K points using semi random technique. It makes the diagonal of the data as a starting line and selects the points randomly around it. El Agha and Ashour tested their algorithm with three artificial datasets created by the authors and with the well-known public available Iris dataset by Fisher (1936). The three artificial datasets has just two dimensions, with no outliers and well-defined clusters that are homogenously distributed in the sample space. With just four dimensions and four classes, the Iris dataset is also a simple dataset. El Agha initialisation archived better results than Random initialisation mainly with the artificial datasets. Our interpretation is that by using a criterion of setting the base points from a diagonal of the data, ElAgha method assumes that the data obeys to a particularly geometric structure of data. For example, ElAgha initialisation method does not select initial centroids from the opposite corners of the diagonal. In contrast, by using our new CK-Means, we pick initial centroids based on a probability distribution according to the distance of the centroids. Thus, we do not favour or exclude any specific regions that are pre-mapped in the clustering space, neither we assume any particularly structure in the data. Therefore, if a dataset has clusters situated in the opposite corners of the diagonal, ElAgha does not pick any point, while our CK-Means picks points in those regions with high probability as long as the centroids are far away from each other. Making assumptions about the structure of datasets in advance might not be a major problem for clustering small and simple datasets. In a Big Data scenario this is usually not the case, therefore our new algorithm is more suitable for Big Data.

2.3 Related work in centroid initialisation methods for K-Means

Several studies have investigated the parallelisation of K-Means (Zhang et al., 2006; Gursoy, 2004; Stoffel and Belkoniene, 1999; Zhao et al., 2009; Jin et al., 2006; Kumar et al., 2011; Wasif and Narayanan, 2011; Ekanayake et al., 2008) to improve performance, but neglected the problem of the seeding initialisation needed to select good centroids for cluster analysis. Niknam et al. (2011) presents a survey of these methods. Evolutionary algorithms are suitable to be parallelised (Knysh and Kureichik, 2010; Crainic and Toulouse, 2010) but they require input parameters that may not be easy to determine (Eiben et al., 2007). The K-Means++ algorithm (Arthur and Vassilvitskii, 2007) improves the initialisation of K-Means by selecting an initial set of centroids that has a higher probability of being closer to the optimum solution.

A few studies investigate the parallelisation of K-Means++. Bahmani et al. (2012) presents a scalable algorithm inspired by the original serial K-Means++. Their main idea is that instead of sampling a single point in each pass of the K-Means++ algorithm, O(k) points can be sampled per iteration. The process repeats for approximately O(log n) iterations. At the end of the

Page 5: Accurate Distributed Cluster Analysis for Big Data Competitive K-Means

54 R.M. Esteves et al.

iterations, O(k log n) points are candidates to form a solution. In the next step these candidates are clustered into k clusters using serial K-Means++. The resulting k centroids are then the seed centroids for K-Means analysis.

The scalable K-Means++ from Bahami has one major downside, which is the requirement of an extra input parameter called oversampling factor l. The value of l is not intuitive, and its choice can dramatically change the quality of the results. As the authors admit, the analysis of the provable guarantees is highly non-trivial and requires new insights in contrast to the analysis of K-Means++.

In previous work (Esteves et al. 2012), we described a solution to parallelise the most intensive calculations of the serial K-Means++. Our approach maintains the batch structure of the original serial K-Means++; therefore the provable guarantees of being an expected O(log (k)) approximation to the K-Means problem are maintained. Our approach presented in Esteves et al. (2012) reduced the execution time by half compared with the serial K-Means++, with no need for additional parameters. However, these results were obtained by using only a single multicore machine. Since our solution presented in Esteves et al. (2012) relies on the same batch principles as the serial K-Means++, it is inefficient for datasets stored across several machines.

Ailon et al. (2009) proposed an algorithm named streaming K-Means (SK-Means) that is another scalable approximation to K-Means++.

Ackermann et al. (2010) introduced a streaming algorithm based on K-Means++ that shares similarities with the one presented in Ailon et al. (2009). Both algorithms presented in Ailon et al. (2009) and Ackermann et al. (2010) follow a divide-and-conquer strategy, whereby the dataset is partitioned into smaller subsets, and analyses are done in parallel to each partition. Both algorithms address the problem of speeding up the K-Means++ running time; however they do not address the problem that K-Means++ is a stochastic algorithm that can produce considerably different results from several runs using similar initial conditions.

Pavan et al. (2010) proposed a deterministic algorithm based on K-Means++ that addresses the stochastic problem. To find each cluster centroid, Pavan’s method calculates a distance matrix between every point in the dataset. Pavan’s algorithm is serial and has O(n2kd) complexity, which is worse than K-Means++. Thus the worst-case running time is even larger than serial K-Means++.

Our new CK-Means reduces the running time compared with serial K-Means++. Several runs of our new algorithm using the same initial conditions produces results with less variance and better quality than the serial K-Means++. Thus our new algorithm improves the seeding compared with K-Means++.

2.4 Using MapReduce/Hadoop for distributed cluster analysis

Hadoop (White, 2010) provides a distributed file system and a framework for the analysis and transformation of very large datasets using the MapReduce (Dean and Ghemawat, 2008) programming model. The Hadoop distributed file system (HDFS) (Shvachko et al., 2010) is designed to reliably store very large datasets across several systems. HDFS is suitable for storing very large files (up to TBs in size) to be processed in a write-once, read-many-times pattern. A typical MapReduce job involves reading a large proportion, if not all, of the dataset; so the time required for reading the whole dataset is more important than the latency incurred in reading the first record (Shvachko et al., 2010).

Our new CK-Means approach is suitable for efficiently performing cluster analyses on large datasets distributed across several machines. The algorithm can be executed in parallel by a cluster of computers. The heavy calculations can be performed by each machine in a cluster on a chunk of data independently of the remaining dataset. Thus our new algorithm can be easily parallelised to use MapReduce and Hadoop.

2.5 Hadoop and R for distributed cluster analysis

R (The Comprehensive R Archive Network, 2012), an open-source statistics package, is by default a serial program that uses only a single computational core. We can use the doMC package (The Comprehensive R Archive Network, 2012) to extend R to use multiple cores in a single machine. There are three approaches available at the moment (Holmes, 2012) to integrate R with a Hadoop cluster:

a R + streaming – by using Hadoop streaming functionality, a user launches a streaming job and provides the map-side and reduce-side R scripts

b RHadoop – an R package integrated with the R environment. RHadoop provides an R wrapper on top of Hadoop and streaming

c RHIPE – similar to RHadoop that is integrated with the R environment.

Rather than using streaming, RHIPE uses its own map and reducer Java functions. Since it does not rely on Hadoop streaming, RHIPE is the fastest of the three approaches. Therefore we chose RHIPE to implement our new CK-Means approach.

3 Algorithms

In this section, we describe the existing serial K-Means++ (Arthur and Vassilvitskii, 2007) (Algorithm 1) along with the existing SK-Means (Ailon et al., 2009) (Algorithm 2), which is a parallel algorithm that has been theoretically proven to provide results similar to the serial K-Means++, and our new CK-Means (Algorithm 3). At the end of this

Page 6: Accurate Distributed Cluster Analysis for Big Data Competitive K-Means

A new approach for accurate distributed cluster analysis for Big Data 55

section, we describe a MapReduce implementation of our new CK-Means (Algorithms 4 to 6).

3.1 Serial K-Means++

The main idea in the K-Means++ algorithm shown in Algorithm 1 is to choose the set of initial centroids IC for K-Means one-by-one in a sequential manner, where the current set of chosen centroids will stochastically bias the selection of the next centroid (Arthur and Vassilvitskii, 2007).

Algorithm 1 Serial K-Means++

Input: A set of data points X and the number of desired centroids k

Output: X points grouped into k clusters and respective final centroids FC

1: IC ← a single data point uniformly sampled at random from X

2: While || IC || <k do

3: For each data point dp ∈ X, compute D(dp, ic), where D is the shortest distance from x to the closest ic ∈ IC

4: Sample dp ∈ X with probability 2

2( , )( , )

D dp icD dp icΣ

5: IC ← IC ∪ {dp}

6: End while

7: K-Means on X using the set of initial centroids IC

Source: Arthur and Vassilvitskii (2007)

K-Means++ is a serial algorithm that repeats steps 2 to 5 k times to select the set of initial centroids IC for the K-Means. When the selection of IC centroids is complete, K-Means++ proceeds with step 7 and performs cluster analysis on X, using K-Means with IC as the set of initial centroids. At the end of step 7, the K-Means group X points into k clusters and calculates the final set of centroids FC, which corresponds to the averages of points within each cluster.

3.2 Streaming K-Means

SK-Means shown in Algorithm 2, divides the input into m equal-sized groups in step 1. In step 2, the algorithm runs in each group a variant of K-Means++ that selects 3 × log(k) points in each iteration a total of k times (traditional K-Means++ selects only a single point). In step 4, the algorithm weights the m sets of points. The process of weighting the points is not clearly specified in Ailon et al. (2009), and the authors provide no guidelines on how to weight them. In step 6, it runs serial K-Means++ on the weighted set of these points to reduce the number of centroids to k (Ailon et al., 2009).

Algorithm 2 Streaming K-Means

Input: A set of data points X; the number of clusters k and the number of partitions, m. A, K-Means++ modified to select 3log(k) points per iteration. A’, K-Means++ selecting one point per iteration.

Output: X points grouped into k clusters and respective final centroids FC 1: Partition X into X1, X2, ..., Xm 2: For each i ∈ {1, 2, …, m} do 3: Run A on Xi to get 3k × log(k) centroids

Ti = {ti1, ti2, …} 4: Denote the induced clusters of Xi as Si1 ∪ Si2 ∪ …

5: Sw ← T1 ∪ T2 ∪ .. ∪ Tm

6: Run A′ on Sw to get k centroids C 7: Run K-Means on X using the set of initial centroids FC

Source: Ailon et al. (2009)

3.3 Our new CK-Means

We observed that running K-Means++ over large datasets produces varying results of the algorithm’s inherent stochastic nature. We can decrease the variability of the results if we run several instances of serial K-Means++ in parallel and select for our cluster analysis the instance that produces the most-accurate clustering results. However, running serial K-Means++ over large datasets is time-consuming. To solve this problem, our new approach reduces the execution time of the serial K-Means++ by performing several cluster analysis instances of K-Means++ over subsets of the dataset in parallel. The result of each cluster analysis instance on a subset of the dataset is then scored using a fitness measure. The K-Means++ instances compete with each other, and the winner is the instance with the best-fit cluster analysis. The set of initial centroids IC of the winner K-Means++ is then used as the initial centroids’ IC for K-Means cluster analysis over the entire dataset.

For an overview of our CK-Means, consider a dataset X for which cluster analysis is needed. We can randomly select x, a subset from X to be used as input for K-Means++ with the aim of selecting a set of initial centroids IC that will be used to perform cluster analysis on the entire dataset X and produce a set CLx as the result of such analysis. We define f as a fitness measure used to score the results of a cluster analysis. For this paper, we defined the fitness measure f to be the within sum of squares (WSSQ) function (Hartigan and Wong 1979). Any of a number of fitness measures as Silhouette (Rousseeuw, 1987), intra-cluster similarity technique and centroid similarity technique (Steinbach et al., 2000) can be used. However, we found that the WSSQ was adequate. For a given clustering problem where we perform cluster analysis several times using WSSQ as f, the Best-fit cluster analysis corresponds to the one with lowest WSSQ.

From our experiments, we found that the expectation value E[f(CLx)] is a good approximation to the expectation

Page 7: Accurate Distributed Cluster Analysis for Big Data Competitive K-Means

56 R.M. Esteves et al.

value of E[f(CLX)], where CLX is the resulting output of cluster analysis on the full dataset X. Consequently, we inferred that the fitness function f(CLx) ≈ f(CLX), using K-Means++ on the complete dataset X and using K-Means++ on a randomly selected representative subset x.

Let clx be the cluster analysis of x with initial k centroids selected by running K-Means++ over x.

We observed that when x is representative of X, there is a strong correlation (~0.7–0.9) between f (CLx) and f (clx). Thus we can use f (clx) to select the best-fit set of initial CI centroids and CI as the initial centroids for K-Means to perform cluster analysis over the full dataset X.

Our new strategy is shown in Algorithm 3. The choice of parameter m is optimal if it equals the total number of available computational cores and simultaneously meets the restriction whereby each partition xi must have more than the minimum sample size to be representative of X. The minimum sample size is O(k), and it is usually determined empirically by trial and error (Davidson and Satyanarayana, 2003).

Algorithm 3 Our new CK-Means

Input: A set of data points X shuffled into a random order; the number of centroids k; the number of competitors m and a fitness measure f Output: X points grouped into k clusters and respective final centroids CF 1: Partition X into x1, x2, ..., xm 2: For each i ∈ {1, 2, …, m} do 3: Run K-Means++ on xi to get k centroids ICi and

clusters clxi 4: Si = f (clxi) 5: C ← ICi, where i ← Best – fit(Si) 6: Run K-Means on X with C as initial centroids to obtain

cluster analysis output

3.4 CK-Means: a MapReduce implementation

Our new CK-Means can be easily implemented using MapReduce and HDFS. We can benefit from the distributed platform provided by MapReduce and HDFS, using the approach described in Algorithm 4 to write X into HDFS where each partition is assigned to a unique MapReduce key.

Algorithm 4 Our approach to write the dataset into HDFS

Input: A set of data points X shuffled into a random order and the number of partitions, m 1: Partition X into x1, x2, ..., xm 2: Assign to each xkey a unique key ∈ {1, 2, …, m} 3: Write the pairs (key, xkey) in the HDFS

Ideally, the input parameter m should match the maximum number of mappers that the computer cluster can run simultaneously. In a scenario where we have many machines and a relatively small dataset, m should be

selected by considering that each partition has sufficient data points for representing X. When RHIPE is used, the maximum size of each partition cannot exceed 256 MB, which is a limitation imposed by the Google Protocol Buffers (Protocol Buffers – Google Developers, 2012) that is a serialisation protocol used by RHIPE. HDFS automatically partitions and distributes X into the machines in the cluster. Steps 3 to 5 of Algorithm 3 can be easily parallelised using a map function. The MapReduce platform launches several map tasks, each one processing a partition xi in the machine where it is stored. At the end of the computation, each map emits to a reducer a candidate solution with a set of the initial centroids and a fitness score. The reducer chooses the fittest set of centroids. We present the details of the MapReduce function in Algorithm 5.

Algorithm 5 Our MapReduce seeder

Input: A path to the stored data points in HDFS; the number of clusters k and a fitness measure f Output: k centroids C 1: Map every (key, xkey) pair 2: Run K-Means++ on xkey to get k centroids ICkey and k

clusters clxkey 3: Skey = f (clxkey) 4: Emit to the reducer the ( , ( , ))key keykey S IC′ pair where

key′ is a constant

5: Reduce {(S1, IC1), (S2, IC2), …, (Sm, ICm)} 6: ( )

1, where miny yy i

C IC S≤ ≤

=

Our new CK-Means uses a limited amount of bandwidth. The input values to the reducer are a fixed set of m pairs (S, IC) emitted by the mappers that are independent of the total number of points existent in the dataset. Since S is a single value and IC is a set of k centroids, and each centroid has d values, a single map emits to the reducer (k × d) + 1 values. This represents an advantage of our new algorithm, since the network is a bottleneck-shared resource in large MapReduce jobs (Zaharia et al., 2008). The second step of Algorithm 6 calls a MapReduce implementation of K-Means, as described in Zhao et al. (2009).

Algorithm 6 Our CK-Means MapReduce

Input: A HDFS path to the stored data points and the number of clusters k Output: X points grouped into k clusters and respective centroids C 1: Run MapReduce seeder on HDFS path to get k centroids C 2: Run K-Means_MR on HDFS path with C as initial

centroids and get CLx

4 Experimental setup

In this section, we demonstrate the significant improvements in processing time and cluster quality

Page 8: Accurate Distributed Cluster Analysis for Big Data Competitive K-Means

A new approach for accurate distributed cluster analysis for Big Data 57

possible through the use of our new algorithm described in Section III. We show this by evaluating the following five hypotheses:

H1 A correlation exists between f (CLx) and f (clx).

H2 Our new CK-Means improves the quality of cluster analysis compared with SK-Means, which is a parallel algorithm that has been theoretically proven by Ailon et al. (2009) to provide similar results as serial K-Means++.

H3 Our new CK-Means has a similar running time when compared with SK-Means.

H4 Our new CK-Means benefits from the usage of MapReduce to reduce the execute time needed for cluster analysis.

H5 Our new CK-Means scales with the dataset size, both with the total number of points and the dimensionality of the dataset.

4.1 The datasets

To test our hypotheses, we used 4 datasets with different characteristics.

• Hypercube – this is a synthetic dataset that we generated to test the behaviour of our new algorithm in an especially difficult situation of multiple overlapping clusters. We created the dataset by using the hypercube function in the R package mlbench. We generated seven hypercubes, each with ten dimensions. The lengths of the hypercube sides were sampled using a Gaussian distribution with a variance of 0.25 and an average of 1. We added points around the vertices sampled, using a Gaussian distribution with a variance of 0.25. The dataset has 10 K points with ten dimensions.

• Electrical – this is a dataset publicly available at the UCI machine learning repository (Frank and Asuncion, 2010) that consists of real measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost four years. The dataset has 2 M points with nine dimensions. We chose it to test our algorithm in a relatively small real-world clustering problem. Its public availability makes it easy for the reader to replicate the tests and benchmark with other algorithms.

• KDD99 – this dataset is publicly available at the UCI machine learning repository (Frank and Asuncion, 2010) and has been used for clustering benchmarks. We chose it to test our algorithm in a medium real-world clustering problem with a medium number of features. The dataset consists of data from a network-intrusion detector with 8 M points in 37 dimensions. The network was submitted with 24 distinct attacks, which suggests the possibility of 24 existing clusters. We created two variations of the KDD99 dataset specifically to test H5. The KDD99n2 also has 37

dimensions, but twice the number of points, (16 M). The KDD99d2 has the same amount of points as the KDD99, but twice the number of dimensions (74).

• Google – a dataset of failures collected from a Google computer cluster (TraceVersion2 – Google Cluster Data – Second Format of Cluster-Usage Traces – Traces of Google Workloads – Google Project Hosting, 2012). Using this dataset, we tested the performance of our new algorithm in a specific clustering situation where we have a great many points with few dimensions. The dataset has 13 M points with two dimensions that represent the time stamp when a node has a failure and the location identifier of the node where the failure occurs. We aim to reduce the spatial temporal dependency present in this dataset by compressing the data into cluster prototypes and using the prototypes as data representatives to further the processing. The problem is described by Hacker et al. (2009).

4.2 The equipment

To follow Hadoop paradigm we used a cluster based on commodity hardware. Each machine has the following configuration: 16 GB 1,333 MHZ DDR3 RAM; 1 TB hardrive, SATA 6 Gb/s, 64 MB Cache, 7,200 RPM; 1 CPU AMD Phenom™ II X6, 6 cores 3,300 MHz; Onboard 1,000 Mbps LAN; Linux CentOS with kernel Version 2.6.32-358.2.1.el6.centos.plus.x86_64; Hadoop Version 0.20.203.0; R version 2.15.2; Rhipe version 0.69; doMC version 1.3.0.

4.3 The experiments

For all of the experiments, we normalised the data and used Euclidean distance as distance measure and WSSQs as the fitness measure f.

4.3.1 Experiment A

The aim of Experiment A was to test Hypothesis 1. To achieve this aim we ran serial K-Means++ on each partition set element {x1, x2, ..., xm} of the dataset X to obtain IC1, IC2, ..., ICm sets of k initial centroids. We performed cluster analyses CLx1, CLx2, ..., CLxm, using K-Means on X, and using IC1, IC2, ..., ICm as the initial centroids. We then performed clx1, clx2, ..., clxm cluster analyses using K-Means on each partition set element {x1, x2, ..., xm}, using the respective IC1, IC2, ..., ICm as initial centroids. We measured the correlation between the fitness of CLx and the fitness of clx. We used the entire hypercube and electrical datasets. Several days were required to run one serial K-Means++ over the entire KDD99 and the Google datasets; thus we performed the experiment with 10% of the KDD99 and Google datasets. All datasets were tested for number of centroids, k = 50, 100, 500, and 1,000, and with a number of competitors m = 6. We calculated each correlation coefficient based on 100 repetitions of cluster analyses

Page 9: Accurate Distributed Cluster Analysis for Big Data Competitive K-Means

58 R.M. Esteves et al.

under the same circumstances. We used a single machine with six cores running the R and doMC package.

4.3.2 Experiment B

The aim of Experiment B was to test H2 and H3. We performed cluster analysis using the following algorithms:

a K-Means with random initial centroids

b SK-Means

c our new CK-Means.

We performed the experiment using a random sample of 10% of the KDD99 and Google datasets. The 10% sample of the KDD99 and Google datasets was the same for all tests. We used the entire hypercube and electrical datasets. All datasets were tested for k = 50, 100, 500, 1,000. We repeated each test 100 times and measured the WSSQ. To measure the execution times, we repeated each experiment five times. We used a single machine with six cores with R and doMC package.

4.3.3 Experiment C

The aim of Experiment C was to test H4. For this experiment, we performed cluster analysis using a MapReduce implementation of our new CK-Means. We used 15 machines with six cores with Hadoop, R and RHIPE package, and compare the results with one machine with six cores with R and doMC package. The HDFS block size was 32 MB. We did the experiment with the entire KDD99 and Google datasets for k = 50, 100. Because we have 15 machines with six cores each, we chose m = 90. We repeated each test five times and measured the execution times used by our MapReduce seeder (Algorithm 5).

4.3.4 Experiment D

The aim of Experiment D was to test H5. For this experiment we performed cluster analysis using a MapReduce implementation of our new CK-Means; the KDD99 dataset was the baseline. For testing how the algorithm scales with an increasing number of points, we compared the execution times of cluster analysis of the KDD99 dataset with the KDD99n2 dataset. To test how it scales with an increasing number of dimensions, we compared the execution times of cluster analysis of the KDD99 dataset with the KDD99d2 dataset.

The KDD99d2 and KDD99n2 datasets occupy each 4 GB in disk space. We used 15 machines with six cores with Hadoop, R, and RHIPE package. The HDFS block size

was 32 MB. We tested the three datasets with k = 50, 100. Since we have 15 machines with six cores each, we chose m = 90. We repeated each test five times and measured the execution times used by our MapReduce seeder (Algorithm 5).

5 Experimental results

5.1 Experiment A

Table 1 shows the correlation coefficients obtained from a correlation analysis of the fitness of CLx and the fitness of clx as defined in Section 3. We observe that the correlation increases as the size of the dataset increases. This observation means that an increase of the correlation is associated with the rise in the number of points in the datasets. The Google dataset has 13 M points while the hypercube has only 10 K points. We observe that the correlation is stronger for smaller values of k.

Table 1 Correlation coefficients between f(CLx) and f(clx) for the four datasets of size N for varying K

n k = 50 k = 100 k = 500 k = 1,000

Hypercube 10K 0.80 0.70 0.56 0.24 Electrical 2M 0.80 0.71 0.53 0.43 KDD99 8M 0.79 0.76 0.74 0.72 Google 13M 0.92 0.86 0.78 0.65

The hypercube and electrical datasets have weak correlation for k = 500, 1,000. The drop in the correlation is explained by the diminishing of the representativeness of each partition of X when k increases. As we increase the number of clusters, the minimum number of points per partition necessary to represent the dataset rises. Thus this problem is more noticeable in the smallest dataset tested, which is the hypercube. The solution for this problem is to reduce the number of partitions m for k equal to 500 and 1,000 in both hypercube and electrical datasets. However, we maintained m for Experiment B to explore the effects of the lack of representativeness in the quality of the cluster analysis.

The results show that a correlation exists between f(CLx) and f(clx), which proves that our first Hypothesis is true. However, the correlation decreases when each partition of X lacks enough points to represent X.

5.2 Experiment B

Figure 1 compares the results of cluster analysis of the hypercube dataset between the following methods:

a the K-Means using random initial centroids

b SK-Means, which is a parallel implementation with equivalent results to the serial K-Means++

c our new CK-Means.

Page 10: Accurate Distributed Cluster Analysis for Big Data Competitive K-Means

A new approach for accurate distributed cluster analysis for Big Data 59

The comparison is repeated for k = 50, 100, 500, 1,000. Figures 2 to 4 show the same information as Figure 1 for the datasets hypercube, KDD99, and Google, respectively. The y-axis in Figures 1 to 5 represent the fitness function f = WSSQ. A lower WSSQ value indicates a better selection of initial centroids for cluster analysis.

We observe in Figure 1 that our CK-Means is more accurate than SK-Means and K-Means for k = 50 and k = 100. However, the relative accuracy of CK-Means in comparison with SK-Means slightly deteriorates when k = 500 and dramatically deteriorates when k = 1,000. This decline is explained because each partition of X does not

have the minimum sample size. The increase in the number of clusters requires an increase in the minimum sample size that is necessary to represent more clusters. However, the same undesired behaviour is expressed by SK-Means. The accuracy of SK-Means when compared with the K-Means drops dramatically in the same way as the CK-Means drop.

Figure 2 shows, in contrast to Figure 1, that the accuracy of CK-Means compared with the SK-Means and with the K-Means for the electrical dataset did not decrease for higher values of k. Although the correlation coefficient is above 0.5 for k equal to 1,000, CK-Means still produces better results when compared with SK-Means and K-Means.

Figure 1 WSSQ of K-Means; SK-Means and our new CK-Means with different values of k (see online version for colours)

2670.00

2675.00

2680.00

2685.00

2690.00

2695.00

2700.00

2705.00

2710.00

2715.00

2720.00

K-means SK-means CK-means

WSS

Q

WSSQ box plot k=50

Min Value Max Value

3120.00

3130.00

3140.00

3150.00

3160.00

3170.00

3180.00

3190.00

3200.00

K-means SK-means CK-means

WSS

Q

WSSQ box plot k=100

Min Value Max Value

4400.00

4500.00

4600.00

4700.00

4800.00

4900.00

5000.00

5100.00

K-means SK-means CK-means

WSS

Q

WSSQ box plot k=500

Min Value Max Value

11800.00

11900.00

12000.00

12100.00

12200.00

12300.00

12400.00

12500.00

K-means SK-means CK-means

WSS

Q

WSSQ box plot k=1000

Min Value Max Value Notes: Dataset: hypercube. Lower is better.

Page 11: Accurate Distributed Cluster Analysis for Big Data Competitive K-Means

60 R.M. Esteves et al.

Figure 2 WSSQ of K-Means; SK-Means and our new CK-Means with different values of k (see online version for colours)

1750.00

1800.00

1850.00

1900.00

1950.00

2000.00

2050.00

K-means SK-means CK-means

WSS

Q

WSSQ box plot k=1000

Min Value Max Value

6000.00

6200.00

6400.00

6600.00

6800.00

7000.00

7200.00

7400.00

7600.00

K-means SK-means CK-means

WSS

Q

WSSQ box plot k=50

Min Value Max Value

4400.00

4600.00

4800.00

5000.00

5200.00

5400.00

5600.00

5800.00

6000.00

K-means SK-means CK-means

WSS

Q

WSSQ box plot k=100

Min Value Max Value

2350.00

2400.00

2450.00

2500.00

2550.00

2600.00

2650.00

2700.00

2750.00

K-means SK-means CK-means

WSS

Q

WSSQ box plot k=500

Min Value Max Value

Notes: Dataset: electrical. Lower is better.

Figures 3 and 4 show for the KDD99 and Google datasets that CK-Means is more accurate than SK-Means and K-Means in a consistent pattern for the four different values of k. A lower box means better cluster quality, and a thinner box means less variance. Our CK-Means produced lower and thinner boxes than SK-Means and K-Means for all tested situations.

These results show that our new CK-Means improves the quality compared with the SK-Means, thus proving our second hypothesis. In terms of execution time, CK-Means consumes (1 ± 1)% times more than the SK-Means. When using one machine with six cores, our new approach reduces serial K-Means++ execution time by (86.96 ± 0.65)%.

5.3 Experiment C

Table 2 shows a comparison of running times between a MapReduce distributed implementation and a non-distributed implementation of our MapReduce seeder (Algorithm 5).

Table 2 comparison of Ck-means MapReduce seeder execution times (seconds) for 15-node MapReduce vs. single-thread

KDD99 Google

k = 50 k = 100 k = 50 k = 100

Single node

828;13

==

1,877;18

xσ=

= 885;

15xσ

==

1,638;21

xσ=

=

15 nodes

82;8

==

142;12

==

81;9

==

142;10

==

Using MapReduce to extend the computation to a cluster of 15 nodes, we can speed up CK-Means to 10 ± 1 times versus one node with six cores. Compared with a single node, MapReduce reduces time consistently for different values of k and for both datasets with different characteristics. Using Hadoop, we also found that our new approach produced a speedup of 76 ± 9 times versus the serial K-Means++ seeder proposed by Arthur and Vassilvitskii. Our results show that our new CK-Means

Page 12: Accurate Distributed Cluster Analysis for Big Data Competitive K-Means

A new approach for accurate distributed cluster analysis for Big Data 61

benefits from MapReduce to reduce the execution time, thus proving that Hypothesis 4 is true.

5.4 Experiment D

Table 3 shows how the execution time of our MapReduce implementation of CK-Means MapReduce seeder scales with the dataset growth and with the number of clusters. The execution time is more sensitive to increases in both n and k than in an increase in d.

Table 3 Comparison of CK-Means MapReduce seeder execution times (seconds) with varying dataset size

k = 50 k = 100 k = 500 k = 1,000

KDD99 82 142 577 1,107 KDD99d2 112 172 713 1,394 KDD99n2 148 254 1,137 2,196

We observed that when the dataset is doubled in size by increasing n, or by increasing d, the execution time of CK-Means grows less than twofold. This means that the CK-Means becomes proportionally faster as we increase the size of the dataset. The explanation for this is that our MapReduce implementation of the seeder launches a fixed number of mappers that is dependent on m and on k and independent from n and from d. Increasing n or d and maintaining the value m and k produce the same number of mappers, but assigns each map with more data to process. Thus all three variants of KDD99 spend the same amount of execution time for setting the mappers and for communication between the mappers and the reducers along the network. Figure 5 shows that increasing k has almost a linear effect on the execution time for the several different variants of the KDD99 dataset. Our new algorithm scales with the dimension of the dataset and thus the Hypothesis 5 is true.

Figure 3 WSSQ of K-Means; SK-Means and our new CK-Means with different values of k (see online version for colours)

0.00

100.00

200.00

300.00

400.00

500.00

600.00

700.00

800.00

900.00

1000.00

K-means SK-means CK-means

WSS

Q

WSSQ box plot k=500

Min Value Max Value

0.00

1000.00

2000.00

3000.00

4000.00

5000.00

6000.00

K-means SK-means CK-means

WSS

Q

WSSQ box plot k=50

Min Value Max Value

0.00

500.00

1000.00

1500.00

2000.00

2500.00

3000.00

3500.00

K-means SK-means CK-means

WSS

Q

WSSQ box plot k=100

Min Value Max Value

0.00

100.00

200.00

300.00

400.00

500.00

600.00

700.00

K-means SK-means CK-means

WSS

Q

WSSQ box plot k=1000

Min Value Max Value Notes: Dataset: KKD99. Lower is better.

Page 13: Accurate Distributed Cluster Analysis for Big Data Competitive K-Means

62 R.M. Esteves et al.

Figure 4 WSSQ of K-Means; SK-Means and our new CK-Means with different values of k (see online version for colours)

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

K-means SK-means CK-means

WSS

Q

WSSQ box plot k=1000

Min Value Max Value

0.00

50.00

100.00

150.00

200.00

250.00

K-means SK-means CK-means

WSS

Q

WSSQ box plot k=500

Min Value Max Value

0.00

200.00

400.00

600.00

800.00

1000.00

1200.00

1400.00

1600.00

K-means SK-means CK-means

WSS

Q

WSSQ box plot k=100

Min Value Max Value

0.00

500.00

1000.00

1500.00

2000.00

2500.00

3000.00

3500.00

4000.00

K-means SK-means CK-means

WSS

Q

WSSQ box plot k=50

Min Value Max Value

Notes: Dataset: Google. Lower is better.

Figure 5 Scaling of MapReduce CK-Means MapReduce seeder with KDD99’s variants for different values of k

Page 14: Accurate Distributed Cluster Analysis for Big Data Competitive K-Means

A new approach for accurate distributed cluster analysis for Big Data 63

6 Conclusions

K-Means is a fast clustering method. However, initial seeding heavily influences the cluster quality. In this paper, we presented a new strategy to parallelise K-Means++ that improves the speed and accuracy of cluster analysis for large datasets. Our new CK-Means is highly scalable and benefits from the use of Hadoop and MapReduce. We observed that a Hadoop cluster of 15 machines running our algorithm produced speedup of 76 ± 9 times compared with serial K-Means++ with improved accuracy. We found that our new CK-Means consistently improves cluster analysis accuracy compared with SK-Means. We proved that the third hypothesis is true, since our new algorithm is only 1% slower when compared with the SK-Means. We found that MapReduce largely decreased the running time by a speed up of 10 ± 1 times compared with a non-distributed implementation. We found that our new algorithm scales with the dimension of the dataset. The running time is more sensitive to variations in the number of data points and of clusters than to variations in the number of dimensions.

With our findings, we have addressed the problem of finding a good initial seeding in less time. Thus performing accurate cluster analysis over large datasets can now be done by using our new CK-Means approach.

References

Ackermann, M., Lammersen, C., Märtens, M., Raupach, C., Sohler, C. and Swierkot, K. (2010) StreamKM++: A Clustering Algorithm for Data Streams, pp.173–87, ALENEX.

Ailon, N., Jaiswal, R. and Monteleoni, C. (2009) ‘Streaming K-Means approximation’ [online] http://scholar.google.com.au/scholar.bib?q=info:eeMPmjm4TNsJ:scholar.google.com/&output=citation&hl=en&as_sdt=2000&ct=citation&cd=0.

Al-Daoud, M.B. (2007) ‘A new algorithm for cluster initialization’, World Academy of Science, Engineering and Technology, Vol. 1, No. 4, pp.568–570.

Arthur, D. and Vassilvitskii, S. (2007) ‘K-Means++: the advantages of careful seeding’, SODA ‘07 Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms [online] http://ilpubs.stanford.edu:8090/778/.

Bahmani, B., Moseley, B., Vattani, A., Kumar, R. and Vassilvitskii, S. (2012) ‘Scalable K-Means++’, Proc. VLDB Endow, Vol. 5, No. 7, pp.622–33.

Crainic, T.G. and Toulouse, M. (2010) Handbook of Metaheuristics, Vol. 146, pp.497–541; International Series in Operations Research & Management Science, Springer, USA [online] http://dx.doi.org/10.1007/978-1-4419-1665-5_17.

Davidson, I, and Satyanarayana, A. (2003) ‘Speeding up K-Means clustering by bootstrap averaging’, in IEEE Data Mining Workshop on Clustering Large Datasets.

Dean, J. and Ghemawat, S. (2008) ‘MapReduce: simplified data processing on large clusters’, Commun. ACM, Vol. 51, No. 1, pp.107–113, doi:10.1145/1327452.1327492.

Eiben, A., Michalewicz, Z., Schoenauer, M. and Smith, J. (2007) ‘Parameter setting in evolutionary algorithms’, Studies in Computational Intelligence, Vol. 54, pp.19–46, Springer Berlin/Heidelberg [online] http://dx.doi.org/10.1007/978-3-540-69432-8_2.

Ekanayake, J., Pallickara, S. and Fox, G. (2008) ‘MapReduce for data intensive scientific analyses’, in eScience ’08, IEEE Fourth International Conference on, pp.277–284.

El Agha, M. and Ashour, W.M. (2012) ‘Efficient and fast initialization algorithm for K-Means clustering’, International Journal of Intelligent Systems and Applications (IJISA), Vol. 4, No. 1, p.21.

Esteves, R.M. and Rong, C. (2011) ‘Using mahout for clustering Wikipedia’s latest articles: a comparison between K-Means and fuzzy C-Means in the cloud’, Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on, pp.565–569, 29 November to 1 December, doi: 10.1109/CloudCom.2011.86.

Esteves, R.M., Hacker, T. and Rong, C. (2012) ‘Cluster analysis for the cloud: parallel competitive fitness and parallel K-Means++ for large dataset analysis’, in Proceedings of the 2012 4th IEEE International Conference on Cloud Computing Technology and Science, IEEE Computer Society, Taipei, Taiwan.

Frank, A. and Asuncion, A. (2010) UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences [online] http://archive.ics.uci.edu/ml.

Gursoy, A. (2004) ‘Data decomposition for parallel K-Means clustering’, Lecture Notes in Computer Science, Vol. 3019, pp.241–248, Springer Berlin/Heidelberg [online] http://dx.doi.org/10.1007/978-3-540-24669-5_31.

Hacker, T.J., Romero, F. and Carothers, C.D. (2009) ‘An analysis of clustered failures on large supercomputing systems’, J. Parallel Distrib. Comput., Vol. 69, No. 7, pp.652–665.

Hartigan, J.A. and Wong, M.A. (1979) ‘Algorithm AS 136: a K-Means clustering algorithm’, Journal of the Royal Statistical Society, Series C (Applied Statistics), Vol. 28, No. 1, pp.100–108.

Holmes, A. (2012) Hadoop in Practice, Manning Publications Co., New York.

Jin, R., Goswami, A. and Agrawal, G. (2006) ‘Fast and exact out-of-core and distributed K-Means clustering’, Knowledge and Information Systems, Vol. 10, No. 1, pp.17–40, doi:10.1007/s10115-005-0210-0.

Khan, S.S. and Ahmad, A. (2004) ‘Cluster center initialization algorithm for K-Means clustering’, Pattern Recognition Letters, Vol. 25, No. 11, pp.293–1302, doi:10.1016/j.patrec.2004.04.007.

Knysh, D. and Kureichik, V. (2010) ‘Parallel genetic algorithms: a survey and problem state of the art’, Journal of Computer and Systems Sciences International, Vol. 49, No. 4, pp.579–589, doi:10.1134/S1064230710040088.

Kumar, J., Mills, R.T., Hoffman, F.M. and Hargrove, W.W. (2011) ‘Parallel K-Means clustering for quantitative ecoregion delineation using large data sets’, Procedia Computer Science, Vol. 4, pp.1602–1611.

Meilă, M. and Heckerman, D. (1998) ‘An experimental comparison of several clustering and initialization methods’, in Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pp.386–395, Morgan Kaufmann Publishers Inc., Madison, Wisconsin.

Page 15: Accurate Distributed Cluster Analysis for Big Data Competitive K-Means

64 R.M. Esteves et al.

Niknam, T., Fard, E.T., Pourjafarian, N. and Rousta, A. (2011) ‘An efficient hybrid algorithm based on modified imperialist competitive algorithm and K-Means for data clustering’, Engineering Applications of Artificial Intelligence, Vol. 24, No. 2, pp.306–317, doi:10.1016/j.engappai.2010.10.001.

Ostrovsky, R. and Rabani, Y. (2006) ‘The effectiveness of lloyd-type methods for the k-means problem’, in 47th IEEE Symposium on the Foundations of Computer Science (FOCS), pp.165–176.

Pavan, KK., Rao, A.A., Rao, A.V.D. and Sridhar, G.R. (2010) ‘Single pass seed selection algorithm for K-Means’, Journal of Computer Science, Vol. 6, No. 1, pp.60–66, doi:10.3844/jcssp.2010.60.66.

Protocol Buffers – Google Developers (2012) [online] https://developers.google.com/protocol-buffers/ (accessed 16 November).

Redmond, S.J. and Heneghan, C. (2007) ‘A method for initialising the K-Means clustering algorithm using Kd-trees’, Pattern Recognition Letters, Vol. 28, No. 8, pp.965–973, doi:10.1016/j.patrec.2007.01.001.

Rousseeuw, P.J. (1987) ‘Silhouettes: a graphical aid to the interpretation and validation of cluster analysis’, Journal of Computational and Applied Mathematics, November, Vol. 20, pp.53–65, doi:10.1016/0377-0427(87)90125-7.

Shvachko, K., Kuang, H., Radia, S. and Chansler, R. (2010) ‘The Hadoop distributed file system’, in Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp.1–10, IEEE Computer Society.

Steinbach, M., Karypis, G. and Kumar, V. (2000) ‘A comparison of document clustering techniques’, in KDD Workshop on Text Mining.

Stoffel, K. and Belkoniene, A. (1999) ‘Parallel K/h-means clustering for large data sets’, in Euro-Par’99 Parallel Processing, , Lecture Notes in Computer Science, Vol. 1685, pp.1451–154, Springer, Berlin/Heidelberg [online] http://dx.doi.org/10.1007/3-540-48311-X_205.

The Comprehensive R Archive Network (2012) [online] http://cran.r-project.org/ (accessed May 10).

TraceVersion2 – Google Cluster Data – Second Format of Cluster-Usage Traces – Traces of Google Workloads – Google Project Hosting (2012) [online] http://code.google.com/p/googleclusterdata/wiki/TraceVersion2 (accessed 20 November).

Wasif, M.K. and Narayanan, P.J. (2011) ‘Scalable clustering using multiple GPUs’, in High Performance Computing (HiPC), 2011 18th International Conference on, pp.1–10, doi:10.1109/HiPC.2011.6152713.

White, T. (2010) Hadoop: The Definitive Guide, 2nd ed., O’Reilly Media/Yahoo Press.

Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R. and Stoica, I. (2008) ‘Improving MapReduce performance in heterogeneous environments’, in Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, pp.29–42, USENIX Association, San Diego, California.

Zhang, Y., Xiong, Z., Mao, J. and Ou, L. (2006) ‘The study of parallel K-Means algorithm’, in Intelligent Control and Automation, WCICA, The Sixth World Congress on, Vol. 2, pp.5868–5871.

Zhao, W., Ma, H. and He, Q. (2009) ‘Parallel K-Means clustering based on MapReduce’, Lecture Notes in Computer Science, Vol. 5931, pp.674–679, Springer, Berlin/Heidelberg [online] http://dx.doi.org/10.1007/978-3-642-10665-1_71.