final report: kmeans clustering - computer scienceark/fall2014/654/team/7/report.pdf · 2014. 12....

FINAL REPORT: KMEANS CLUSTERING SAPNA GANESH (sg1368) VAIBHAV GANDHI(vrg5913)

Overview

The partitioning of data points according to certain features of the points into small groups is called clustering. That is, similar data that is based on such features are grouped together. Clustering has several applications. The growth of data has made clustering applicable in several fields ranging from Artificial Intelligence to Economics. Kmeans algorithm aids in partitioning of such data points into k clusters. It is a learning algorithm, which is not supervised. The method requires the user to input k which is the set of clusters. The outline of the method is, find k centers in the graph of data points and assign each data point to the nearest cluster center. Kmeans is one of the most popular methods of clustering, as it simply takes the distance of the points that are plotted based on their features, for computations. Distance calculations use Euclidean Distance :(x – p )^2

where x is the data point p is the position of the clusters.

The Kmeans algorithm is as follows.

1) Initialize K centers for the clusters 2) Assign closest cluster to each data point 3) Update the centers to the mean distance of the the data points in that cluster 4) Iterate until the centers do not change.

The final output depends heavily on the initialization of the clusters. This algorithm does not give the same result every time it is run. It finds the local maxima of the given representation. Computational Problem

The traditional approach is to develop a simple program that runs sequentially but will have several problems. Current computers have extremely fast multi core processors but are limited in terms of memory. This would require us to partition and process the data separately. The time required to do this would go up significantly. In the KMeans algorithm, there are several computations that are independent of each other. Such computations if run sequentially will have a high time complexity. The most intensive computation in this case is calculation of the distances between points and the cluster positions. This can be implemented in parallel, as every such distance is independent of the other.

After the distance calculation, the centers need to be updated to the mean of the distances in that group. This needs to be done sequentially, as this computation involves the relationship of all the data points. Analysis of Research Papers

Research Paper 1 [1] The KMeans algorithm is highly sensitive to the initial placement of the cluster centers. Initialization of cluster centers is thus a really important factor. This paper extensively discusses several approaches based on time complexity to initialize the cluster centers. The approaches are divided into Linear Time Complexity initialization methods and Log Linear Time Complexity (O(nlogn)) initialization methods. Some of the Linear Time Complexity initialization methods: 1. Forgy's method: Assign points to one of the k clusters uniformly at random. The centers are determined using the centroids of these initial clusters. 2. Jancey's method: Assign to each cluster a synthetic point in the data space. Empty clusters is a problem. 3. MacQueen's method: Either, take the first k points as center (would be data order dependent). Or, choose k random points from the data. Chances of high density region point being selected. Some of the Log Linear Time Complexity (O(nlogn)) initialization methods: 1. Hartigan’s method: Sorts the points. i'th center is (1 + (i − 1)N/K)th point. Invariant to data ordering, well separated 2. AlDaoud’s variancebased method: Sort points on the attribute with the greatest variance, then partition them into K groups along the same dimension. 3. The ROBIN method: Avoids outliers. The paper convinces us that many of the linear time complex methods do not perform well, but the performance of every method depends on the dataset. Research Paper 2 [2] When distributing data from the dataset to the different workers, some workers might get data that takes more processing time than others. This paper addresses such load balancing using the Master Worker approach. There are three parallel strategies, disk parallel, task parallel and both data and disk parallel. The Master would read the dataset from the file and select initial centers. The Slave would execute clustering operation for received data and return the clustering results to the Master. The Master then partitions a new sub dataset and sends it to the Slave. This is continued until there are no more data in the Master.

Load Balancing: If data division is static, data deflection may be produced and some processors will be idle. So the available method is that after Slave completes computing for assigned sub dataset, it initially applies for the next sub dataset that has the same size from the Master until there are no more data to be allocated. This balances the load better. Research Paper 3 [3] It identifies the sequential and parallel parts of the algorithm. It explains the implementation of Map Reduce to solve the KMeans problem in parallel cluster. In this identification, calculation of distance from the centroid is the parallel part and updating the centroids is the sequential part. The framework is explained in detail below.

1) map (key, value) Input: Global variable centers, the offset key, the sample value. Output: <key’, value’> pair, where the key’ is the index of the closest center point and value’ is a string comprise of sample information.

2) combine (key, value) Input: key is the index of the cluster, V is the list of the samples assigned to the same cluster. Output: < key′, value′ > pair, where the key’ is the index of the cluster, value’ is a string comprised of sum of the samples in the same cluster and the sample number.

3) reduce (key, value) Input: key is the index of the cluster, V is the list of the partial sums from different host. Output: < key′,value′ > pair, where the key’ is the index of the cluster, value’ is a string representing the new center.

This paper also illustrates the performance in SpeedUp, ScaleUp and SizeUp. Usage Details DEVELOPERS MANUAL In order to compile the code run the following lines. It can be run on nessie, kraken or champ computers as we run on more than 4 cores. In the folder containing the java files, Build path for Parallel Java 2 Library $ export CLASSPATH=.:/var/tmp/parajava/pj2/pj2.jar $ export PATH=/usr/local/dcs/versions/jdk1.7.0_11_x64/bin:$PATH Compile the software $ javac *.java

Refer Users Manual to run the program. USERS MANUAL Sequential Program: $ java pj2 KMeansSeq <number of clusters> <number of iterations> <number of lines from dataset> <columns> <input filename.csv> Parallel Program: $ java pj2 threads=<threads> debug=makespan schedule=dynamic KMeansSmp <Number of clusters> <number of iterations> <number of inputs from the dataset> <input filename.csv> for reference, use /var/tmp/vrg5913/dataset.csv on nessie Design and Operation of Sequential Program

The general design of the Kmeans algorithm is as follows.

The program takes in the number of clusters and number of iterations among other arguments. The number of lines and number of columns are also specified in the arguments.

Initialize a Kmeans object. The constructor initializes all the data points and number of clusters from the CSV file. And then the data is read by the filereader function. It reads the data, calls the stringToDouble function and returns an arraylist of data. The stringToDouble function converts the data that is read as a string into double datatype in order to perform computations. The clustering performs the Kmeans clustering. It iteratively updates the new positions and updates the positions of the data points. It also prints the output The mean function calculates the mean of the given numbers and the EuclideanDistance calculates the distances. Design and Operation of Parallel Program

We implemented the Map Reduce framework but it does not work as it is not meant to be used iteratively, that is, the output of the first iteration of the results can not be used as an input to the second iteration. The masterworker configuration had similar issues, except that in this program, we also had several problems splitting the data set to distribute to the clusters. It had to be read every time such an operation took place which in Map Reduce was handled by the library. We settled for a singlenode multicore program. The main method takes in the arguments, reads the data and converts it into an array list. It also initializes the cluster. It performs the number of iterations which is the sequential part of the program. Inside this iterative loop, we run the parallelFor method to compute the distances for the new clusters. At the end of the parallelFor loop, the new clusters are updated. The stringToDouble function converts the data that is read as a string into double datatype in order to perform computations. computeDist calculates the distances. We also use a reduction variable myVbl (which extends Vbl from the pj2 library), it has two data members, an array and a counter. This variable takes care of the reducing of the means calculated by the parallelFor. Strong Scaling The code is run from the range of 30,000 to 150,000 lines of input from 1 through 8 (except for 30,000 where we ran from 1 through 16 cores). The results are illustrated below. We notice that the efficiency falls majorly after 3 or 4 cores. The data is tabularized below. Below every table is its graph.

DataSize Cores Time (msec) Speed Up Efficiency 30,000 Seq 50569

1 50560 1.0002 1.0002

2 49413 1.0234 0.5117

3 28092 1.8001 0.6000

4 34983 1.4455 0.3614

5 41674 1.2134 0.2427

6 60051 0.8421 0.1404

7 65133 0.7764 0.1109

8 54531 0.9273 0.1159

9 56795 0.8904 0.0989

12 56724 0.8915 0.0743

16 66771 0.7573 0.0473


1 38140 1.0001 1.0001

2 27105 1.4072 0.7036

3 29666 1.2857 0.4286

4 37423 1.0192 0.2548

5 52847 0.7218 0.1444

6 92988 0.4102 0.0684

7 94192 0.4049 0.0578

8 99498 0.3834 0.0479


1 58992 0.9985 0.9985

2 40481 1.4550 0.7275

3 36682 1.6057 0.5352

4 53954 1.0917 0.2729

5 75256 0.7827 0.1565

6 161830 0.3640 0.0607

7 166954 0.3528 0.0504

8 164032 0.3591 0.0449


1 79311 1.0001 1.0001

2 52312 1.5163 0.7582

3 53472 1.4834 0.4945

4 76327 1.0392 0.2598

5 135581 0.5850 0.1170

6 210974 0.3760 0.0627

7 209594 0.3785 0.0541

8 207765 0.3818 0.0477


1 115145 1.0001 1.0001

2 77639 1.4832 0.7416

3 81361 1.4153 0.4718

4 112380 1.0247 0.2562

5 183336 0.6281 0.1256

6 195223 0.5899 0.0983

7 224344 0.5133 0.0733

8 275667 0.4177 0.0522

Issues with Strong Scaling The KMeans algorithm requires the updates of the new center positions. This involves all the threads to complete execution before a new iteration. Essentially, the parallel part of the program runs within the sequential loop. This may cause some loss in efficiency. We timed

the sequential, parallel, run time of the program for several inputs. However, the sequential part of the program does not take as much time as expected. It is not the outer loop that is causing the major delay in the execution. As we add more threads, the run time and the total time is similar. The time required for the completion of each thread is added to the total run time. This is because, the data is not divided among all the threads. So every thread does all the computation. We conclude that the parallel loop in the program has caused the lack of efficiency. However, the loop is implemented as per the kmeans requirements. We intend to work on it in the future. Weak scaling The code is run through 5 different cases, 10,000, 15,000, 20,000, 25,000, 30,000 lines of input from 1 through 8. As mentioned in the tables, every input is multiplied with 1 to 8 for cores 1 to 8 respectively The results are illustrated below. We notice that the efficiency falls majorly after 3 or 4 cores. The data is tabularized below. Below every table is its graph.

DataSize Multiplier Cores Time (msec) Size Up Efficiency 10,000 Seq 8120

1 1 8112.0000 1.0010 1.0010

2 2 10449.0000 1.5542 0.7771

3 3 15492.0000 1.5724 0.5241

4 4 29590.0000 1.0977 0.2744

5 5 86413.0000 0.4698 0.0940

6 6 123693.0000 0.3939 0.0656

7 7 162655.0000 0.3495 0.0499

8 8 180499.0000 0.3599 0.0450


1 1 26624 1.0005 1.0005

2 2 45156 1.1797 0.5899

3 3 77436 1.0319 0.3440

4 4 86306 1.2345 0.3086

5 5 104335 1.2765 0.2553

6 6 170205 0.9390 0.1565

7 7 228754 0.8151 0.1164

8 8 286553 0.7436 0.0930

20,000 Seq 32267

1 1 32349 0.9975 0.99752 2 66112 0.9761 0.48813 3 60849 1.5908 0.5303

4 4 95229 1.3553 0.33885 5 129819 1.2428 0.24866 6 235394 0.8225 0.13717 7 230772 0.9788 0.13988 8 242992 1.0623 0.1328


1 1 42079 0.9988 0.9988

2 2 76689 1.0961 0.5481

3 3 78320 1.6099 0.5366

4 4 122446 1.3730 0.3433

5 5 291197 0.7217 0.1443

6 6 291040 0.8665 0.1444

7 7 297937 0.9875 0.1411

8 8 317733 1.0582 0.1323


1 1 38140 1.0001 1.0001

2 2 52379 1.4564 0.7282

3 3 85259 1.3421 0.4474

4 4 146313 1.0428 0.2607

5 5 254869 0.7483 0.1497

6 6 612172 0.3738 0.0623

7 7 756196 0.3531 0.0504

8 8 740313 0.4122 0.0515

Issues with weak scaling As the data size multiplies, we assume that each thread can balance the load of the lines of input. As mentioned in the strong scaling issues, every thread performs reading and

computations on all the data. So as the load is not being balanced correctly, the weak scaling efficiency is poor. Future Work

In the future work, we intend to increase the efficiency of the program that is already implemented. We also intend to extend current program from single node multi processor to cluster program. Lessons Learned

We primarily learnt the complete working of the KMeans algorithm. We also learnt the implementation of the Map Reduce framework and its correct usage. We unsuccessfully implemented the Master worker configuration. However we understood several subtle errors that were taking place. We also learnt some different initialization methods for finding the initial cluster positions in KMeans algorithm. We explored the Parallel Java 2 library, especially the Map Reduce framework. Statement on contributions of the team members Sapna Ganesh: Summarized the first two papers, implemented a part of sequential program and performed debugging, implemented the masterworker configuration and implemented a part of the multicore program. Ran the program for weak scaling and obtained results. Vaibhav Gandhi: Summarized the third paper, implemented part of the sequential program, aided in developing the masterworker configuration, implemented the Map Reduce framework and performed debugging on it, and implemented a part of the multicore program. Ran the program for strong scaling and obtained results. References

[1] Title: A Comparative Study of Efficient Initialization Methods for the KMeans Clustering Algorithm Authors: M. Emre Celebi, Hassan A. Kingravi and Patricio A. Vela Journal: Expert Systems with Applications, 40(1):200–210, 2013 [2]Title: The Study of Parallel KMeans algorithm Authors: Yufang Zhang, Zhongyang Xiong, Jiali Mao and Ling Ou Proceedings of the 6th World Congress on Intelligent Control and Automation, June 21 23, 2006, Dalian, China [3] Title: Parallel KMeans Clustering Based on MapReduce Authors: Weizhong Zhao, Huifang Ma, and Qing He Conference: International Conference On Cloud Computing Technology And Science CloudCom , pp. 674679, 2009

final report: kmeans clustering - computer scienceark/fall2014/654/team/7/report.pdf · 2014. 12....

Documents