finalreport-ashutosh with indentation (1)

115
Performance Analysis of Fuzzy C-Means Algorithm LIST OF ABBREVATIONS Abbreviation Description 1. FCM Fuzzy C-Means 2. MR Map reduce 3. DBMS Database Management System 4. HDFS Hadoop Distribution File System 5. RDD Resilient Distributed Datasets 6. HKM Hadoop Based K-Means 7. HFCM Hadoop Based Fuzzy C-Means 8. GoogleFS Google File System 1

Upload: ashutosh-sathe

Post on 13-Apr-2017

279 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

LIST OF ABBREVATIONS

Abbreviation Description

1. FCM Fuzzy C-Means2. MR Map reduce3. DBMS Database Management System4. HDFS Hadoop Distribution File System5. RDD Resilient Distributed Datasets6. HKM Hadoop Based K-Means7. HFCM Hadoop Based Fuzzy C-Means8. GoogleFS Google File System

1

Page 2: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

1. Introduction

Clustering is the unsupervised classification of patters. Clustering problem have many

applications in many research areas, it shows its importance in field of Data Mining.

Clustering covers many application areas like image segmentation, object recognition and

information retrieval Clustering comes under roof of data analytics. Clustering is useful

in several fields pattern-analysis, grouping, decision making, and machine learning

situations, including data mining, document retrieval, image segmentation, and pattern

classification. How- ever, in many such problems, there is little prior information (e.g.,

statistical models) available about the data, and the decision-maker must make as few

assumptions about the data as possible. It is under these restrictions that clustering

methodology is particularly appropriate for the exploration of interrelationships among

the data points to make an assessment (perhaps preliminary) of their structure {Jain, 1999

#1}

Figure 1: Stages of Clustering

The aim of data clustering is to find structure in data and is therefore exploratory in

nature. Well known clustering methods, discuss the major challenges and key issues in

designing clustering algorithms, and point out some of the emerging and useful research

directions, including semi-supervised clustering, ensemble clustering, and simultaneous

feature selection during data clustering, and large-scale data clustering

2

Page 3: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Data clustering has been used for the following three main purposes.

Underlying structure: to gain insight into data, generate hypotheses, detect

anomalies, and identify salient features.

Natural classification: to identify the degree of similarity among forms or

organisms (phylogenetic relationship).

Compression: as a method for organizing the data and summarizing it

through cluster prototypes. [2]

To define big data one has to consider many aspects, where we are defining it as such

amount of data, which may be, structured or unstructured that cannot be process by

todays Database Management Systems [DBMS] or any relevant tools. Then question

arises how to handle such huge amount of data. Where demand in market says that huge

amount of data available that needs to process within a fraction of seconds. However,

DBMS are unable to fulfill this requirement [3].

A solution for above problem is data clustering. Data clustering identifies the pattern

among the data and then model the unstructured data. Modeling leads to Vectored

representation of data and it is useful for similarity matching. According to similarity,

contents group the very similar kind of data. Similar kind of data and its grouping will

provide the ease of searching; ease of data retrieval. The other solution is use the big data

processing tools like Hadoop’s MR framework based on Hadoop Distribution File

System [HDFS]. Data clustering of big data using big data processing tools solves above

problem. Hence, we have carried out a detailed survey of data clustering algorithms on

big data and studied how it works.

The algorithms provide hard clustering assigns each pattern to one and only one

cluster during its operation and in its output. A fuzzy clustering method assigns degrees

of membership in several clusters to each input pattern and put points in many clusters.

Further, we are concentrated on Fuzzy data clustering algorithms like Fuzzy C-means.

Fuzzy C-means is similar algorithm to K-means with slight variations [4].

3

Page 4: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

1.1. Fuzzy C- Means

The idea of fuzzy membership or putting each input pattern into many clusters had

introduced by Bezdeck in 1984. The FCM method has membership cardinality from 1 to

N. It says that each pattern in dataset belong to one cluster to N {1-N} clusters. It is also

consider as advanced version of k means algorithm. In FCM, data are bound to each

cluster by means of a Membership Function, which represents the fuzzy behavior of this

algorithm.

Looking at the Figure 2, we may identify two clusters in proximity of the two data

concentrations. We will refer to them using ‘A’ and ‘B’. In the first approach shown in

this tutorial - the k-means algorithm - we associated each data pattern to a specific

centroid; therefore, this membership function looked like this:

Figure 2: Clustering Algorithm Behavior

In the figure above, the data pattern shown as a red marked spot belongs more to the

B cluster rather than the A cluster. The value 0.2 of ‘m’ indicates the degree of

membership to A for such datum. Now, instead of using a graphical representation, we

introduce a matrix U whose factors are the ones taken from the membership functions:

Umc = [1 00 1] Matrix for K-means

4

Page 5: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Umc = [0.8 0.20.3 0.7] Matrix for Fuzzy C-Means

1.2. Hadoop

Hadoop is a collection of related subprojects that fall under the umbrella of

infrastructure for distributed computing. These projects are hosted by the Apache

Software Foundation, which provides support for a community of open source software

projects. Although Hadoop is best known for Map-Reduce and its distributed file system

(HDFS, renamed from NDFS), the other subprojects provide complementary services, or

build on the core to add higher-level abstractions. The subprojects, and where they sit in

the technology stack, are shown in Figure and described briefly here:

1. Core

A set of components and interfaces for distributed file systems and general I/O

(serialization, Java RPC, persistent data structures).

2. Avro

A data serialization system for efficient, cross-language RPC, and persistent data

storage. (At the time of this writing, Avro had been created only as a new subproject, and

no other Hadoop subprojects were using it yet.)

3. Map-Reduce

A distributed data processing model and execution environment that runs on large

clusters of commodity machines.

4. HDFS

A distributed file system that runs on large clusters of commodity machines.

5

Page 6: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

5. Pig

A data flow language and execution environment for exploring very large datasets.

Pig runs on HDFS and Map-Reduce clusters.

6. HBase

A distributed, column-oriented database. HBase uses HDFS for its underlying

storage, and supports both batch-style computations using Map-Reduce and point queries

(random reads).

7. ZooKeeper

A distributed, highly available coordination service. ZooKeeper provides primitives

such as distributed locks that can be used for building distributed applications. Hive A

distributed data warehouse.

8. Hive

It manages data stored in HDFS and provides a query language based on SQL (and

which is translated by the runtime engine to Map-Reduce jobs) for querying the data.

9. Chukwa

A distributed data collection and analysis system. Chukwa runs collectors that store

data in HDFS, and it uses Map-Reduce to produce reports. (At the time of this writing,

Chukwa had only recently graduated from a “contrib” module in Core to its own

subproject.) [Hadoop Definitive guide]

1.3. Hadoop (Map-Reduce Framework)

Google implemented a highly scalable and easily adaptable processing and storage

architecture, centered around the ‘map-reduce’ paradigm borrowed from functional

programming languages, and GoogleFS, a fault-tolerant distributed filesystem. It is an

6

Page 7: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

abstraction for large-scale computation. A simple programming model that applies to

many large-scale computing problems

Hide messy details in Map-Reduce runtime library:

1. Automatic parallelization

2. Load balancing

3. Network and disk transfer optimization

4. Handling of machine failures

5. Robustness

Map: extract something you care about from each record

Reduce: aggregate, summarize, filter, or transform

1.4. SPARK

Map-Reduce and its variants have been highly successful in implementing large-scale

data-intensive applications on commodity clusters. However, most of these systems are

built around an acyclic data flow model that is not suitable for other popular applications.

This includes many iterative machine learning algorithms, as well as interactive data

analysis tools. Spark introduces an abstraction called “Resilient Distributed Datasets”

(RDDs). An RDD is a read-only collection of objects partitioned across a set of machines

that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative

machine learning jobs [8].

1.5. Applications

Big data clustering has typically used to provide fast solution to the data

heterogeneity problem. Data heterogeneity can converted into clustering of similar kind

of data at single place. It is computational-intensive task or data clustering is NP-Hard

problem even for two nodes. Here speedup has achieved over on the fly network of

distributed systems, which may have different processing capacity or different platform.

7

Page 8: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Fuzzy C-means provide overlapping results on the same dataset. Text based clustering is

very computational intensive task but required in many area of day-to-day life [9].

Table 1: Data clustering Application [10]

Application What isClustered

Benefits

Search result clustering Search results More effective information presentation to user

Scatter-Gather (subsets of) collection Alternative user interface: “search without typing”

Collection clustering Collection Effective information presentation for exploratory browsing

Language modeling Collection Increased precision and/or recall

Cluster-based retrieval Collection Higher efficiency: faster search

1.6. Proposed Work

1. Problem Statement

Design and implementation of Fuzzy C-means clustering algorithm using Map-

Reduce.

2. Significance

Fuzzy C-means have wide variety of application in image processing based

application clustering. Text based data has many applications that can be solved using

Fuzzy C-means. Fuzzy C-means performance evaluation on text based clustering will

provide a new platform for fuzzy clustering algorithms on distributed computing. Reason

to select distributed computing is that it allows on the fly network formation. In a

distributed computing heterogonous system can come under one roof and increase the

processing capacity. As increases in processing capacity, it will lead to solve data

8

Page 9: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

clustering problem in less time. Text based clustering have many applications like find

good way to distribute a big graph [11 [ppt pdf]].

3. Objectives

1. Design of processing model of C-means Algorithm for Map-Reduce

2. Implementation of C-means algorithm on Map-Reduce

3. Testing & Performance analysis of above algorithm with Big-Data on Map-

Reduce

4. Compare C-means with other equivalent works

1.7. Organization of Dissertation Report

Chapter 2 gives Literature survey of evolution of data clustering algorithm on Big

data, Table of applications of clustering using Big Data. Chapter 3 presents methodology

adopted for the dissertation work. It includes flow chart, pseudo code for proposed

algorithms. Chapter 4 gives in detail implementation details. Chapter 5 is dedicated for

test data, hardware configuration specifications and experimental results. The results for

Classical K-means, Map-reduce Based K-means, Classical FCM and Map-Reduce based

FCM and its related speedup graphs has presented in this chapter. Chapter 6 gives

conclusion and exhibits future scope which ends the dissertation report.

9

Page 10: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

2. Literature Review

2.1. Background

In a field of Data-mining a K-means is the most popular algorithms because of its

simplicity in nature. It clusters the similar kind of data based on similarity criteria. K-

means is a clustering algorithm. K-means is a NP-Hard problem even for two nodes. K-

means uses Euclidian distance method for calculating distance between two point’s lies

on the same plane. K-means aims at minimizing objective function.

2.2. Clustering Taxonomy based of Similarity measure [12]

1. Agglomerative vs. divisive: This taxonomy relates to algorithmic structure

and its operation. An agglomerative perspective begins with each pattern in

a separate (singleton) cluster, and consecutive merges clusters together until

a stopping criterion has appeared. A divisive method begins with all

patterns with in cluster and performs splitting until a stopping criterion has

met [13][14].

2. Hard vs. fuzzy: A hard clustering algorithm assigns each pattern to one and

only one cluster during its operation and in its output. A fuzzy clustering

method assigns degrees of membership in several clusters to each input

pattern and put points in many clusters [15][16 need to find[17][18].

3. Incremental vs. non-incremental: This is a case when the pattern set

conceptual similarity between points. To be clustered is large, and

constraints on execution time or memory space affect the architecture of the

algorithm [19]

4. Deterministic vs. stochastic: This issue is most relevant to partitioned

approaches designed to optimize a squared error function. This optimization

can be accomplished using traditional techniques or through a random

search of the state space consisting of all possible labeling

10

Page 11: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

2.3. Fuzzy C-means Algorithm (FCM)

FCM has developed from basic K-means as variation to the K-means algorithm to

archive accuracy in algorithm, ease of implementation and speed-up. In 1974 Dunn has

proposed fuzzy version of K-means optimal and fuzzy version of least squared error

partitioning problem [20].

Robert, Jitendra, Bezdeck (1986) [23] experimented on 2 versions of FCM clustering

algorithm an approximate fuzzy c-means (AFCM) implementation based upon replacing

the necessary "exact" variants in the FCM equation with integer-valued or real-valued

estimates. This approximation enables AFCM to exploit a lookup table approach for

computing Euclidean distances and for exponentiation. The net effect of the proposed

implementation is that CPU time during iteration has reduced to approximately one sixth

of the time required for a literal implementation of the algorithm, while apparently

preserving the overall quality of terminal clusters produced. The two implementations

have tested numerically on a nine-band digital image. Literal fuzzy C-means algorithm

(LFCM) with a table-driven approach also proposed by authors.

Bezdeck, Ehrlich, Full (1984) [24] implements FOTRAN-IV coding of the fuzzy c-

means (FCM) clustering program. The FCM program is applicable to a wide variety of

geostatistical data analysis problems. This program generates fuzzy partitions and

prototypes for any set of numerical data. These partitions are useful for corroborating

known substructures or suggesting substructure in unexplored data. The clustering

criterion used to aggregate subsets is a generalized least-squares objective function.

Features of this program include a choice of three norms (Euclidean, Diagonal, or

Mahalonobis), an adjustable weighting factor that essentially controls sensitivity to noise,

acceptance of variable numbers of clusters, and outputs that include several measures of

cluster validity.

Alexandre, Leandro (2011) [25] proposed FCM algorithm implementation based on

particle swarm optimization technique. The algorithm named as Fuzzy Particle Swarm

Clustering (FPSC). This algorithm has the extension of crisp data clustering algorithm of

particle swarm clustering. The main structural changes of the original PSC algorithm to

11

Page 12: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

design FPSC occurred in the selection and evaluation steps of the winner particle,

comparing the degree of membership of each object from the database in relation to the

particles in the swarm.

Nikhil, Kuhu, James and Bezdek (2005) [26] observed drawbacks of FPCM (Fuzzy

Probabilistic C-means) and proposed a new algorithm based on the fuzzy C-means. That

algorithm named as Possibilistic Fuzzy C-means (PFCM) clustering. FPCM generates

both membership and typicality values when clustering unlabeled data. FPCM constrains

the typicality values so that the sum over all data points of typicality’s to a cluster is one.

The row sum constraint produces unrealistic typicality values for large data sets. PFCM

produces memberships and possibilities simultaneously, along with the usual point

prototypes or cluster centers for each cluster. PFCM is a hybridization of possibilistic c-

means (PCM) and fuzzy c-means (FCM) that often avoids various problems of PCM,

FCM and FPCM. PFCM solves the noise sensitivity defect of FCM, overcomes the

coincident clusters problem of PCM and eliminates the row sum constraints of FPCM.

Andrew and Khaled (2013) [27 - IMP] experimented on sentence based clustering of

data. Sentences are heart of documents or any kind of communication. They may belong

to more than one theme or topic present within a document or set of documents.

However, because most sentence similarity measures do not represent sentences in a

common metric space, conventional fuzzy clustering approaches based on prototypes or

mixtures of Gaussians are generally not applicable to sentence clustering. Author

proposes a novel fuzzy clustering algorithm that operates on relational input data; i.e.,

data in the form of a square matrix of pairwise similarities between data objects. The

algorithm uses a graph representation of the data, and operates in an Expectation-

Maximization framework in which the graph centrality of an object in the graph is

interpreted as likelihood. Results of applying the algorithm to sentence clustering tasks

demonstrate that the algorithm is capable of identifying overlapping clusters of

semantically related sentences, and that it is therefore of potential use in a variety of text

mining tasks.

12

Page 13: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Suganya, Shanthi (2012) [28] has reviewed three different implementation of Fuzzy

C-means algorithm, its objectives and advantages-disadvantages of each Fuzzy C-means

method. Methods have mentioned in tabular format.

Table 3, briefs many variation of Fuzzy C-Means implementation. Table 3 also

represents advantages of each Fuzzy C-Means variation as well as disadvantage of it.

Table 2: Various Fuzzy C-means Implementations

Algorithm Advantages DisadvantagesFuzzy C-Means Algorithm [20]

- Unsupervised- Converges

-Long computational time-Sensitivity to the initial guess (speed, local minima)- Sensitivity to noise and One expects low (or even no) membership degree for outliers (noisy points)

Possibilistic C-Means (PCM) [26]

- Clustering noisy data samples

- Very sensitive to good initialization- Coincident clusters may result Because the columns and rows of the typicality matrix are independent of each other

Fuzzy Possibilistic C Means Algorithm(FPCM) [26]

- Ignores the noise sensitivity deficiency of FCM- Overcomes the coincident clusters problem of PCM

- The row sum constraints must be equal to one

Possibilistic Fuzzy C-Means Algorithm(PFCM) [26]

- Ignores the noise sensitivity deficiency of FCM- Overcomes the coincident clusters problem of PCM- Eliminates the row sum constraints of FPCM

13

Page 14: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

2.4. Fuzzy C-means Algorithm on Big Data

Clustering of data leads to simplification of big data for processing. Where, clustering

has defined as gathering similar kind of data at one place and differentiating from non-

similar content. Similarity matrix has defined between items of data and then similar kind

of items has picked to form a data or clustering is the unsupervised classification of

patterns (observations data items or feature vectors) into groups (clusters). Clustering is a

NP-Hard problem. Clustering become more complex and time consuming for big data but

clustering helps efficient handling of big data is a fact. Clustering data is a fundamental

problem in a variety of areas of computer science and related fields, such as machine

learning, data mining, pattern recognition, image analysis, information retrieval, etc.

Clustering itself is not one specific algorithm. The concept of parallelization appears

because large datasets or big data. However, parallelization has obstacle of data

dependency. In case of data dependency, synchronization is required or serial code will

execute in the critical section. This reduces the performance of parallel algorithms or

parallelization. So, some part of processing needs to be 100% parallel which will utilize

all the available resources as well as simple framework required for programming which

will provide abstract level of programming over parallelization, distribution of data and

load balancing.

Distributed environment and Map-Reduce architecture appears to solve the data-

clustering problem with achievement of speedup as compare to sequential. The Map-

Reduce library has created as an abstraction. It allowed the developer to express the

simple computation while hiding the details of parallelization, fault-tolerance, data

distribution and load balancing in the library [1]. Developing an algorithm that would

best suit for big data processing is a challenging task. For big data processing apache

developed Hadoop architecture. Hadoop is a distributed environment architecture

components located over the remote places and computers over remote places can

communicate and coordinate their actions by passing messages to each other this have

been known as a distributed environment. Hadoop supports Hadoop Distributed File

Systems (HDFS). Hadoop distributed file systems are a Java-based file systems that

provides expandable and impeccable data storage that has been design to span large

14

Page 15: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

clusters of special servers. HDFS designed to be a scalable, fault-tolerant, distributed

storage system that works closely with Map-Reduce [2].

Map-Reduce is architecture for distributed programming model and implementation

for processing and generating large data set in parallel. A Map-Reduce program is based

on three functions Map(), Shuffle(), Reduce(). Map() performs distribution of data to

different node located into the distributed environment. Reduce() that gathers results at

single location. The "Map-Reduce System" (also called "infrastructure" or "framework")

orchestrates the processing by marshaling the scattered servers, running the various tasks

in parallel, managing all communications and data transfers among the various parts of

the system, and providing for redundancy and fault tolerance .

Anchalia, Prajesh P., Anjan K. Koundinya, and N. K. Srinath [9] authors have

research in area of data mining. Faster Information retrieval has been author’s main goal

of research. For fast retrieval of information, clustering has been useful. Authors

proposed parallel implementation of K-means clustering algorithm for clustering of data.

Authors implemented K-means algorithm on Map-Reduce architecture to achieve speed

up in formation of data clustering. Authors observed outliers handling is much important

in implementation of K-means on Map-Reduce architecture. Improvisation in stopping

criteria and proper initialization of clusters may lead to better results. Algorithm and

implementation detail has given by authors to implement K-means on Map-Reduce

architecture.

Zhao, Weizhong, Huifang, and Qinge [10] makes a statement based on their

experimental research data clustering has been an important research area as the data

increases handling and maintaining become difficult and clustering is much more

complex task. In order to deal with this problem parallel K-means clustering algorithm

based on Map-Reduce. K-means is simpler and easy to implement clustering algorithm.

The map function performs the procedure of assigning each sample to the closest center

(node). The reduce function performs the procedure of modifying center values. To

decrease network communication combiner function has introduced to handle the

midway values. Key is the offset in bytes of this record to the start point of the data file,

15

Page 16: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

and the value is a string with the content of this record. The dataset has divided and

broadcast to all mappers. Distance calculations simultaneously performed. For each map

task, K-Means construct a universal variant centers which is an array stores details about

centers of the clusters. Information is base then mapper can calculate the closest center

point for each point. The intermediate values are then composed of two parts: the index

of the closest center point and the information.

Ene, Alina, SungjinIm, and Benjamin Moseley [11] experimented on k-means by

designing an algorithm that could be implemented on Map-Reduce architecture and

focused on two well-studied algorithms k-median and k-center. Evaluations have

performed on various serial and parallel algorithms for the k-median problem. K-median

sample contains more information than normal sample for each unsampled point x. The

good solution for overhead of more information author proposes to select the sampled

point that is nearest to x. Author add additional weight to each sampled point y which is

equal to the number of un-sampled points that picked y at its closest point. The Map-

Reduce k-Median algorithm takes extra time to assign a weight to each point in the

sample. This needs to be gradually removed and implemented on Map-Reduce. Map-

Reduce architecture has not considered as best programming model for iterative

programming.

Esteves, RuiMáximo, and Chunming Rong[12] experimented on big dataset which is

realistic noisy dataset using data clustering algorithms like k-means and FCM [fuzzy c-

means]. The evaluation made using a free cloud computing solution Apache Mahout and

Wikipedia’s latest articles. Authors have proved dimensionality reduction has crucial role

in document clustering. Authors also proved that in presence of noise FCM gives worst

results than k-means. Initialization of cluster center has effect on convergence speed of

both the algorithm. Different initialization methods provide different convergence time

for big dataset. In general, FCM is faster than k-mean but random initialization will give

different results or no one can predict which algorithm is much faster.

Xiaojun, Junying, and Haitao[13] proposed distance regularity factor to improve

fuzzy C-means algorithm. The conventional Fuzzy C Means (FCM) that uses Euclidean

16

Page 17: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

distance as the similarity measurement criterion of data points for data distribution in data

clustering. Euclidean distance has its limitation of equal distribution of datasets and the

clustering performance is decreases by the data structure including cluster shape and

cluster density. Solution to above problem is distance regulatory factor. It assigns correct

the similarity measurement while calculating similarity criteria between cluster center

and sample point. The factors have taken into account on cluster density, which

represents the global distribution information of points in a cluster. It has applied to the

conventional FCM for distance correction. A result has good forbearance over different

clustering density.

Xie, Jiong, ShuYinet. et al. [14] shows that ignoring data locality in heterogeneous

environment reduces Map-Reduce performance. All the reviews are relate to clustering

using Map-Reduce but authors want to show data clustering is prerequisite for Map-

Reduce jobs to improve the performance. Authors point out the problem of data

distribution with distributed environment and Map-Reduce architecture. Instead of

distributing random data to random node; data clustering will be performed first over data

and then similar data will be transferred to same node in distributed environment. Data

clustering has been used to enhancing data locality while single read. Data intensive

applications will give much positive results in the scene of time and balancing data across

the nodes.

Ferreira Cordeiro, Robson Leonardo, Caetano Traina Junior. et. Al [15] observes

some problems while using Map-Reduce for data clustering. One of the points is

bottleneck. To overcome from the problem minimize the I/O cost. Consider the already

existing data partition, which minimize the network cost among processing nodes. To

remove bottleneck authors proposed Best of Both World (BOW) strategy. Authors

proposed cost function, which can be able to choose best strategy. It works with almost

all-serial clustering method as plug-in subroutine clustering, synchronization between the

cost for disk accesses and network access, achieving very good results between the two

as well as it uses no user-defined parameters. Author reports have based on

experimentation on actual and artificial data with billions of points, using up to 1,024

17

Page 18: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

cores in parallel. For 0.2 TB of multi-dimensional data, it took only 8 minutes to be

clustered, using 128 cores.

Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae[16]

has proposed designing and implemented Twister a distributed in-memory Map-Reduce

runtime. It is an improvisation over Map-Reduce architecture. This experiment carried

out to optimize the iterative operations or computations over Map-Reduce. Authors have

presented the extended programming model of Twister and its architecture. Authors

compare their model with the Map-Reduce and their experiment. It resulted into

abstraction over Map-Reduce to wide variety of applications. Application mostly related

to data clustering, computer vision and machine learning. Authors presented the results

on a bunch of applications with bigdata sets. Some of the benchmarks performed indicate

that Twister performs and scales well for many iterative Map-Reduce computations.

Table 3, represents clustering algorithms that are implemented on Hadoop and its

relevant technology.

Table 3: List of Hadoop based Clustering ApplicationsPaper Pros Cons

Parallel k-means clustering based on Map-Reduce [10]

- Combiner () included to reduce communication overhead

- Distance calculation performed Simultaneously

- Initialization of center values are not given

- Stopping Criteria is not mentioned

Fast clustering using Map-Reduce

- Constant number of Map-Reduce rounds

- Works in

O(log n log log n) time

- All experimentation performed on single machine

Using Mahout for clustering Wikipedia's latest articles: a comparison between K-means and fuzzy

- Proved that Mahout library is promising clustering technique.

- Conclusion on convergence rate and preprocessing tool are given based on

- Considered dataset is artificial.

18

Page 19: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

C-means in the cloud [12]

experimentation

Improved fuzzy C-means clustering algorithm based on cluster density [13]

- Similarity criteria changed

- Good result for different clustering density

- Does not designed for Map-Reduce architecture or parallel implementation

Improving Map-Reduce performance through data placement in heterogeneous hadoop clusters [14]

- Data placements are considered before data clustering

- Overhead increases before data clustering or while data distribution

Clustering very large multi-dimensional datasets with Map-Reduce [15]

- Minimizes IO and network cost

- Works as plugin

Experimentation on both real and artificial data

- Design is not useful for overlapping datasets

Twister: a runtime for iterative Map-Reduce [16]

- Improvisation over Map-Reduce architecture

- Perform well with iterative computations

- Suitable for data clustering applications

- Implemented on top of Map-Reduce Architecture

2.5. Summary

Hard clustering has not considered the best solution for overlapping datasets and its

clusters. Soft clustering technique has considered for data clustering on overlapping

dataset. In soft clustering technique Fuzzy C-means is well known algorithm and can be

treated as an advanced version of K-means algorithm, which is soft clustering algorithm

where cluster data points are depends on the fuzziness of the data point. Fuzzy C-means

allows same sample into many cluster according to its fuzziness with cluster center.

Fuzzy clustering has a promising idea for data clustering. Map-Reduce (Hadoop) is

architecture of distributed computing. Many things regarding parallelization and load

19

Page 20: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

balancing are kept abstract to the programmer so that programmer can concentrate on the

actual logic part. As well as it is an open source programming framework and capable to

handle large datasets or Bigdata. Authors are looking into solving data clustering problem

using Fuzzy clustering with Map-Reduce for voluminous overlapping datasets or

Bigdata.

Along with fuzziness over Map-Reduce on Bigdata, fast and accurate clustering

requires many aspects like

1. Proper initialization method for cluster center point

2. Similarity criteria that has less computation and able to converge algorithm

early

3. Stopping criteria for an algorithm

4. Data placement in heterogeneous network as a part of preprocessing to data

clustering etc.,

Above mentioned points are considered as variations over simple soft clustering

algorithms or Fuzzy C Means algorithm, which will be improvisation of the performance

of soft clustering. These algorithm needs to be parallelize for performance evaluation.

Above points also considered as optimization points for Fuzzy C-means algorithm.

20

Page 21: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

3. Design & Implementation

Fuzzy C-means (FCM) starts with vectoring the elements, choosing centroids from

them as a reference to further operations. Based on some similarity criteria points have to

place in its matching cluster. Here each individual point calculates distance with each

centroid as similarity criteria and based on fuzziness membership values. From various

distance calculation method we are considering euclidean distance method because of its

simplicity in calculations. FCM has more ever advanced version of advanced version of

K-means.

In many situations, fuzzy clustering is more natural than hard clustering. Objects on

the boundaries between several classes are not forced to fully belong to one of the

classes, but rather are assigned membership degrees between zero and one indicating

their partial membership. Fuzzy c-means clustering was first reported in the literature for

a special case (m=2) by Joe Dunn in 1974. Jim Bezdek developed the general case (for

any m greater than 1) in his PhD thesis at Cornell University in 1973. Bezdek had

improved it in 1981. The FCM employs fuzzy partitioning such that a data point can

belong to all groups with different membership grades between 0 and 1.

3.1. Fuzzy C-means

The objective function of algorithm has define clustering criterion used to aggregate

subsets is a generalized least-squares objective function. Features of this algorithm is

Euclidean distance as simple to calculate, and adjustable weighting factor that essentially

controls sensitivity to noise, acceptance of variable numbers of clusters, and outputs that

include overlapping clusters.

3.2. Objective Function

The FCM algorithms has best described by recasting conditions (equation 1) in

matrix-theoretic terms. Towards this end, let U be a real c × N matrix, U = [uik]. U is the

matrix representation of the partition {Yi} [28]

21

Page 22: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

ui( y¿¿ k)=uik={ 1 ; yk Ɛ y i

Otherwise¿ (3.1)

∑i=1

N

u ik>0 for all i (3.2)

∑i=1

N

u ik=0 for all k (3.3)

ui is a function; ui: Y → {0, 1}. In conventional models, ui is the characteristic

function of Yi: in fact, ui and Yi determine one another, so there is no harm in labeling u;

the ith hard subset of the partition.

To refer U as a fuzzy c-partition of Y when the elements of U are numbers in the unit

interval [0, 1] that continue to satisfy both equations (3.2) and (3.3). The basis for this

definition are c functions ui: Y→[0, 1] whose values Ui(Yk) Ɛ [0, 1] are interpreted as the

grades of membership of the YkS in the "fuzzy subsets" ui of Y.

Jm(U , v)=∑k=1

N

∑i=1

c

(U ik )m|yk−v i|A2 (3.4)

Where,

Y = {Y1, Y2 ..... YN} C Rn = the data, (3.5)

c = number of clusters in Y; 2 <= c

<= n,

(3.6)

m = weighting exponent; 1 <= m < ∞ (3.7)

U = fuzzy c-partition of Y; U E {0,1} (3.8)

V = (V1, V2,…. VC) = vectors of

centers,

(3.9)

Vi= (Vi, Vj, .. , Vn) = center of

cluster i,

(3.10)

22

Page 23: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

The weight attached to each squared error is (ua)", the mth power of ykS

membership in cluster i. The vectors {Vi} in equation (3.10) are viewed as "cluster

centers" or centers of mass of the partitioning subsets. If m = 1, it can be shown

that Jm minimizes only at hard U's Ɛ Mc,

Table 4 : Objective Function Described

d ik2 squared A-distance from point Yk to

center of mass Vi

(U ik )m d ik2 squared A-error incurred by repre-

senting Yk by vi weighted by (a power of) the membership of Yk in cluster i

∑i=1

c

( U ik )md ik2 sum of squared A-errors due to yks

partial replacement by all c of the centers {vi}

∑k =1

N

∑i=1

c

(U ik )m dik2 overall weighted sum of generalized A-

errors due to replacing Y by v.

3.3. Importance of Fuzzification factor (m)

Weighting exponent m controls the relative weights placed on each of the squared

errors d2k. As m ->1 from earlier discussion partitions that minimize Jm become

increasingly hard (and, as mentioned before, at m = l, are necessarily hard). Conversely,

each entry of optimal Jm approaches for (1/c) as m → ∞. Consequently, increasing m

tends to degrade membership towards the fuzziest state. Each choice for m defines, all

other parameters being fixed, in FCM algorithm. No theoretical or computational

evidence distinguishes an optimal m. The range of useful values seems to be [1, 30] or so.

If a test set is available for the process under investigation, the best strategy for selecting

m at present seems to be experimental. For most data, 1.5 <= m <= 3.0 gives good results

[28].

23

Page 24: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

3.4. Document Clustering

Document Clustering, also known as text data mining or knowledge discovery

process from the textual databases, generally, is the process of extracting interesting and

non-trivial patterns or knowledge from unstructured text documents. All the extracted

information is linked together to form new facts or new hypotheses to be explored further

by more conventional means of experimentation.

The classic approach of information retrieval based on keyword search from WWW

makes it cumbersome for the users to look up for the exact and precise information from

the search results. It is up to the user to go through each document to extract the relevant

and necessary information from those search results. This is an impractical and tedious

task. Document Clustering can be a better solution for this as it links all the extracted

information together, pushes all the irrelevant information aside, and keeps the relevant

ones based on the question of interest.

3.5. Dataset Conversion

Dataset 20_newsgroup dataset converted as shown in figure into numerical format

from text format. Find out unique keywords from all the documents and its count i.e. total

count of unique keyword in all documents. Display the unique keywords to user and ask

for keywords. Store unique keywords and count in javas Tree-Map collection. Program

collects keywords from user on which documents need to be done. According to

keywords find out the occurrence of that keyword in all the documents and then store its

count.

TF = Count of Keyword / ∑(Keywords appeared in Document) (5.11)

Using above formula text based data gets converted into the numerical data which

has representation of 20_newgroups dataset. Properties of Vectorized are “KEY”,

“Value” and “Location” as in ID of object. Value represents Term Frequency of keyword

in that document. Do it same for centroids too.

24

Page 25: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Figure 3 : Dataset Conversion

25

Page 26: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

3.6. Classical K-Means

Algorithm

26

Procedure KMeansBeginRepeat

Count the frequency of unique words in documentCount the global_Frequency of unique words in each document

Until all documents are scannedDisplay it to userGet user Input of Keywords on which clustering documents

should be done.Repeat

Calculate TF per Document:Keyword

TF = Count of Keyword / ∑(Keywords appeared in Document)Store TF in a Vector_Dataset

Until all documents are scannedRandomly select K number of points from Vector_Dataset as

CentroidStep I: Vectorization:

Store TF in Vecorized ObjectStore Centroids in Vecorized Object

RepeatRepeat Step II: Calculate distance (Euclidean Distance):

1. Calculate Euclidean Distance between Current Document to cluster centroids.

2. d ( x , y )=√∑i

N

( x i− y i)2

3. Check whether distance is minimum than previous distance 4. If Yes Update Current Document belongs to this ith cluster.

Until all Documents are scanned.Update Centroids

Until all iterations are overEnd

Figure 4 : Algorithm of classical K-Means

Page 27: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Flowchart

Figure 5 : Flowchart of Classical K-Means

27

Page 28: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Initially Convert Vector_Dataset to Vectorized objects as Mentioned in section 3.5.

Start iterating over vectorized data and compare distance (Euclidean distance) between

current document’s Vectorized object with each centroid. Document will belong to the

nearest distance centroid will give the shortest distance. Compare the distance with each

centroid and store the smallest distance and index of centroids. Now, copy the point into

new cluster which it belongs and update the centroid values. In next iteration iterate over

modified centroid values.

The above procedure has known as hard clustering where, each document will belong

to one and only one cluster.

28

Page 29: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

3.7. Map-Reduce Based K-Means

Algorithm

29

Procedure MapReduceKMeansBegin (sequential)Repeat

Count the frequency of unique words in documentCount the global_Frequency of unique words in each document

Until all documents are scannedDisplay it to userGet user Input of Keywords on which clustering documents should

get done.RepeatCalculate TF per Document:

KeywordTF = Count of Keyword / ∑(Keywords appeared in Document)Store TF in a Vector_Dataset

Until all documents are scannedRandomly select K number of points from Vector_Dataset as

CentroidEnd (sequentially)

Begin (MapReduce)Map Given dataset over all Nodes available (Based on split size)

MAP PHASE (On all node)RepeatStep I: Vectorization:

Store TF in Vecorized ObjectStore Centroids in Vecorized Object

RepeatRepeat Step II:Calculate distance (Ecludean Distance):

1. Calculate Ecludean Distance between Current Document to cluster centroids.

2. d ( x , y )=√∑i

N

( x i− y i)2

3. Check whether distance is minimum than previous distance 4. If Yes Update Current Document belongs to this ith cluster.

Until all Documents are scanned.Update Centroids

Until all iterations are over

REDUCE PHASE (From all nodes)Gather all results and store on single node.

Figure 6 : Algorithm of Hadoop Based K-Means

Page 30: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Flowchart

Figure 7: Flowchart of Hadoop Based K-Means

30

Page 31: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Initially follow data conversion method as mentioned in section 3.5. Converted

vectorized data will be distributed over the Hadoop nodes. The distribution will be done

according the Hadoop configuration for replication. While programming slices of dataset

are created according to split size and distributed over the Hadoop node for processing.

The difference between above two statement is first replicated the data for number of

times to reduces the communication overhead sacrificing space on number of nodes.

Second slice the required data from available replicated node which is optimally nearest

to processing node and which does not have data copy. Sometime, data split increase the

communication overhead and following decreases the throughput of the system.

Follow the section 3.6 for K-means operation but in this case each node will run the

same code of K-means. On completion of distance calculation reduce data from each

node to one single node and update the centroid. Repeat the same procedure for n times.

31

Page 32: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

3.8. Classical Fuzzy C-Means

Algorithm

32

Procedure FCMeansBeginRepeat

Count the frequency of unique words in documentCount the global_Frequency of unique words in each document

Until all documents are scannedDisplay it to userGet user Input of Keywords on which clustering documents

should be done.RepeatCalculate TF per Document:

KeywordTF = Count of Keyword / ∑(Keywords appeared in Document)Store TF in a Vector_Dataset

Until all documents are scannedRandomly select K number of points from Vector_Dataset as Centroid

Step I: Vectorization:1. Store TF in Vecorized Object2. Store Centroids in Vecorized Object

RepeatRepeat Step II: Calculate distance (Euclidean Distance):Repeat

Calculate Euclidean Distance between Current Document to cluster centroids.

Until all centroids are coveredFuzzyfication Factor Calculation

Until Fuzzyfication Factor of a point with all Centroids is calculated.Step III:

1. If Fuzzyfication Factor >Threshold; 2. Update Current Document belongs to this ith cluster.

Until all Documents are scanned.Update CentroidsUntil all iterations are overEnd

Figure 8: Algorithm of Classical Fuzzy C-Means

Page 33: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

1. Flowchart

Figure 9: Flowchart of Classical Fuzzy C-Means

33

Page 34: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Fuzzy C-Means has same to follow same methods for data conversion which

mentioned in section 3.5. Then Fuzzy C-Means has same sequence of steps as K-means.

Except distance calculation formula which have been mentioned in section 3.8.1.

Fuzzification factor will be calculated instead of Euclidean distance calculation. Using

fuzzification factor it has concluded that current point will belong to the all cluster

beyond the given threshold value.

It is also known as soft clustering. Same point may belong to one cluster or many

cluster depend upon the threshold value and fuzzification factor.

34

Page 35: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

3.9. Map-Reduce Based Fuzzy C-Means

1. Algorithm

35

Procedure MapReduceFCMeansBegin (sequential)Repeat

Count the frequency of unique words in documentCount the global_Frequency of unique words in each document

Until all documents are scannedDisplay it to userGet user Input of Keywords on which clustering documents should get

done.RepeatCalculate TF per Document: Keyword

TF = Count of Keyword / ∑(Keywords appeared in Document) Store TF in a Vector_Dataset

Until all documents are scannedRandomly select K number of points from Vector_Dataset as Centroid

End(sequentially)Begin (MapReduce)

Map Given dataset over all Nodes available (Based on split size)MAP PHASE (On all nodes)RepeatStep I: Vectorization:

1. Store TF in Vecorized Object2. Store Centroids in Vecorized Object

RepeatRepeat

Step II: Calculate distance (Euclidean Distance):Repeat

Calculate Euclidean Distance between Current Document to cluster centroids.

Until all centroids are coveredFuzzification Factor Calculation

Until Fuzzification Factor of a point with all Centroids is calculated.Step III:

If Fuzzification Factor >Threshold Update Current Document belongs to this ith cluster.

Until all Documents are scanned.Update Centroids

Until all iterations are overEndREDUCE PHASE (From all nodes)Gather all results and store on single node.

Figure 10: Algorithm of Map-Reduce Based Fuzzy C-Means

Page 36: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

2. Flowchart

Figure 11: Flowchart of Map-Reduce Based Fuzzy C-Means

36

Page 37: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Map-Reduce based Fuzzy C-Means has the combination of Classical Fuzzy C-means

implementation and Map-Reduce Based K-Means implantation. Data conversion, data

distribution work as preprocessing to Map-Reduce based Fuzzy C-Means i.e.

sequentially. Then Classical Fuzzy C-Means will run on each node in Hadoop cluster till

Fuzzification factor calculation. Finally gather the entire Fuzzification factor on single

node and check for each point where it belongs. Update the centroids. Iterate for number

of times.

37

Page 38: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

3.10. Summary

Sequential Fuzzy C Means is converted into parallel Fuzzy C Means. Only

programming parallelism is achieved but, the sequential nature of algorithm is retained.

Programming parallelism is carried out using Hadoop Distributed Computing

Framework.

38

Page 39: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

4. Results and Discussion

4.1. Performance Evaluation Criteria

Fuzzy C-Means is evaluated on the following performance parameters. These are the

speed up comparison of Fuzzy C-means based on Map-Reduce Vs Sequential K-Means

on multiple node setup of Hadoop i.e (2, 4, 8) for 4 and 6 centroids with Map-Reduce

based programming of 4 Mb split, 8Mb split, 32 Mb split.

4.2. Speed Up

Speed-up is the ratio of sequential run time to parallel run time. It is described in

equation 4.1.

4.1

4.3. Hardware Configuration

The results are taken on processor Intel Quad Core i5-2400s CPU (2.5GHz * 4). The

Machine equipped with 4 Gb of RAM and 1 Tb of Hard Disk capacity. Operating System

Ubuntu 14.04 LTS. . The IDE used for development is Eclipse-LUNA with JAVA 6 and

above versions.

4.4. Test Data

20 newsgroup data has been considered here for document clustering. It has 20000

approx. articles representing documents. These documents are structured in 20 folders i.e.

logically ordered.

20 Folders are named as

1. comp.graphics2. comp.os.ms-windows.misc3. comp.sys.ibm.pc.hardware4. comp.sys.mac.hardware

39

Page 40: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

5. comp.windows.x6. misc.forsale7. rec.autos8. rec.motorcycles9. rec.sport.baseball10. rec.sport.hockey11. alk.politics.misc12. talk.politics.guns13. talk.politics.mideast14. sci.crypt15. sci.electronics16. sci.med17. sci.space18. talk.religion.misc19. alt.atheism20. soc.religion.christian

1000 Usenet articles were taken from each of the following 20 newsgroups.

Approximately 4% of the articles are cross posted. The articles are typical postings and

thus have headers including

a. subject lines,

b. signature files,

c. Quoted portions of other articles.

4.5. Data Format

Each newsgroup is stored in a subdirectory, with each article stored as a separate file.

It is popular dataset for text applications in are for Machine learning, Classification, Text

Clustering.

4.6. Data Specification

Table 5: Dataset Specification

Data Set Characteristics:

Text Number of Instances:

20000 Area: N/A

Attribute Characteristics:

N/A Number of Attributes:

N/A Date Donated

1999-0--09

Associateds N/A Missing NO

40

Page 41: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Tasks: Values?

4.7. Experimental Setup

Following table will describe the combination of experiments that have been

performed for this dissertation.

Table 6: Experimental Permutations and Combination

Experimental Setup3

Centroids4

Centroids5

Centroids6

Centroids4 Itr 6 Itr 4 Itr 6 Itr 4 Itr 6 Itr 4 Itr 6 Itr

Classical K-Means √ √ √ √ √ √ √ √ 4 Mb Split√ √ √ √ √ √ √ √ 8 Mb Split√ √ √ √ √ √ √ √ 32 Mb Split

Hadoop Based K-Means √ √ √ √ √ √ √ √ 4 Mb Split

√ √ √ √ √ √ √ √ 8 Mb Split√ √ √ √ √ √ √ √ 32 Mb Split

Classical FCMeans √ √ √ √ √ √ √ √ 4 Mb Split

√ √ √ √ √ √ √ √ 8 Mb Split√ √ √ √ √ √ √ √ 32 Mb Split

HAdoop Based FCMeans √ √ √ √ √ √ √ √ 4 Mb Split

√ √ √ √ √ √ √ √ 8 Mb Split√ √ √ √ √ √ √ √ 32 Mb Split

Where in table 5,

Itr = Iteration

Based on Table 6, Combinations results are taken and discussed in the current

chapter.

4.8. Hadoop Version and number of nodes

1. Hadoop 1.2.1

2. 2, 4, 8 Node Hadoop Clusters

41

Page 42: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

4.9. Other required Software’s

Eclipse IDE (LUNA version)

4.10. Experiments and Discussion

1. Classical K-Means vs. Hadoop Based K-Means

4.10.1.1. 4Mb Split 4 Iterations

Table 7: 2 Nodes Hadoop Cluster 4Mb Split 4 Iterations

K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM1st ITR 28000 35000 28000 310002nd ITR 363000 736000 498000 5550003rd ITR 399000 921000 517000 6170004th ITR 369000 935000 799000 615000

Table 8 : 4 Nodes Hadoop Cluster 4Mb Split 4 Iterations

K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM1st ITR 28000 3400 27000 310002nd ITR 162000 251000 206000 2240003rd ITR 188000 284000 215000 2250004th ITR 182000 272000 211000 226000

Table 9: 8 Nodes Hadoop Cluster 4Mb Split 4 Iterations

K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM

42

Page 43: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

1st ITR 29000 36000 29000 310002nd ITR 209000 357000 217000 2660003rd ITR 237000 423000 258000 2800004th ITR 239000 407000 300000 292000

Table 10: 2, 4, 8 Node 4Mb Split 4 Iterations

3 centroid 4 centroid 5 centroid 6 centroid60000 60000 60000 60000

Classical 300000 300000 300000 300000K-Means 300000 300000 420000 540000

420000 420000 480000 480000

28000 35000 28000 310002 Nodes 363000 736000 498000 555000HKM 399000 921000 517000 617000

369000 935000 799000 615000

28000 3400 27000 310004 Nodes 162000 251000 206000 224000HKM 188000 284000 215000 225000

182000 272000 211000 226000

29000 36000 29000 310008 Nodes 209000 357000 217000 266000HKM 237000 423000 258000 280000

239000 407000 300000 292000

43

Page 44: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Figure 12: 2 Nodes Hadoop Cluster 4Mb Split 4 Iterations

Figure 13: 4 Nodes Hadoop Cluster 4Mb Split 4 Iterations

Figure 14: 8 Nodes Hadoop Cluster 4Mb Split 4 Iterations

Classical

K-Means

2 Node

K-Means

4 Node

K-Means

8 Node

K-Means

0 200000 400000 600000 800000 1000000

6 centroid5 centroid4 centroid3 centroid

44

Page 45: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Figure 15: 2, 4, 8 Node 4Mb Split 4 Iterations

4.10.1.2. 4Mb Split 6 Iterations

Table 11: 2 Nodes Hadoop Cluster 4Mb Split 6 Iterations

K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000HKM1st ITR 28000 35000 28000 310002nd ITR 455000 887000 498000 5470003rd ITR 402000 852000 558000 5780004th ITR 528000 741000 733000 6800005th ITR 429000 953000 930000 5820006th ITR 396000 699000 720000 720000

Table 12: 4 Nodes Hadoop Cluster 4Mb Split 6 Iterations

K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000HKM1st ITR 27000 38000 29000 310002nd ITR 158000 262000 222000 2320003rd ITR 187000 284000 234000 2350004th ITR 193000 269000 270000 2410005th ITR 195000 272000 233000 2380006th ITR 196000 289000 247000 228000

Table 13: 8 Nodes Hadoop Cluster 4Mb Split 6 Iterations

K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 780000

45

Page 46: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

6th ITR 600000 660000 660000 900000HKM1st ITR 27000 38000 29000 310002nd ITR 158000 262000 222000 2320003rd ITR 187000 284000 234000 2350004th ITR 193000 269000 270000 2410005th ITR 195000 272000 233000 2380006th ITR 196000 289000 247000 228000

Table 14: 2, 4, 8 Node 4Mb Split 6 Iterations

3 centroid 4 centroid 5 centroid 6 centroid60000 60000 60000 60000300000 360000 300000 360000

Classical 300000 420000 540000 600000K-Means 420000 480000 540000 660000

540000 720000 720000 780000600000 660000 660000 900000

28000 35000 28000 31000455000 887000 498000 547000

2 Node 402000 852000 558000 578000K-Means 528000 741000 733000 680000

429000 953000 930000 582000396000 699000 720000 720000

27000 38000 29000 31000158000 262000 222000 232000

4 Node 187000 284000 234000 235000K-Means 193000 269000 270000 241000

195000 272000 233000 238000196000 289000 247000 228000

27000 38000 29000 31000158000 262000 222000 232000

8 Node 187000 284000 234000 235000K-Means 193000 269000 270000 241000

195000 272000 233000 238000196000 289000 247000 228000

46

Page 47: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Figure 16: 2 Nodes Hadoop Cluster 4Mb Split 6 Iterations

Figure 17: 4 Nodes Hadoop Cluster 4Mb Split 6 Iterations

Figure 18: 8 Nodes Hadoop Cluster 4Mb Split 6 Iterations

ClassicalK-Means

2 Node K-Means

4 NodeK-Means

8 NodeK-Means

0 200000 400000 600000 800000 1000000 1200000

6 centroid5 centroid4 centroid3 centroid

Time

Figure 19: 2, 4, 8 Node 4Mb Split 6 Iterations

47

Page 48: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

4.10.1.3. 8Mb Split 4 Iterations

Table 15: 2 Nodes Hadoop Cluster 8Mb Split 4 Iterations

K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM1st ITR 28000 35000 28000 320002nd ITR 358000 603000 477000 4800003rd ITR 391000 912000 603000 6220004th ITR 375000 926000 808000 639000

Table 16: 4 Nodes Hadoop Cluster 8Mb Split 4 Iterations

K-means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM1st ITR 29000 38000 29000 310002nd ITR 180000 262000 200000 2130003rd ITR 189000 275000 231000 2350004th ITR 194000 322000 242000 231000

Table 17: 8 Nodes Hadoop Cluster 8Mb Split 4 Iterations

K-means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM1st ITR 29000 36000 28000 310002nd ITR 308000 615000 405000 4690003rd ITR 307000 655000 451000 4680004th ITR 307000 682000 573000 465000

Table 18: 2, 4, 8 Node 8Mb Split 4 Iterations

3 centroid 4 centroid 5 centroid 6 centroid60000 60000 60000 60000

Classical 300000 300000 300000 300000K-Means 300000 300000 420000 540000

48

Page 49: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

420000 420000 480000 480000

2 Node 28000 35000 28000 32000K-Means 358000 603000 477000 480000

391000 912000 603000 622000375000 926000 808000 639000

29000 38000 29000 310002 Node 180000 262000 200000 213000K-Means 189000 275000 231000 235000

194000 322000 242000 231000

29000 36000 28000 310002 Node 308000 615000 405000 469000K-Means 307000 655000 451000 468000

307000 682000 573000 465000

Figure 20: 2 Nodes Hadoop Cluster 8Mb Split 4 Iterations

Figure 21: 4 Nodes Hadoop Cluster 8Mb Split 4 Iterations

Figure 22: 8 Nodes Hadoop Cluster 8Mb Split 4 Iterations

49

Page 50: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

ClassicalK-Means

2 Node

K-Means

4 NodeK-Means

8 Node

K-Means

0 200000 400000 600000 800000 1000000

2 Node

6 centroid5 centroid4 centroid3 centroid

TIme

Cla

ssic

K-M

eans

and

Had

oop

Bas

ed K

-Mea

ns

Figure 23: 2, 4, 8 Node 4Mb Split 4 Iterations

4.10.1.4. 8Mb Split 6 Iterations

Table 19: 2 Nodes Hadoop Cluster 8Mb Split 6 Iterations

3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000

1st ITR 28000 35000 28000 310002nd ITR 350000 784000 545000 850000

50

Page 51: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

3rd ITR 396000 843000 850000 6220004th ITR 388000 927000 802000 6070005th ITR 419000 877000 662000 5650006th ITR 411000 876000 720000 545000

Table 20: 4 Nodes Hadoop Cluster 8Mb Split 6 Iterations

3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000

1st ITR 28000 37000 27000 310002nd ITR 168000 287000 192000 2400003rd ITR 198000 273000 253000 2190004th ITR 204000 305000 276000 2510005th ITR 194000 288000 257000 2480006th ITR 194000 282000 245000 248000

Table 21: 8 Nodes Hadoop Cluster 8Mb Split 6 Iterations

3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000

1st ITR 29000 35000 28000 310002nd ITR 281000 579000 369000 4400003rd ITR 309000 657000 444000 4600004th ITR 308000 699000 573000 5120005th ITR 340000 687000 625000 4820006th ITR 312000 740000 584000 538000

Table 22: 2, 4, 8 Node 8Mb Split 6 Iterations

3 centroid 4 centroid 5 centroid 6 centroid60000 60000 60000 60000300000 360000 300000 360000

Classical 300000 420000 540000 600000K-Means 420000 480000 540000 660000

540000 720000 720000 780000600000 660000 660000 900000

51

Page 52: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

28000 35000 28000 31000350000 784000 545000 850000

2 Node 396000 843000 850000 622000K-Means 388000 927000 802000 607000

419000 877000 662000 565000411000 876000 720000 545000

28000 37000 27000 31000168000 287000 192000 240000

4 Node 198000 273000 253000 219000K-Means 204000 305000 276000 251000

194000 288000 257000 248000194000 282000 245000 248000

29000 35000 28000 31000281000 579000 369000 440000

8 Node 309000 657000 444000 460000K-Means 308000 699000 573000 512000

340000 687000 625000 482000312000 740000 584000 538000

Figure 24: 2 Nodes Hadoop Cluster 8Mb Split 6 Iterations

Figure 25: 4 Nodes Hadoop Cluster 8Mb Split 6 Iterations

Figure 26: 8 Nodes Hadoop Cluster 8Mb Split 6 Iterations

52

Page 53: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

ClassicalK-Means

2 Node K-Means

4 NodeK-Means

8 NodeK-Means

0 200000 400000 600000 800000 1000000

6 centroid5 centroid4 centroid3 centroid

Time

Figure 27: 2, 4, 8 Node 8Mb Split 6 Iterations

4.10.1.5. 32Mb Split 4 Iterations

Table 23: 2 Nodes Hadoop Cluster 32Mb Split 4 Iterations

K-Means 3 centroid 4 centroid 5 centroid6 centroid

1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM

53

Page 54: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

1st ITR 29000 35000 28000 310002nd ITR 343000 693000 475000 5160003rd ITR 387000 841000 577000 6210004th ITR 395000 655000 632000 609000

Table 24: 4 Nodes Hadoop Cluster 32Mb Split 4 Iterations

K-Means 3 centroid 4 centroid 5 centroid6 centroid

1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM1st ITR 30000 41000 41000 320002nd ITR 163000 263000 214000 2330003rd ITR 195000 290000 228000 2450004th ITR 183000 319000 254000 217000

Table 25: 8 Nodes Hadoop Cluster 32Mb Split 4 Iterations

K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM1st ITR 29000 36000 28000 310002nd ITR 276000 577000 427000 4640003rd ITR 300000 660000 476000 4910004th ITR 303000 692000 573000 501000

Table 26: 2, 4, 8 Node 32Mb Split 4 Iterations

3 centroid 4 centroid 5 centroid 6 centroidClassical 60000 60000 60000 60000K-Means 300000 300000 300000 300000

300000 300000 420000 540000420000 420000 480000 480000

2 Node 29000 35000 28000 31000K-Means 343000 693000 475000 516000

387000 841000 577000 621000395000 655000 632000 609000

4 Node 30000 41000 41000 32000K-Means 163000 263000 214000 233000

195000 290000 228000 245000183000 319000 254000 217000

54

Page 55: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

8 Node 29000 36000 28000 31000K-Means 276000 577000 427000 464000

300000 660000 476000 491000303000 692000 573000 501000

Figure 28: 2 Nodes Hadoop Cluster 32Mb Split 4 Iterations

Figure 29: 4 Nodes Hadoop Cluster 32Mb Split 4 Iterations

Figure 30: 8 Nodes Hadoop Cluster 32Mb Split 4 Iterations

55

Page 56: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Figure 31: 2, 4, 8 Node 32Mb Split 4 Iterations

Classical

K-Means

2 Node

K-Means

4 Node

K-Means

8 Node

K-Means

0 200000 400000 600000 800000 1000000

6 centroid5 centroid4 centroid3 centroid

4.10.1.6. 32Mb Split 6 Iterations

Table 27 : 2 Nodes Hadoop Cluster 32Mb Split 6 Iteration

K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000HKM1st ITR 29000 43000 29000 320002nd ITR 387000 1377000 569000 517000

56

Page 57: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

3rd ITR 363000 835000 543000 6360004th ITR 443000 875000 694000 10580005th ITR 499000 806000 771000 9160006th ITR 387000 956000 858000 862000

Table 28: 4 Nodes Hadoop Cluster 32Mb Split 6 Iterations

K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000HKM1st ITR 28000 34000 28000 310002nd ITR 172000 275000 195000 2620003rd ITR 185000 285000 229000 2600004th ITR 182000 303000 241000 2680005th ITR 194000 273000 258000 2540006th ITR 185000 277000 285000 266000

Table 29: 8 Nodes Hadoop Cluster 32Mb Split 6 Iterations

K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000HKM1st ITR 28000 35000 28000 310002nd ITR 323000 604000 397000 4390003rd ITR 320000 661000 474000 4600004th ITR 333000 678000 607000 5050005th ITR 369000 687000 610000 4790006th ITR 366000 708000 581000 516000

Table 30: 2, 4, 8 Node 32Mb Split 6 Iterations

3 centroid 4 centroid 5 centroid 6 centroid60000 60000 60000 60000300000 360000 300000 360000

Classical 300000 420000 540000 600000K-Means 420000 480000 540000 660000

540000 720000 720000 780000

57

Page 58: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

600000 660000 660000 90000029000 43000 29000 32000387000 1377000 569000 517000

2 Node 363000 835000 543000 636000K-Means 443000 875000 694000 1058000

499000 806000 771000 916000387000 956000 858000 86200028000 34000 28000 31000172000 275000 195000 262000

4 Node 185000 285000 229000 260000K-Means 182000 303000 241000 268000

194000 273000 258000 254000185000 277000 285000 26600028000 35000 28000 31000323000 604000 397000 439000

8 Node 320000 661000 474000 460000K-Means 333000 678000 607000 505000

369000 687000 610000 479000366000 708000 581000 516000

58

Page 59: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Figure 32: 2 Nodes Hadoop Cluster 32Mb Split 6 Iterations

Figure 33: 4 Nodes Hadoop Cluster 32Mb Split 6 Iterations

Figure 34: 8 Nodes Hadoop Cluster 32Mb Split 6 Iterations

ClassicalK-Means

2 Node K-Means

4 NodeK-Means

8 NodeK-Means

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

6 centroid5 centroid4 centroid3 centroid

59

Page 60: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Figure 35: 2, 4, 8 Node 32Mb Split 6 Iterations

Above table x describes parameter settings employed to experimentation. This report

has combination of three components and its subcomponents

1. Number of nodesa. 2 Node Hadoop clusterb. 4 Node Hadoop clusterc. 8 Node Hadoop cluster

2. Number of Iterationsa. 4 Iterationb. 6 Iteration

3. Size of splita. 4Mb Splitb. 8Mb Splitc. 32Mb Split

Classical K-Means Vs Hadoop Based K-Means

Tables 4 to 27 exhibits that using Hadoop Based K-Means algorithm speedup has

gained. Figure5 to 27 shows statistics of time consumed by Hadoop node 2, 4, 8

compared to Single node classical K-Means. All of the above results are statistical results

of classical K-means implantation and Hadoop based K-means implementation over the

time required to execute the algorithm. In most of the cases 2 Node Hadoop based K-

means gives result like classical K-Means or even more than that. It happens because of

communication overhead between two Nodes. Moreover, hadoop assigns task to its

slaves. In two node system first, master transfers data to slave second copy program to

each node. Then, an execution takes place of Mapper phase. On completion of mapper

phase again communication takes place to gather data at one location or node. This phase

has called as Reducer phase. Reducer phase works similar to the sequential execution. In

4 Mb data split for 120 Mb 20_newsgroup dataset 24 splits occurs where using 32 Mb

split only 3 splits occur. Because of this communication overhead 2 Node hadoop based

K-means has much slower than classical K-means.

In above tables 5 to 27 time mentioned is in milliseconds. Number of nodes with

number of iteration and its time has mentioned. The tables are separated over data split in

60

Page 61: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

hadoop. Experiment has taken place with initial centroids 3, 4, 5, 6. The combination of

all these parameter has considered into this dissertation report.

In case of 4 nodes Hadoop cluster hadoop gives best speed up for Hadoop based

K-means. Where in case 8 Nodes again instead of processing time slaves and masters are

involved most of time in communication.

Above experiment shows that Hadoop suits to less iterative algorithms and and

where communication burden is low as much as possible in small datasets. For larger

datasets Hadoop suits with large data split. Number of splits reduces communication

overhead reduces itself and processing elements get more time for processing.

K-means works on cardinality of 0 and 1. Because of that instead K-means

algorithm Fuzzy C-Means concept comes into picture. Fuzzy C-Means supports “1” to

“many” cardinality. Where, achievement of exponential speedup has expected as

compare to classical FCM with Hadoop based FCM implementation with increase in

number of nodes.

Figure 36: Communication Overhead

61

Page 62: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

2. Classical FC-Means vs. Hadoop Based FC-Means

4.10.2.1. 4Mb Split 4 Iterations

Table 31:2 Nodes Hadoop Cluster 4Mb Split 4 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM1st ITR 69000 81000 116000 1270002nd ITR 945000 1543000 2418000 31320003rd ITR 863000 1716000 2431000 31450004th ITR 859000 1714000 2437000 3242000

Table 32: 4 Nodes Hadoop Cluster 4Mb Split 4 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM1st ITR 66000 79000 97000 1210002nd ITR 1556000 1278000 2218000 38450003rd ITR 824000 1366000 2836000 40370004th ITR 765000 1440000 3042000 5828000

Table 33: 8 Nodes Hadoop Cluster 4Mb Split 4 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM1st ITR 55000 70000 97000 1110002nd ITR 798000 1357000 1695000 21250003rd ITR 458000 1145000 1800000 21550004th ITR 785000 1547000 1508000 2135000

Table 34:2, 4, 8 Node 4Mb Split 4 Iteration

3 centroid 4 centroid 5 centroid 6 centroidClassical 60000 60000 120000 180000

62

Page 63: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

2580000 3360000 6120000 6600000FCM 3120000 3960000 6660000 6660000

3120000 4140000 7020000 8280000

2 Node 69000 81000 116000 127000945000 1543000 2418000 3132000

FCM 863000 1716000 2431000 3145000859000 1714000 2437000 3242000

4 Node 66000 79000 97000 1210001556000 1278000 2218000 3845000

FCM 824000 1366000 2836000 4037000765000 1440000 3042000 5828000

8 Node 55000 70000 97000 111000798000 1357000 1695000 2125000

FCM 458000 1145000 1800000 2155000785000 1547000 1508000 2135000

Figure 37: 2 Nodes Hadoop Cluster 4Mb Split 4 Iterations

Figure 38: 4 Nodes Hadoop Cluster 4Mb Split 4 Iterations

Figure 39: 8 Nodes Hadoop Cluster 4Mb Split 4 Iterations

63

Page 64: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Classical

FCM

2 Node

FCM

4 Node

FCM

8 Node

FCM

0 2000000 4000000 6000000 8000000 10000000

6 centroid5 centroid4 centroid3 centroid

Figure 40: 2, 4, 8 Node 4Mb Split 4 Iteration

4.10.2.2. 4Mb Split 6 Iterations

Table 35: 2 Nodes Hadoop Cluster 4Mb Split 6 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 120000 120000 1800002nd ITR 2460000 4200000 6240000 66600003rd ITR 2460000 4260000 7020000 79200004th ITR 2520000 4320000 6720000 84000005th ITR 2520000 4560000 6600000 91200006th ITR 2460000 4620000 6840000 9420000HFCM1st ITR 63000 79000 112000 129000

64

Page 65: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

2nd ITR 942000 1541000 2420000 32810003rd ITR 1096000 1698000 2420000 32640004th ITR 1100000 1696000 2423000 32730005th ITR 1088000 1694000 2431000 44770006th ITR 1088000 1695000 2418000 3211000

Table 36: 4 Nodes Hadoop Cluster 4Mb Split 6 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 120000 120000 1800002nd ITR 2460000 4200000 6240000 66600003rd ITR 2460000 4260000 7020000 79200004th ITR 2520000 4320000 6720000 84000005th ITR 2520000 4560000 6600000 91200006th ITR 2460000 4620000 6840000 9420000HFCM1st ITR 57000 70000 100000 1240002nd ITR 641000 1112000 1900000 22740003rd ITR 833000 1165000 1785000 32810004th ITR 732000 1136000 1824000 29720005th ITR 792000 1180000 2016000 35440006th ITR 733000 1417000 1839000 3709000

Table 37: 8 Nodes Hadoop Cluster 4Mb Split 6 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 120000 120000 1800002nd ITR 2460000 4200000 6240000 66600003rd ITR 2460000 4260000 7020000 79200004th ITR 2520000 4320000 6720000 84000005th ITR 2520000 4560000 6600000 91200006th ITR 2460000 4620000 6840000 9420000HFCM1st ITR 53000 68000 94000 1060002nd ITR 486000 779000 1210000 15320003rd ITR 590000 885000 1150000 15040004th ITR 616000 847000 1153000 15320005th ITR 584000 853000 1162000 15280006th ITR 588000 875000 1144000 1846000

Table 38: 2, 4, 8 Node 4Mb Split 6 Iterations

3 centroid 4 centroid 5 centroid 6 centroid60000 120000 120000 1800002460000 4200000 6240000 6660000

Classical 2460000 4260000 7020000 79200002520000 4320000 6720000 8400000

65

Page 66: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

FCM 2520000 4560000 6600000 91200002460000 4620000 6840000 9420000

63000 79000 112000 129000942000 1541000 2420000 3281000

2 Node 1096000 1698000 2420000 32640001100000 1696000 2423000 3273000

FCM 1088000 1694000 2431000 44770001088000 1695000 2418000 3211000

57000 70000 100000 124000641000 1112000 1900000 2274000

4 Node 833000 1165000 1785000 3281000732000 1136000 1824000 2972000

FCM 792000 1180000 2016000 3544000733000 1417000 1839000 3709000

53000 68000 94000 106000486000 779000 1210000 1532000

8 Node 590000 885000 1150000 1504000616000 847000 1153000 1532000

FCM 584000 853000 1162000 1528000588000 875000 1144000 1846000

66

Page 67: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Figure 41: 2 Nodes Hadoop Cluster 4Mb Split 6 Iterations

Figure 42: 4 Nodes Hadoop Cluster 4Mb Split 6 Iterations

Figure 43: 8 Nodes Hadoop Cluster 4Mb Split 6 Iterations

Classical

FCM

2 Node

FCM

4 Node

FCM

8 Node

FCM

0 2000000 4000000 6000000 8000000 10000000

6 centroid5 centroid4 centroid3 centroid

Time

67

Page 68: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Figure 44:2, 4, 8 Node 4Mb Split 6 Iterations

4.10.2.3. 8Mb Split 4 Iterations

Table 39: 2 Nodes Hadoop Cluster 8Mb Split 4 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM1st ITR 58000 76000 105000 1240002nd ITR 796000 1356000 2127000 29350003rd ITR 911000 1466000 2147000 29180004th ITR 1032000 1696000 2287000 3138000

Table 40 : 4 Nodes Hadoop Cluster 8Mb Split 4 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM1st ITR 50000 78000 101000 1200002nd ITR 842000 2603000 2881000 53570003rd ITR 1045000 1786000 3188000 63600004th ITR 1349000 1894000 3667000 4541000

Table 41: 8 Nodes Hadoop Cluster 8Mb Split 4 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM1st ITR 52000 64000 91000 1100002nd ITR 403000 650000 1047000 14230003rd ITR 471000 737000 1024000 13630004th ITR 467000 772000 1067000 1447000

Table 42: 2, 4, 8 Node 8Mb Split 4 Iterations

3 centroid 4 centroid 5 centroid 6 centroidClassical 60000 60000 120000 180000

68

Page 69: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

2580000 3360000 6120000 6600000FCM 3120000 3960000 6660000 6660000

3120000 4140000 7020000 8280000

2 Node 58000 76000 105000 124000796000 1356000 2127000 2935000

FCM 911000 1466000 2147000 29180001032000 1696000 2287000 3138000

4 Node 50000 78000 101000 120000842000 2603000 2881000 5357000

FCM 1045000 1786000 3188000 63600001349000 1894000 3667000 4541000

8 Node 52000 64000 91000 110000403000 650000 1047000 1423000

FCM 471000 737000 1024000 1363000467000 772000 1067000 1447000

Figure 45: 2 Nodes Hadoop Cluster 8Mb Split 4 Iterations

Figure 46: 4 Nodes Hadoop Cluster 8Mb Split 4 Iterations

Figure 47: 8 Nodes Hadoop Cluster 8Mb Split 4 Iterations

69

Page 70: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Classical

FCM

2 Node

FCM

4 Node

FCM

0 2000000 4000000 6000000 8000000 10000000

6 centroid5 centroid4 centroid3 centroid

Time

Figure 48: 2, 4, 8 Node 8Mb Split 4 Iterations

4.10.2.4. 8Mb Split 6 Iterations

Table 43 :2 Nodes Hadoop Cluster 8Mb Split 6 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 120000 120000 1800002nd ITR 2460000 4200000 6240000 66600003rd ITR 2460000 4260000 7020000 79200004th ITR 2520000 4320000 6720000 84000005th ITR 2520000 4560000 6600000 91200006th ITR 2460000 4620000 6840000 9420000

70

Page 71: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

HFCM1st ITR 58000 76000 103000 1250002nd ITR 794000 1332000 2127000 29380003rd ITR 922000 1468000 2153000 29310004th ITR 921000 1460000 2136000 29290005th ITR 1052000 1598000 2297000 30880006th ITR 1072000 1590000 2323000 3031000

Table 44: 4 Nodes Hadoop Cluster 8Mb Split 6 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 120000 120000 1800002nd ITR 2460000 4200000 6240000 66600003rd ITR 2460000 4260000 7020000 79200004th ITR 2520000 4320000 6720000 84000005th ITR 2520000 4560000 6600000 91200006th ITR 2460000 4620000 6840000 9420000HFCM1st ITR 60000 78000 104000 1150002nd ITR 1425000 2437000 3089000 50490003rd ITR 1353000 2804000 4886000 43870004th ITR 1517000 2377000 3164000 46060005th ITR 1754000 2008000 3159000 53240006th ITR 2096000 2249000 3652000 5115000

Table 45: 8 Nodes Hadoop Cluster 8Mb Split 6 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 120000 120000 1800002nd ITR 2460000 4200000 6240000 66600003rd ITR 2460000 4260000 7020000 79200004th ITR 2520000 4320000 6720000 84000005th ITR 2520000 4560000 6600000 91200006th ITR 2460000 4620000 6840000 9420000HFCM1st ITR 518000 65000 92000 1060002nd ITR 422000 639000 1147000 13620003rd ITR 463000 703000 1073000 15050004th ITR 517000 720000 1080000 14370005th ITR 477000 900000 1013000 16480006th ITR 474000 742000 1161000 1544000

Table 46: 2, 4, 8 Node 8Mb Split 6 Iterations

3 centroid 4 centroid 5 centroid 6 centroid60000 120000 120000 1800002460000 4200000 6240000 6660000

71

Page 72: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Classical 2460000 4260000 7020000 79200002520000 4320000 6720000 8400000

FCM 2520000 4560000 6600000 91200002460000 4620000 6840000 9420000

58000 76000 103000 125000794000 1332000 2127000 2938000

2 Node 922000 1468000 2153000 2931000921000 1460000 2136000 2929000

FCM 1052000 1598000 2297000 30880001072000 1590000 2323000 3031000

60000 78000 104000 1150001425000 2437000 3089000 5049000

4 Node 1353000 2804000 4886000 43870001517000 2377000 3164000 4606000

FCM 1754000 2008000 3159000 53240002096000 2249000 3652000 5115000

518000 65000 92000 106000422000 639000 1147000 1362000

8 Node 463000 703000 1073000 1505000517000 720000 1080000 1437000

FCM 477000 900000 1013000 1648000474000 742000 1161000 1544000

Figure 49: 2 Nodes Hadoop Cluster 8Mb Split 6 Iterations

Figure 50: 4 Nodes Hadoop Cluster 8Mb Split 6 Iterations

Figure 51: 8 Nodes Hadoop Cluster 8Mb Split 6 Iterations

72

Page 73: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Figure 52: 2, 4, 8 Node 8Mb Split 6 Iterations

Classical

FCM

2 Node

FCM

4 Node

FCM

8 Node

FCM

0 2000000 4000000 6000000 8000000 10000000

6 centroid5 centroid4 centroid3 centroid

4.10.2.5. 32Mb Split 4 Iterations

Table 47 :2 Nodes Hadoop Cluster 32Mb Split 4 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM

73

Page 74: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

1st ITR 52000 73000 105000 1660002nd ITR 913000 1506000 2506000 35240003rd ITR 1021000 1703000 2508000 35120004th ITR 1148000 1873000 2628000 3640000

Table 48: 4 Nodes Hadoop Cluster 32Mb Split 4 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM1st ITR 172000 75000 112000 1270002nd ITR 898000 1628000 2595000 34060003rd ITR 1027000 1591000 2595000 32370004th ITR 949000 1663000 2301000 3239000

Table 49: 8 Nodes Hadoop Cluster 32Mb Split 4 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM1st ITR 68000 83000 107000 1260002nd ITR 989000 1768000 2617000 34200003rd ITR 1053000 1743000 2540000 34500004th ITR 1035000 1697000 2512000 3407000

Table 50: 2, 4, 8 Node 32Mb Split 4 Iterations

3 centroid 4 centroid 5 centroid 6 centroidClassical 60000 60000 120000 180000

2580000 3360000 6120000 6600000FCM 3120000 3960000 6660000 6660000

3120000 4140000 7020000 8280000

2 Node 52000 73000 105000 166000913000 1506000 2506000 3524000

FCM 1021000 1703000 2508000 35120001148000 1873000 2628000 3640000

4 Node 172000 75000 112000 127000898000 1628000 2595000 3406000

FCM 1027000 1591000 2595000 3237000

74

Page 75: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

949000 1663000 2301000 3239000

8 Node 68000 83000 107000 126000989000 1768000 2617000 3420000

FCM 1053000 1743000 2540000 34500001035000 1697000 2512000 3407000

Figure 53: 2 Nodes Hadoop Cluster 32Mb Split 4 Iterations

Figure 54: 4 Nodes Hadoop Cluster 32Mb Split 4 Iterations

Figure 55: 8 Nodes Hadoop Cluster 32Mb Split 4 Iterations

75

Page 76: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Classical

FCM

2 Node

FCM

4 Node

FCM

8 Node

FCM

0 2000000 4000000 6000000 8000000 10000000

6 centroid5 centroid4 centroid3 centroid

Time

Figure 56: 2, 4, 8 Node 32Mb Split 4 Iterations

4.10.2.6. 32Mb Split 4 Iterations

Table 51:2 Nodes Hadoop Cluster 32Mb Split 6 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 120000 120000 1800002nd ITR 2460000 4200000 6240000 66600003rd ITR 2460000 4260000 7020000 79200004th ITR 2520000 4320000 6720000 84000005th ITR 2520000 4560000 6600000 91200006th ITR 2460000 4620000 6840000 9420000HFCM1st ITR 57000 73000 74000 123000

76

Page 77: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

2nd ITR 922000 1498000 2494000 35640003rd ITR 1030000 1695000 2510000 31320004th ITR 1028000 1695000 2752000 35380005th ITR 1140000 1955000 3110000 36300006th ITR 1165000 1930000 3245000 3565000

Table 52: 4 Nodes Hadoop Cluster 32Mb Split 6 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 120000 120000 1800002nd ITR 2460000 4200000 6240000 66600003rd ITR 2460000 4260000 7020000 79200004th ITR 2520000 4320000 6720000 84000005th ITR 2520000 4560000 6600000 91200006th ITR 2460000 4620000 6840000 9420000HFCM1st ITR 55000 73000 106000 1270002nd ITR 925000 1492000 2269000 35520003rd ITR 1019000 1776000 2579000 32220004th ITR 1015000 1660000 2577000 34070005th ITR 1016000 1671000 2468000 32250006th ITR 1014000 1772000 2587000 3580000

Table 53: 8 Nodes Hadoop Cluster 32Mb Split 6 Iterations

FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000HFCM1st ITR 28000 35000 28000 310002nd ITR 323000 604000 397000 4390003rd ITR 320000 661000 474000 4600004th ITR 333000 678000 607000 5050005th ITR 369000 687000 610000 4790006th ITR 366000 708000 581000 516000

Table 54: 2, 4, 8 Node 32Mb Split 6 Iterations

3 centroid 4 centroid 5 centroid 6 centroid60000 60000 60000 60000300000 360000 300000 360000

Classical 300000 420000 540000 600000K-Means 420000 480000 540000 660000

77

Page 78: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

540000 720000 720000 780000600000 660000 660000 90000029000 43000 29000 32000387000 1377000 569000 517000

2 Node 363000 835000 543000 636000K-Means 443000 875000 694000 1058000

499000 806000 771000 916000387000 956000 858000 86200028000 34000 28000 31000172000 275000 195000 262000

4 Node 185000 285000 229000 260000K-Means 182000 303000 241000 268000

194000 273000 258000 254000185000 277000 285000 26600028000 35000 28000 31000323000 604000 397000 439000

8 Node 320000 661000 474000 460000K-Means 333000 678000 607000 505000

369000 687000 610000 479000366000 708000 581000 516000

78

Page 79: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

Figure 57: 2 Nodes Hadoop Cluster 32Mb Split 6 Iterations

Figure 58: 4 Nodes Hadoop Cluster 32Mb Split 6 Iterations

Figure 59: 8 Nodes Hadoop Cluster 32Mb Split 6 Iterations

ClassicalK-Means

2 Node K-Means

4 NodeK-Means

8 NodeK-Means

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

6 centroid5 centroid4 centroid3 centroid

Figure 60: 2, 4, 8 Node 32Mb Split 6 Iterations

Classical FCM vs. Hadoop Based FCM

Tables 28 to 51 exhibits that using Hadoop Based FCM algorithm speedup has

gained. All of the above results are statistical results of classical FCM implantation and

Hadoop based FCM implementation over the time required to execute the algorithm. In

79

Page 80: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

most of the cases 2 Node Hadoop based FCM gives speedup of 100% i.e. twice over

classical K-Means or even more than that. In 4 Mb data split for 120 Mb 20_newsgroup

dataset 24 splits occurs where using 32 Mb split only 3 splits occur. 4 Nodes face

communication overhead problem because of small dataset. FCM has more iterative than

K-means. 2-Nodes utilize its full processing capacity for FCM processing and master

needs to handle only one node in communication process. 4-Nodes master system needs

to handle 3 nodes communication. While hadoop setup it is mentioned as Default copies

of data as 3. In 4 nodes 4th node never has data copy so it needs data copy every time.

This is the reason of little change in performance of 2 Nodes and 4 Nodes.

In above 28-51 tables time mentioned is in milliseconds. Number of nodes with

number of iteration and its time has mentioned. The tables are separated over data split in

hadoop. Experiment has taken place with initial centroids 3, 4, 5, 6. The combination of

all these parameter has considered into this dissertation report.

In case of 8 nodes Hadoop cluster hadoop gives best speed up for Hadoop based

FCM. Where in case 8 Nodes default copies of data mentioned in hadoop setup setting

are 6. 8 Nodes gives best speed up for FCM implementation. One reason for 8 Nodes best

speed up is increase in number of nodes and decrease in communication overhead.

Before, starting hadoop programming one always needs to understand hadoop and its

nodes are always used to communicate between each other as its internal process.

Advanced versions of Hadoop take care of data distribution, Load balancing and

communication overhead. Advanced version of Hadoop will give better performance

over 4 Nodes Hadoop cluster and number of nodes increases it will add its performance

improvement.

Above experiment shows that Hadoop suits to large datasets and where

communication burden is low as much as possible in small datasets. FCM always take

more time than classical K-means implementation. FCM gives better results for

overlapping dataset.

80

Page 81: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

It is observed that, as the number of node increases, speed-up increases and it

remains constant limited by sequential part of the code. The corresponding graphs shows

speedup.

5. Conclusion and Future Scope

Hadoop Based K-means and Hadoop Based Fuzzy C-Means for document clustering

are implemented in this dissertation. Speedup using Hadoop based multi-node cluster has

achieved. Fuzzy C Means gives better results for overlapping datasets. Where K-means is

know for it's Hard clustering and behave like that only. Hadoop Based Fuzzy C-Means

gives 5 fold speedup in 8 node Hadoop cluster as compared to classical fuzzy C-means

algorithm.

5.1. Hadoop Based K-means

Classical K-means algorithm with different kind of implementation or

advanced K-means has been available. These advanced K-means needs to be

implemented with Hadoop and document clustering and need to check for better results

than produced in this dissertation.

The Hadoop based K-means algorithm need to minimize the iterations

because hadoop gives worst results for more iterative algorithms.

The Hadoop based K-means algorithm needs to be modified for its initial

random centroids selection with some better technique.

5.2. Hadoop Based Fuzzy C-Means

Hadoop Based Fuzzy C-Means observes more time required to execute

algorithm in both cases classical c-means and Hadoop Based Fuzzy C-Means. The

required time is very high as compared to K-means implementation.

Hadoop Based Fuzzy C-Means algorithm with different kind of

implementation or advanced Fuzzy C-Means(mentioned in literature review) has been

81

Page 82: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

available. These advanced Fuzzy C-means needs to be implemented with Hadoop and

document clustering and need to check for better results than produced in this

dissertation.

Hadoop Based Fuzzy C-Means algorithm need to minimize the iterations

because hadoop gives worst results for more iterative algorithms.

Hadoop Based Fuzzy C-Means algorithm needs to be modified for its

initial random centroids selection with some better technique.

5.3. Future Scope

Implemented Hadoop based Fuzzy C-Means algorithm has much

iteration that need to be calculated in each iterations and have burden of storing

fuzzification factor each time while calculating point belongs to which centroid and

participate in calculation of each centroids. Alternative advanced design can be created

with less iteration and less burden of storing fuzzification factor.

Effective distance calculation method than Euclidean distance calculation

method will help for better speedup.

82

Page 83: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

6. References

83

Page 84: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

APPENDIX A

84

Page 85: FinalReport-Ashutosh with indentation (1)

Performance Analysis of Fuzzy C-Means Algorithm

VitaeName of student: Ashutosh Shrikant SatheNative Place: Solapur, MaharashtraDate of Birth: 14th May, 1989Address: 130B, Vidyut Sahwas Society Near Dhumma Vasti,

Laxmi Peth, SolapurEmail: [email protected] .in Objective: To involve in work that can be helped to utilize,

share and improve knowledge and experiences.Short Term Goal: To complete doctorate in EngineeringAreas of Interest: Data Mining, Distributed Computing, Distributed

Computing Administration

85