finalreport-ashutosh with indentation (1)
TRANSCRIPT
Performance Analysis of Fuzzy C-Means Algorithm
LIST OF ABBREVATIONS
Abbreviation Description
1. FCM Fuzzy C-Means2. MR Map reduce3. DBMS Database Management System4. HDFS Hadoop Distribution File System5. RDD Resilient Distributed Datasets6. HKM Hadoop Based K-Means7. HFCM Hadoop Based Fuzzy C-Means8. GoogleFS Google File System
1
Performance Analysis of Fuzzy C-Means Algorithm
1. Introduction
Clustering is the unsupervised classification of patters. Clustering problem have many
applications in many research areas, it shows its importance in field of Data Mining.
Clustering covers many application areas like image segmentation, object recognition and
information retrieval Clustering comes under roof of data analytics. Clustering is useful
in several fields pattern-analysis, grouping, decision making, and machine learning
situations, including data mining, document retrieval, image segmentation, and pattern
classification. How- ever, in many such problems, there is little prior information (e.g.,
statistical models) available about the data, and the decision-maker must make as few
assumptions about the data as possible. It is under these restrictions that clustering
methodology is particularly appropriate for the exploration of interrelationships among
the data points to make an assessment (perhaps preliminary) of their structure {Jain, 1999
#1}
Figure 1: Stages of Clustering
The aim of data clustering is to find structure in data and is therefore exploratory in
nature. Well known clustering methods, discuss the major challenges and key issues in
designing clustering algorithms, and point out some of the emerging and useful research
directions, including semi-supervised clustering, ensemble clustering, and simultaneous
feature selection during data clustering, and large-scale data clustering
2
Performance Analysis of Fuzzy C-Means Algorithm
Data clustering has been used for the following three main purposes.
Underlying structure: to gain insight into data, generate hypotheses, detect
anomalies, and identify salient features.
Natural classification: to identify the degree of similarity among forms or
organisms (phylogenetic relationship).
Compression: as a method for organizing the data and summarizing it
through cluster prototypes. [2]
To define big data one has to consider many aspects, where we are defining it as such
amount of data, which may be, structured or unstructured that cannot be process by
todays Database Management Systems [DBMS] or any relevant tools. Then question
arises how to handle such huge amount of data. Where demand in market says that huge
amount of data available that needs to process within a fraction of seconds. However,
DBMS are unable to fulfill this requirement [3].
A solution for above problem is data clustering. Data clustering identifies the pattern
among the data and then model the unstructured data. Modeling leads to Vectored
representation of data and it is useful for similarity matching. According to similarity,
contents group the very similar kind of data. Similar kind of data and its grouping will
provide the ease of searching; ease of data retrieval. The other solution is use the big data
processing tools like Hadoop’s MR framework based on Hadoop Distribution File
System [HDFS]. Data clustering of big data using big data processing tools solves above
problem. Hence, we have carried out a detailed survey of data clustering algorithms on
big data and studied how it works.
The algorithms provide hard clustering assigns each pattern to one and only one
cluster during its operation and in its output. A fuzzy clustering method assigns degrees
of membership in several clusters to each input pattern and put points in many clusters.
Further, we are concentrated on Fuzzy data clustering algorithms like Fuzzy C-means.
Fuzzy C-means is similar algorithm to K-means with slight variations [4].
3
Performance Analysis of Fuzzy C-Means Algorithm
1.1. Fuzzy C- Means
The idea of fuzzy membership or putting each input pattern into many clusters had
introduced by Bezdeck in 1984. The FCM method has membership cardinality from 1 to
N. It says that each pattern in dataset belong to one cluster to N {1-N} clusters. It is also
consider as advanced version of k means algorithm. In FCM, data are bound to each
cluster by means of a Membership Function, which represents the fuzzy behavior of this
algorithm.
Looking at the Figure 2, we may identify two clusters in proximity of the two data
concentrations. We will refer to them using ‘A’ and ‘B’. In the first approach shown in
this tutorial - the k-means algorithm - we associated each data pattern to a specific
centroid; therefore, this membership function looked like this:
Figure 2: Clustering Algorithm Behavior
In the figure above, the data pattern shown as a red marked spot belongs more to the
B cluster rather than the A cluster. The value 0.2 of ‘m’ indicates the degree of
membership to A for such datum. Now, instead of using a graphical representation, we
introduce a matrix U whose factors are the ones taken from the membership functions:
Umc = [1 00 1] Matrix for K-means
4
Performance Analysis of Fuzzy C-Means Algorithm
Umc = [0.8 0.20.3 0.7] Matrix for Fuzzy C-Means
1.2. Hadoop
Hadoop is a collection of related subprojects that fall under the umbrella of
infrastructure for distributed computing. These projects are hosted by the Apache
Software Foundation, which provides support for a community of open source software
projects. Although Hadoop is best known for Map-Reduce and its distributed file system
(HDFS, renamed from NDFS), the other subprojects provide complementary services, or
build on the core to add higher-level abstractions. The subprojects, and where they sit in
the technology stack, are shown in Figure and described briefly here:
1. Core
A set of components and interfaces for distributed file systems and general I/O
(serialization, Java RPC, persistent data structures).
2. Avro
A data serialization system for efficient, cross-language RPC, and persistent data
storage. (At the time of this writing, Avro had been created only as a new subproject, and
no other Hadoop subprojects were using it yet.)
3. Map-Reduce
A distributed data processing model and execution environment that runs on large
clusters of commodity machines.
4. HDFS
A distributed file system that runs on large clusters of commodity machines.
5
Performance Analysis of Fuzzy C-Means Algorithm
5. Pig
A data flow language and execution environment for exploring very large datasets.
Pig runs on HDFS and Map-Reduce clusters.
6. HBase
A distributed, column-oriented database. HBase uses HDFS for its underlying
storage, and supports both batch-style computations using Map-Reduce and point queries
(random reads).
7. ZooKeeper
A distributed, highly available coordination service. ZooKeeper provides primitives
such as distributed locks that can be used for building distributed applications. Hive A
distributed data warehouse.
8. Hive
It manages data stored in HDFS and provides a query language based on SQL (and
which is translated by the runtime engine to Map-Reduce jobs) for querying the data.
9. Chukwa
A distributed data collection and analysis system. Chukwa runs collectors that store
data in HDFS, and it uses Map-Reduce to produce reports. (At the time of this writing,
Chukwa had only recently graduated from a “contrib” module in Core to its own
subproject.) [Hadoop Definitive guide]
1.3. Hadoop (Map-Reduce Framework)
Google implemented a highly scalable and easily adaptable processing and storage
architecture, centered around the ‘map-reduce’ paradigm borrowed from functional
programming languages, and GoogleFS, a fault-tolerant distributed filesystem. It is an
6
Performance Analysis of Fuzzy C-Means Algorithm
abstraction for large-scale computation. A simple programming model that applies to
many large-scale computing problems
Hide messy details in Map-Reduce runtime library:
1. Automatic parallelization
2. Load balancing
3. Network and disk transfer optimization
4. Handling of machine failures
5. Robustness
Map: extract something you care about from each record
Reduce: aggregate, summarize, filter, or transform
1.4. SPARK
Map-Reduce and its variants have been highly successful in implementing large-scale
data-intensive applications on commodity clusters. However, most of these systems are
built around an acyclic data flow model that is not suitable for other popular applications.
This includes many iterative machine learning algorithms, as well as interactive data
analysis tools. Spark introduces an abstraction called “Resilient Distributed Datasets”
(RDDs). An RDD is a read-only collection of objects partitioned across a set of machines
that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative
machine learning jobs [8].
1.5. Applications
Big data clustering has typically used to provide fast solution to the data
heterogeneity problem. Data heterogeneity can converted into clustering of similar kind
of data at single place. It is computational-intensive task or data clustering is NP-Hard
problem even for two nodes. Here speedup has achieved over on the fly network of
distributed systems, which may have different processing capacity or different platform.
7
Performance Analysis of Fuzzy C-Means Algorithm
Fuzzy C-means provide overlapping results on the same dataset. Text based clustering is
very computational intensive task but required in many area of day-to-day life [9].
Table 1: Data clustering Application [10]
Application What isClustered
Benefits
Search result clustering Search results More effective information presentation to user
Scatter-Gather (subsets of) collection Alternative user interface: “search without typing”
Collection clustering Collection Effective information presentation for exploratory browsing
Language modeling Collection Increased precision and/or recall
Cluster-based retrieval Collection Higher efficiency: faster search
1.6. Proposed Work
1. Problem Statement
Design and implementation of Fuzzy C-means clustering algorithm using Map-
Reduce.
2. Significance
Fuzzy C-means have wide variety of application in image processing based
application clustering. Text based data has many applications that can be solved using
Fuzzy C-means. Fuzzy C-means performance evaluation on text based clustering will
provide a new platform for fuzzy clustering algorithms on distributed computing. Reason
to select distributed computing is that it allows on the fly network formation. In a
distributed computing heterogonous system can come under one roof and increase the
processing capacity. As increases in processing capacity, it will lead to solve data
8
Performance Analysis of Fuzzy C-Means Algorithm
clustering problem in less time. Text based clustering have many applications like find
good way to distribute a big graph [11 [ppt pdf]].
3. Objectives
1. Design of processing model of C-means Algorithm for Map-Reduce
2. Implementation of C-means algorithm on Map-Reduce
3. Testing & Performance analysis of above algorithm with Big-Data on Map-
Reduce
4. Compare C-means with other equivalent works
1.7. Organization of Dissertation Report
Chapter 2 gives Literature survey of evolution of data clustering algorithm on Big
data, Table of applications of clustering using Big Data. Chapter 3 presents methodology
adopted for the dissertation work. It includes flow chart, pseudo code for proposed
algorithms. Chapter 4 gives in detail implementation details. Chapter 5 is dedicated for
test data, hardware configuration specifications and experimental results. The results for
Classical K-means, Map-reduce Based K-means, Classical FCM and Map-Reduce based
FCM and its related speedup graphs has presented in this chapter. Chapter 6 gives
conclusion and exhibits future scope which ends the dissertation report.
9
Performance Analysis of Fuzzy C-Means Algorithm
2. Literature Review
2.1. Background
In a field of Data-mining a K-means is the most popular algorithms because of its
simplicity in nature. It clusters the similar kind of data based on similarity criteria. K-
means is a clustering algorithm. K-means is a NP-Hard problem even for two nodes. K-
means uses Euclidian distance method for calculating distance between two point’s lies
on the same plane. K-means aims at minimizing objective function.
2.2. Clustering Taxonomy based of Similarity measure [12]
1. Agglomerative vs. divisive: This taxonomy relates to algorithmic structure
and its operation. An agglomerative perspective begins with each pattern in
a separate (singleton) cluster, and consecutive merges clusters together until
a stopping criterion has appeared. A divisive method begins with all
patterns with in cluster and performs splitting until a stopping criterion has
met [13][14].
2. Hard vs. fuzzy: A hard clustering algorithm assigns each pattern to one and
only one cluster during its operation and in its output. A fuzzy clustering
method assigns degrees of membership in several clusters to each input
pattern and put points in many clusters [15][16 need to find[17][18].
3. Incremental vs. non-incremental: This is a case when the pattern set
conceptual similarity between points. To be clustered is large, and
constraints on execution time or memory space affect the architecture of the
algorithm [19]
4. Deterministic vs. stochastic: This issue is most relevant to partitioned
approaches designed to optimize a squared error function. This optimization
can be accomplished using traditional techniques or through a random
search of the state space consisting of all possible labeling
10
Performance Analysis of Fuzzy C-Means Algorithm
2.3. Fuzzy C-means Algorithm (FCM)
FCM has developed from basic K-means as variation to the K-means algorithm to
archive accuracy in algorithm, ease of implementation and speed-up. In 1974 Dunn has
proposed fuzzy version of K-means optimal and fuzzy version of least squared error
partitioning problem [20].
Robert, Jitendra, Bezdeck (1986) [23] experimented on 2 versions of FCM clustering
algorithm an approximate fuzzy c-means (AFCM) implementation based upon replacing
the necessary "exact" variants in the FCM equation with integer-valued or real-valued
estimates. This approximation enables AFCM to exploit a lookup table approach for
computing Euclidean distances and for exponentiation. The net effect of the proposed
implementation is that CPU time during iteration has reduced to approximately one sixth
of the time required for a literal implementation of the algorithm, while apparently
preserving the overall quality of terminal clusters produced. The two implementations
have tested numerically on a nine-band digital image. Literal fuzzy C-means algorithm
(LFCM) with a table-driven approach also proposed by authors.
Bezdeck, Ehrlich, Full (1984) [24] implements FOTRAN-IV coding of the fuzzy c-
means (FCM) clustering program. The FCM program is applicable to a wide variety of
geostatistical data analysis problems. This program generates fuzzy partitions and
prototypes for any set of numerical data. These partitions are useful for corroborating
known substructures or suggesting substructure in unexplored data. The clustering
criterion used to aggregate subsets is a generalized least-squares objective function.
Features of this program include a choice of three norms (Euclidean, Diagonal, or
Mahalonobis), an adjustable weighting factor that essentially controls sensitivity to noise,
acceptance of variable numbers of clusters, and outputs that include several measures of
cluster validity.
Alexandre, Leandro (2011) [25] proposed FCM algorithm implementation based on
particle swarm optimization technique. The algorithm named as Fuzzy Particle Swarm
Clustering (FPSC). This algorithm has the extension of crisp data clustering algorithm of
particle swarm clustering. The main structural changes of the original PSC algorithm to
11
Performance Analysis of Fuzzy C-Means Algorithm
design FPSC occurred in the selection and evaluation steps of the winner particle,
comparing the degree of membership of each object from the database in relation to the
particles in the swarm.
Nikhil, Kuhu, James and Bezdek (2005) [26] observed drawbacks of FPCM (Fuzzy
Probabilistic C-means) and proposed a new algorithm based on the fuzzy C-means. That
algorithm named as Possibilistic Fuzzy C-means (PFCM) clustering. FPCM generates
both membership and typicality values when clustering unlabeled data. FPCM constrains
the typicality values so that the sum over all data points of typicality’s to a cluster is one.
The row sum constraint produces unrealistic typicality values for large data sets. PFCM
produces memberships and possibilities simultaneously, along with the usual point
prototypes or cluster centers for each cluster. PFCM is a hybridization of possibilistic c-
means (PCM) and fuzzy c-means (FCM) that often avoids various problems of PCM,
FCM and FPCM. PFCM solves the noise sensitivity defect of FCM, overcomes the
coincident clusters problem of PCM and eliminates the row sum constraints of FPCM.
Andrew and Khaled (2013) [27 - IMP] experimented on sentence based clustering of
data. Sentences are heart of documents or any kind of communication. They may belong
to more than one theme or topic present within a document or set of documents.
However, because most sentence similarity measures do not represent sentences in a
common metric space, conventional fuzzy clustering approaches based on prototypes or
mixtures of Gaussians are generally not applicable to sentence clustering. Author
proposes a novel fuzzy clustering algorithm that operates on relational input data; i.e.,
data in the form of a square matrix of pairwise similarities between data objects. The
algorithm uses a graph representation of the data, and operates in an Expectation-
Maximization framework in which the graph centrality of an object in the graph is
interpreted as likelihood. Results of applying the algorithm to sentence clustering tasks
demonstrate that the algorithm is capable of identifying overlapping clusters of
semantically related sentences, and that it is therefore of potential use in a variety of text
mining tasks.
12
Performance Analysis of Fuzzy C-Means Algorithm
Suganya, Shanthi (2012) [28] has reviewed three different implementation of Fuzzy
C-means algorithm, its objectives and advantages-disadvantages of each Fuzzy C-means
method. Methods have mentioned in tabular format.
Table 3, briefs many variation of Fuzzy C-Means implementation. Table 3 also
represents advantages of each Fuzzy C-Means variation as well as disadvantage of it.
Table 2: Various Fuzzy C-means Implementations
Algorithm Advantages DisadvantagesFuzzy C-Means Algorithm [20]
- Unsupervised- Converges
-Long computational time-Sensitivity to the initial guess (speed, local minima)- Sensitivity to noise and One expects low (or even no) membership degree for outliers (noisy points)
Possibilistic C-Means (PCM) [26]
- Clustering noisy data samples
- Very sensitive to good initialization- Coincident clusters may result Because the columns and rows of the typicality matrix are independent of each other
Fuzzy Possibilistic C Means Algorithm(FPCM) [26]
- Ignores the noise sensitivity deficiency of FCM- Overcomes the coincident clusters problem of PCM
- The row sum constraints must be equal to one
Possibilistic Fuzzy C-Means Algorithm(PFCM) [26]
- Ignores the noise sensitivity deficiency of FCM- Overcomes the coincident clusters problem of PCM- Eliminates the row sum constraints of FPCM
13
Performance Analysis of Fuzzy C-Means Algorithm
2.4. Fuzzy C-means Algorithm on Big Data
Clustering of data leads to simplification of big data for processing. Where, clustering
has defined as gathering similar kind of data at one place and differentiating from non-
similar content. Similarity matrix has defined between items of data and then similar kind
of items has picked to form a data or clustering is the unsupervised classification of
patterns (observations data items or feature vectors) into groups (clusters). Clustering is a
NP-Hard problem. Clustering become more complex and time consuming for big data but
clustering helps efficient handling of big data is a fact. Clustering data is a fundamental
problem in a variety of areas of computer science and related fields, such as machine
learning, data mining, pattern recognition, image analysis, information retrieval, etc.
Clustering itself is not one specific algorithm. The concept of parallelization appears
because large datasets or big data. However, parallelization has obstacle of data
dependency. In case of data dependency, synchronization is required or serial code will
execute in the critical section. This reduces the performance of parallel algorithms or
parallelization. So, some part of processing needs to be 100% parallel which will utilize
all the available resources as well as simple framework required for programming which
will provide abstract level of programming over parallelization, distribution of data and
load balancing.
Distributed environment and Map-Reduce architecture appears to solve the data-
clustering problem with achievement of speedup as compare to sequential. The Map-
Reduce library has created as an abstraction. It allowed the developer to express the
simple computation while hiding the details of parallelization, fault-tolerance, data
distribution and load balancing in the library [1]. Developing an algorithm that would
best suit for big data processing is a challenging task. For big data processing apache
developed Hadoop architecture. Hadoop is a distributed environment architecture
components located over the remote places and computers over remote places can
communicate and coordinate their actions by passing messages to each other this have
been known as a distributed environment. Hadoop supports Hadoop Distributed File
Systems (HDFS). Hadoop distributed file systems are a Java-based file systems that
provides expandable and impeccable data storage that has been design to span large
14
Performance Analysis of Fuzzy C-Means Algorithm
clusters of special servers. HDFS designed to be a scalable, fault-tolerant, distributed
storage system that works closely with Map-Reduce [2].
Map-Reduce is architecture for distributed programming model and implementation
for processing and generating large data set in parallel. A Map-Reduce program is based
on three functions Map(), Shuffle(), Reduce(). Map() performs distribution of data to
different node located into the distributed environment. Reduce() that gathers results at
single location. The "Map-Reduce System" (also called "infrastructure" or "framework")
orchestrates the processing by marshaling the scattered servers, running the various tasks
in parallel, managing all communications and data transfers among the various parts of
the system, and providing for redundancy and fault tolerance .
Anchalia, Prajesh P., Anjan K. Koundinya, and N. K. Srinath [9] authors have
research in area of data mining. Faster Information retrieval has been author’s main goal
of research. For fast retrieval of information, clustering has been useful. Authors
proposed parallel implementation of K-means clustering algorithm for clustering of data.
Authors implemented K-means algorithm on Map-Reduce architecture to achieve speed
up in formation of data clustering. Authors observed outliers handling is much important
in implementation of K-means on Map-Reduce architecture. Improvisation in stopping
criteria and proper initialization of clusters may lead to better results. Algorithm and
implementation detail has given by authors to implement K-means on Map-Reduce
architecture.
Zhao, Weizhong, Huifang, and Qinge [10] makes a statement based on their
experimental research data clustering has been an important research area as the data
increases handling and maintaining become difficult and clustering is much more
complex task. In order to deal with this problem parallel K-means clustering algorithm
based on Map-Reduce. K-means is simpler and easy to implement clustering algorithm.
The map function performs the procedure of assigning each sample to the closest center
(node). The reduce function performs the procedure of modifying center values. To
decrease network communication combiner function has introduced to handle the
midway values. Key is the offset in bytes of this record to the start point of the data file,
15
Performance Analysis of Fuzzy C-Means Algorithm
and the value is a string with the content of this record. The dataset has divided and
broadcast to all mappers. Distance calculations simultaneously performed. For each map
task, K-Means construct a universal variant centers which is an array stores details about
centers of the clusters. Information is base then mapper can calculate the closest center
point for each point. The intermediate values are then composed of two parts: the index
of the closest center point and the information.
Ene, Alina, SungjinIm, and Benjamin Moseley [11] experimented on k-means by
designing an algorithm that could be implemented on Map-Reduce architecture and
focused on two well-studied algorithms k-median and k-center. Evaluations have
performed on various serial and parallel algorithms for the k-median problem. K-median
sample contains more information than normal sample for each unsampled point x. The
good solution for overhead of more information author proposes to select the sampled
point that is nearest to x. Author add additional weight to each sampled point y which is
equal to the number of un-sampled points that picked y at its closest point. The Map-
Reduce k-Median algorithm takes extra time to assign a weight to each point in the
sample. This needs to be gradually removed and implemented on Map-Reduce. Map-
Reduce architecture has not considered as best programming model for iterative
programming.
Esteves, RuiMáximo, and Chunming Rong[12] experimented on big dataset which is
realistic noisy dataset using data clustering algorithms like k-means and FCM [fuzzy c-
means]. The evaluation made using a free cloud computing solution Apache Mahout and
Wikipedia’s latest articles. Authors have proved dimensionality reduction has crucial role
in document clustering. Authors also proved that in presence of noise FCM gives worst
results than k-means. Initialization of cluster center has effect on convergence speed of
both the algorithm. Different initialization methods provide different convergence time
for big dataset. In general, FCM is faster than k-mean but random initialization will give
different results or no one can predict which algorithm is much faster.
Xiaojun, Junying, and Haitao[13] proposed distance regularity factor to improve
fuzzy C-means algorithm. The conventional Fuzzy C Means (FCM) that uses Euclidean
16
Performance Analysis of Fuzzy C-Means Algorithm
distance as the similarity measurement criterion of data points for data distribution in data
clustering. Euclidean distance has its limitation of equal distribution of datasets and the
clustering performance is decreases by the data structure including cluster shape and
cluster density. Solution to above problem is distance regulatory factor. It assigns correct
the similarity measurement while calculating similarity criteria between cluster center
and sample point. The factors have taken into account on cluster density, which
represents the global distribution information of points in a cluster. It has applied to the
conventional FCM for distance correction. A result has good forbearance over different
clustering density.
Xie, Jiong, ShuYinet. et al. [14] shows that ignoring data locality in heterogeneous
environment reduces Map-Reduce performance. All the reviews are relate to clustering
using Map-Reduce but authors want to show data clustering is prerequisite for Map-
Reduce jobs to improve the performance. Authors point out the problem of data
distribution with distributed environment and Map-Reduce architecture. Instead of
distributing random data to random node; data clustering will be performed first over data
and then similar data will be transferred to same node in distributed environment. Data
clustering has been used to enhancing data locality while single read. Data intensive
applications will give much positive results in the scene of time and balancing data across
the nodes.
Ferreira Cordeiro, Robson Leonardo, Caetano Traina Junior. et. Al [15] observes
some problems while using Map-Reduce for data clustering. One of the points is
bottleneck. To overcome from the problem minimize the I/O cost. Consider the already
existing data partition, which minimize the network cost among processing nodes. To
remove bottleneck authors proposed Best of Both World (BOW) strategy. Authors
proposed cost function, which can be able to choose best strategy. It works with almost
all-serial clustering method as plug-in subroutine clustering, synchronization between the
cost for disk accesses and network access, achieving very good results between the two
as well as it uses no user-defined parameters. Author reports have based on
experimentation on actual and artificial data with billions of points, using up to 1,024
17
Performance Analysis of Fuzzy C-Means Algorithm
cores in parallel. For 0.2 TB of multi-dimensional data, it took only 8 minutes to be
clustered, using 128 cores.
Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae[16]
has proposed designing and implemented Twister a distributed in-memory Map-Reduce
runtime. It is an improvisation over Map-Reduce architecture. This experiment carried
out to optimize the iterative operations or computations over Map-Reduce. Authors have
presented the extended programming model of Twister and its architecture. Authors
compare their model with the Map-Reduce and their experiment. It resulted into
abstraction over Map-Reduce to wide variety of applications. Application mostly related
to data clustering, computer vision and machine learning. Authors presented the results
on a bunch of applications with bigdata sets. Some of the benchmarks performed indicate
that Twister performs and scales well for many iterative Map-Reduce computations.
Table 3, represents clustering algorithms that are implemented on Hadoop and its
relevant technology.
Table 3: List of Hadoop based Clustering ApplicationsPaper Pros Cons
Parallel k-means clustering based on Map-Reduce [10]
- Combiner () included to reduce communication overhead
- Distance calculation performed Simultaneously
- Initialization of center values are not given
- Stopping Criteria is not mentioned
Fast clustering using Map-Reduce
- Constant number of Map-Reduce rounds
- Works in
O(log n log log n) time
- All experimentation performed on single machine
Using Mahout for clustering Wikipedia's latest articles: a comparison between K-means and fuzzy
- Proved that Mahout library is promising clustering technique.
- Conclusion on convergence rate and preprocessing tool are given based on
- Considered dataset is artificial.
18
Performance Analysis of Fuzzy C-Means Algorithm
C-means in the cloud [12]
experimentation
Improved fuzzy C-means clustering algorithm based on cluster density [13]
- Similarity criteria changed
- Good result for different clustering density
- Does not designed for Map-Reduce architecture or parallel implementation
Improving Map-Reduce performance through data placement in heterogeneous hadoop clusters [14]
- Data placements are considered before data clustering
- Overhead increases before data clustering or while data distribution
Clustering very large multi-dimensional datasets with Map-Reduce [15]
- Minimizes IO and network cost
- Works as plugin
Experimentation on both real and artificial data
- Design is not useful for overlapping datasets
Twister: a runtime for iterative Map-Reduce [16]
- Improvisation over Map-Reduce architecture
- Perform well with iterative computations
- Suitable for data clustering applications
- Implemented on top of Map-Reduce Architecture
2.5. Summary
Hard clustering has not considered the best solution for overlapping datasets and its
clusters. Soft clustering technique has considered for data clustering on overlapping
dataset. In soft clustering technique Fuzzy C-means is well known algorithm and can be
treated as an advanced version of K-means algorithm, which is soft clustering algorithm
where cluster data points are depends on the fuzziness of the data point. Fuzzy C-means
allows same sample into many cluster according to its fuzziness with cluster center.
Fuzzy clustering has a promising idea for data clustering. Map-Reduce (Hadoop) is
architecture of distributed computing. Many things regarding parallelization and load
19
Performance Analysis of Fuzzy C-Means Algorithm
balancing are kept abstract to the programmer so that programmer can concentrate on the
actual logic part. As well as it is an open source programming framework and capable to
handle large datasets or Bigdata. Authors are looking into solving data clustering problem
using Fuzzy clustering with Map-Reduce for voluminous overlapping datasets or
Bigdata.
Along with fuzziness over Map-Reduce on Bigdata, fast and accurate clustering
requires many aspects like
1. Proper initialization method for cluster center point
2. Similarity criteria that has less computation and able to converge algorithm
early
3. Stopping criteria for an algorithm
4. Data placement in heterogeneous network as a part of preprocessing to data
clustering etc.,
Above mentioned points are considered as variations over simple soft clustering
algorithms or Fuzzy C Means algorithm, which will be improvisation of the performance
of soft clustering. These algorithm needs to be parallelize for performance evaluation.
Above points also considered as optimization points for Fuzzy C-means algorithm.
20
Performance Analysis of Fuzzy C-Means Algorithm
3. Design & Implementation
Fuzzy C-means (FCM) starts with vectoring the elements, choosing centroids from
them as a reference to further operations. Based on some similarity criteria points have to
place in its matching cluster. Here each individual point calculates distance with each
centroid as similarity criteria and based on fuzziness membership values. From various
distance calculation method we are considering euclidean distance method because of its
simplicity in calculations. FCM has more ever advanced version of advanced version of
K-means.
In many situations, fuzzy clustering is more natural than hard clustering. Objects on
the boundaries between several classes are not forced to fully belong to one of the
classes, but rather are assigned membership degrees between zero and one indicating
their partial membership. Fuzzy c-means clustering was first reported in the literature for
a special case (m=2) by Joe Dunn in 1974. Jim Bezdek developed the general case (for
any m greater than 1) in his PhD thesis at Cornell University in 1973. Bezdek had
improved it in 1981. The FCM employs fuzzy partitioning such that a data point can
belong to all groups with different membership grades between 0 and 1.
3.1. Fuzzy C-means
The objective function of algorithm has define clustering criterion used to aggregate
subsets is a generalized least-squares objective function. Features of this algorithm is
Euclidean distance as simple to calculate, and adjustable weighting factor that essentially
controls sensitivity to noise, acceptance of variable numbers of clusters, and outputs that
include overlapping clusters.
3.2. Objective Function
The FCM algorithms has best described by recasting conditions (equation 1) in
matrix-theoretic terms. Towards this end, let U be a real c × N matrix, U = [uik]. U is the
matrix representation of the partition {Yi} [28]
21
Performance Analysis of Fuzzy C-Means Algorithm
ui( y¿¿ k)=uik={ 1 ; yk Ɛ y i
Otherwise¿ (3.1)
∑i=1
N
u ik>0 for all i (3.2)
∑i=1
N
u ik=0 for all k (3.3)
ui is a function; ui: Y → {0, 1}. In conventional models, ui is the characteristic
function of Yi: in fact, ui and Yi determine one another, so there is no harm in labeling u;
the ith hard subset of the partition.
To refer U as a fuzzy c-partition of Y when the elements of U are numbers in the unit
interval [0, 1] that continue to satisfy both equations (3.2) and (3.3). The basis for this
definition are c functions ui: Y→[0, 1] whose values Ui(Yk) Ɛ [0, 1] are interpreted as the
grades of membership of the YkS in the "fuzzy subsets" ui of Y.
Jm(U , v)=∑k=1
N
∑i=1
c
(U ik )m|yk−v i|A2 (3.4)
Where,
Y = {Y1, Y2 ..... YN} C Rn = the data, (3.5)
c = number of clusters in Y; 2 <= c
<= n,
(3.6)
m = weighting exponent; 1 <= m < ∞ (3.7)
U = fuzzy c-partition of Y; U E {0,1} (3.8)
V = (V1, V2,…. VC) = vectors of
centers,
(3.9)
Vi= (Vi, Vj, .. , Vn) = center of
cluster i,
(3.10)
22
Performance Analysis of Fuzzy C-Means Algorithm
The weight attached to each squared error is (ua)", the mth power of ykS
membership in cluster i. The vectors {Vi} in equation (3.10) are viewed as "cluster
centers" or centers of mass of the partitioning subsets. If m = 1, it can be shown
that Jm minimizes only at hard U's Ɛ Mc,
Table 4 : Objective Function Described
d ik2 squared A-distance from point Yk to
center of mass Vi
(U ik )m d ik2 squared A-error incurred by repre-
senting Yk by vi weighted by (a power of) the membership of Yk in cluster i
∑i=1
c
( U ik )md ik2 sum of squared A-errors due to yks
partial replacement by all c of the centers {vi}
∑k =1
N
∑i=1
c
(U ik )m dik2 overall weighted sum of generalized A-
errors due to replacing Y by v.
3.3. Importance of Fuzzification factor (m)
Weighting exponent m controls the relative weights placed on each of the squared
errors d2k. As m ->1 from earlier discussion partitions that minimize Jm become
increasingly hard (and, as mentioned before, at m = l, are necessarily hard). Conversely,
each entry of optimal Jm approaches for (1/c) as m → ∞. Consequently, increasing m
tends to degrade membership towards the fuzziest state. Each choice for m defines, all
other parameters being fixed, in FCM algorithm. No theoretical or computational
evidence distinguishes an optimal m. The range of useful values seems to be [1, 30] or so.
If a test set is available for the process under investigation, the best strategy for selecting
m at present seems to be experimental. For most data, 1.5 <= m <= 3.0 gives good results
[28].
23
Performance Analysis of Fuzzy C-Means Algorithm
3.4. Document Clustering
Document Clustering, also known as text data mining or knowledge discovery
process from the textual databases, generally, is the process of extracting interesting and
non-trivial patterns or knowledge from unstructured text documents. All the extracted
information is linked together to form new facts or new hypotheses to be explored further
by more conventional means of experimentation.
The classic approach of information retrieval based on keyword search from WWW
makes it cumbersome for the users to look up for the exact and precise information from
the search results. It is up to the user to go through each document to extract the relevant
and necessary information from those search results. This is an impractical and tedious
task. Document Clustering can be a better solution for this as it links all the extracted
information together, pushes all the irrelevant information aside, and keeps the relevant
ones based on the question of interest.
3.5. Dataset Conversion
Dataset 20_newsgroup dataset converted as shown in figure into numerical format
from text format. Find out unique keywords from all the documents and its count i.e. total
count of unique keyword in all documents. Display the unique keywords to user and ask
for keywords. Store unique keywords and count in javas Tree-Map collection. Program
collects keywords from user on which documents need to be done. According to
keywords find out the occurrence of that keyword in all the documents and then store its
count.
TF = Count of Keyword / ∑(Keywords appeared in Document) (5.11)
Using above formula text based data gets converted into the numerical data which
has representation of 20_newgroups dataset. Properties of Vectorized are “KEY”,
“Value” and “Location” as in ID of object. Value represents Term Frequency of keyword
in that document. Do it same for centroids too.
24
Performance Analysis of Fuzzy C-Means Algorithm
Figure 3 : Dataset Conversion
25
Performance Analysis of Fuzzy C-Means Algorithm
3.6. Classical K-Means
Algorithm
26
Procedure KMeansBeginRepeat
Count the frequency of unique words in documentCount the global_Frequency of unique words in each document
Until all documents are scannedDisplay it to userGet user Input of Keywords on which clustering documents
should be done.Repeat
Calculate TF per Document:Keyword
TF = Count of Keyword / ∑(Keywords appeared in Document)Store TF in a Vector_Dataset
Until all documents are scannedRandomly select K number of points from Vector_Dataset as
CentroidStep I: Vectorization:
Store TF in Vecorized ObjectStore Centroids in Vecorized Object
RepeatRepeat Step II: Calculate distance (Euclidean Distance):
1. Calculate Euclidean Distance between Current Document to cluster centroids.
2. d ( x , y )=√∑i
N
( x i− y i)2
3. Check whether distance is minimum than previous distance 4. If Yes Update Current Document belongs to this ith cluster.
Until all Documents are scanned.Update Centroids
Until all iterations are overEnd
Figure 4 : Algorithm of classical K-Means
Performance Analysis of Fuzzy C-Means Algorithm
Flowchart
Figure 5 : Flowchart of Classical K-Means
27
Performance Analysis of Fuzzy C-Means Algorithm
Initially Convert Vector_Dataset to Vectorized objects as Mentioned in section 3.5.
Start iterating over vectorized data and compare distance (Euclidean distance) between
current document’s Vectorized object with each centroid. Document will belong to the
nearest distance centroid will give the shortest distance. Compare the distance with each
centroid and store the smallest distance and index of centroids. Now, copy the point into
new cluster which it belongs and update the centroid values. In next iteration iterate over
modified centroid values.
The above procedure has known as hard clustering where, each document will belong
to one and only one cluster.
28
Performance Analysis of Fuzzy C-Means Algorithm
3.7. Map-Reduce Based K-Means
Algorithm
29
Procedure MapReduceKMeansBegin (sequential)Repeat
Count the frequency of unique words in documentCount the global_Frequency of unique words in each document
Until all documents are scannedDisplay it to userGet user Input of Keywords on which clustering documents should
get done.RepeatCalculate TF per Document:
KeywordTF = Count of Keyword / ∑(Keywords appeared in Document)Store TF in a Vector_Dataset
Until all documents are scannedRandomly select K number of points from Vector_Dataset as
CentroidEnd (sequentially)
Begin (MapReduce)Map Given dataset over all Nodes available (Based on split size)
MAP PHASE (On all node)RepeatStep I: Vectorization:
Store TF in Vecorized ObjectStore Centroids in Vecorized Object
RepeatRepeat Step II:Calculate distance (Ecludean Distance):
1. Calculate Ecludean Distance between Current Document to cluster centroids.
2. d ( x , y )=√∑i
N
( x i− y i)2
3. Check whether distance is minimum than previous distance 4. If Yes Update Current Document belongs to this ith cluster.
Until all Documents are scanned.Update Centroids
Until all iterations are over
REDUCE PHASE (From all nodes)Gather all results and store on single node.
Figure 6 : Algorithm of Hadoop Based K-Means
Performance Analysis of Fuzzy C-Means Algorithm
Flowchart
Figure 7: Flowchart of Hadoop Based K-Means
30
Performance Analysis of Fuzzy C-Means Algorithm
Initially follow data conversion method as mentioned in section 3.5. Converted
vectorized data will be distributed over the Hadoop nodes. The distribution will be done
according the Hadoop configuration for replication. While programming slices of dataset
are created according to split size and distributed over the Hadoop node for processing.
The difference between above two statement is first replicated the data for number of
times to reduces the communication overhead sacrificing space on number of nodes.
Second slice the required data from available replicated node which is optimally nearest
to processing node and which does not have data copy. Sometime, data split increase the
communication overhead and following decreases the throughput of the system.
Follow the section 3.6 for K-means operation but in this case each node will run the
same code of K-means. On completion of distance calculation reduce data from each
node to one single node and update the centroid. Repeat the same procedure for n times.
31
Performance Analysis of Fuzzy C-Means Algorithm
3.8. Classical Fuzzy C-Means
Algorithm
32
Procedure FCMeansBeginRepeat
Count the frequency of unique words in documentCount the global_Frequency of unique words in each document
Until all documents are scannedDisplay it to userGet user Input of Keywords on which clustering documents
should be done.RepeatCalculate TF per Document:
KeywordTF = Count of Keyword / ∑(Keywords appeared in Document)Store TF in a Vector_Dataset
Until all documents are scannedRandomly select K number of points from Vector_Dataset as Centroid
Step I: Vectorization:1. Store TF in Vecorized Object2. Store Centroids in Vecorized Object
RepeatRepeat Step II: Calculate distance (Euclidean Distance):Repeat
Calculate Euclidean Distance between Current Document to cluster centroids.
Until all centroids are coveredFuzzyfication Factor Calculation
Until Fuzzyfication Factor of a point with all Centroids is calculated.Step III:
1. If Fuzzyfication Factor >Threshold; 2. Update Current Document belongs to this ith cluster.
Until all Documents are scanned.Update CentroidsUntil all iterations are overEnd
Figure 8: Algorithm of Classical Fuzzy C-Means
Performance Analysis of Fuzzy C-Means Algorithm
1. Flowchart
Figure 9: Flowchart of Classical Fuzzy C-Means
33
Performance Analysis of Fuzzy C-Means Algorithm
Fuzzy C-Means has same to follow same methods for data conversion which
mentioned in section 3.5. Then Fuzzy C-Means has same sequence of steps as K-means.
Except distance calculation formula which have been mentioned in section 3.8.1.
Fuzzification factor will be calculated instead of Euclidean distance calculation. Using
fuzzification factor it has concluded that current point will belong to the all cluster
beyond the given threshold value.
It is also known as soft clustering. Same point may belong to one cluster or many
cluster depend upon the threshold value and fuzzification factor.
34
Performance Analysis of Fuzzy C-Means Algorithm
3.9. Map-Reduce Based Fuzzy C-Means
1. Algorithm
35
Procedure MapReduceFCMeansBegin (sequential)Repeat
Count the frequency of unique words in documentCount the global_Frequency of unique words in each document
Until all documents are scannedDisplay it to userGet user Input of Keywords on which clustering documents should get
done.RepeatCalculate TF per Document: Keyword
TF = Count of Keyword / ∑(Keywords appeared in Document) Store TF in a Vector_Dataset
Until all documents are scannedRandomly select K number of points from Vector_Dataset as Centroid
End(sequentially)Begin (MapReduce)
Map Given dataset over all Nodes available (Based on split size)MAP PHASE (On all nodes)RepeatStep I: Vectorization:
1. Store TF in Vecorized Object2. Store Centroids in Vecorized Object
RepeatRepeat
Step II: Calculate distance (Euclidean Distance):Repeat
Calculate Euclidean Distance between Current Document to cluster centroids.
Until all centroids are coveredFuzzification Factor Calculation
Until Fuzzification Factor of a point with all Centroids is calculated.Step III:
If Fuzzification Factor >Threshold Update Current Document belongs to this ith cluster.
Until all Documents are scanned.Update Centroids
Until all iterations are overEndREDUCE PHASE (From all nodes)Gather all results and store on single node.
Figure 10: Algorithm of Map-Reduce Based Fuzzy C-Means
Performance Analysis of Fuzzy C-Means Algorithm
2. Flowchart
Figure 11: Flowchart of Map-Reduce Based Fuzzy C-Means
36
Performance Analysis of Fuzzy C-Means Algorithm
Map-Reduce based Fuzzy C-Means has the combination of Classical Fuzzy C-means
implementation and Map-Reduce Based K-Means implantation. Data conversion, data
distribution work as preprocessing to Map-Reduce based Fuzzy C-Means i.e.
sequentially. Then Classical Fuzzy C-Means will run on each node in Hadoop cluster till
Fuzzification factor calculation. Finally gather the entire Fuzzification factor on single
node and check for each point where it belongs. Update the centroids. Iterate for number
of times.
37
Performance Analysis of Fuzzy C-Means Algorithm
3.10. Summary
Sequential Fuzzy C Means is converted into parallel Fuzzy C Means. Only
programming parallelism is achieved but, the sequential nature of algorithm is retained.
Programming parallelism is carried out using Hadoop Distributed Computing
Framework.
38
Performance Analysis of Fuzzy C-Means Algorithm
4. Results and Discussion
4.1. Performance Evaluation Criteria
Fuzzy C-Means is evaluated on the following performance parameters. These are the
speed up comparison of Fuzzy C-means based on Map-Reduce Vs Sequential K-Means
on multiple node setup of Hadoop i.e (2, 4, 8) for 4 and 6 centroids with Map-Reduce
based programming of 4 Mb split, 8Mb split, 32 Mb split.
4.2. Speed Up
Speed-up is the ratio of sequential run time to parallel run time. It is described in
equation 4.1.
4.1
4.3. Hardware Configuration
The results are taken on processor Intel Quad Core i5-2400s CPU (2.5GHz * 4). The
Machine equipped with 4 Gb of RAM and 1 Tb of Hard Disk capacity. Operating System
Ubuntu 14.04 LTS. . The IDE used for development is Eclipse-LUNA with JAVA 6 and
above versions.
4.4. Test Data
20 newsgroup data has been considered here for document clustering. It has 20000
approx. articles representing documents. These documents are structured in 20 folders i.e.
logically ordered.
20 Folders are named as
1. comp.graphics2. comp.os.ms-windows.misc3. comp.sys.ibm.pc.hardware4. comp.sys.mac.hardware
39
Performance Analysis of Fuzzy C-Means Algorithm
5. comp.windows.x6. misc.forsale7. rec.autos8. rec.motorcycles9. rec.sport.baseball10. rec.sport.hockey11. alk.politics.misc12. talk.politics.guns13. talk.politics.mideast14. sci.crypt15. sci.electronics16. sci.med17. sci.space18. talk.religion.misc19. alt.atheism20. soc.religion.christian
1000 Usenet articles were taken from each of the following 20 newsgroups.
Approximately 4% of the articles are cross posted. The articles are typical postings and
thus have headers including
a. subject lines,
b. signature files,
c. Quoted portions of other articles.
4.5. Data Format
Each newsgroup is stored in a subdirectory, with each article stored as a separate file.
It is popular dataset for text applications in are for Machine learning, Classification, Text
Clustering.
4.6. Data Specification
Table 5: Dataset Specification
Data Set Characteristics:
Text Number of Instances:
20000 Area: N/A
Attribute Characteristics:
N/A Number of Attributes:
N/A Date Donated
1999-0--09
Associateds N/A Missing NO
40
Performance Analysis of Fuzzy C-Means Algorithm
Tasks: Values?
4.7. Experimental Setup
Following table will describe the combination of experiments that have been
performed for this dissertation.
Table 6: Experimental Permutations and Combination
Experimental Setup3
Centroids4
Centroids5
Centroids6
Centroids4 Itr 6 Itr 4 Itr 6 Itr 4 Itr 6 Itr 4 Itr 6 Itr
Classical K-Means √ √ √ √ √ √ √ √ 4 Mb Split√ √ √ √ √ √ √ √ 8 Mb Split√ √ √ √ √ √ √ √ 32 Mb Split
Hadoop Based K-Means √ √ √ √ √ √ √ √ 4 Mb Split
√ √ √ √ √ √ √ √ 8 Mb Split√ √ √ √ √ √ √ √ 32 Mb Split
Classical FCMeans √ √ √ √ √ √ √ √ 4 Mb Split
√ √ √ √ √ √ √ √ 8 Mb Split√ √ √ √ √ √ √ √ 32 Mb Split
HAdoop Based FCMeans √ √ √ √ √ √ √ √ 4 Mb Split
√ √ √ √ √ √ √ √ 8 Mb Split√ √ √ √ √ √ √ √ 32 Mb Split
Where in table 5,
Itr = Iteration
Based on Table 6, Combinations results are taken and discussed in the current
chapter.
4.8. Hadoop Version and number of nodes
1. Hadoop 1.2.1
2. 2, 4, 8 Node Hadoop Clusters
41
Performance Analysis of Fuzzy C-Means Algorithm
4.9. Other required Software’s
Eclipse IDE (LUNA version)
4.10. Experiments and Discussion
1. Classical K-Means vs. Hadoop Based K-Means
4.10.1.1. 4Mb Split 4 Iterations
Table 7: 2 Nodes Hadoop Cluster 4Mb Split 4 Iterations
K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM1st ITR 28000 35000 28000 310002nd ITR 363000 736000 498000 5550003rd ITR 399000 921000 517000 6170004th ITR 369000 935000 799000 615000
Table 8 : 4 Nodes Hadoop Cluster 4Mb Split 4 Iterations
K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM1st ITR 28000 3400 27000 310002nd ITR 162000 251000 206000 2240003rd ITR 188000 284000 215000 2250004th ITR 182000 272000 211000 226000
Table 9: 8 Nodes Hadoop Cluster 4Mb Split 4 Iterations
K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM
42
Performance Analysis of Fuzzy C-Means Algorithm
1st ITR 29000 36000 29000 310002nd ITR 209000 357000 217000 2660003rd ITR 237000 423000 258000 2800004th ITR 239000 407000 300000 292000
Table 10: 2, 4, 8 Node 4Mb Split 4 Iterations
3 centroid 4 centroid 5 centroid 6 centroid60000 60000 60000 60000
Classical 300000 300000 300000 300000K-Means 300000 300000 420000 540000
420000 420000 480000 480000
28000 35000 28000 310002 Nodes 363000 736000 498000 555000HKM 399000 921000 517000 617000
369000 935000 799000 615000
28000 3400 27000 310004 Nodes 162000 251000 206000 224000HKM 188000 284000 215000 225000
182000 272000 211000 226000
29000 36000 29000 310008 Nodes 209000 357000 217000 266000HKM 237000 423000 258000 280000
239000 407000 300000 292000
43
Performance Analysis of Fuzzy C-Means Algorithm
Figure 12: 2 Nodes Hadoop Cluster 4Mb Split 4 Iterations
Figure 13: 4 Nodes Hadoop Cluster 4Mb Split 4 Iterations
Figure 14: 8 Nodes Hadoop Cluster 4Mb Split 4 Iterations
Classical
K-Means
2 Node
K-Means
4 Node
K-Means
8 Node
K-Means
0 200000 400000 600000 800000 1000000
6 centroid5 centroid4 centroid3 centroid
44
Performance Analysis of Fuzzy C-Means Algorithm
Figure 15: 2, 4, 8 Node 4Mb Split 4 Iterations
4.10.1.2. 4Mb Split 6 Iterations
Table 11: 2 Nodes Hadoop Cluster 4Mb Split 6 Iterations
K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000HKM1st ITR 28000 35000 28000 310002nd ITR 455000 887000 498000 5470003rd ITR 402000 852000 558000 5780004th ITR 528000 741000 733000 6800005th ITR 429000 953000 930000 5820006th ITR 396000 699000 720000 720000
Table 12: 4 Nodes Hadoop Cluster 4Mb Split 6 Iterations
K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000HKM1st ITR 27000 38000 29000 310002nd ITR 158000 262000 222000 2320003rd ITR 187000 284000 234000 2350004th ITR 193000 269000 270000 2410005th ITR 195000 272000 233000 2380006th ITR 196000 289000 247000 228000
Table 13: 8 Nodes Hadoop Cluster 4Mb Split 6 Iterations
K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 780000
45
Performance Analysis of Fuzzy C-Means Algorithm
6th ITR 600000 660000 660000 900000HKM1st ITR 27000 38000 29000 310002nd ITR 158000 262000 222000 2320003rd ITR 187000 284000 234000 2350004th ITR 193000 269000 270000 2410005th ITR 195000 272000 233000 2380006th ITR 196000 289000 247000 228000
Table 14: 2, 4, 8 Node 4Mb Split 6 Iterations
3 centroid 4 centroid 5 centroid 6 centroid60000 60000 60000 60000300000 360000 300000 360000
Classical 300000 420000 540000 600000K-Means 420000 480000 540000 660000
540000 720000 720000 780000600000 660000 660000 900000
28000 35000 28000 31000455000 887000 498000 547000
2 Node 402000 852000 558000 578000K-Means 528000 741000 733000 680000
429000 953000 930000 582000396000 699000 720000 720000
27000 38000 29000 31000158000 262000 222000 232000
4 Node 187000 284000 234000 235000K-Means 193000 269000 270000 241000
195000 272000 233000 238000196000 289000 247000 228000
27000 38000 29000 31000158000 262000 222000 232000
8 Node 187000 284000 234000 235000K-Means 193000 269000 270000 241000
195000 272000 233000 238000196000 289000 247000 228000
46
Performance Analysis of Fuzzy C-Means Algorithm
Figure 16: 2 Nodes Hadoop Cluster 4Mb Split 6 Iterations
Figure 17: 4 Nodes Hadoop Cluster 4Mb Split 6 Iterations
Figure 18: 8 Nodes Hadoop Cluster 4Mb Split 6 Iterations
ClassicalK-Means
2 Node K-Means
4 NodeK-Means
8 NodeK-Means
0 200000 400000 600000 800000 1000000 1200000
6 centroid5 centroid4 centroid3 centroid
Time
Figure 19: 2, 4, 8 Node 4Mb Split 6 Iterations
47
Performance Analysis of Fuzzy C-Means Algorithm
4.10.1.3. 8Mb Split 4 Iterations
Table 15: 2 Nodes Hadoop Cluster 8Mb Split 4 Iterations
K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM1st ITR 28000 35000 28000 320002nd ITR 358000 603000 477000 4800003rd ITR 391000 912000 603000 6220004th ITR 375000 926000 808000 639000
Table 16: 4 Nodes Hadoop Cluster 8Mb Split 4 Iterations
K-means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM1st ITR 29000 38000 29000 310002nd ITR 180000 262000 200000 2130003rd ITR 189000 275000 231000 2350004th ITR 194000 322000 242000 231000
Table 17: 8 Nodes Hadoop Cluster 8Mb Split 4 Iterations
K-means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM1st ITR 29000 36000 28000 310002nd ITR 308000 615000 405000 4690003rd ITR 307000 655000 451000 4680004th ITR 307000 682000 573000 465000
Table 18: 2, 4, 8 Node 8Mb Split 4 Iterations
3 centroid 4 centroid 5 centroid 6 centroid60000 60000 60000 60000
Classical 300000 300000 300000 300000K-Means 300000 300000 420000 540000
48
Performance Analysis of Fuzzy C-Means Algorithm
420000 420000 480000 480000
2 Node 28000 35000 28000 32000K-Means 358000 603000 477000 480000
391000 912000 603000 622000375000 926000 808000 639000
29000 38000 29000 310002 Node 180000 262000 200000 213000K-Means 189000 275000 231000 235000
194000 322000 242000 231000
29000 36000 28000 310002 Node 308000 615000 405000 469000K-Means 307000 655000 451000 468000
307000 682000 573000 465000
Figure 20: 2 Nodes Hadoop Cluster 8Mb Split 4 Iterations
Figure 21: 4 Nodes Hadoop Cluster 8Mb Split 4 Iterations
Figure 22: 8 Nodes Hadoop Cluster 8Mb Split 4 Iterations
49
Performance Analysis of Fuzzy C-Means Algorithm
ClassicalK-Means
2 Node
K-Means
4 NodeK-Means
8 Node
K-Means
0 200000 400000 600000 800000 1000000
2 Node
6 centroid5 centroid4 centroid3 centroid
TIme
Cla
ssic
K-M
eans
and
Had
oop
Bas
ed K
-Mea
ns
Figure 23: 2, 4, 8 Node 4Mb Split 4 Iterations
4.10.1.4. 8Mb Split 6 Iterations
Table 19: 2 Nodes Hadoop Cluster 8Mb Split 6 Iterations
3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000
1st ITR 28000 35000 28000 310002nd ITR 350000 784000 545000 850000
50
Performance Analysis of Fuzzy C-Means Algorithm
3rd ITR 396000 843000 850000 6220004th ITR 388000 927000 802000 6070005th ITR 419000 877000 662000 5650006th ITR 411000 876000 720000 545000
Table 20: 4 Nodes Hadoop Cluster 8Mb Split 6 Iterations
3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000
1st ITR 28000 37000 27000 310002nd ITR 168000 287000 192000 2400003rd ITR 198000 273000 253000 2190004th ITR 204000 305000 276000 2510005th ITR 194000 288000 257000 2480006th ITR 194000 282000 245000 248000
Table 21: 8 Nodes Hadoop Cluster 8Mb Split 6 Iterations
3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000
1st ITR 29000 35000 28000 310002nd ITR 281000 579000 369000 4400003rd ITR 309000 657000 444000 4600004th ITR 308000 699000 573000 5120005th ITR 340000 687000 625000 4820006th ITR 312000 740000 584000 538000
Table 22: 2, 4, 8 Node 8Mb Split 6 Iterations
3 centroid 4 centroid 5 centroid 6 centroid60000 60000 60000 60000300000 360000 300000 360000
Classical 300000 420000 540000 600000K-Means 420000 480000 540000 660000
540000 720000 720000 780000600000 660000 660000 900000
51
Performance Analysis of Fuzzy C-Means Algorithm
28000 35000 28000 31000350000 784000 545000 850000
2 Node 396000 843000 850000 622000K-Means 388000 927000 802000 607000
419000 877000 662000 565000411000 876000 720000 545000
28000 37000 27000 31000168000 287000 192000 240000
4 Node 198000 273000 253000 219000K-Means 204000 305000 276000 251000
194000 288000 257000 248000194000 282000 245000 248000
29000 35000 28000 31000281000 579000 369000 440000
8 Node 309000 657000 444000 460000K-Means 308000 699000 573000 512000
340000 687000 625000 482000312000 740000 584000 538000
Figure 24: 2 Nodes Hadoop Cluster 8Mb Split 6 Iterations
Figure 25: 4 Nodes Hadoop Cluster 8Mb Split 6 Iterations
Figure 26: 8 Nodes Hadoop Cluster 8Mb Split 6 Iterations
52
Performance Analysis of Fuzzy C-Means Algorithm
ClassicalK-Means
2 Node K-Means
4 NodeK-Means
8 NodeK-Means
0 200000 400000 600000 800000 1000000
6 centroid5 centroid4 centroid3 centroid
Time
Figure 27: 2, 4, 8 Node 8Mb Split 6 Iterations
4.10.1.5. 32Mb Split 4 Iterations
Table 23: 2 Nodes Hadoop Cluster 32Mb Split 4 Iterations
K-Means 3 centroid 4 centroid 5 centroid6 centroid
1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM
53
Performance Analysis of Fuzzy C-Means Algorithm
1st ITR 29000 35000 28000 310002nd ITR 343000 693000 475000 5160003rd ITR 387000 841000 577000 6210004th ITR 395000 655000 632000 609000
Table 24: 4 Nodes Hadoop Cluster 32Mb Split 4 Iterations
K-Means 3 centroid 4 centroid 5 centroid6 centroid
1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM1st ITR 30000 41000 41000 320002nd ITR 163000 263000 214000 2330003rd ITR 195000 290000 228000 2450004th ITR 183000 319000 254000 217000
Table 25: 8 Nodes Hadoop Cluster 32Mb Split 4 Iterations
K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 300000 300000 3000003rd ITR 300000 300000 420000 5400004th ITR 420000 420000 480000 480000HKM1st ITR 29000 36000 28000 310002nd ITR 276000 577000 427000 4640003rd ITR 300000 660000 476000 4910004th ITR 303000 692000 573000 501000
Table 26: 2, 4, 8 Node 32Mb Split 4 Iterations
3 centroid 4 centroid 5 centroid 6 centroidClassical 60000 60000 60000 60000K-Means 300000 300000 300000 300000
300000 300000 420000 540000420000 420000 480000 480000
2 Node 29000 35000 28000 31000K-Means 343000 693000 475000 516000
387000 841000 577000 621000395000 655000 632000 609000
4 Node 30000 41000 41000 32000K-Means 163000 263000 214000 233000
195000 290000 228000 245000183000 319000 254000 217000
54
Performance Analysis of Fuzzy C-Means Algorithm
8 Node 29000 36000 28000 31000K-Means 276000 577000 427000 464000
300000 660000 476000 491000303000 692000 573000 501000
Figure 28: 2 Nodes Hadoop Cluster 32Mb Split 4 Iterations
Figure 29: 4 Nodes Hadoop Cluster 32Mb Split 4 Iterations
Figure 30: 8 Nodes Hadoop Cluster 32Mb Split 4 Iterations
55
Performance Analysis of Fuzzy C-Means Algorithm
Figure 31: 2, 4, 8 Node 32Mb Split 4 Iterations
Classical
K-Means
2 Node
K-Means
4 Node
K-Means
8 Node
K-Means
0 200000 400000 600000 800000 1000000
6 centroid5 centroid4 centroid3 centroid
4.10.1.6. 32Mb Split 6 Iterations
Table 27 : 2 Nodes Hadoop Cluster 32Mb Split 6 Iteration
K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000HKM1st ITR 29000 43000 29000 320002nd ITR 387000 1377000 569000 517000
56
Performance Analysis of Fuzzy C-Means Algorithm
3rd ITR 363000 835000 543000 6360004th ITR 443000 875000 694000 10580005th ITR 499000 806000 771000 9160006th ITR 387000 956000 858000 862000
Table 28: 4 Nodes Hadoop Cluster 32Mb Split 6 Iterations
K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000HKM1st ITR 28000 34000 28000 310002nd ITR 172000 275000 195000 2620003rd ITR 185000 285000 229000 2600004th ITR 182000 303000 241000 2680005th ITR 194000 273000 258000 2540006th ITR 185000 277000 285000 266000
Table 29: 8 Nodes Hadoop Cluster 32Mb Split 6 Iterations
K-Means 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000HKM1st ITR 28000 35000 28000 310002nd ITR 323000 604000 397000 4390003rd ITR 320000 661000 474000 4600004th ITR 333000 678000 607000 5050005th ITR 369000 687000 610000 4790006th ITR 366000 708000 581000 516000
Table 30: 2, 4, 8 Node 32Mb Split 6 Iterations
3 centroid 4 centroid 5 centroid 6 centroid60000 60000 60000 60000300000 360000 300000 360000
Classical 300000 420000 540000 600000K-Means 420000 480000 540000 660000
540000 720000 720000 780000
57
Performance Analysis of Fuzzy C-Means Algorithm
600000 660000 660000 90000029000 43000 29000 32000387000 1377000 569000 517000
2 Node 363000 835000 543000 636000K-Means 443000 875000 694000 1058000
499000 806000 771000 916000387000 956000 858000 86200028000 34000 28000 31000172000 275000 195000 262000
4 Node 185000 285000 229000 260000K-Means 182000 303000 241000 268000
194000 273000 258000 254000185000 277000 285000 26600028000 35000 28000 31000323000 604000 397000 439000
8 Node 320000 661000 474000 460000K-Means 333000 678000 607000 505000
369000 687000 610000 479000366000 708000 581000 516000
58
Performance Analysis of Fuzzy C-Means Algorithm
Figure 32: 2 Nodes Hadoop Cluster 32Mb Split 6 Iterations
Figure 33: 4 Nodes Hadoop Cluster 32Mb Split 6 Iterations
Figure 34: 8 Nodes Hadoop Cluster 32Mb Split 6 Iterations
ClassicalK-Means
2 Node K-Means
4 NodeK-Means
8 NodeK-Means
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
6 centroid5 centroid4 centroid3 centroid
59
Performance Analysis of Fuzzy C-Means Algorithm
Figure 35: 2, 4, 8 Node 32Mb Split 6 Iterations
Above table x describes parameter settings employed to experimentation. This report
has combination of three components and its subcomponents
1. Number of nodesa. 2 Node Hadoop clusterb. 4 Node Hadoop clusterc. 8 Node Hadoop cluster
2. Number of Iterationsa. 4 Iterationb. 6 Iteration
3. Size of splita. 4Mb Splitb. 8Mb Splitc. 32Mb Split
Classical K-Means Vs Hadoop Based K-Means
Tables 4 to 27 exhibits that using Hadoop Based K-Means algorithm speedup has
gained. Figure5 to 27 shows statistics of time consumed by Hadoop node 2, 4, 8
compared to Single node classical K-Means. All of the above results are statistical results
of classical K-means implantation and Hadoop based K-means implementation over the
time required to execute the algorithm. In most of the cases 2 Node Hadoop based K-
means gives result like classical K-Means or even more than that. It happens because of
communication overhead between two Nodes. Moreover, hadoop assigns task to its
slaves. In two node system first, master transfers data to slave second copy program to
each node. Then, an execution takes place of Mapper phase. On completion of mapper
phase again communication takes place to gather data at one location or node. This phase
has called as Reducer phase. Reducer phase works similar to the sequential execution. In
4 Mb data split for 120 Mb 20_newsgroup dataset 24 splits occurs where using 32 Mb
split only 3 splits occur. Because of this communication overhead 2 Node hadoop based
K-means has much slower than classical K-means.
In above tables 5 to 27 time mentioned is in milliseconds. Number of nodes with
number of iteration and its time has mentioned. The tables are separated over data split in
60
Performance Analysis of Fuzzy C-Means Algorithm
hadoop. Experiment has taken place with initial centroids 3, 4, 5, 6. The combination of
all these parameter has considered into this dissertation report.
In case of 4 nodes Hadoop cluster hadoop gives best speed up for Hadoop based
K-means. Where in case 8 Nodes again instead of processing time slaves and masters are
involved most of time in communication.
Above experiment shows that Hadoop suits to less iterative algorithms and and
where communication burden is low as much as possible in small datasets. For larger
datasets Hadoop suits with large data split. Number of splits reduces communication
overhead reduces itself and processing elements get more time for processing.
K-means works on cardinality of 0 and 1. Because of that instead K-means
algorithm Fuzzy C-Means concept comes into picture. Fuzzy C-Means supports “1” to
“many” cardinality. Where, achievement of exponential speedup has expected as
compare to classical FCM with Hadoop based FCM implementation with increase in
number of nodes.
Figure 36: Communication Overhead
61
Performance Analysis of Fuzzy C-Means Algorithm
2. Classical FC-Means vs. Hadoop Based FC-Means
4.10.2.1. 4Mb Split 4 Iterations
Table 31:2 Nodes Hadoop Cluster 4Mb Split 4 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM1st ITR 69000 81000 116000 1270002nd ITR 945000 1543000 2418000 31320003rd ITR 863000 1716000 2431000 31450004th ITR 859000 1714000 2437000 3242000
Table 32: 4 Nodes Hadoop Cluster 4Mb Split 4 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM1st ITR 66000 79000 97000 1210002nd ITR 1556000 1278000 2218000 38450003rd ITR 824000 1366000 2836000 40370004th ITR 765000 1440000 3042000 5828000
Table 33: 8 Nodes Hadoop Cluster 4Mb Split 4 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM1st ITR 55000 70000 97000 1110002nd ITR 798000 1357000 1695000 21250003rd ITR 458000 1145000 1800000 21550004th ITR 785000 1547000 1508000 2135000
Table 34:2, 4, 8 Node 4Mb Split 4 Iteration
3 centroid 4 centroid 5 centroid 6 centroidClassical 60000 60000 120000 180000
62
Performance Analysis of Fuzzy C-Means Algorithm
2580000 3360000 6120000 6600000FCM 3120000 3960000 6660000 6660000
3120000 4140000 7020000 8280000
2 Node 69000 81000 116000 127000945000 1543000 2418000 3132000
FCM 863000 1716000 2431000 3145000859000 1714000 2437000 3242000
4 Node 66000 79000 97000 1210001556000 1278000 2218000 3845000
FCM 824000 1366000 2836000 4037000765000 1440000 3042000 5828000
8 Node 55000 70000 97000 111000798000 1357000 1695000 2125000
FCM 458000 1145000 1800000 2155000785000 1547000 1508000 2135000
Figure 37: 2 Nodes Hadoop Cluster 4Mb Split 4 Iterations
Figure 38: 4 Nodes Hadoop Cluster 4Mb Split 4 Iterations
Figure 39: 8 Nodes Hadoop Cluster 4Mb Split 4 Iterations
63
Performance Analysis of Fuzzy C-Means Algorithm
Classical
FCM
2 Node
FCM
4 Node
FCM
8 Node
FCM
0 2000000 4000000 6000000 8000000 10000000
6 centroid5 centroid4 centroid3 centroid
Figure 40: 2, 4, 8 Node 4Mb Split 4 Iteration
4.10.2.2. 4Mb Split 6 Iterations
Table 35: 2 Nodes Hadoop Cluster 4Mb Split 6 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 120000 120000 1800002nd ITR 2460000 4200000 6240000 66600003rd ITR 2460000 4260000 7020000 79200004th ITR 2520000 4320000 6720000 84000005th ITR 2520000 4560000 6600000 91200006th ITR 2460000 4620000 6840000 9420000HFCM1st ITR 63000 79000 112000 129000
64
Performance Analysis of Fuzzy C-Means Algorithm
2nd ITR 942000 1541000 2420000 32810003rd ITR 1096000 1698000 2420000 32640004th ITR 1100000 1696000 2423000 32730005th ITR 1088000 1694000 2431000 44770006th ITR 1088000 1695000 2418000 3211000
Table 36: 4 Nodes Hadoop Cluster 4Mb Split 6 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 120000 120000 1800002nd ITR 2460000 4200000 6240000 66600003rd ITR 2460000 4260000 7020000 79200004th ITR 2520000 4320000 6720000 84000005th ITR 2520000 4560000 6600000 91200006th ITR 2460000 4620000 6840000 9420000HFCM1st ITR 57000 70000 100000 1240002nd ITR 641000 1112000 1900000 22740003rd ITR 833000 1165000 1785000 32810004th ITR 732000 1136000 1824000 29720005th ITR 792000 1180000 2016000 35440006th ITR 733000 1417000 1839000 3709000
Table 37: 8 Nodes Hadoop Cluster 4Mb Split 6 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 120000 120000 1800002nd ITR 2460000 4200000 6240000 66600003rd ITR 2460000 4260000 7020000 79200004th ITR 2520000 4320000 6720000 84000005th ITR 2520000 4560000 6600000 91200006th ITR 2460000 4620000 6840000 9420000HFCM1st ITR 53000 68000 94000 1060002nd ITR 486000 779000 1210000 15320003rd ITR 590000 885000 1150000 15040004th ITR 616000 847000 1153000 15320005th ITR 584000 853000 1162000 15280006th ITR 588000 875000 1144000 1846000
Table 38: 2, 4, 8 Node 4Mb Split 6 Iterations
3 centroid 4 centroid 5 centroid 6 centroid60000 120000 120000 1800002460000 4200000 6240000 6660000
Classical 2460000 4260000 7020000 79200002520000 4320000 6720000 8400000
65
Performance Analysis of Fuzzy C-Means Algorithm
FCM 2520000 4560000 6600000 91200002460000 4620000 6840000 9420000
63000 79000 112000 129000942000 1541000 2420000 3281000
2 Node 1096000 1698000 2420000 32640001100000 1696000 2423000 3273000
FCM 1088000 1694000 2431000 44770001088000 1695000 2418000 3211000
57000 70000 100000 124000641000 1112000 1900000 2274000
4 Node 833000 1165000 1785000 3281000732000 1136000 1824000 2972000
FCM 792000 1180000 2016000 3544000733000 1417000 1839000 3709000
53000 68000 94000 106000486000 779000 1210000 1532000
8 Node 590000 885000 1150000 1504000616000 847000 1153000 1532000
FCM 584000 853000 1162000 1528000588000 875000 1144000 1846000
66
Performance Analysis of Fuzzy C-Means Algorithm
Figure 41: 2 Nodes Hadoop Cluster 4Mb Split 6 Iterations
Figure 42: 4 Nodes Hadoop Cluster 4Mb Split 6 Iterations
Figure 43: 8 Nodes Hadoop Cluster 4Mb Split 6 Iterations
Classical
FCM
2 Node
FCM
4 Node
FCM
8 Node
FCM
0 2000000 4000000 6000000 8000000 10000000
6 centroid5 centroid4 centroid3 centroid
Time
67
Performance Analysis of Fuzzy C-Means Algorithm
Figure 44:2, 4, 8 Node 4Mb Split 6 Iterations
4.10.2.3. 8Mb Split 4 Iterations
Table 39: 2 Nodes Hadoop Cluster 8Mb Split 4 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM1st ITR 58000 76000 105000 1240002nd ITR 796000 1356000 2127000 29350003rd ITR 911000 1466000 2147000 29180004th ITR 1032000 1696000 2287000 3138000
Table 40 : 4 Nodes Hadoop Cluster 8Mb Split 4 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM1st ITR 50000 78000 101000 1200002nd ITR 842000 2603000 2881000 53570003rd ITR 1045000 1786000 3188000 63600004th ITR 1349000 1894000 3667000 4541000
Table 41: 8 Nodes Hadoop Cluster 8Mb Split 4 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM1st ITR 52000 64000 91000 1100002nd ITR 403000 650000 1047000 14230003rd ITR 471000 737000 1024000 13630004th ITR 467000 772000 1067000 1447000
Table 42: 2, 4, 8 Node 8Mb Split 4 Iterations
3 centroid 4 centroid 5 centroid 6 centroidClassical 60000 60000 120000 180000
68
Performance Analysis of Fuzzy C-Means Algorithm
2580000 3360000 6120000 6600000FCM 3120000 3960000 6660000 6660000
3120000 4140000 7020000 8280000
2 Node 58000 76000 105000 124000796000 1356000 2127000 2935000
FCM 911000 1466000 2147000 29180001032000 1696000 2287000 3138000
4 Node 50000 78000 101000 120000842000 2603000 2881000 5357000
FCM 1045000 1786000 3188000 63600001349000 1894000 3667000 4541000
8 Node 52000 64000 91000 110000403000 650000 1047000 1423000
FCM 471000 737000 1024000 1363000467000 772000 1067000 1447000
Figure 45: 2 Nodes Hadoop Cluster 8Mb Split 4 Iterations
Figure 46: 4 Nodes Hadoop Cluster 8Mb Split 4 Iterations
Figure 47: 8 Nodes Hadoop Cluster 8Mb Split 4 Iterations
69
Performance Analysis of Fuzzy C-Means Algorithm
Classical
FCM
2 Node
FCM
4 Node
FCM
0 2000000 4000000 6000000 8000000 10000000
6 centroid5 centroid4 centroid3 centroid
Time
Figure 48: 2, 4, 8 Node 8Mb Split 4 Iterations
4.10.2.4. 8Mb Split 6 Iterations
Table 43 :2 Nodes Hadoop Cluster 8Mb Split 6 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 120000 120000 1800002nd ITR 2460000 4200000 6240000 66600003rd ITR 2460000 4260000 7020000 79200004th ITR 2520000 4320000 6720000 84000005th ITR 2520000 4560000 6600000 91200006th ITR 2460000 4620000 6840000 9420000
70
Performance Analysis of Fuzzy C-Means Algorithm
HFCM1st ITR 58000 76000 103000 1250002nd ITR 794000 1332000 2127000 29380003rd ITR 922000 1468000 2153000 29310004th ITR 921000 1460000 2136000 29290005th ITR 1052000 1598000 2297000 30880006th ITR 1072000 1590000 2323000 3031000
Table 44: 4 Nodes Hadoop Cluster 8Mb Split 6 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 120000 120000 1800002nd ITR 2460000 4200000 6240000 66600003rd ITR 2460000 4260000 7020000 79200004th ITR 2520000 4320000 6720000 84000005th ITR 2520000 4560000 6600000 91200006th ITR 2460000 4620000 6840000 9420000HFCM1st ITR 60000 78000 104000 1150002nd ITR 1425000 2437000 3089000 50490003rd ITR 1353000 2804000 4886000 43870004th ITR 1517000 2377000 3164000 46060005th ITR 1754000 2008000 3159000 53240006th ITR 2096000 2249000 3652000 5115000
Table 45: 8 Nodes Hadoop Cluster 8Mb Split 6 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 120000 120000 1800002nd ITR 2460000 4200000 6240000 66600003rd ITR 2460000 4260000 7020000 79200004th ITR 2520000 4320000 6720000 84000005th ITR 2520000 4560000 6600000 91200006th ITR 2460000 4620000 6840000 9420000HFCM1st ITR 518000 65000 92000 1060002nd ITR 422000 639000 1147000 13620003rd ITR 463000 703000 1073000 15050004th ITR 517000 720000 1080000 14370005th ITR 477000 900000 1013000 16480006th ITR 474000 742000 1161000 1544000
Table 46: 2, 4, 8 Node 8Mb Split 6 Iterations
3 centroid 4 centroid 5 centroid 6 centroid60000 120000 120000 1800002460000 4200000 6240000 6660000
71
Performance Analysis of Fuzzy C-Means Algorithm
Classical 2460000 4260000 7020000 79200002520000 4320000 6720000 8400000
FCM 2520000 4560000 6600000 91200002460000 4620000 6840000 9420000
58000 76000 103000 125000794000 1332000 2127000 2938000
2 Node 922000 1468000 2153000 2931000921000 1460000 2136000 2929000
FCM 1052000 1598000 2297000 30880001072000 1590000 2323000 3031000
60000 78000 104000 1150001425000 2437000 3089000 5049000
4 Node 1353000 2804000 4886000 43870001517000 2377000 3164000 4606000
FCM 1754000 2008000 3159000 53240002096000 2249000 3652000 5115000
518000 65000 92000 106000422000 639000 1147000 1362000
8 Node 463000 703000 1073000 1505000517000 720000 1080000 1437000
FCM 477000 900000 1013000 1648000474000 742000 1161000 1544000
Figure 49: 2 Nodes Hadoop Cluster 8Mb Split 6 Iterations
Figure 50: 4 Nodes Hadoop Cluster 8Mb Split 6 Iterations
Figure 51: 8 Nodes Hadoop Cluster 8Mb Split 6 Iterations
72
Performance Analysis of Fuzzy C-Means Algorithm
Figure 52: 2, 4, 8 Node 8Mb Split 6 Iterations
Classical
FCM
2 Node
FCM
4 Node
FCM
8 Node
FCM
0 2000000 4000000 6000000 8000000 10000000
6 centroid5 centroid4 centroid3 centroid
4.10.2.5. 32Mb Split 4 Iterations
Table 47 :2 Nodes Hadoop Cluster 32Mb Split 4 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM
73
Performance Analysis of Fuzzy C-Means Algorithm
1st ITR 52000 73000 105000 1660002nd ITR 913000 1506000 2506000 35240003rd ITR 1021000 1703000 2508000 35120004th ITR 1148000 1873000 2628000 3640000
Table 48: 4 Nodes Hadoop Cluster 32Mb Split 4 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM1st ITR 172000 75000 112000 1270002nd ITR 898000 1628000 2595000 34060003rd ITR 1027000 1591000 2595000 32370004th ITR 949000 1663000 2301000 3239000
Table 49: 8 Nodes Hadoop Cluster 32Mb Split 4 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 120000 1800002nd ITR 2580000 3360000 6120000 66000003rd ITR 3120000 3960000 6660000 66600004th ITR 3120000 4140000 7020000 8280000HFCM1st ITR 68000 83000 107000 1260002nd ITR 989000 1768000 2617000 34200003rd ITR 1053000 1743000 2540000 34500004th ITR 1035000 1697000 2512000 3407000
Table 50: 2, 4, 8 Node 32Mb Split 4 Iterations
3 centroid 4 centroid 5 centroid 6 centroidClassical 60000 60000 120000 180000
2580000 3360000 6120000 6600000FCM 3120000 3960000 6660000 6660000
3120000 4140000 7020000 8280000
2 Node 52000 73000 105000 166000913000 1506000 2506000 3524000
FCM 1021000 1703000 2508000 35120001148000 1873000 2628000 3640000
4 Node 172000 75000 112000 127000898000 1628000 2595000 3406000
FCM 1027000 1591000 2595000 3237000
74
Performance Analysis of Fuzzy C-Means Algorithm
949000 1663000 2301000 3239000
8 Node 68000 83000 107000 126000989000 1768000 2617000 3420000
FCM 1053000 1743000 2540000 34500001035000 1697000 2512000 3407000
Figure 53: 2 Nodes Hadoop Cluster 32Mb Split 4 Iterations
Figure 54: 4 Nodes Hadoop Cluster 32Mb Split 4 Iterations
Figure 55: 8 Nodes Hadoop Cluster 32Mb Split 4 Iterations
75
Performance Analysis of Fuzzy C-Means Algorithm
Classical
FCM
2 Node
FCM
4 Node
FCM
8 Node
FCM
0 2000000 4000000 6000000 8000000 10000000
6 centroid5 centroid4 centroid3 centroid
Time
Figure 56: 2, 4, 8 Node 32Mb Split 4 Iterations
4.10.2.6. 32Mb Split 4 Iterations
Table 51:2 Nodes Hadoop Cluster 32Mb Split 6 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 120000 120000 1800002nd ITR 2460000 4200000 6240000 66600003rd ITR 2460000 4260000 7020000 79200004th ITR 2520000 4320000 6720000 84000005th ITR 2520000 4560000 6600000 91200006th ITR 2460000 4620000 6840000 9420000HFCM1st ITR 57000 73000 74000 123000
76
Performance Analysis of Fuzzy C-Means Algorithm
2nd ITR 922000 1498000 2494000 35640003rd ITR 1030000 1695000 2510000 31320004th ITR 1028000 1695000 2752000 35380005th ITR 1140000 1955000 3110000 36300006th ITR 1165000 1930000 3245000 3565000
Table 52: 4 Nodes Hadoop Cluster 32Mb Split 6 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 120000 120000 1800002nd ITR 2460000 4200000 6240000 66600003rd ITR 2460000 4260000 7020000 79200004th ITR 2520000 4320000 6720000 84000005th ITR 2520000 4560000 6600000 91200006th ITR 2460000 4620000 6840000 9420000HFCM1st ITR 55000 73000 106000 1270002nd ITR 925000 1492000 2269000 35520003rd ITR 1019000 1776000 2579000 32220004th ITR 1015000 1660000 2577000 34070005th ITR 1016000 1671000 2468000 32250006th ITR 1014000 1772000 2587000 3580000
Table 53: 8 Nodes Hadoop Cluster 32Mb Split 6 Iterations
FCM 3 centroid 4 centroid 5 centroid 6 centroid1st ITR 60000 60000 60000 600002nd ITR 300000 360000 300000 3600003rd ITR 300000 420000 540000 6000004th ITR 420000 480000 540000 6600005th ITR 540000 720000 720000 7800006th ITR 600000 660000 660000 900000HFCM1st ITR 28000 35000 28000 310002nd ITR 323000 604000 397000 4390003rd ITR 320000 661000 474000 4600004th ITR 333000 678000 607000 5050005th ITR 369000 687000 610000 4790006th ITR 366000 708000 581000 516000
Table 54: 2, 4, 8 Node 32Mb Split 6 Iterations
3 centroid 4 centroid 5 centroid 6 centroid60000 60000 60000 60000300000 360000 300000 360000
Classical 300000 420000 540000 600000K-Means 420000 480000 540000 660000
77
Performance Analysis of Fuzzy C-Means Algorithm
540000 720000 720000 780000600000 660000 660000 90000029000 43000 29000 32000387000 1377000 569000 517000
2 Node 363000 835000 543000 636000K-Means 443000 875000 694000 1058000
499000 806000 771000 916000387000 956000 858000 86200028000 34000 28000 31000172000 275000 195000 262000
4 Node 185000 285000 229000 260000K-Means 182000 303000 241000 268000
194000 273000 258000 254000185000 277000 285000 26600028000 35000 28000 31000323000 604000 397000 439000
8 Node 320000 661000 474000 460000K-Means 333000 678000 607000 505000
369000 687000 610000 479000366000 708000 581000 516000
78
Performance Analysis of Fuzzy C-Means Algorithm
Figure 57: 2 Nodes Hadoop Cluster 32Mb Split 6 Iterations
Figure 58: 4 Nodes Hadoop Cluster 32Mb Split 6 Iterations
Figure 59: 8 Nodes Hadoop Cluster 32Mb Split 6 Iterations
ClassicalK-Means
2 Node K-Means
4 NodeK-Means
8 NodeK-Means
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
6 centroid5 centroid4 centroid3 centroid
Figure 60: 2, 4, 8 Node 32Mb Split 6 Iterations
Classical FCM vs. Hadoop Based FCM
Tables 28 to 51 exhibits that using Hadoop Based FCM algorithm speedup has
gained. All of the above results are statistical results of classical FCM implantation and
Hadoop based FCM implementation over the time required to execute the algorithm. In
79
Performance Analysis of Fuzzy C-Means Algorithm
most of the cases 2 Node Hadoop based FCM gives speedup of 100% i.e. twice over
classical K-Means or even more than that. In 4 Mb data split for 120 Mb 20_newsgroup
dataset 24 splits occurs where using 32 Mb split only 3 splits occur. 4 Nodes face
communication overhead problem because of small dataset. FCM has more iterative than
K-means. 2-Nodes utilize its full processing capacity for FCM processing and master
needs to handle only one node in communication process. 4-Nodes master system needs
to handle 3 nodes communication. While hadoop setup it is mentioned as Default copies
of data as 3. In 4 nodes 4th node never has data copy so it needs data copy every time.
This is the reason of little change in performance of 2 Nodes and 4 Nodes.
In above 28-51 tables time mentioned is in milliseconds. Number of nodes with
number of iteration and its time has mentioned. The tables are separated over data split in
hadoop. Experiment has taken place with initial centroids 3, 4, 5, 6. The combination of
all these parameter has considered into this dissertation report.
In case of 8 nodes Hadoop cluster hadoop gives best speed up for Hadoop based
FCM. Where in case 8 Nodes default copies of data mentioned in hadoop setup setting
are 6. 8 Nodes gives best speed up for FCM implementation. One reason for 8 Nodes best
speed up is increase in number of nodes and decrease in communication overhead.
Before, starting hadoop programming one always needs to understand hadoop and its
nodes are always used to communicate between each other as its internal process.
Advanced versions of Hadoop take care of data distribution, Load balancing and
communication overhead. Advanced version of Hadoop will give better performance
over 4 Nodes Hadoop cluster and number of nodes increases it will add its performance
improvement.
Above experiment shows that Hadoop suits to large datasets and where
communication burden is low as much as possible in small datasets. FCM always take
more time than classical K-means implementation. FCM gives better results for
overlapping dataset.
80
Performance Analysis of Fuzzy C-Means Algorithm
It is observed that, as the number of node increases, speed-up increases and it
remains constant limited by sequential part of the code. The corresponding graphs shows
speedup.
5. Conclusion and Future Scope
Hadoop Based K-means and Hadoop Based Fuzzy C-Means for document clustering
are implemented in this dissertation. Speedup using Hadoop based multi-node cluster has
achieved. Fuzzy C Means gives better results for overlapping datasets. Where K-means is
know for it's Hard clustering and behave like that only. Hadoop Based Fuzzy C-Means
gives 5 fold speedup in 8 node Hadoop cluster as compared to classical fuzzy C-means
algorithm.
5.1. Hadoop Based K-means
Classical K-means algorithm with different kind of implementation or
advanced K-means has been available. These advanced K-means needs to be
implemented with Hadoop and document clustering and need to check for better results
than produced in this dissertation.
The Hadoop based K-means algorithm need to minimize the iterations
because hadoop gives worst results for more iterative algorithms.
The Hadoop based K-means algorithm needs to be modified for its initial
random centroids selection with some better technique.
5.2. Hadoop Based Fuzzy C-Means
Hadoop Based Fuzzy C-Means observes more time required to execute
algorithm in both cases classical c-means and Hadoop Based Fuzzy C-Means. The
required time is very high as compared to K-means implementation.
Hadoop Based Fuzzy C-Means algorithm with different kind of
implementation or advanced Fuzzy C-Means(mentioned in literature review) has been
81
Performance Analysis of Fuzzy C-Means Algorithm
available. These advanced Fuzzy C-means needs to be implemented with Hadoop and
document clustering and need to check for better results than produced in this
dissertation.
Hadoop Based Fuzzy C-Means algorithm need to minimize the iterations
because hadoop gives worst results for more iterative algorithms.
Hadoop Based Fuzzy C-Means algorithm needs to be modified for its
initial random centroids selection with some better technique.
5.3. Future Scope
Implemented Hadoop based Fuzzy C-Means algorithm has much
iteration that need to be calculated in each iterations and have burden of storing
fuzzification factor each time while calculating point belongs to which centroid and
participate in calculation of each centroids. Alternative advanced design can be created
with less iteration and less burden of storing fuzzification factor.
Effective distance calculation method than Euclidean distance calculation
method will help for better speedup.
82
Performance Analysis of Fuzzy C-Means Algorithm
6. References
83
Performance Analysis of Fuzzy C-Means Algorithm
APPENDIX A
84
Performance Analysis of Fuzzy C-Means Algorithm
VitaeName of student: Ashutosh Shrikant SatheNative Place: Solapur, MaharashtraDate of Birth: 14th May, 1989Address: 130B, Vidyut Sahwas Society Near Dhumma Vasti,
Laxmi Peth, SolapurEmail: [email protected] .in Objective: To involve in work that can be helped to utilize,
share and improve knowledge and experiences.Short Term Goal: To complete doctorate in EngineeringAreas of Interest: Data Mining, Distributed Computing, Distributed
Computing Administration
85