chapter 1 introduction - akaltara education thesis rgpv cse.pdf · 1 chapter 1 introduction 1.1...

61
1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. Two or more objects belong to the same cluster if they are “close” according to a given distance (in this case geometrical distance). This is called distance-based clustering. Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures. The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs. For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection) [1]. Possible Applications Clustering algorithms can be applied in many fields, for instance: Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records; Biology: classification of plants and animals given their features; Libraries: book ordering;

Upload: duongkhanh

Post on 09-Jul-2018

234 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

1

CHAPTER 1

INTRODUCTION

1.1 Introduction to Clustering

Clustering can be considered the most important unsupervised learning problem; so, as

every other problem of this kind, it deals with finding a structure in a collection of

unlabeled data. A loose definition of clustering could be “the process of organizing

objects into groups whose members are similar in some way”. A cluster is therefore a

collection of objects which are “similar” between them and are “dissimilar” to the

objects belonging to other clusters. Two or more objects belong to the same cluster if

they are “close” according to a given distance (in this case geometrical distance). This

is called distance-based clustering. Another kind of clustering is conceptual clustering:

two or more objects belong to the same cluster if this one defines a concept common

to all that objects. In other words, objects are grouped according to their fit to

descriptive concepts, not according to simple similarity measures. The goal of

clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to

decide what constitutes a good clustering? It can be shown that there is no absolute

“best” criterion which would be independent of the final aim of the clustering.

Consequently, it is the user which must supply this criterion, in such a way that the

result of the clustering will suit their needs. For instance, we could be interested in

finding representatives for homogeneous groups (data reduction), in finding “natural

clusters” and describe their unknown properties (“natural” data types), in finding

useful and suitable groupings (“useful” data classes) or in finding unusual data objects

(outlier detection) [1].

Possible Applications

Clustering algorithms can be applied in many fields, for instance:

Marketing: finding groups of customers with similar behavior given a large

database of customer data containing their properties and past buying

records;

Biology: classification of plants and animals given their features;

Libraries: book ordering;

Page 2: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

2

Insurance: identifying groups of motor insurance policy holders with a

high average claim cost; identifying frauds;

City-planning: identifying groups of houses according to their house type,

value and geographical location;

Earthquake studies: clustering observed earthquake epicenters to identify

dangerous zones;

WWW: document classification; clustering weblog data to discover groups

of similar access patterns.

Requirements

The main requirements that a clustering algorithm should satisfy are:

scalability;

dealing with different types of attributes;

discovering clusters with arbitrary shape;

minimal requirements for domain knowledge to determine input parameters;

ability to deal with noise and outliers;

insensitivity to order of input records;

high dimensionality;

Interpretability and usability.

1.2 Types of Clustering

There exit a large number of clustering algorithms in the literature .The choice of

clustering algorithm depends both on the type of data available and on the particular

purpose and application. If cluster analysis is used as a descriptive or exploratory tool,

it is possible to try several algorithms on the same data to see what the data may

disclose. In general, major clustering methods can be classified into the following

categories [1].

1.2.1 Partitioning methods

Given a database of n objects or data tuples, a partition in method constructs k

partitions of the data, where each partition represents cluster and K<=n . That is, it

classifies the data into k groups, which together satisfy the following requirements:

Page 3: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

3

Each group must contain at least on e object, and

Each object must belong to exactly one group-Notice that the second

requirement can be relaxed in some fuzzy partitioning technique.

Given K, the number of partitions to construct, a partitioning method creates an initial

partitioning. It then uses an iterative relocation technique that attempts to improve the

partitioning by moving objects from one group to another .The general criterion of a

good partitioning is that objects in the same clusters are "close" or related to each

other, whereas objects of different clusters are "far apart “or very different. There are

various kinds of other criteria for judging the quality of partitions. To achieve global

optimality in partitioning-based clustering would require the exhaustive enumeration

of all of the possible partitions. Instead, most applications adopt one of two popular

heuristic methods;

1. The k-means algorithm, where each cluster is represented by the mean value of

the objects in the

2. Cluster, and the k-medoids algorithm, where each cluster is represented by one

of the objects located near the center of the cluster. These heuristic clustering

methods work well for finding spherical-shaped clusters in small to medium -

sized databases. To find clusters with complex shapes and for clustering very

large data sets, partitioning-based methods need to be extended. Partitioning-

based clustering methods are studied in depth later.

Given a database of objects and k, the number of clusters to form, a partitioning

algorithm organizes the objects into k partitions (k<=n), where each partition

represents a cluster. The clusters are formed to optimize an objective-partitioning

criterion, often called a similarity function, such as distance, so that the objects within

a cluster are "similar," whereas the objects of different clusters are "dissimilar" in

terms of the database attributes [2].

1.2.1.1 Classical Partitioning Methods: k-means and k-medoids

The most well known and commonly used partitioning methods are k-means, k-

medoids, and their variations.

Page 4: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

4

Centroid-Based Technique: The K-Means method

The k-means algorithm takes the input parameter, k, and partitions a set of n objects

into k clusters so that the resulting intra-cluster similarity is high but the inter-cluster

similarity is low. Cluster similarity is measured in regard to the mean value of the

objects in a cluster, which can be viewed as the cluster's center of gravity. "How does

the k-means algorithm work?" The k-means algorithm proceeds as follows. First, it

randomly selects k of the objects, each of which initially represents a cluster mean or

center. For each of the remaining objects, an object is assigned to the cluster to which

it is the most similar, based on the distance between the object and the cluster mean. It

then computes the new mean for each cluster. This process iterates until the criterion

function converges. Typically, the squared-error criterion is used, defined as

E=sigma sigmap=ci square

Where E is the sum of square-error for all objects in the database ,p is the point in

space representing a given object, and mi, is the mean of cluster ci ( both p and mi, are

multidimensional).This criterion tries to make the resulting k clusters as compact and

as separate as possible. The algorithm attempts to determine K partitions that

minimize the squared-error function. It works when the clusters are compact clouds

that are rather well separated from one another. The method is relatively scalable and

efficient in processing large data sets because the computational complexity of the

algorithm is O (nkt), where n is the total number of objects, k is the number of

clusters, and t is the number of iterations. n Normally, k<<n and t<<n. The method

often terminates at a local optimum. The k-means method, however, can be applied

only when the mean of a cluster is denned-This may not be the case in some

applications, such as when data with categorical attributes are involved-the necessity

for users to specify k, the number of clusters, in advance can be seen as a

disadvantage. The k-means method is not suitable for discovering clusters with no

convex shapes or clusters of very different size. Moreover, it is sensitive to noise and

outlier data points since a small number of such data can substantially influence the

mean value [3].

Page 5: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

5

1.2.2 Hierarchical methods

A hierarchical method creates a hierarchical decomposition of the given set of data

objects, a hierarchical method can be classified as being either agglomerative or

divisive, based on how the hierarchical decomposition is formed. The agglomerative

approach, also called the bottom-up approach ,starts with each object forming a

separate group, It successively merges the objects or groups close to one another, until

all of the groups are merged into one( the topmost level of the hierarchy), or until a

termination condition holds. The divisive approach, also called the top-down

approach, starts with all the objects in the same cluster, until eventually each object is

in one cluster, or until a termination condition holds, Hierarchical methods suffer from

the fact that once a step(merge o9r split) is done, it can never be undone. This rigidity

is useful in that it leads to smaller computation costs by not worrying about a

combinatorial number of different choices. However, a major problem of such

techniques is that they cannot correct erroneous decisions. There are two approaches

to improving the quality of hierarchical partitioning, such as in CURE and Chameleon,

or integrate hierarchical agglomeration and iterative relocation by first using a

hierarchical agglomerative algorithm and then refining the result using iterative

relocation by first using a hierarchical agglomerative algorithm and then refining the

result using iterative relocation , as in BIRCH.

A hierarchical clustering method works by grouping data objects into a tree clusters.

Hierarchical clustering methods can be further classified into agglomerative and

divisive hierarchical clustering, depending on whether the hierarchy decomposition is

formed in a bottom-up or top-down fashion. The quality of a pure hierarchical

clustering method suffers from its inability to perform adjustment once a merge or

split decision has been executed. Decent studies have emphasized the integration of

hierarchical agglomeration with iterative relocation methods [4].

Page 6: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

6

1.2.2.1 Agglomerative and Divisive Hierarchical Clustering

In general, there are two types of hierarchical clustering methods:

Agglomerative hierarchical clustering:

This bottom-up strategy starts by placing each object in its own cluster and then

merges these atomic dusters into larger and larger clusters, until all of the objects are

in a single cluster or until certain termination conditions are satisfied .Most

hierarchical clustering methods belong to this category. They differ only in their

definition of inter cluster similarity.

Divisive hierarchical clustering:

This top-down strategy does the reverse of agglomerative hierarchical clustering by

starting with all objects in one cluster. It subdivides the cluster into smaller and

smaller pieces, until each object forms a cluster on its own or until it satisfies certain

termination conditions, such as a desired number of clusters is obtained or the distance

between the two closest clusters is above a certain threshold distance. Four widely

used measures for distance between clusters are as follows, where |p-p'| is the distance

between two objects or points p and p', m, is the mean for cluster C, and n, is the

number of objects of in Ci[5].

minimum distance :

maximum distance:

mean distance:

Average distance:

1.2.3 Density- Based Methods

Most partitioning methods cluster objects based on the distance between objects. Such

methods can find only spherical-shaped clusters and encounter difficulty at

discovering clusters of arbitrary shapes.DBSCAN's definition of a cluster is based on

the notion of density reach ability. Basically, a point q is directly density-reachable

from a point p if it is not farther away than a given distance ε (i.e., is part of its ε-

neighborhood), and if p is surrounded by sufficiently many points such that one may

consider p and q be part of a cluster. q is called density-reachable (note: this is

different from "directly density-reachable") from p if there is a sequence

of points with and p1 = p and pn = q where each pi + 1 is directly density-

Page 7: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

7

reachable from pi. Note that the relation of density-reachable is not, so the notion of

density-connected is introduced: two points p and q are density-connected if there is a

point o such that both p and q are density reachable from 0.Other clustering methods

have been developed based on the notion of density. Their general idea is to continue

growing the given cluster as long as the density. Their general idea is to continue

growing the given cluster as long as the density in the "neighborhood" exceeds some

threshold; that is , for each data point within a given cluster, the neighborhood of a

given radius has to contain at least a minimum number of points .Such a method can

be used to filter out noise and discover clusters of arbitrary shape. DBSCAN is a

typical density-based method that grows clusters according to a density threshold;

OPTICS is a density-based method that computes an augmented clustering ordering

for automatic and interactive cluster analysis. Grid -based method: Grid -based

methods quantize the object space into a finite number of cells that form a grid

structure .All of the clustering operations is performed on the grid structure .he main

advantage of this approach is its fast processing time, which is typically independent

of the number of data objects and dependent only on the number of cells in each

dimension in the quantized space. STING is a typical example of a grid-based method.

CLIQUE and wave-cluster are two clustering algorithms that are both grid-based and

density-based. Model-based methods: Model-based methods hypothesize a model for

each of the clusters and find the best fit of the data to the given model. Model- based

methods hypothesize a model for each of the clusters and find the best fit of the data to

the given model .A model-based algorithm may locater clusters by constructing a

density function that reflects the spatial distribution of the data points. It also leads to a

way of automatically determining the number of clusters based on standard statistics,

taking "noise" or outliers into account and thus yielding robust clustering methods

.Model-based clustering methods are studied below. Some clustering algorithms

integrate the ideas of several clustering methods, so that it is sometimes difficult to

classify a given algorithm as uniquely belonging to only one clustering method

category. Furthermore, some applications may have clustering criteria that require the

integration of several clustering techniques. In the following sections, we examine

each of the above five clustering methods in detail. We also introduce algorithms that

integrate the ideas of several clustering methods. Outlier analysis, which typically

involves clustering, is described at the end of this section [6].

Page 8: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

8

1.3 Thesis Statement

Methods are studied in which some clustering Density-Based (DBSCAN), Particle

swarm optimization (PSO), Hierarchical clustering, Hierarchical agglomerative

clustering (HAC), C-mean, and K-mean algorithms integrate the ideas of several

clustering methods, so that it is sometimes difficult to classify a given algorithm as

uniquely belonging to only one clustering method category. Furthermore, some

applications may have clustering criteria that require the integration of several

clustering techniques. In the following sections, we examine K mean clustering

algorithm with genetic algorithm over the Breast cancer dataset, Thyroid dataset and

E-coil dataset in detail.

1.4 Problem Statement

In the field of data mining there are many techniques k-mean, C-mean, Hierarchical,

DB Scan of clustering. The clustering are used in some important area like Pattern

recognition, Image analysis, Bioinformatics, Earthquake studies, Insurance. But in

clustering techniques are facing main three basic problems. First is the seed generation

problem , second is the generation of right number of cluster and third one is content

validation problem. In this thesis our main interest is to overcome these problems by

using of genetic algorithm.

1.5 Scopes of this Thesis

The current existing clustering techniques are k-mean algorithm. This is the powerful

clustering algorithm for the numeric data set is most widely used in the data mining

for clustering. In this algorithm randomly generate the cluster center called the seed of

cluster and then generate the hole cluster by measuring the minimum distances

between points. Therefore the objective of my thesis is generate initial seed which is

optimal seed of the cluster and then generate the hole cluster using the KVG algorithm

so we can find the result is more accurate and more errorless compare to k-mean. This

work is done by using the Genetic Algorithm through VSM . VSM is the vector

sequence method it maintain the flow of iteration and manipulate the selection of seed

for genetic algorithm.

Page 9: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

9

1.6 Document Organization

This dissertation consists of seven chapters.

Chapter 1 Introduction

This chapter introduces some basic concepts, the problems to address, and the thesis

statement and provides the motivation and justification for the work described in this

dissertation.

Chapter 2 Literature Survey

This chapter provides a brief description of different clustering algorithm, genetic

algorithm.

Chapter 3 Theoretical Aspects

This chapter describes basics of K-mean, genetic algorithm on different dataset

(breast cancer dataset, thyroid dataset, E-coil dataset).

Chapter 4 Setting up Environment

This chapter provides details for setting up environment for MATLAB.

Chapter 5 Proposed Models for Classification

This chapter provides the details of proposed model for Clustering. It also describes

the use of Genetic algorithm with K-mean algorithm,

Chapter 6 Result Analysis

This chapter provides the results on various dataset using clustering algorithms and the

combinations of Genetic with K-mean. This chapter also analyses the results obtained.

Chapter 7 Conclusion and Future Work

This chapter includes conclusion and future scope of the dissertation.

Page 10: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

10

CHAPTER 2

LITERATURE SURVEY

2.1 K-Means Clustering

The idea is to choose random cluster centers, one for each cluster. These centers are

preferred to be as far as possible from each other. Starting points affect the clustering

process and results. After that, each point will be taken into consideration to calculate

similarity with all cluster centers through a distance measure, and it will be assigned

to the most similar cluster, the nearest cluster center. When this assignment process is

over, a new center will be calculated for each cluster using the points in it. For each

cluster, the mean value will be calculated for the coordinates of all the points in that

cluster and set as the coordinates of the new center. Once we have these k new

centroids or center points, the assignment process must start over. As a result of this

loop we may notice that the k centroids change their locations step by step until no

more changes are made. When the centroids do not move any more or no more errors

exist in the clusters, we call the clustering has reached a minima. Finally, this

algorithm aims at minimizing an objective function, which is in this case a squared

error function. A simple approach is to compare the results of multiple runs with

different k clusters and choose the best one according to a given criterion. However,

we need to be careful as increasing k results in smaller error-function values by

definition, due to the few number of data points each center will represent, and thus it

will lose its generalization ability, as well as increasing the risk of over fitting.

2.1.1 Initialization for Clustering Techniques

The main purpose of clustering algorithm modifications is to improve the

performance of the underlying algorithms by fixing their weaknesses. And because

randomness is one of the techniques used in initializing many of clustering

techniques, and giving each point an equal opportunity to be an initial one, it is

considered the main point of weakness that has to be solved. However, because of the

sensitivity of K-Means to its initial points, which is considered very high, we have to

make them as near to global minima as possible in order to improve the clustering

performance. [3, 5].

Page 11: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

11

K-Means is one of the most common algorithms used for clustering. The algorithm

classifies pixels to a predefined number of clusters (assume k clusters). The idea is to

choose random cluster centers called centroids, one for each cluster. These centroids

are preferred to be as far as possible from each other. Initial points affect the

clustering process and results. After that, each pixel will be taken into consideration to

calculate similarity with all cluster centers through a distance measure, and it will be

assigned to the most similar cluster, the nearest cluster center. When this assignment

process is over, a new centroid is calculated for each cluster using the pixels in it. For

each cluster,the mean value will be calculated for the coordinates of all the points in

that cluster and set as the coordinates of the new center. Once we have these k new

centroids or center points, the assignment process must start over. This process is

repeated until there is no change in centroids. Finally, this algorithm aims at

minimizing an objective function, which is in this case is a squared error function as

given by eq. 1 .

E ………(1)

In this formula K is the number of clusters, x represents a data point, Ck represents

cluster k, mk represents the mean of the cluster k, and A is the total number of

attributes for a data point.

The K-means algorithm starts by initializing the k cluster centers. The input data

points are then allocated to one of the existing clusters according to the square of the

Euclidean distance from the clusters, choosing the closest. The mean (centroid) of

each cluster is then computed so as to update the cluster center. This update occurs as

a result of the change in the membership of each cluster.the processes of re-assigning

the input vectors and the update of the cluster centers is repeated until no more change

in the value of any of the cluster centers. The K Means clustering method can be

considered, as the cunning method because here, to obtain a better result the centroids,

are kept as far as possible from each other.

Page 12: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

12

The steps for the K-means algorithm are given below:

1. Initialization: choose randomly K input vectors (data points) to initialize the

clusters.

2. Nearest-neighbor search: for each input vector, find the cluster center that is

closest, and assign that input vector to the corresponding cluster.

3. Mean update: update the cluster centers in each cluster using the mean

(Centroid) of the input vectors assigned to that cluster.

4. Stopping rule: repeat steps 2 and 3 until no more change in the value of the

means.

One drawback of K-means is that it is sensitive to the initially selected points, and so

it does not always produce the same output. To avoid this problem, the algorithm may

run many times before taking an average values for all runs, or at least take the

median value[3].

2.2 A Fast Genetic K-means Clustering Algorithm

A new clustering algorithm called Fast Genetic K-means Algorithm (FGKA). FGKA

is inspired by the Genetic K-means Algorithm (GKA) proposed by Author but

features several improvements over GKA. they experiment indicate that, while K-

means algorithm might converge to a local optimum, both FGKA and GKA always

converge to the global optimum eventually but FGKA runs much faster than GKA. In

this paper, we propose a new clustering algorithm called Fast Genetic K-means

Algorithm (FGKA). FGKA is inspired by the Genetic K-means Algorithm (GKA) but

features several improvements over it, including efficient calculation of TWCVs,

avoiding illegal string elimination overhead, and the simplification of the mutation

operator. The initialization phase and the three operators are redefined to achieve

these improvements [7] .

FGKA starts with the initialization phase, which generates the initial population P0.

The population in the next generation Pi+1 is obtained by applying the following

genetic perators sequentially: the selection, the mutation and the K-means operator on

the current population Pi. The evolution takes place until the termination condition is

reached. The initialization phase randomly generates the initialpopulation P0 of Z

solutions which might end up with illegal strings. At first sight, illegal strings are

undesirable. For this reason, the GKA algorithm makes significant effort to eliminate

Page 13: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

13

illegal strings. Illegal strings, however, are permitted in FGKA, but are considered as

the most undesirable solutions by defining their TWCVs as +∞ and assigning them

with lower fitness values. Our flexibility of allowing illegal strings in the evolution

process avoids the overhead of illegal string elimination, and thus improves the time

performance of the algorithm. In the following, we give a brief description of the

three genetic operators.

FGKA maintains a population (set) of Z coded solutions (partitions), where Z is a

parameter specified by the user. Each solution is coded by a string a1…aN of length

N. Given a solution Sz = a1…aN, we define the legality ratio of Sz, e(Sz), as the

number of non-empty clusters in Sz divided by K. Sz is legal if e(Sz)=1,and illegal

otherwise.FGKA starts with the initialization phase, which generates the initial

population P0. The population in the next generation Pi+1 is obtained by applying the

following genetic operators sequentially: the selection, the mutation and the K-means

operator on the current population Pi. The evolution takes place until the termination

condition is reached. The initialization phase randomly generates the initial

population P0 of Z solutions which might end up with illegal strings. At first sight,

illegal strings are undesirable. For this this reason, the GKA algorithm makes

significant effort to eliminate illegal strings. Illegal strings, however, are permitted in

FGKA, but are considered as the most undesirable solutions by defining their TWCVs

as +∞ and assigning them with lower fitness values. Our flexibility of allowing illegal

strings in the evolution process avoids the overhead of illegal string elimination, and

thus improves the time performance of the algorithm. In the following, we give a brief

description of the three genetic operators.

A new clustering algorithm called Fast Genetic K-means Algorithm (FGKA). FGKA

is inspired by the Genetic K-means Algorithm (GKA) but features several

improvements over it, including efficient calculation of TWCVs, avoiding illegal

string elimination overhead, and the simplification of the mutation operator. The

initialization phase and the three operators are redefined to achieve these

improvements [7].

Page 14: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

14

2.3 Agglomerative Hierarchical Clustering based on Affinity

Propagation Algorithm

A hierarchical clustering based approach for identifying refactoring that would

improve the class structure of a software system was introduced [8]. In this direction,

a hierarchical agglomerative clustering algorithm (HAC), was developed. The

algorithm suggests the refactoring needed in order to improve the structure of the

software system. The main idea is that clustering is used in order to obtain a better

design, suggesting the needed refactorings.Real applications evolve in time, and new

application classes are added in order to met new requirements. Obviously, for

obtaining in these conditions a restructuring model of the modified software system,

the clustering algorithm (HAC in our approach) can be applied from scratch, every

time when the application classes set changes. This means that every time when the

software system changes, the extended system is analyzed starting from the entire set

of classes, methods and attributes, and HAC is applied to obtain an improved

structure of the system. But this can be inefficient, particularly for large software

systems. by proposing an adaptive method to cope with the evolving application

classes set. The method is based on detecting stable structures (cores) inside the

restructuring model of the system and resuming the clustering process from these

structures, when the application classes set increases. We aim to reach the result more

efficiently than applying HAC again from the scratch on the extended software

system.

In order to implement the hierarchical structure which extends the single-level

classification to a hierarchical multilevel one, the first task is to divide all classes’

hypotheses into groups level-by-level. For this, an agglomerative clustering can be

used to merge same classes or group of classes, level by level until all classes are

merged together. For example classes in Figure 1 can be agglomerated as in Figure 2.

According to Figure 2, class1 and class2 have greater similarity or smaller distance

and are merged together in the first level. The distance between classes or group of

them is performance based. The most straight-forward performance-based distance

between two classes or class groups is probably the accuracy of identifying them from

each other. Higher accuracy indicates that they are easier to discriminate (larger

distance).To calculate this distance, a series of pair-wise classification experiments

Page 15: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

15

are performed at each classification level and the classes in each pair that has smallest

performance are chosen to be merged.

Figure 2.1. An example of class distribution

Affinity propagation (AP) algorithm doesn't fix the number of the clusters and doesn't

rely on random sampling by the author [8]. It exhibits fast execution speed with low

error rate. However, it is hard to generate optimal clusters. This paper proposes an

agglomerative clustering based on Affinity propagation method to overwhelm the

limitation. It puts forward k-c1uster closeness to merge the clusters yielded by AP. In

comparison to AP, method has better performance and is better than or equal to the

quality of AP method. And it has an advantage of time complexity compared to

adaptive affinity propagation. This paper studies properties of AP algorithm, and then

Proposes the agglomerative hierarchical clustering based on AP. It generates the

initial division by AP partition. And it defines a novel cluster closeness based on

neighbor relationship, which can evade the influence of density. Based on it, Affinity

propagation algorithm can quickly and effectively performs agglomerative

hierarchical clustering and generates the better clusters. Experiments show that

Affinity propagation algorithm works better than the original AP and gets a more

accurate division and has an advantage in time complexity compared with Affinity

propagation. How to deal with the data with complicate structure and noise is a

direction for future research.

Page 16: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

16

2.4 Hybridized Improved Genetic Algorithm with Variable Length

Chromosome for Image Clustering

Clustering is a process of putting similar data into groups. This paper presents data

clustering using improved genetic algorithm (IGA) in which an efficient method of

crossover and mutation are implemented. Further it is hybridized with the popular

Nelder-Mead (NM) Simplex search and K-means to exploit the potentiality of both in

the hybridized algorithm. The performance of hybrid approach is evaluated with few

data clustering problems. Further a Variable Length IGA is proposed which optimally

finds the clusters of benchmark image datasets and the performance is compared with

K-means and GCUK.The results revealed are very encouraging with IGA and its

hybridization with other algorithms. Improved GA based clustering on some well

known data sets. Although K-means clustering is a very well established approach,

however it has some demerits of initialization and falling in local minima. GA being a

randomized based approach has the capability to alleviate the problems faced by K-

means. In this paper an improved version of GA was discussed and implemented for

data clustering. In this improved version of GA (IGA) a new approach of crossover

and offspring formation adopted. When applied to data clustering problem IGA

performs better compared to K-means in all data set under study in this paper.

However, to further improvise the performance of IGA on data clustering the K-

means was hybridized resulting in KMIGA and boost the KM-IGA further more it has

been hybridized with Nelder-Mead resulting in KM-NM-IGA. In hybrid algorithm

(KM-NM-IGA) the outcome of K-means becomes one of the chromosomes in the

initial population of NM-IGA. The results reveal that hybrid algorithm gives better

results compared K-means, IGA and Nelder-Mead. Since the clustering results

achieved by the IGA are satisfactory we have applied the IGA to the Image clustering

problem by proposing a new variable length IGA (VLIGA) for automatic evolution of

clusters. Experiments were carried out with three standard natural grey scale images

to evaluate the performance of the proposed VLIGA. It was evident from the results

that VLIGA algorithm was effective compared to the GCUK and traditional K-means

algorithm. Further enhancements will include the study of higher dimensional data

sets and large data set for clustering. Also the datasets with mixed data can be studied.

It is also planned to study the appropriateness of hybrid algorithm (K-NM-IGA) for

Page 17: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

17

image clustering and extend the same to color images.The K-means algorithm tends

to converge faster than the IGA as it requires fewer function evaluations, but it

usually results in less accurate clustering. One can take advantage of its speed at the

inception of the clustering process and leave accuracy to be achieved by other

methods at a later stage of the process. This statement shall be verified in later

sections of this paper by showing that the results of clustering by IGA can further be

improved by seeding the initial population with the outcome of the K-means

algorithm (denoted as KM–IGA and KM–NM– IGA). More specifically, the hybrid

algorithm first executes the K-means algorithm, which terminates when there is no

change in centroid vectors. In the case of KM– IGA, the result of the K-means

algorithm is used as one of the chromosomes, while the remaining chromosomes are

initialized randomly. The IGA algorithm then proceeds as presented above. In the case

of KM–NM–IGA, the first chromosome is seeded from k-means algorithm and rest 3N

particles or vertices as termed in randomly generated and NM–IGA is then carried out

to its completion [9].

2.5 Gene Expression Analysis Using Clustering

Data Mining has become an important topic in effective analysis of gene expression

data due to its wide application in the biomedical industry. In this paper, k-means

clustering algorithm has been extensively studied for gene expression analysis. Since

our purpose is to demonstrate the effectiveness of the k-means algorithm for a wide

variety of data sets, Two pattern recognition data and thirteen microarray data sets

with both overlapping and non-overlapping class boundaries were taken for studies,

where the number of features/genes ranges from 4 to 7129 and number of sample

ranges from 32 to 683. The number of clusters ranges from two to eleven. For pattern

recognition, we use IRIS and WBCD data and for microarray data we use serum data ,

yeast data, leukemia data, breast data, Lymphoma data , lung cancer, and St. Jude

leukemia data. To identify common subtypes in independent disease data, four

different types of breast data and four Diffused Large B-cell Lymphoma (DLBCL)

data were used. Clustering error rate (or, clustering accuracy) is used as evaluation

metrics to measure the performance of k-means algorithm. Clustering is an efficient

way of analyzing information from microarray data and K-means is a basic method

for it. K-means can be very easily applied to Microarray data. Depending on the

Page 18: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

18

nature and complexity of the data performance of K-means varies. We achieve

maximum accuracy for IRIS data where as lowest for DLBCL D. K-means has some

serious drawbacks. Many papers have presented in past to improve K-Means. In the

future we are planning to study K-Means clustering with other heuristic based search

methods like SA and GA or some others [10].

2.6 Enhancing Cluster Compactness using GA Initialized K-means

This paper presents a new initialization technique for clustering. Genetic algorithm

has been used for optimal centroid selection. These centroids act as starting points for

k-means. Previous researches used GA (genetic algorithm) initialized K-means

(GAIK) for clustering. In this paper some modification is done and a partition based

GA initialized K-means (PGAIK) technique is introduced in order to improve the

clustering performance. To measure the cluster compactness a within cluster scatter

criteria has been used. Experimental results show that PGAIK yields more compact

clusters as compared to simple GAIK. The initialization step is very important for any

clustering algorithm. The experimental results show that the partition based random

initialization method performs well and yields more compact clusters as compared to

the normal random selection [11].

.

2.7 Ant-based Clustering Algorithms

Ant-based clustering is a biologically inspired data clustering technique. Clustering

task aims at the unsupervised classification of patterns in different groups. Clustering

problem has been approached from different disciplines during last year’s. In recent

years, many algorithms have been developed for solving numerical and combinatorial

optimization problems. Most promising among them are swarm intelligence

algorithms. Clustering with swarm-based algorithms is emerging as an alternative to

more conventional clustering techniques. These algorithms have recently been shown

to produce good results in a wide variety of real-world applications. During the last

five years, research on and with the ant-based clustering algorithms has reached a

very promising state. In this paper, a brief survey on ant-based clustering algorithms

is described. We also present some applications of ant-based clustering algorithms.

Ant-based clustering algorithms are an appropriate alternative to traditional clustering

Page 19: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

19

algorithms. The algorithm has a number of features that make it an interesting study

of cluster analysis. It has the ability of automatically discovering the number of

clusters. It linearly scales against the dimensionality of data. The nature of the

algorithm makes it fairly robust to the effects of outliers within the data. Research on

ant-based clustering algorithms is still an on-going field of research. In this paper, we

address a brief survey of ant-based clustering algorithms and an overview of some of

its applications. There are a number of directions in which research on ant-based

clustering can be continued. We summarize and conclude the survey with listing some

important future works and research trends for ant-based clustering algorithms: a

comparative study of ant clustering performance with respect to other clustering

algorithms; applying ant clustering algorithms to real-world applications; effects on

performance of user-defined parameters; a hierarchical analysis of the input data by

varying some of the user-defined parameters; sensitivity analysis of various user-

defined parameters of ant clustering algorithms; to determine optimal values of

parameters other than pick and drop policies; developing new probabilistic rules for

picking and dropping objects; study the effect based on reasonably good validity

index function to judge the fitness of several possible partitioning of the data of ant-

based clustering schemes and validating mathematically; study the possibility of

dynamic clustering using ant clustering with data mining applications; applying ant

clustering algorithms for multi-objective optimization problems; study of

transformation of ant clustering algorithms into supervised algorithms; developing

new theoretical results of behavior of ant clustering algorithms and study of

hierarchical ant-based clustering algorithms; to analyze the working principles that

ant-based clustering shares with other clustering methods; hybridization of ant-

clustering algorithm with alternative clustering methods [12].

2.8 Fuzzy Kernel K-Means Clustering Method Based on Immune GA

A fuzzy kernel k-means clustering method based on immune genetic algorithm (IGA-

FKKM) is proposed in this paper to overcome the dependence on the shape of the

sample space and local optimization of fuzzy k-means algorithm. Mapping samples

from low-dimension space into high-dimension feature space with Mercer kernel, the

method thus eliminates the influence of the shape of sample space on clustering

accuracy. Meanwhile, the probability of gaining the global optimal value is also

Page 20: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

20

increased by using the immune genetic algorithm. Compared with the fuzzy k-means

clustering method (FKM) and the fuzzy k-means clustering method based on genetic

algorithm (GA-FKM), IGA-FKKM is validated by experimental results to achieve

higher classification accuracy. We propose a Fuzzy Kernel K-Means clustering

method based on Immune Genetic Algorithm (IGA-FKKM). Dependence of fuzzy K-

Means clustering on distribution of sample is eliminated with the introduction of

kernel function. Immune genetic algorithm is also used to suppress fluctuation

occurred at later evolvement and avoid local optimum. Compared with FKM and GA-

FKM, the experimental results show that IGA-FKKM obtains the global optimum,

and has higher cluster accuracy. Further study will focus on dealing with the

sensibility of clustering algorithm to initial value .

The experiments show that K-means algorithm might converge to a local optimum,

both FGKA and GKA always converge to the global optimum but FGKA runs almost

20 times faster than GKA. It also shows that three improvements: efficient calculation

of TWCVs, avoiding illegal string elimination overhead, and the simplification of the

mutation operator have different improvement impact factor over GKA. More details

are available in [13].

2.9 An IGA for Document Clustering with Semantic Similarity

Measure

This paper proposes a self-organized IGA (improved genetic algorithm) for document

clustering based on semantic similarity measure. The traditional method to represent

text is that the document is organized as a string of words, while the conceptual

similarity is ignored. We take advantage of thesaurus-based ontology to overcome this

problem. To investigate how ontology method could be used effectively in document

clustering, a hybrid strategy which combines the thesaurus-based semantic similarity

measure and vector space model (VSM) measure to provide more accurate assessment

of similarity between documents are implemented. Considering the influence between

the diversity of the population and the selective pressure, an approach of dynamic

evolution operators is put forward in this article. In our experiment two data sets of

200 and 600 documents from Reuter-21578 corpus are excerpted for test and the

experiment results show that our method of genetic algorithm in conjunction with the

hybrid semantic strategy, the combination of the thesaurus-based measure and VSM-

Page 21: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

21

based measure, outperforms that with the sole VSM measure. Our clustering

algorithm also efficiently enhances the performance of precision and recall in

comparison with k-means in the same similarity environments. In this article a

modified genetic algorithm with the semantic similarity measure is proposed for

document clustering. The common problem in the fields of text clustering is that the

document is represented as a bag of words, while the conceptual similarity between

each pairs of documents is ignored. We take advantage of thesaurus based ontology to

overcome this problem. In our experiments, data set 1 with 200 documents from four

topics and data set 2 with 600 documents form 6 topics are selected for test. The

results show that our genetic algorithm in conjunction with the hybrid strategy, the

combination of the VSM-based and thesaurus-based similarity measure, gets the best

clustering performance in terms of the precision and recall. Furthermore, the proposed

self-organized genetic algorithm, considering the influence between the diversity of

the population and the selective pressure, efficiently evolve the clustering of the

documents in comparison with standard k-means algorithm in the same similarity

strategy. As we discussed, some important words which transform to incomplete

forms after stemming are not included in WorldNet lexicon and will not be considered

as concepts for similarity evaluation. In the future we will refine our algorithm by

using a more excellent parser, for example, Text Analyst, or combine with the corpus-

based method to overcome this problem for clustering [14].

2.10 Web Clustering Based on GA with Latent Semantic Indexing

Technology

This paper constructed a latent semantic text model using genetic algorithm (GA) for

web clustering. The main difficulty in the application of GA for text clustering is

thousands or even tens of thousands of dimensions in the feature space. Latent

semantic indexing (LSI) is a successful technology which attempts to explore the

latent semantics structure in textual data, and furthermore, it reduces this large space

to smaller one and provides a robust space for clustering. GA belongs to search

techniques that efficiently evolve the optimal solution for the problem. Evolved in the

reduced latent semantic indexing model, GA can improve clustering accuracy and

speed which is typically suitable for real time clustering. We used SSTRESS criteria

Page 22: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

22

to analyze the dissimilarity between original term-by-document corpus matrix and the

approximate decomposition matrix with different ranks corresponding to the

performance of our algorithm evolved in the reduced space. The superiority of GA

applied in LSI model over K-means and conventional GA in the vector space model

(VSM) is demonstrated by providing good Reuter text clustering results. In this paper,

we propose a genetic algorithm with latent semantic indexing method for web text

clustering. The analysis of LSI shows that not only does it provide an underlying

semantic structure for text model, but also reduces the dimension drastically which is

very suitable for GA to evolve the optimal clusters. Our algorithm is applied to

Reuter-21578 text collection and demonstrates the effectiveness of our clustering

algorithm which is superior to that of GA in VSM and traditional K-means algorithm.

200 documents from 4 topics are chosen for simulating real time web clustering.

When the dimensions are reduced to 100, GA with 900 terms in LSI model obtains its

best performance with the computational time of 11.3 seconds. Therefore, the less

important dimensions correspond to “noise” are ignored. A reduced rank

approximation matrix to the original matrix is constructed by dropping these noisy

dimensions. Furthermore, the experimental results verify that we have succeeded in

reducing the number of terms with then we use the SSTRESS criteria to analyze the

dissimilarity between original term-by-document corpus matrix and the approximate

decomposition matrix with different ranks corresponding to the performance of our

algorithm evolved in the reduced space. In the future, we will refine our algorithm by

decreasing computational time of clustering [15].

2.11 Hierarchical Clustering for Adaptive Refactoring Identification

This paper studies an adaptive refactoring problem. It is well-known that improving

the software systems design through refactoring is one of the most important issues

during the evolution of object oriented software systems. We focus on identifying the

refactoring needed in order to improve the class structure of software systems, in an

adaptive manner, when new application classes are added to the system. We propose

an adaptive clustering method based on a hierarchical agglomerative approach that

adjusts the structure of the system that was established by applying a hierarchical

agglomerative clustering algorithm before the application classes set changed. The

adaptive method identifies, more efficiently, the refactoring that would improve the

Page 23: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

23

structure of the extended software system, without decreasing the accuracy of the

obtained results. An experiment testing the method’s efficiency is also reported. We

have proposed in this paper a new method (HAR) for adapting a restructuring scheme

of a software system when new application classes are added to the system. The

considered experiment proves that the result is reached more efficiently using HAR

method than running HAC again from the scratch on the extended software system.

Further work will be done in the following directions: To isolate conditions to decide

when it is more effective to adapt (using HAR) the partitioning of the extended

software system than to recalculate it from scratch using HAC algorithm. To apply the

adaptive algorithm HAR on open source case studies and real software systems.

Identify adaptive extensions of other existing automatic methods for refactoring

identification [16].

2.12 Summarization of Text Clustering Based Vector Space Model

Text clustering is an important task of natural language processing and is widely

applicable in areas such as information retrieval and web mining. The representation

of document and the clustering algorithm are the key issues of text clustering. This

paper discusses Vector Space Model (VSM)-based clustering algorithms. This paper

reviewed the text clustering algorithm. Text clustering has three issues. They are

sparse high dimensional, multi-word synonyms and polysemy. They lead the

clustering to very high time complexity. And they greatly interfere with the accuracy

of the clustering algorithm. They cause a sharp decline in the performance of

clustering. This is the difficult of the technique. In this paper, we describe some

clustering algorithms which are widely used in the document clustering based VSM

model [17].

Page 24: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

24

2.13 Initializing K-Means using Genetic Algorithms

K-Means (KM) is considered one of the major algorithms widely used in clustering.

However, it still has some problems, and one of them is in its initialization step where

it is normally done randomly. Another problem for KM is that it converges to local

minima. Genetic algorithms are one of the evolutionary algorithms inspired from

nature and utilized in the field of clustering. In this paper, we propose two algorithms

to solve the initialization problem, Genetic Algorithm Initializes KM (GAIK) and KM

Initializes Genetic Algorithm (KIGA). To show the effectiveness and efficiency of

our algorithms, a comparative study was done among GAIK, KIGA, Genetic-based

Clustering Algorithm (GCA), and FCM. Our experimental evaluation scheme was

used to provide a common base of performance assessment and comparison with

other methods. From the experiments on the eight data sets, we find that pre-

initialized algorithms work well and yield meaningful and useful results in terms of

finding good clustering configurations which contain interdependence information

within clusters and discriminative information for clustering. In addition, it is more

meaningful in selecting, from each cluster, significant centers, with high multiples

interdependence with other points within each cluster. Finally, when comparing the

experimental results of K Means, GKA, GAIK and KIGA we find that KIGA is better

than the others. As shown by the results on all datasets KIGA is ready to achieve high

clustering accuracy if compared to other algorithms [18].

Page 25: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

25

CHAPTER 3

THEORETICAL ASPECTS

3.1 K-mean algorithm

In statistics and data mining, k-means clustering is a method of cluster analysis which

aims to partition n observations into k clusters in which each observation belongs to

the cluster with the nearest mean. This results into a partitioning of the data space into

Verona cells. The problem is computationally difficult (NP-hard), however there are

efficient heuristic algorithms that are commonly employed that converge fast to a

local optimum. These are usually similar to the expectation-maximization algorithm

for mixtures of Gaussian distributions via an iterative refinement approach employed

by both algorithms. Additionally, they both use cluster centers to model the data,

however k-means clustering tends to find clusters of comparable spatial extend, while

the expectation-maximization mechanism allows clusters to have different shapes.

Almost all partitioned clustering methods are based upon the idea of optimizing a

function F referred to as clustering criterion which, hopefully, translates one's

intuitive notions on cluster into a reasonable mathematical formula. The function

value usually depends on the current partition of the database {C1; . . . ; Ck}.

Concretely, the K-Means algorithm finds locally optimal solutions using as clustering

criterion F the sum of the L2 distance between each element and its nearest cluster

centre (centroid). This criterion is sometimes referred to as square-error criterion.

Therefore, it follows that

F =

Where K is the number of clusters, Ki the number of objects of the cluster i, wij the jth

object of the ith

cluster and wi is the centroid of the ith

cluster which is defined as

Page 26: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

26

As can be seen below where the pseudo-code is presented, the K-Means algorithm is

provided somehow with an initial partition of the database and the centroids of these

initial clusters are calculated. Then, the instances of the database are relocated to the

cluster represented by the nearest centroid in an attempt to reduce the square-error.

This relocation of the instances is done following the instance order. If an instance in

the relocation step (Step 3) changes its cluster membership, then the centroids of the

clusters Cs and Ct and the square-error should be recomputed. This process is

repeated until convergence, that is, until the square-error cannot be further reduced

which means no instance changes its cluster membership.

*******************************************************************

Step 1:- select show how an initial partition of the database in K clusters

{C1… CK}

Step 2:- calculate cluster centroids ,i=1…K

Step 3:- for every Wi in the database and following the instance order Do

Step 3.1:- Reassign instance Wi to its closest cluster centroid

, is moved from Cs to Ct if

For all j=1… K, j≠s

Step 3.2:-Recalculate centroids for clusters Cs and Ct

Step 4:- IF cluster membership is stabilized THEN stops ELSE go to Step3.

*******************************************************************

3.1.1 Drawbacks of the K-Means algorithm

Despite being used in a wide array of applications, the K-Means algorithm is not

exempt of drawbacks. Some of these drawbacks have been extensively reported in the

literature. The most important are listed below:

As many clustering methods, the K-Means algorithm assumes that the number

of clusters K in the database is known beforehand which, obviously, is not

necessarily true in real-world applications,

As an iterative technique, the K-Means algorithm is especially sensitive to

initial starting conditions (initial clusters and instance order),

Page 27: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

27

The K-Means algorithm converges finitely to a local minimum. The running

of the algorithm defines a deterministic mapping from the initial solution to

the final one.

To overcome the lack of knowledge on the real value in the database of the input

parameter K, we adopt a rough but usual approach: to try clustering with several

values of K. The problem of initial starting conditions is not exclusive to the K-Means

algorithm but shared with many clustering algorithms that work as a hill-climbing

strategy whose deterministic behavior leads to a local minima dependent on initial

solution and on instance order. Although there is no guarantee of achieving a global

minimum, at least the convergence of K-Means algorithm is ensured. Milligan shows

the strong dependence of the K-Means algorithm on initial clustering and suggests

that good final cluster structures can be obtained using Ward's hierarchical method to

provide the K-Means algorithm with initial clusters. Fisher proposes creating the

initial clusters by constructing an initial hierarchical clustering based upon the work.

Author suggests using a MaxMin algorithm in order to select a subset of the original

database as the initial centroids to establish the initial clusters. Present some

experimental results of an instance of the EM algorithm reminiscent of the K-Means

with three different initialization methods (being one of them a hierarchical

agglomerative clustering method). Most of the initialization methods that we have

mentioned above do not constitute only initialization methods. They are clustering

methods themselves and when used with the K-Means algorithms result in a hybrid

clustering algorithm. Thus, these initialization methods suffer from the same problem

as the K-Means algorithm and they have to be provided with an initial clustering. For

The remaining part of this paper, we focus on much simpler and more inexpensive

initialization methods that constitute the first initialization of any other more

complexes clustering method. This is the reason that motivates Author to develop an

algorithm for refining the initial seeds for the K-Means algorithm. To overcome the

possible bad effects of instance order, present a procedure to order the instances of the

database. They show that ordering instances, so that consecutive observations are

dissimilar based on L2, lead to good clustering’s. Author proposes a local strategy to

reduce the effect of the instance ordering problem. Although they focus on

incremental clustering procedures, their strategy is not coupled to any particular

procedure and may be adapted to the K-Means algorithm [19].

Page 28: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

28

3.2 Genetic algorithms

As a part of our main objective, we aim to find the best and the worst set of initial

starting conditions to approach the extremes of the probability distributions of the

square-error values. Due to the computational expense of performing an exhaustive

search, we tackle the problem using genetic algorithms. Roughly speaking, we can

say that Genetic Algorithms (GAs) are kinds of evolutionary algorithms, that is,

probabilistic search algorithms which simulate natural evolution. GAs is used to solve

combinatorial optimization problems following the rules of natural selection and

natural genetics. They are based upon the survival of the fittest among string

structures together with a structured yet randomized information exchange. Working

in this way and under certain conditions, GAs evolves to the global optima with

probability arbitrarily close to 1. When dealing with GAs, the search space of a

problem is represented as a collection of individuals. The individuals are represented

by character strings. Each individual is coding a solution to the problem. In addition,

each individual has associated a fitness measure. The part of the space to be examined

is called the population. The purpose of the use of a GA is to find the individual from

the search space with the best ``genetic material''. Below show the pseudo-code of the

GA that we use. First, the initial population is chosen and the fitness of each of its

individuals is determined. Next, iteration two parents are selected from the

population. This parental couple produces children which, with a probability near

zero, are mutated, i.e., their hereditary distinctions are changed. After the evaluation

of the children, the worst individual of the population is replaced by the fittest of the

children. This process is iterated until a convergence criterion is satisfied.The

operators which define the children production process and the mutation process are

the crossover operator and the mutation operator respectively. Both operators are

applied with different probabilities and play different roles in the GA. Mutation is

needed to explore new areas of the search space and helps the algorithm avoid local

optima. Crossover is aimed to increase the average quality of the population. By

choosing adequate crossover and mutation operators as well as an appropriate

reduction mechanism, the probability that the GA reaches a near-optimal solution in a

reasonable number of iterations increases [20].

Page 29: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

29

*********************************************************************

BEGIN AGA

Choose initial population at random

Evaluate initial population

WHILE NOT convergence criterion DO

BEGIN

Select two parents from the current population

Producer children by the selected parents

Mutate the children

Replace the worst individual of the population by the best

child

END

Output the best individual found

END AGA

*******************************************************************

Genetic algorithm (GA) is a class of optimization procedures inspired by the

biological mechanism of reproduction. In the past, it has been used to solve various

problems including target recognition , object recognition, face recognition , and face

detection/verification [21]. GA operates iteratively on a population of structures, each

one of which represents a candidate solution to the problem at hand, properly encoded

as a string of symbols. A randomly generated set of such strings forms the initial

population from which the GA starts its search. Three basic genetic operators guide

this search: selection, crossover, and mutation. The genetic search process is iterative:

evaluating, selecting, and recombining strings in the population during the iteration

(generation) until reaching some termination condition. Evaluation of each string is

based on a fitness function that is problem-dependent. It determines which of the

candidate solutions are better. This corresponds to the environmental determination of

survivability in national selection. Selection of a string, which represents a point in

the search space, depends on the string’s fitness relative to those of other strings in the

population. It probabilistically removes, from the population, those points that have

relatively low fitness. Mutation, as in natural systems, is a very low probability

operator and just flips a specific bit. Mutation plays the role of restoring lost genetic

Page 30: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

30

material. In contrast, crossover is applied with high probability. It is a randomized yet

structured operator that allows information exchange between points. Its goal is to

preserve the fittest individual without introducing any new value. The goal of feature

subset selection is to use less features to achieve the same or better performance.

Therefore, the fitness evaluation contains two terms: (1 )Fatemeh seiti, Afsaneh Alavi,

Mohammad Reza Nazari, Mahdi Aliyari, Mohammad Teshnehlab International

Journal of Computer Science and Security 4 error and (2) the number of features

selected. We used the fitness function shown below to combine the two terms:

Fitness = Error + J × Ones (1)

Where the Error corresponds to the classification error rate and Ones corresponds to

the number of features selected (i.e., Ones in the chromosome). The Ones term ranges

from 1 to L where L is the length of chromosome (for the second dataset, L=21).

Finding the best balance between the number of features and the classification error

rate is an important issue. According to equation (1), the lower the error rate, the

better the fitness. Also the fewer the number of features, the better the fitness in this

study, we prefer to achieve the best accuracy rate with the fewer number of features.

Therefore, the first and the second term should be at the same range.

In below figure GA is used to find an optimal binary vector, where each bit is

associated with a feature. If the ith bit of this vector is equals to 1, then the ith feature

is allowed to participate in classification. If the bit is equal to 0, then the

corresponding feature does not participate.

0 1 0

1 1

d bits

Feature 2 is included in the classifier.

Feature 1 is not included in the classifier.

A d-dimensional binary vector, comprising a single member of the GA population for

GA-based feature selection.

Page 31: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

31

3.3 Vector space model

The vector-space models for information retrieval are just one subclass of retrieval

techniques that have been studied in recent years. The taxonomy provided labels the

class of techniques that resemble vector-space models ``formal, feature-based,

individual, partial match'' retrieval techniques since they typically rely on an

underlying, formal mathematical model for retrieval, model the documents as sets of

terms that can be individually weighted and manipulated, perform queries by

comparing the representation of the query to the representation of each document in

the space, and can retrieve documents that don't necessarily contain one of the search

terms. Although the vector-space techniques share common characteristics with other

techniques in the information retrieval hierarchy, they all share a core set of

similarities that justify their own class. Vector-space models rely on the premise that

the meaning of a document can be derived from the document's constituent terms.

They represent documents as vectors of terms d= (t1, t2…tn) where ti (1≤ i≤ n) is a

non-negative value denoting the single or multiple occurrences of term i in document

d. Thus, each unique term in the document collection corresponds to a dimension in

the space. Similarly, a query is represented as a vector q= (t1, t2…tm) where term tj (1≤

j≤ m) is a non-negative value denoting the number of occurrences of ti (or, merely a 1

to signify the occurrence of term tj in the query [22]. Both the document vectors and

the query vector provide the locations of the objects in the term-document space. By

computing the distance between the query and other objects in the space, objects with

similar semantic content to the query presumably will be retrieved. Vector-space

models that don't attempt to collapse the dimensions of the space treat each term

independently, essentially mimicking an inverted index. However, vector-space

models are more flexible than inverted indices since each term can be individually

weighted, allowing that term to become more or less important within a document or

the entire document collection as a whole. Also, by applying different similarity

measures to compare queries to terms and documents, properties of the document

collection can be emphasized or deemphasized. For example, the dot product (or,

inner product) similarity measure finds the Euclidean distance between the query and

a term or document in the space. The cosine similarity measure, on the other hand, by

computing the angle between the query and a term or document rather than the

distance, deemphasizes the lengths of the vectors. In some cases, the directions of the

Page 32: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

32

vectors are a more reliable indication of the semantic similarities of the objects than

the distance between the objects in the term-document space . Vector-space models

were developed to eliminate many of the problems associated with exact, lexical

matching techniques. In particular, since words often have multiple meanings

(polysemy), it is difficult for a lexical matching technique to differentiate between

two documents that share a given word, but use it differently, without understanding

the context in which the word was used. Also, since there are many ways to describe a

given concept (synonomy), related documents may not use the same terminology to

describe their shared concepts. A query using the terminology of one document will

not retrieve the other related documents. In the worst case, a query using terminology

different than that used by related documents in the collection may not retrieve any

documents using lexical matching, even though the collection contains related

documents. Vector-space models, by placing terms, documents, and queries in a term-

document space and computing similarities between the queries and the terms or

documents, allow the results of a query to be ranked according to the similarity

measure used. Unlike lexical matching techniques that provide no ranking or a very

crude ranking scheme (for example, ranking one document before another document

because it contains more occurrences of the search terms), the vector-space models,

by basing their rankings on the Euclidean distance or the angle measure between the

query and terms or documents in the space, are able to automatically guide the user to

documents that might be more conceptually similar and of greater use than other

documents. Also, by representing terms and documents in the same space, vector-

space models often provide an elegant method of implementing relevance feedback .

Relevance feedback, by allowing documents as well as terms to form the query, and

using the terms in those documents to supplement the query, increases the length and

precision of the query, helping the user to more accurately specify what he or she

desires from the search. Information retrieval models typically express the retrieval

performance of the system in terms of two quantities: precision and recall. Precision is

the ratio of the number of relevant documents retrieved by the system to the total

number of documents retrieved. Recall is the ratio of the number of relevant

documents retrieved for a query to the number of documents relevant to that query in

the entire document collection. Both precision and recall are expressed as values

between 0 and 1. An optimal retrieval system would provide precision and recall

Page 33: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

33

values of 1, although precision tends to decrease with greater recall in real-world

systems.

The VSM has been a standard model of representing documents in information

retrieval for almost three decades . Let D be a document collection and Q the set of

queries representing users’ information needs. Let also ti symbol- ize term i used to

index the documents in the collection, with i = 1... n. The VSM assumes that for each

term ti there exist a vector ~ti in the vector space that represents it. It then considers

the set of all term vectors {~ti} to be the generating set of the vector space, thus the

space basis. If each dk,(for k = 1, .., p) denotes a document of the collection, then

there exists a linear combination of the term vectors {~ti} which represents each dk in

the vector space. Similarly, any query q can be modeled as a vector ~q that is a linear

combination of the term vectors. In the standard VSM, the term vectors are

considered pair wise orthogonal, meaning that they are linearly independent. But this

assumption is unrealistic, since it enforces lack of relatedness between any pair of

terms, whereas the terms in a language often relate to each other. Provided that the

orthogonally assumption holds, the similarity between a document vector ~ dk and a

query vector ~q in the VSM can be expressed by the cosine measure given in equation

below.

3.4 Thyroid Dataset

In order to perform the research reported in this manuscript, two thyroid disease

datasets are used. These thyroid datasets are taken from UCI machine learning

repository [23]. The first thyroid dataset consists of 215 patients and 5 features. These

features are T3-resin uptake test (A percentage), Total serum thyroxin as measured by

the isotopic displacement method, Total serum triiodothyronine as measured by

radioimmuno assay, Basal thyroid-stimulating hormone (TSH) as measured by

radioimmuno assay, Maximal absolute difference of TSH value after injection of 200

mg of thyrotropin-releasing hormone as compared to the basal value. This dataset

consist of 3 classes which are normal, hyperthyroidism and hypothyroidism. The

Page 34: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

34

second thyroid dataset consists of 7200 patients of 21 features each, 15 binary (from

x2 to x16) and 6 continuous (x1, x17, x18, x19, x20, x21). The training data and the

test data consist of 3772 and 3428 data, respectively. 92% of samples in this dataset

belong to normal class. The features are age, sex, on thyroxin, maybe on thyroxin, on

antthyroid medication, sick-patient reports malaise, pregnant, thyroid surgery, I131

treatment, test hypothyroid, test hyperthyroid, on lithium, has goiter, has tumor,

hypopituitary, psychological symptoms, TSH, T3, TT4, T4U, and FTI.

This directory contains 6 databases, corresponding test set, and corresponding

documentation. They were left at the University of California at Irvine by Ross

Quinlan during his visit in 1987 for the 1987 Machine Learning Workshop. The

documentation files (with file extension "names") are formatted to be read by

Quinlan's C4 decision tree program. Though briefer than the other documentation

files found in this database repository, they should suffice to describe the database,

specifically :

1. Source

2. Number and names of attributes (including class names)

3. Types of values that each attribute takes

In general, these databases are quite similar and can be characterized somewhat as

follows:

1. Many attribute (29 or so, mostly the same set over all the databases)

2. Mostly numeric or Boolean valued attributes

3. Thyroid disease domains (records provided by the Garavan Institute of

Sydney, Australia)

4. Several missing attribute values (signified by "?")

5. small number of classes (under 10, changes with each database)

6. 2800 instances in each data set

7. 972 instances in each test set (It seems that the test sets' instances

8. are disjoint with respect to the corresponding data sets, but this has not been

verified)

This database now also contains an additional two data files, named hypothyroid.

Data and sicken thyroid Dataset. They have approximately the same data format and

set of attributes as the other 6 databases, but their integrity is questionable. Ross

Quinlan is concerned that they may have been corrupted since they first arrived at

Page 35: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

35

UCI, but we have not yet established the validity of this possibility. These 2

databases differ in terms of their number of instances (3163) and lack of

corresponding test files. They each have 2 concepts (negative/hypothyroid and sick-

euthyroid/negative respectively). Their source also appears to be the Garavan

institute. Each contains several missing values. Another relatively recent file

thyroid0387.data has been added that contains the latest version of an archive of

thyroid diagnoses obtained from the Garvan Institute, consisting of 9172 records from

1984 to early 1987. A domain theory related to thyroid disease has also been added

recently (thyroid. theory). The files new-thyroid.[names, data] were donated by

Stefan Abe hard [23].

3.5 Breast Cancer Dataset

Breast cancer is one of the most common cancers among women World Wide and its

incidence is about one million new patients annually by the year 2000. There is an

overall increase of 2% in the incidence of breast cancer throughout the world per year.

Worldwide it is estimated that 420000 deaths would occur annually as a result of

breast cancer by the year 2000. Although breast cancer is a potentially fatal condition,

early diagnosis of disease can lead to successful treatment. One of the important steps

to diagnose the breast cancer is classification of tumor. Tumors can be either benign

or malignant but only the latter is cancer. So, malignant tumors generally are more

serious than benign tumors. Early diagnosis needs a precise and reliable diagnosis

procedure that allows physicians to distinguish between benign breast tumors and

malignant ones. For this purpose, there are various computer-based solutions to serve

as the diagnosis procedure and assist the physicians to specify the type of breast mass.

These systems, called Medical Diagnostic Decision Support (MDDS) systems, can

augment the natural capabilities of human diagnosticians incorporating imprecise

models about the incompletely understood and exceptionally complex process of

medical diagnosis. For evaluating the model, Wisconsin Diagnostic Breast Cancer

(WDBC) Dataset is used. Each record of this dataset is represented with 30 numerical

features. Features are computed from a digitized image of a fine needle aspirate

(FNA) of a breast mass. They describe characteristics of the cell nuclei present in the

image. The diagnosis of each record is “benign” or “malignant”. This dataset contains

Page 36: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

36

569 instances. 357 instances are benign and 212 malignant. There is no missing value

in the dataset [24].

Table:3.1 Data set table of Breast Cancer

Data set

Characteristics: Multivariate

Number of

Instances: 569 Area: Life

Attribute

Characteristics:

Real Number of

Attributes:

32 Date

Donated

1995-

11-01

Associated Tasks: Clustering Missing Values No Number of

Web Hits:

119949

3.6 Ecoli Data Set

This section describes how the benchmark consisting of 43 E.coli sequence datasets

each containing a known motif derived from RegulonDB is created. RegulonDB is a

database on transcription regulation and operon organization in Escherichia coli. We

start with the file 'TF Binding. For each of 43 distinct transcription factors, we select

the first target gene of each transcription unit regulated by this transcription factor,

and for each target gene, we select the intergenic region 250 nucleotides upstream and

50 nucleotides downstream of the translation start site (as the transcription start site is

often unknown). This gives for each transcription factor a file in FASTA format with

a list of DNA sequences (separated by '>') for each target gene. Based on the genome

coordinates for all transcription factor binding sites described in 'TF Binding Sites',

we describe the motif model that corresponds to the transcription factor in each

dataset by the relative start and end position, strand and nucleotide description of the

sites in the created sequences dataset. Secondly, we use the positional nucleotide

counts from the file 'Matrices_Alignments' to create a PWM representation of the

motif model [25].

Table:3.1 Data set table of Ecoli

Data set

Characteristics: Multivariate

Number of

Instances: 336 Area: Life

Attribute

Characteristics:

Real Number of

Attributes:

8 Date

Donated

1996-

09-01

Associated Tasks: Clustering Missing Values No Number of

Web Hits:

31008

Page 37: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

37

CHAPTER 4

PROPOSED WORK

This work evaluates the performance of K means with VSM and Genetic algorithm

for the breast, thyroid dataset. A The basic idea about selecting initial cluster centers

using genetic algorithm In the proposed algorithm, we first use random function to

select K data objects as initial cluster centers to form a chromosome, a total of M

chromosomes selected, then have K-means operation on each group of cluster center

in the initial population to compute fitness, select individuals according to the fitness

of each chromosome, select high-fitness chromosomes for the crossover and mutation

operation eliminating low fitness chromosomes, format next generation group finally.

In this way, within each new generation of groups, the average fitness are rising, each

cluster center is closer to the optimal cluster center, and finally select chromosome

that have the highest fitness as the initial cluster center. Algorithm

*********************************************************************

1. Choose a number of clusters k

2. Initialize cluster centers ........... based on mode

3. For each data point, compute the cluster center it is closest to (using some distance

measure)

and assign the data point to this cluster.

4. Re-compute cluster centers (mean of data points in cluster)

5. Stop when there are no new re-assignments.

6. GA based refinement

a) a Construct the initial population (p1)

b) b Calculate the global minimum (Gmin)

c) c For i = 1 to N do

i. Perform reproduction

ii. Apply the crossover operator between each parent.

iii. Perform mutation and get the new population. (p2)

iv. Calculate the local minimum (Lmin).

v. If Gmin < Lmin then

a.Gmin = Lmin;

b. p1 = p2;

d) Repeat

*******************************************************************

Page 38: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

38

A. Chromosome Coding, we use real-coded, the value of length of the

chromosome is the number of cluster, and the specific code form is:

X= K is the number of cluster center of a chromosome.

B. Population Initialization The range of M is 20-100. Specific operation is as

follows: select K cluster centers randomly to form a chromosome Ran, if the

center randomly selected has already exist in the same chromosome, then

remove the center and reselect until it reaches K centers, until the population

size to M = 100. Select the Fitness Function we use the inverse of objective

function J as the fitness function, that is F=1/J. The smaller J is the greater

fitness function will become, so the better clustering effect is.

C. Genetic Operation we use proportional selection operator, single-point

crossover operator and uniform mutation operator. To avoid premature or slow

convergence phenomenon using a fixed probability, we are using self-adaptive

genetic operator that is dynamically adjust the crossover rate and mutation

rate. Pc and Pm is calculated as follows:

Among them, means average fitness value of each generation group;

means the largest individual fitness value in the group; means the

larger fitness value of the two crossing individuals; f indicates the fitness value

of mutating individual. The formula makes individuals with high fitness have

lower crossover rate and mutation rate; individuals with small fitness have a

higher crossover rate and mutation rate. This helps protect the best individual,

but also can make individuals with lower fitness cross and mutate at higher

rate, producing excellent model.

Page 39: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

39

D. Loop Termination Conditions In this paper, we use termination algebra T as

running end condition of genetic algorithm, which indicate that the genetic

algorithm stop running after it runs to the specified evolution algebra, and

output the best individual in current group as optimal solution of the problem.

the range from 100 to 1000.

E. Description of the Specific algorithm

a. Set the parameters: population size M, the maximum number of

iteration T, the number of clusters K, etc.

b. Generate m chromosomes randomly; a chromosome represents

a set of initial cluster centers, to form the initial population.

c. According to the initial cluster centers showed by every

chromosome, carry out K-means clustering, each chromosome

corresponds to once K-means clustering, then calculate

chromosome fitness in line with clustering result, and

implement the optimal preservation strategy.

d. For the group, to carry out selection, crossover and mutation

operator to produce a new generation of group.

e. To determine whether the conditions meet the genetic

termination conditions, if meet then withdrawal genetic

operation and turn 6, otherwise turn 3.

f. Calculate fitness of the new generation of group; compare the

fitness of the best individual in current group with the best

individual's fitness so far to find the individual with the highest

fitness.

g. Carry out K-means clustering according to the initial cluster

center represented by the chromosome with the

h. Highest fitness, and then output clustering result.

Page 40: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

40

Figure 4.1 Generation of clusters by KVG

Page 41: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

41

4.1 Clustering using GA

In clustering using GA defines a measurement naming μ as below:

Duty of GA is finding proper cluster centers 𝑍1, 𝑍2, 𝑍3… 𝑍𝑘 in such a way that

clustering standard of M can be minimize.

4.2 Displaying series

Each series is a collection of real numbers that show K cluster centers. In N-D

environment, length of each chromosome is identify with N*K.

4.2.1 Population initialization

K cluster centers are accidentally selected from available list and insert in one

chromosome. This trend repeats for all P chromosomes producing population. Indeed

P is population size.

4.3 Fitness computation Access of fitness contains two processes. In first process, clusters, base of centers is

chromosomes, will produce. This means any point 𝑋𝑖 specify to one of the clusters 𝐶𝑗

with center of, if:

P=1,2,3.....K and j

After clustering is done, available cluster centers in chromosomes replace with

average of any cluster’s points. For cluster, new center 𝑍𝑖 accesses in this way:

For define this subject.

Page 42: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

42

The clustering standard or is account in this way:

F=

Also fitness dependent accesses in this way:

4.4 Reproduction (Selection)

The selection process selects chromosomes from the mating pool directed by the

survival of the fittest concept of natural genetic systems. In the proportional selection

strategy adopted in this article, a chromosome is assigned a number of copies, which

is proportional to its fitness in the population, that go into the mating pool for further

genetic operations. Roulette wheel selection is one common technique that

implements the proportional selection strategy.These processes ultimately result in the

next generation population of chromosomes that is different from the initial

generation. Generally the average fitness will have increased by this procedure for the

population, since only the best organisms from the first generation are selected for

breeding, along with a small proportion of less fit solutions, for reasons already

mentioned above.

4.5 Crossover

Crossover is a probabilistic process that exchanges information between two parent

chromosomes for generating two child chromosomes. In this dissertation we are using

Single point crossover with a fixed crossover probability of pc = 0.8 is used. For

chromosomes of length l, a random integer, called the crossover point, is generated in

the range [1, l-1]. The portions of the chromosomes lying to the right of the crossover

point are exchanged to produce two offspring.

Page 43: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

43

4.6 Mutation

Each chromosome undergoes mutation with a fixed probability pm = 0.6. For binary

representation of chromosomes, a bit position (or gene) is mutated by simply flipping

its value. Since we are considering real numbers in this paper, a random position is

chosen in the chromosome and replace by a random number between 0-9.After the

genetic operators are applied, the local minimum fitness value is calculated and

compared with global minimum. If the local minimum is less than the global

minimum then the global minimum is assigned with the local minimum, and the next

iteration is continued with the new population. The cluster points will be repositioned

corresponding to the chromosome having global minimum. Otherwise, the next

iteration is continued with the same old population. This process is repeated for N

number of iterations. From the following section, it is shown that our refinement

algorithm improves the cluster quality.

4.7 Stopping criteria

In this work, calculation trend of fitness, crossover, mutation is done in amount of

maximum repeated number.

Page 44: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

44

CHAPTER 5

ENVOIRMENT SETUP

5.1 MATLAB Environment

MATLAB is a high-level technical computing language and interactive environment

for algorithm development, data visualization, data analysis, and numeric

computation. Using the MATLAB product, you can solve technical computing

problems faster than with traditional programming languages, such as C, C++, and

FORTRAN.

You can use MATLAB in a wide range of applications, including signal and image

processing, communications, control design, test and measurement, financial

modeling and analysis, and computational biology. Add-on toolboxes (collections of

special-purpose MATLAB functions, available separately) extend the MATLAB

environment to solve particular classes of problems in these application areas.

MATLAB provides a number of features for documenting and sharing your work.

You can integrate your MATLAB code with other languages and applications, and

distribute your MATLAB algorithms and applications. Features include:

High-level language for technical computing

Development environment for managing code, files, and data

Interactive tools for iterative exploration, design, and problem solving

Mathematical functions for linear algebra, statistics, Fourier analysis, filtering,

optimization, and numerical integration

2-D and 3-D graphics functions for visualizing data

Tools for building custom graphical user interfaces

Functions for integrating MATLAB based algorithms with external

applications and languages, such as C, C++, Fortran, Java™, COM, and

Microsoft® Excel

®

Page 45: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

45

5.1.1 Introduction to MATLAB

MATLAB (matrix laboratory) is a numerical computing environment and fourth-

generation programming language. Developed by MathWorks, MATLAB allows

matrix manipulations, plotting of functions and data, implementation of algorithms,

creation of user interfaces, and interfacing with programs written in other languages,

including C, C++, Java, and Fortran. Although MATLAB is intended primarily for

numerical computing, an optional toolbox uses the MuPAD symbolic engine,

allowing access to symbolic computing capabilities. An additional package, Simulink,

adds graphical multi-domain simulation and Model-Based Design for dynamic and

embedded systems. In 2004, MATLAB had around one million users across industry

and academia. MATLAB users come from various backgrounds of engineering,

science, and economics. MATLAB is widely used in academic and research

institutions as well as industrial enterprises.

5.1.2 Requirements of MATLAB

Page 46: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

46

5.1.3 How to Use MATLAB

Page 47: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

47

CHAPTER 6

RESULT ANALYSIS

Comparing K-means algorithm based on genetic algorithm with the original K-means

algorithm and two known improved algorithms to verify the effectiveness that

selecting initial cluster center using genetic algorithm. Improved algorithm 1 is

proposed in, the improved algorithm 2 is proposed in. In order to exclude the impact

of isolated points, we use the method proposed in that using the average value of

subset whose object is more close to center as a new round of cluster center to

improve K-means algorithm, and also apply this method to original K-means

algorithm, improved algorithm 1 and improved algorithm 2, having a comprehensive

comparison on them. Experimental data are two groups of data from the UCI

database, iris data sets. Considering that great value properties affect the distance

between the samples. We added groups of isolated points respectively to the two sets

of data above-mentioned. Iris data. Experiment parameter settings are as follows: k =

0.25; pcl = 0.9; pc2 = 0. 6; pml = 0.5; pm2 = 0.1; pc = 0. 6; pm = 0.1; m (initial

population size) = 50, maxgen (the maximum number of iteration) = 100. The results

are showed in Table: 6.1, Table: 6.2 and Table: 6.3, the initial cluster center values are

data object label. In the experimental results, for improvement 1, due to calculating

the mean within each separate section as the initial cluster centers it is not marked.

Can be seen from the above data, the traditional K-means algorithm is sensitive to the

initial cluster centers, different cluster centers have quite different clustering results,

and the results sometimes are poor, the algorithm is unstable. For the improvement 1

algorithm and improvement 2 algorithms, due to selecting the ideal initial cluster

centers by calculating, so they have better clustering effect, the objective function

value is smaller. And the K-means algorithm based on genetic algorithm proposed in

the article find out the optimal objective function value through searching initial

cluster centers, in the three groups of data, its objective function values are smaller

than the other two algorithms', indicating that algorithm proposed in this paper has

better clustering effect. During 3 experiments, this algorithm has already found the

optimal objective function value, indicating that the algorithm is relatively stable. Can

be seen from the from the table, when the data have apparent isolated points, the

Page 48: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

48

algorithm proposed in this paper can significantly reduce the impact of outliers and

improve clustering accuracy than the other two improved methods, and when the data

have no apparent isolated points, the proposed algorithm still have more accurate

clustering results than the other two improved methods, it is proved in the experiment

that the method selecting initial cluster centers proposed in the text is not affected

when the data have apparent isolated points, while the other the other two improved

methods have certain limitations.

6.1 Experiment Design:

6.1.1 Experiment for Thyroid Dataset

Figure 6.1 Result of K-mean algorithm

Page 49: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

49

Figure 6.2 Result of Proposed (KVG) methods

6.1.2 Experimental Graph

Figure 6.3 Parameter graph between k-mean and KVG for thyroid dataset

Page 50: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

50

6.1.3 Experimental Table.

Table 6.1 Base Parameter comparison on thyroid dataset.

6.2.1 Experiment for breast cancer dataset

Figure 6.4 Result of K-mean algorithm

S.N Clustering Algorithm

Threshold Iteratin Time Error Rate

1.

K-means

Algorithm

0.21

5

2.9328

3.8631

2.

KVG

Algorithm

0.21

6

4.92963

1.15938

Page 51: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

51

Figure 6.5 Result of Proposed (KVG) methods

6.2.2 Experimental Graph

Page 52: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

52

Figure 6.6 Parameter graph between k-mean and KVG for breast cancer dataset

6.2.3 Experimental Table

Table 6.2 Base Parameter comparison on breast cancer dataset

6.3.1 Experiment on E-coil dataset

S.N Clustering Algorithm

Threshold Iteration Time Error Rate

1.

K-means

Algorithm

0.51

4

2.58962

4.3169

2.

KVG Algorithm

0.51

5

4.38682

1.34741

Page 53: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

53

Figure 6.7 Result of K-mean algorithm

Figure 6.8 Result of Proposed (KVG) methods

6.3.2 Experimental Graph Between k-mean and KVG for E-coil Dataset:

Page 54: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

54

Figure 6.9 Parameter graph between k-mean and KVG for E-coil dataset

6.3.3 Experimental Table

Table 6.2 Base Parameter comparison on E-coil dataset

6.4 Performance Measures

Figure 6.10 Comparison of Error rate between k-mean and KVG

S.N Clustering Algorithm

Threshold Iteration Time Error Rate

1.

K-means

Algorithm

0.31

9

5.17923

5.16772

2.

KVG Algorithm

0.31

10

7.90002

1.51839

Page 55: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

55

Figure 6.11 Comparison of Iteration through cluster between k-mean and KVG

6.5 The result of simulation and comparison of algorithms

Before draw a conclusion, we give some definition for kind of parameters. As

previously said, the object was assessment of K-means clustering method and

proposal trend of GA-clustering and comparison of their result. For this, we use some

collections of different parameters. The example is Iris Data set. This dataset consist

of 3 classes which are normal, hyperthyroidism and hypothyroidism. The second

thyroid dataset consists of 7200 patients of 21 features each, 15 binary (from x2 to

x16) and 6 continuous (x1, x17, x18, x19, x20, x21). The parameters are described in

3 classes: 1, 0.5, and 0. the number of parameters should be identical with 3. Other

parameter collection is called Vowel that contains 871 records parameter. Each record

parameter constitutes 3 index 𝑓1, 𝑓2, 3. This parameter should be insert in 6 class.

Third parameter collection is called Crude oil that contains 56 record parameters and

each record constitutes 5 indexes. This parameter should insert in 3 classes.

Page 56: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

56

GA-Clustering implementation is done with these parameters:

Population size (pop-size): 100

Crossover rate (PC): 0.8

Mutation rate (PM): 0.6

Maximum iteration: 100

K-means clustering is done with these parameters:

Number of clusters or K: That there are different parameters in

each kind.

Maximum iteration: 100

6.6 Observed Results

The numbers in this table are clustering standard or μ; and the lower one can give

better conclusion. This conclusion implements more and more and accepts with

different initiative population on different parameters collections.

6.7 Obtained from running two algorithms on Thyroid Dataset.

S.N Clustering

Algorithm

Threshold Iteration Time Error Rate

1.

K-means

Algorithm

0.21

5

2.9328

3.8631

2.

KVG

Algorithm

0.21

6

4.92963

1.15938

Page 57: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

57

6.8 Obtained from running two algorithms on Breast Cancer

Dataset.

6.9 Obtained from running two algorithms on E-coil Dataset.

S.N Clustering

Algorithm

Threshold Iteration Time Error Rate

1.

K-means

Algorithm

0.51

4

2.58962

4.3169

2.

KVG

Algorithm

0.51

5

4.38682

1.34741

S.N Clustering

Algorithm

Threshold Iteration Time Error Rate

1.

K-means

Algorithm

0.31

9

5.17923

5.16772

2.

KVG

Algorithm

0.31

10

7.90002

1.51839

Page 58: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

58

CHAPTER 7

CONCLUSION AND FUTURE WORK

In this dissertation we modified the K-Means algorithm that is one of the popular

clustering techniques has been surveyed and tried to apply one of the optimization

method named genetic algorithm improve in unsupervised clustering procedure.

Genetic algorithms are population based methods that use from operators for

processing of population chromosomes. In this research, we defined a representation

of chromosome string and combine K-Means and GA together. Observing simulations

in different running show that K-Means clustering based on Genetic algorithm

improved clustering measurement better and more efficient rather than pure K-Means

considerably.

Future Work:

In this Dissertation we have observed that the time Complexity is greater than

previous K-means Algorithm. In Future Research we want to minimize the execution

time of Proposed Algorithm.

In the process of optimization and controlling of iteration the

simple k-mean algorithm is improved but in case the complexity of process in future

we attempt to bind VSM with GA and minimization some complex step.

.

Page 59: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

59

REFERENCES

1. Yi Lu, Shiyong Lu, Farshad Fotouhi, “FGKA: A Fast Genetic K-means Clustering

Algorithm”,IEEE Third International conference on Symposium on

Bioinformatics and Bioengineering ,ACM ,PP.622-625,2011.

2. Qinghe Zhang, Xiaoyun Chen, “Agglomerative Hierarchical Clustering based on

Affinity Propagation Algorithm” Third International Symposium on Knowledge

Acquisition and Modeling,Vol.22,PP.56-59, 2010.

3. Venkatesh Katari,Suresh Chandra Satapathy,“Hybridized Improved Genetic

Algorithm with Variable Length Chromosome for Image Clustering” IJCSNS

International Journal of Computer Science and Network Security, VOL.7

No.11,PP.17-27,2007

4. Kumar Dhiraj and Santanu Kumar Rath, “Gene Expression Analysis Using

Clustering”, International Journal of Computer and Electrical Engineering, Vol. 1,

No. 2,PP.62-66, 2009

5. Kailash Chander, Dr. Dinesh Kumar, Vijay Kumar,” Enhancing Cluster

Compactness using Genetic Algorithm Initialized K-means” International Journal

of Software Engineering Research & Practices Vol.1, Issue 1, PP.141-145, 2011.

6. O.A. Mohamed Jafar and R. Sivakumar,” Ant-based Clustering Algorithms: A

Brief Survey”, International Journal of Computer Theory and Engineering, Vol. 2,

No. 5, PP.407-411 , 2010.

7. Chengjie GU1, Shunyi ZHANG1, Kai LIU1, He HUANG2, “Fuzzy Kernel K-

Means Clustering Method Based on Immune Genetic Algorithm,” Journal of

Computational Information Systems ,Vol.3,PP.56-59,2011.

8. Wei Song and Soon Cheol Park,” An Improved Genetic Algorithm for Document

Clustering with Semantic Similarity Measure” IEEE Forth International

conference on Natural Computation, vol. 1, pp. 536-540, 2008.

9. Wei Song and Soon Cheol Park, “Analysis of Web Clustering Based on Genetic

Algorithm with Latent Semantic Indexing Technology”,IEEE Sixth International

Conference on Advanced Language Processing and Web Information Technology,

Vol.1, PP. 81-87,2007

10. Istvan Gergely Czibula1, Gabriela Czibula, “Hierarchical Clustering for Adaptive

Refactoring Identification” IEEE International conference on cloud computing,

vol.4,PP.123-126,2007

Page 60: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

60

11. Mingzhen Chen, Yu Song,“Summarization of Text Clustering based Vector Space

Model”,IEEE 10th International Conference, ISSN: 13468030, Vol. 6, PP. 919–

938. 2009.

12. Bashar Al-Shboul, and Sung-Hyon Myaeng, “Initializing K-Means using Genetic

Algorithms”, World Academy of Science, Engineering and Technology,

Knowledge and Data Engineering, Vol.17, No.10, PP.1363-1366,2009.

13. J.M. Pena, J.A. Lozano, P. Larranaga, “An empirical comparison of four

initialization methods for the K-Means algorithm” Department of Computer

Science and Artificial Intelligence, Intelligent Systems Group, Vol. 3242 ,PP.

519-547,1999.

14. Daniel Costa, “An evolutionary tabu search algorithm and the NHL scheduling

problem” Vol. 55, PP.56-59,1994.

15. Grefenstette J.Incorporating Problem Speci “Knowledge into Genetic

Algorithms", Genetic Algorithms and Simulated Annealing, Hingham, ed. Davis

(Pitman, London and Morgan Kaufmann Publishers, Inc., 1987), 42-60.

16. N. Belkin and W. Croft. Retrieval techniques. In M. Williams, editor, Annual

Review of Information Science and Technology (ARIST), Elsevier Science

Publishers B.V, Vol.22, chapter 4, pages 109--145.., 1987.

17. W. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures &

Algorithms. Prentice Hall, Englewood Cliffs, New Jersey, 1992.

18. http://archive.ics.uci.edu/ml/datasets/Wine

http://www.mathworks.in/index.html

19. http://homes.esat.kuleuven.be/~bioi_marchal/MotifSuite/benchmarktest.php.

Page 61: CHAPTER 1 INTRODUCTION - Akaltara Education thesis rgpv cse.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 Introduction to Clustering Clustering can be considered the most important unsupervised

61

PUBLICATION

Amit Dubey, Prof. Anurag Jain and Dr. A.K. Sachan

“A survey: Performance

improving of K-mean by Genetic Algorithm” Accepted in International Journal of

Computational Intelligence and Information Security(IJCIIS) .Australia vol,2 ,PP-25-

29.2011