genetic algorithm based clustering techniques and tree based … · i declaration of authorship i...
TRANSCRIPT
Genetic Algorithm based Clustering
Techniques and Tree based Validation in
Producing and Evaluating Sensible Clusters
Abul Hashem Beg
A thesis submitted in the fulfilment of the requirements for the degree of
Doctor of Philosophy
School of Computing and Mathematics
Charles Sturt University
Panorama Avenue, Bathurst, NSW 2795, Australia
September 2017
i
Declaration of Authorship
I Abul Hashem Beg hereby declare that this submission titled “A Novel Genetic Algorithm
based Clustering and Tree based validation in Producing and Evaluating Sensible Clusters” is
my own work and to the best of my knowledge and belief, understand that it contains no
material previously published or written by another person, nor material which to a substantial
extent has been accepted for the award of any other degree or diploma at Charles Sturt
University or any other educational institution, except where due acknowledgement is made in
the thesis. Any contribution made to the research by colleagues with whom I have worked at
Charles Sturt University or elsewhere during my candidature is fully acknowledged.
I agree that this thesis be accessible for the purpose of study and research in accordance with
normal conditions established by the Executive Director, Library Services, Charles Sturt
University or nominee, for the care, loan and reproduction of thesis, subject to confidentiality
provisions as approved by the University.
Signature:
Date: 25/09/2017
ii
Acknowledgement
First and foremost, I would like to thank the almighty Allah for blessing me with the strength,
knowledge, ability and opportunity to complete this research work. This achievement would not
have been possible without his blessings.
I would like to express my deepest gratitude to my principal supervisor Associate Professor
Dr Md Zahidul Islam for his continuous support, discussions, suggestions and valuable time
throughout my PhD. His constant inspiration, valuable guidance and directions made this work
possible. My sincere and cordial appreciation to him because without his supervision I cannot
be whom I am today.
I am also thankful to my co-supervisors Professor Vladimir Estivill-Castro and Dr Peter
White for their kind support and suggestions during my study. I am also grateful to Charles Sturt
University for providing the scholarship to me. I am also thankful to the Centre for Research in
Complex Systems (CRiCS) for providing me a nice working environment.
My special thanks to my parents, my wife, my brothers and sister, brother in law and sisters
in law, my nieces and other relatives for their support and encouragement, especially to my elder
brother Associate Professor Dr Md Dalour Hossen Beg for his kind support during my PhD.
I am grateful to my friends Nasim Adnan, Dr Anisur Rahman, Dr Geaur Rahman, Dr Zavid
Parvez, Samuel Fletcher, Michael Siers, Fazley Rabbi, Khubeb Siddiqui, Pallab Podder, Jahid
Reza, Musfequs Salehin and Buyani for their moral supports throughout my PhD. I would give
thanks also to all faculty and staff members of the School of Computing and Mathematics and
all postgraduate students for being very supportive and friendly to me during my study.
iii
With love and gratitude this thesis is dedicated to
My Parents
Mily, My wife
My Brothers and Sister
My Brother in law and Sisters in law
My Father-in-law, My Mother-in-law, My Nieces
and All Relatives
for their support and inspiration.
iv
Publications from the Thesis
[1] Beg, A. H., Islam, M. Z., and Estivill-Castro. V. (2016): Application of a Novel GA-based
Clustering and Tree based Validation on a Brain Data Set for Knowledge Discovery,
Information Systems, ELSEVIER. (Status: Under Review). (ERA 2010 Rank A*, SJR
2016 Rank Q1, H Index 64).
[2] Beg, A. H., Islam, M. Z., and Estivill-Castro. V. (2016): Genetic Algorithm with Healthy
Population and Multiple Streams Sharing Information for Clustering, Knowledge-Based
Systems, 114 (2016) 61-78, ELSEVIER. (ABDC 2016 Rank A, SJR 2016 Rank Q1, 5
Year Impact Factor: 3.433, H Index 63).
[3] Beg, A. H. and Islam, M. Z. (2016): A Novel Genetic Algorithm-Based Clustering
Technique and its Suitability for Knowledge Discovery from a Brain Data set, In Proc.
of the IEEE Congress on Evolutionary Computation (IEEE CEC 2016), Vancouver,
Canada, July 24-29, 2016, pp. 948-956. (ERA 2010 Rank A).
[4] Beg, A. H. and Islam, M. Z. (2016): Novel crossover and mutation operation in genetic
algorithm for Clustering, In Proc. of the IEEE Congress on Evolutionary Computation
(IEEE CEC 2016), Vancouver, Canada, July 24-29, 2016, pp. 2114-2121. (ERA 2010
Rank A).
[5] Beg, A. H. and Islam, M. Z. (2016): Branches of Evolutionary Algorithms and their
Effectiveness for Clustering Records, In Proc. of the 11th IEEE Conference on Industrial
Electronics and Applications (ICIEA 2016), Hefei, China, June 5-7, 2016, pp. 2484-
2489. (ERA 2010 Rank A).
[6] Beg, A. H. and Islam, M. Z. (2016): Advantages and Limitations of Genetic Algorithms
for Clustering Records, In Proc. of the 11th IEEE Conference on Industrial Electronics
v
and Applications (ICIEA 2016), Hefei, China, June 5-7, 2016, pp. 2478-2483. (ERA
2010 Rank A).
[7] Beg, A. H. and Islam, M. Z. (2016): Genetic Algorithm with Novel Crossover, Selection
and Health Check for Clustering, In Proc. of the 24th European Symposium on Artificial
Neural Networks, Computational Intelligence and Machine Learning (ESANN 2016),
Bruges, Belgium, April 27-29, 2016, pp. 575-580. (ERA 2010 Rank B).
[8] Beg, A. H., and Islam, M. Z. (2015): Clustering by Genetic Algorithm - High Quality
Chromosome Selection for Initial Population, In Proc. of the 10th IEEE Conference on
Industrial Electronics and Applications (ICIEA 2015), Auckland, New Zealand, 15-17
June, 2015, pp. 129 -134. (ERA 2010 Rank A).
ERA: Excellence in Research for Australia
SJR: SCImago Journal rank
ABDC: Australian Business Deans Council
vi
Abstract
Clustering is an important technique in the area of data mining, which aims to group similar
records in one cluster and dissimilar records in different clusters. Clustering is used in various
fields for knowledge discovery and facilitating decision making processes, and many clustering
approaches have been proposed. However, many of them have various limitations; including
the requirement for user input on the number of clusters, the tendency to get stuck at local
optima, and a high complexity of 𝑂(𝑛)2. There is room for improvement of the cluster quality
produced by existing methods. We also observe that existing cluster evaluation methods often
produce inappropriate/biased evaluation values. A good cluster evaluation technique is therefore
critical.
In this study, we propose a number of clustering techniques that produce high-quality clusters
through the improvement of various genetic operations with a low complexity of 𝑂(𝑛), and
require no user input on the cluster numbers. We also demonstrate through a graphical
visualization that many existing clustering techniques often do not produce sensible clusters,
which may not be useful in knowledge discovery from underlying data sets. Sometimes, they
obtain huge numbers of clusters and sometimes they derive only two clusters where one cluster
contains one record and the other cluster contains all remaining records.
Hence, in this study we propose a clustering technique that produces sensible clustering
solutions. We graphically visualize the clustering results of our proposed technique on a brain
data set and demonstrate its ability in knowledge discovery. In this study, we also propose an
evaluation technique for clustering results. We validate the effectiveness of the proposed
evaluation method by analyzing it on some ground truth clustering results.
vii
Table of Contents
Declaration of Authorship........................................................................................................... i
Acknowledgement ..................................................................................................................... ii
Publications from the Thesis ..................................................................................................... iv
Abstract ..................................................................................................................................... vi
Table of Contents ..................................................................................................................... vii
Principal Notations.................................................................................................................. xvi
List of Figures .......................................................................................................................xviii
List of Tables ........................................................................................................................xxiii
Chapter 1 Introduction ....................................................................................................... 1
Chapter 2 Literature Review ............................................................................................ 10
2.1 Introduction ............................................................................................................... 10
2.2 Data Set with its Notations and Definition................................................................ 11
2.3 Data Mining............................................................................................................... 12
2.4 Machine Learning ..................................................................................................... 12
2.5 Clustering .................................................................................................................. 14
2.5.1 Applications of Clustering ................................................................................. 14
2.5.2 Categories of Clustering Techniques ................................................................. 16
2.5.2.1 Partition-based Clustering Techniques ......................................................... 17
2.5.2.2 Hierarchical Clustering Techniques .............................................................. 21
2.5.2.3 Density-based Clustering Techniques ........................................................... 22
2.5.2.4 Graph-based Clustering Techniques ............................................................. 23
2.5.2.5 Grid-based Clustering Techniques ................................................................ 23
2.5.2.6 Spectral Clustering Techniques .................................................................... 24
2.5.2.7 Model-based Clustering Techniques ............................................................. 25
2.5.2.8 Evolutionary Algorithm-based Clustering Techniques .................................... 26
viii
Ant Colony Algorithm-based Clustering Techniques ........................................... 27
Bee Colony Algorithm-based Clustering Techniques ........................................... 30
Particle Swarm Optimization (PSO) Algorithm-based Clustering Techniques .... 31
Black hole Algorithm-based Clustering Techniques ............................................. 32
Firefly Algorithm-based Clustering Techniques ................................................... 33
Genetic Algorithm-based Clustering Techniques.................................................. 33
2.6 Distance Calculation ................................................................................................. 41
2.6.1 Distance Calculation for Numerical Attributes.................................................. 41
Minkowski Distance .............................................................................................. 42
Manhattan Distance ............................................................................................... 42
Euclidean Distance ................................................................................................ 42
Chebyshev Distance .............................................................................................. 43
Cosine Distance ..................................................................................................... 43
Jaccard distance ..................................................................................................... 43
2.6.2 Distance Calculation for Categorical Attributes ................................................ 44
2.7 Cluster Evaluation Techniques.................................................................................. 46
2.7.1 Internal Cluster Evaluation Techniques ............................................................. 47
Sum of Square Error (SSE) ................................................................................... 47
Davies-Bouldin (DB) Index ................................................................................... 47
Silhouette Coefficient ............................................................................................ 48
Xie-Beni Index ...................................................................................................... 49
Dunn Index ............................................................................................................ 50
2.7.2 External Cluster Evaluation Techniques ............................................................ 50
F-measure .............................................................................................................. 50
Purity ..................................................................................................................... 51
Entropy .................................................................................................................. 52
2.8 Summary ................................................................................................................... 52
Chapter 3 High-Quality Initial Population in a GA for High-Quality Clustering with
Low Complexity ..................................................................................................................... 56
3.1 Introduction ............................................................................................................... 56
ix
3.2 DeRanClust: Deterministic and Random Selection for the Initial Population in a GA-
Based Clustering Technique................................................................................................. 59
Step 1: Normalization .................................................................................................... 59
Step 2: Population Initialization .................................................................................... 60
Step 3: Noise-based Selection Operation ...................................................................... 64
Step 4: Crossover Operation .......................................................................................... 64
Step 5: Twin Removal ................................................................................................... 65
Step 6: Mutation Operation ........................................................................................... 67
Step 7: Elitist Operation ................................................................................................ 67
3.3 Experimental Results and Discussion ....................................................................... 68
3.3.1 Data Sets ............................................................................................................ 68
3.3.2 Evaluation Criteria ............................................................................................. 68
3.3.3 Experimental Results on All Techniques ........................................................... 69
3.3.4 An Analysis of the Impact of Various Component of DeRanClust ................... 70
3.3.4.1 An Analysis of the Impact of the Population Initialization .................... 70
3.3.4.2 An Analysis of the Impact of the Crossover Operation.......................... 71
3.3.4.3 Cluster Quality Comparison between DeRanClust and Modified
AGCUK ................................................................................................................. 72
3.3.5 Complexity Analysis .......................................................................................... 73
3.4 Summary ................................................................................................................... 74
Chapter 4 Extensive Crossover and Mutation in a GA for High-Quality Clustering with
Low Complexity ..................................................................................................................... 76
4.1 Introduction ............................................................................................................... 76
4.2 The Motivation Behind the Proposed Technique ...................................................... 78
4.3 GMC: Genetic Algorithm with Novel Mutation and Crossover for Clustering ........ 79
Step 1: Normalization ............................................................................................... 79
Step 2: Population Initialization ................................................................................ 80
Step 3: Probabilistic Selection .................................................................................. 81
Step 4: Two Phases of Crossover Operation ............................................................. 81
Step 5: Twin Removal ............................................................................................... 84
Step 6: Three Steps of Mutation Operation .............................................................. 84
x
Step 7: Elitist Operation ............................................................................................ 85
4.4 Experimental Results and Discussion ....................................................................... 85
4.4.1 Data Sets ............................................................................................................ 85
4.4.2 The Parameter used in the Experiments............................................................. 85
4.4.3 The Experimental Setup ..................................................................................... 86
4.4.4 Experimental Results on All Techniques ........................................................... 87
4.4.5 An Analysis of the Impact of Various Properties of GMC ................................ 88
4.4.5.1 An Analysis of the Impact of the Crossover Operation.......................... 88
4.4.5.2 An Analysis of the Impact of the Mutation Operation ........................... 89
4.4.5.3 An Analysis of the Impact of the Probabilistic Selection Operation ...... 89
4.4.5.4 An Analysis of Improvement in Chromosomes over the Iterations ....... 90
4.4.6 Statistical Analysis ............................................................................................. 91
4.5 Summary ................................................................................................................... 93
Chapter 5 High-Quality Clustering through Novel Crossover, Selection and Health
Check with Low Complexity ................................................................................................. 96
5.1 Introduction ............................................................................................................... 96
5.2 The Motivation Behind the Proposed Technique ...................................................... 98
5.3 GCS: GA with Novel Crossover, Health Check and Selection for Clustering ........ 99
Step 1: Normalization ............................................................................................... 99
Step 2: Population Initialization .............................................................................. 101
Step 3: Two Phases of Selection Operation ............................................................ 101
Step 4: Crossover Operation ................................................................................... 101
Step 5: Twin Removal ............................................................................................. 103
Step 6: Mutation Operation ..................................................................................... 103
Step 7: Health Check Operation .............................................................................. 103
Step 8: Elitist Operation .......................................................................................... 103
5.4 Experimental Results and Discussion ..................................................................... 104
5.4.1 Data Sets .......................................................................................................... 105
5.4.2 Evaluation Criteria ........................................................................................... 105
5.4.3 Experimental Results on All Techniques ......................................................... 105
5.4.4 Comparative Results between GCS and GMC ................................................ 107
xi
5.4.5 An Analysis of the Impact of Various Component of GCS............................. 108
5.4.5.1 An Analysis of the Impact of the Health Check Operation .................. 109
5.4.5.2 An Analysis of the Impact of the Crossover Operation........................ 109
5.4.6 An Analysis of the Improvement in Chromosomes over the Iterations........... 110
5.5 Summary ................................................................................................................. 111
Chapter 6 GA with Multiple Streams and Neighbor Information Sharing for Clustering
................................................................................................................................................ 114
6.1 Introduction ............................................................................................................. 114
6.2 HeMI: Healthy Population and Multiple Streams Sharing Information in a GA for
Clustering ........................................................................................................................... 118
6.2.1 Basic Concepts ................................................................................................. 118
6.2.2 Main Steps ....................................................................................................... 120
Component 1: Normalization .................................................................................. 121
Component 2: Multiple Streams ............................................................................. 123
Component 3: Population Initialization .................................................................. 123
Component 4: Noise-based Selection ..................................................................... 124
Component 5: Crossover Operation ........................................................................ 124
Component 6: Twin Removal ................................................................................. 125
Component 7: Three Steps Mutation Operation ..................................................... 125
Component 8: Health Improvement Operation ....................................................... 127
Component 9: Elitist Operation .............................................................................. 129
Component 10: Neighbor Information Sharing ...................................................... 130
Component 11: Global Best Selection .................................................................... 130
6.2.3 The HeMI Algorithm ....................................................................................... 131
6.3 Experimental Results and Discussion ..................................................................... 132
6.3.1 The Data sets and the Evaluation Criteria........................................................ 132
6.3.2 The Parameters used in the Experiments ......................................................... 134
6.3.3 The Experimental Setup ................................................................................... 134
6.3.4 Experimental Results on All Techniques ......................................................... 135
6.3.5 Comparative Results between HeMI and GCS ................................................ 137
6.3.6 Comparative Results among HeMI, GCS, GMC and DeRanClust ................. 138
xii
6.3.7 An Analysis of the Impact of Various Properties of HeMI ............................. 140
6.3.7.1 An Analysis of the Impact of the Multiple Streams that Exchange
Information .............................................................................................................. 140
6.3.7.2 An Analysis of the Impact of the Population Initialization ........................ 143
6.3.7.3 An Analysis of the Impact of the Mutation Operation ................................ 144
6.3.7.4 An Analysis of the Impact of the Health Improvement .............................. 145
6.3.7.5 An Analysis of the Impact of the Interval ................................................... 145
6.3.7.6 An Analysis of the Impact of the number of Streams ................................. 146
6.3.7.7 An Analysis of the Improvement in Chromosomes over the Iterations ...... 147
6.3.8 Statistical Analysis ........................................................................................... 149
6.3.9 An Analysis on the use of K-means++ instead of K-means in HeMI ............. 151
6.3.10 Complexity Analysis ........................................................................................ 151
6.3.11 Comparison between HeMI and Multiple Runs of K-means........................... 152
6.4 Summary ................................................................................................................. 154
Chapter 7 A Novel GA-based Clustering Technique and its Suitability for Knowledge
Discovery from a Brain Data Set ........................................................................................ 156
7.1 Introduction ............................................................................................................. 156
7.2 The Motivation Behind the Proposed Technique .................................................... 159
7.3 CSClust: High-quality Chromosome Selection and Cleansing Operation in a GA for
Clustering ........................................................................................................................... 162
Step 1: Normalization ............................................................................................. 163
Step 2: Population Initialization .............................................................................. 163
Step 3: Sensible Properties Selection ...................................................................... 163
Step 4: Crossover Operation ................................................................................... 163
Step 5: Mutation Operation ..................................................................................... 164
Step 6: Twin Removal Operation ............................................................................ 165
Step 7: Cleansing Operation.................................................................................... 166
Step 8: Cloning Operation ....................................................................................... 166
Step 9: Elitist Operation .......................................................................................... 167
7.4 Experimental Results and Discussion ..................................................................... 167
7.4.1 The Data sets and the Cluster Evaluation Criteria ........................................... 167
xiii
7.4.2 The Parameter used in the Experiments........................................................... 168
7.4.3 The Experimental Setup ................................................................................... 168
7.4.4 Brain Data set (CHB-MIT Scalp EEG) Pre-processing ................................... 169
7.4.5 Experimental Results on Brain Data Set .......................................................... 170
7.4.6 Analysis of the Clustering Result obtained by CSClust on the Brain Data set 170
7.4.7 Knowledge from Decision Tree on Brain Data set .......................................... 175
7.4.8 Experimental Results on 10 Real Life Data sets .............................................. 177
7.4.9 An Analysis of the Improvement in Chromosomes over the Iterations........... 178
7.4.10 Statistical Analysis ........................................................................................... 178
7.5 Summary ................................................................................................................. 179
Chapter 8 Application of a Novel GA-based Clustering and Tree based Validation on
a Brain Data Set for Knowledge Discovery ....................................................................... 182
8.1 Introduction ............................................................................................................. 182
8.2 Our Technique ......................................................................................................... 187
8.2.1 Basic Concepts of Our Clustering Technique HeMI++ ................................... 187
8.2.2 Basic Concepts of Our Cluster Evaluation Technique Tree Index .................. 193
8.2.3 Main Components of HeMI++ ........................................................................ 194
Component 1: Normalization .................................................................................. 195
Component 2: Multiple Stream ............................................................................... 195
Component 3: Population Initialization .................................................................. 195
Component 4: Selection of Sensible Properties ...................................................... 195
Component 5: Noise-based Selection ..................................................................... 196
Component 6: Crossover Operation ........................................................................ 198
Component 7: Twin Removal ................................................................................. 198
Component 8: Three Steps Mutation Operation ..................................................... 198
Component 9: Health Improvement Operation ....................................................... 198
Component 10: Cleansing Operation ...................................................................... 198
Component 11: Cloning Operation ......................................................................... 199
Component 12: The Elitist Operation ..................................................................... 199
Component 13: Neighbor Information Sharing ...................................................... 199
Component 14: Global Best Selection .................................................................... 199
8.2.4 The HeMI++ Algorithm ................................................................................... 200
xiv
8.2.5 Our Cluster Evaluation Technique (Tree Index) ............................................. 201
8.3 Experimental Results and Discussion ..................................................................... 202
8.3.1 The Data Sets and the Evaluation Criteria ....................................................... 202
8.3.2 The Parameters used in the Experiments ......................................................... 204
8.3.3 The Experimental Setup ................................................................................... 205
8.3.4 Brain Data set Pre-processing .......................................................................... 205
8.3.5 Clustering Quality Comparison between HeMI++ and Other Techniques on the
MIT-Chb01_03 Data Set ................................................................................................ 205
8.3.6 Analysis of the Clustering Result Obtained by HeMI++ from the CHB-MIT
Scalp EEG (chb01-03) Data Set ..................................................................................... 210
8.3.7 Evaluation of HeMI++ and Tree Index on the LD data set ............................. 214
8.3.8 Experimental Results on All Techniques on 21 Real Life Data Sets .............. 217
8.3.9 An Analysis of the Clustering Quality of HeMI++ on Different Data Sets .... 219
8.3.9.1 Performance of HeMI++ compared to Existing Techniques, based on
Number of Records ................................................................................................. 221
8.3.9.2 Performance of HeMI++ compared to Existing Techniques, based on
Number of Attributes .............................................................................................. 222
8.3.9.3 Performance of HeMI++ compared to Existing Techniques, based on Type
of the Majority of Attributes ................................................................................... 223
8.3.10 Knowledge from the Brain Data ...................................................................... 224
8.3.11 Complexity Analysis ........................................................................................ 227
8.3.12 Statistical Friedman Test.................................................................................. 228
8.4 Summary ................................................................................................................. 230
Chapter 9 Discussion ....................................................................................................... 232
9.1 Introduction ............................................................................................................. 232
9.2 Comparison and Discusion of the Proposed Techniques ........................................ 233
9.2.1 DeRanClust ...................................................................................................... 233
9.2.2 GMC ................................................................................................................ 235
9.2.3 GCS .................................................................................................................. 236
9.2.4 HeMI ................................................................................................................ 237
9.2.5 CSClust ............................................................................................................ 240
9.2.6 HeMI++ ........................................................................................................... 241
9.3 Key Contributions of the Thesis.............................................................................. 244
xv
9.4 Complexity Analysis of the Techniques ................................................................. 245
9.4.1 Notations for Complexity Analysis ................................................................. 245
9.4.2 Complexity of DeRanClust .............................................................................. 245
9.4.3 Complexity of GMC ........................................................................................ 248
9.4.4 Complexity of GCS.......................................................................................... 252
9.4.5 Complexity of HeMI ........................................................................................ 255
9.4.6 Complexity of CSClust .................................................................................... 260
9.4.7 Complexity of HeMI++ ................................................................................... 263
9.4.8 Complexity of AGCUK ................................................................................... 266
9.4.9 Complexity of GAGR ...................................................................................... 266
9.4.10 Complexity of GenClust .................................................................................. 268
9.4.11 Complexity of K-means ................................................................................... 270
9.5 Comparison of the Complexities of the Techniques ............................................... 270
9.6 Summary of the Proposed Techniques .................................................................... 271
9.7 Future Research Directions ..................................................................................... 274
Chapter 10 Conclusion .................................................................................................... 275
References .............................................................................................................................. 279
xvi
Principal Notations
This is a list of the principal notations used throughout the thesis
𝐷 A data set
𝑛 The number of records of a data set
𝑅𝑖 The 𝑖𝑡ℎ record of a data set
|𝐴| Set of attributes
𝑚 The number of attributes of a data set
𝑚𝑐 The number of categorical attributes of a data set
𝑚𝑟 The number of numerical attributes of a data set
𝑑 The domain size of an attribute
𝑘 The number of clusters
𝐶 A set of clusters
𝑆 A set of seeds
𝑃𝑗𝑖 The 𝑗𝑡ℎ chromosome in the 𝑖𝑡ℎ iteration
𝑓𝑗𝑖 The fitness of 𝑗𝑡ℎ chromosome in the 𝑖𝑡ℎ iteration
𝑃𝑑 Set of chromosomes generated in the deterministic phase
𝑃𝑠 Set of chromosomes
𝑃𝑚 Set of mutated chromosomes
𝐼 User defined number of iterations/generations
𝑃𝑟 Set of random chromosomes
xvii
𝐹𝑠 Fitness of set of Chromosomes
𝑃𝑜 Set of offspring chromosomes
𝑂 A pair of offspring chromosomes
𝑃𝑣 Set of chromosomes after division and absorption operation
𝐻𝑠 Set of healthy chromosomes
𝑇𝑗 Crossover probability of 𝑗𝑡ℎ chromosome
𝑓𝑚𝑒𝑎𝑛 Average fitness value of the chromosome in the population
𝑃𝑗,𝑑 𝑗𝑡ℎ chromosome after division
𝑃𝑗,𝑎 𝑗𝑡ℎ chromosome after absorption
𝑀𝑗 Mutation probability of 𝑗𝑡ℎ chromosome
𝛱𝑖𝑗 Cosine similarity between 𝑖𝑡ℎ record 𝑅𝑖 and seed 𝑠𝑗 of 𝑗𝑡ℎ cluster
Ϛ𝑖𝑗 Cosine distance between 𝑖𝑡ℎ record 𝑅𝑖 and seed 𝑠𝑗 of 𝑗𝑡ℎ cluster
𝑠𝑗 The seed of 𝑗𝑡ℎ cluster 𝑐𝑗
𝑠𝑗,𝑎 The 𝑎𝑡ℎ attribute value of the seed of the 𝑗𝑡ℎ cluster
Ӻ𝑖𝑗 Jaccard coefficient between 𝑖𝑡ℎ record 𝑅𝑖 and seed 𝑠𝑗 of 𝑗𝑡ℎ cluster
Ԓ𝑖𝑗 Jaccard distance between 𝑖𝑡ℎ record 𝑅𝑖 and seed 𝑠𝑗 of 𝑗𝑡ℎ cluster
SSE Sum of square error
𝐷𝐵 Davis-Bouldin Index
𝑋𝐵 Xie-Beni Index
𝐷𝐼 Dunn Index
𝐹𝑀 F-measure
𝑃𝑇 Purity
𝑒𝑇 Entropy
xviii
List of Figures
Fig. 1.1: The three dimensional CHB-MIT Scalp EEG (chb01-03) data set ............................. 4
Fig. 2.1: Basic steps of Genetic Algorithms (GA) ................................................................... 36
Fig. 3.1: The formation of a chromosome through K-means .................................................. 61
Fig. 3.2: Flowchart of the population initialization ................................................................. 62
Fig. 3.3: Single point crossover between a pair of chromosomes ........................................... 65
Fig. 3.4: Comparative results between DeRanClust and other techniques based on Silhouette
Coefficient................................................................................................................................ 69
Fig. 3.5: Comparative result between DeRanClust and other techniques based on DB Index 70
Fig. 4.1: Comparative results between GMC and other techniques based on Silhouette
Coefficient (higher the better) .................................................................................................. 87
Fig. 4.2: Comparative results between GMC and other techniques based on DB Index (lower
the better) ................................................................................................................................. 87
Fig. 4.3: Comparative results between GMC and GMC without Crossover based on Silhouette
Coefficient (higher the better) .................................................................................................. 88
Fig. 4.4: Comparative results between GMC and GMC without Crossover based on DB Index
(lower the better) ...................................................................................................................... 88
Fig. 4.5: Comparative results between GMC and GMC without Mutation based on Silhouette
Coefficient (higher the better) .................................................................................................. 89
Fig. 4.6: Comparative results between GMC and GMC without Mutation based on DB Index
(lower the better) ...................................................................................................................... 89
Fig. 4.7: Comparative results between GMC and GMC without Probabilistic Selection (PS)
based on Silhouette Coefficient (higher the better) ................................................................. 90
xix
Fig. 4.8: Comparative results between GMC and GMC without Probabilistic Selection (PS)
based on DB Index (lower the better) ...................................................................................... 90
Fig. 4.9: Average fitness (best chromosome fitness) versus iterations over the 10 data sets .. 91
Fig. 4.10: Flow chart of sign test ............................................................................................. 92
Fig. 4.11: Sign test of GMC on 10 data sets ............................................................................ 93
Fig. 5.1: Silhouette Coefficient of the techniques on eight data sets ..................................... 105
Fig. 5.2: Silhouette Coefficient of the techniques on seven data sets .................................... 106
Fig. 5.3: DB Index of the techniques on eight data sets ........................................................ 106
Fig. 5.4: DB Index of the techniques on seven data sets ....................................................... 106
Fig. 5.5: Comparative results between GCS and GMC based on Silhouette Coefficient ...... 108
Fig. 5.6: Comparative results between GCS and GMC based on DB Index ......................... 108
Fig. 5.7: Average fitness (best chromosome) versus Iteration of 20 runs on PID data set .... 110
Fig. 5.8: Average fitness (all chromosomes) versus Iterations. Each line represents the average
fitness of 20 runs on PID data set .......................................................................................... 111
Fig. 6.1: Flowchart of HeMI algorithm ................................................................................. 129
Fig. 6.2: (a) Comparative results between HeMI and other techniques on ten data sets based on
Silhouette Coefficient. (b) Comparative results between HeMI and other techniques on ten data
sets based on Silhouette Coefficient ...................................................................................... 135
Fig. 6.3: (a) Comparative results between HeMI and other techniques on ten data sets based on
DB Index. (b) Comparative results between HeMI and other techniques on ten data sets based
on DB Index ........................................................................................................................... 136
Fig. 6.4: Comparative results between HeMI and HeMI with different Intervals ................. 146
Fig. 6.5: Comparative results between HeMI and HeMI with 8 Streams .............................. 147
Fig. 6.6: Average Fitness versus Iteration. Each line represents the average fitness of the best
chromosome of 5 consecutive runs of HeMI on a data set .................................................... 148
xx
Fig. 6.7: Average Fitness (best chromosome) versus Iteration over the 20 data sets ........... 148
Fig. 6.8: Average Fitness (best chromosome) versus Iteration. Each line represents the average
fitness of 5 consecutive runs on PID data set ........................................................................ 149
Fig. 6.9: (a) Sign test of HeMI based on Silhouette Coefficient on ten data sets. (b) Sign test of
HeMI based on Silhouette Coefficient on ten data set ........................................................... 149
Fig. 6.10: (a) Sign test of HeMI based on DB Index on ten data sets. (b) Sign test of HeMI
based on DB Index on ten data sets ....................................................................................... 150
Fig. 6.11: Comparative result between HeMI and K-means ................................................. 153
Fig. 7.1: The three-dimensional CHB-MIT Scalp EEG (chb01-03) data set ......................... 159
Fig. 7.2: Clustering result of GenClust on the CHB-MIT Scalp EEG (chb01-03) data set ... 160
Fig. 7.3: Clustering result of GAGR on the CHB-MIT Scalp EEG (chb01-03) data set....... 160
Fig. 7.4: Clustering result of GenClust using DB Index on CHB-MIT Scalp EEG (chb01-03)
data set ................................................................................................................................... 161
Fig. 7.5: Clustering result of CSClust on CHB-MIT Scalp EEG (chb01-03) data set .......... 171
Fig. 7.6: Channel positions according to the International 10-20 system (Jasper, 1958;
Sharbrough F et al., 1991)...................................................................................................... 173
Fig. 7.7: Seizure records on different channels ...................................................................... 174
Fig. 7.8: EEG signals (10 seconds) of channel-5 during the non-seizure time ...................... 174
Fig. 7.9: EEG signals (10 seconds) of channel-5 during the seizure time ............................. 174
Fig. 7.10: EEG signals (10 seconds) of channel-7 during the seizure time ........................... 175
Fig. 7.11: EEG signals (10 seconds) of channel-9 during the seizure time ........................... 175
Fig. 7.12: EEG signals (10 seconds) of channel-13 during the seizure time ......................... 175
Fig. 7.13: Decision trees on CHB-MIT (chb01-03) data set.................................................. 176
Fig. 7.14: Comparative results between CSClust and other techniques based on Silhouette
Coefficient (higher the better) ................................................................................................ 177
xxi
Fig. 7.15: Comparative results between CSClust and other techniques based on DB Index
(lower the better) .................................................................................................................... 177
Fig. 7.16: Grand Average fitness versus iteration over the 10 data sets ................................ 178
Fig. 7.17: Sign test of CSClust on 11 data sets ...................................................................... 179
Fig. 8.1: The three dimensional CHB-MIT Scalp EEG (chb01-03) data set ......................... 188
Fig. 8.2: Clustering result of GenClust on the CHB-MIT Scalp EEG (chb01-03) data set ... 188
Fig. 8.3: Clustering result of GAGR on the CHB-MIT Scalp EEG (chb01-03) data set....... 190
Fig. 8.4: Clustering result of HeMI on the CHB-MIT Scalp EEG (chb01-03) data set ........ 191
Fig. 8.5: A sensible clustering result on the CHB-MIT Scalp EEG (chb01-03) data set ...... 193
Fig. 8.6: Clustering result of HeMI on the CHB-MIT Scalp EEG (chb01-03) data set ........ 206
Fig. 8.7 : Clustering result of AGCUK on the CHB-MIT Scalp EEG (chb01-03) data set ... 206
Fig. 8.8: Clustering result of GAGR on the CHB-MIT Scalp EEG (chb01-03) data set....... 207
Fig. 8.9: Clustering result of GenClust on the CHB-MIT Scalp EEG (chb01-03) data set ... 208
Fig. 8.10: Clustering result of K-means on the CHB-MIT Scalp EEG (chb01-03) data set . 209
Fig. 8.11: Clustering result of K-means++ on the CHB-MIT Scalp EEG (chb01-03) data set
................................................................................................................................................ 209
Fig. 8.12: Clustering result of our proposed technique, HeMI++ on the CHB-MIT Scalp EEG
(chb01-03) data set ................................................................................................................. 210
Fig. 8.13: Seizure records of different channels .................................................................... 212
Fig. 8.14: EEG signals (10 seconds) of Channel-5 during the non-seizure time ................... 213
Fig. 8.15: EEG signals (10 seconds) of Channel-5 during the seizure time .......................... 213
Fig. 8.16: EEG signals (10 seconds) of Channel-7 during the seizure time .......................... 213
Fig. 8.17: EEG signals (10 seconds) of Channel-9 during the seizure time .......................... 213
Fig. 8.18: EEG signals (10 seconds) of Channel-13 during the seizure time ........................ 213
xxii
Fig. 8.19: Channel positions according to the International 10-20 system (Jasper, 1958;
Sharbrough F et al., 1991)...................................................................................................... 214
Fig. 8.20: Clustering result of HeMI on the LD data set ....................................................... 215
Fig. 8.21: Clustering result of AGCUK on the LD data set ................................................... 215
Fig. 8.22: Clustering result of GAGR on the LD data set...................................................... 215
Fig. 8.23: Clustering result of GenClust on the LD data set .................................................. 215
Fig. 8.24: Clustering result of K-means on the LD data set .................................................. 215
Fig. 8.25: Clustering result of K-means++ on the LD data set .............................................. 215
Fig. 8.26: Clustering result of HeMI++ on the LD data set ................................................... 216
Fig. 8.27: The three dimensional LD data set ........................................................................ 216
Fig. 8.28: Scores of the techniques on 15 numerical data sets based on Tree Index ............. 219
Fig. 8.29: Decision trees on the CHB-MIT Scalp EEG (chb01-03) data set ......................... 226
xxiii
List of Tables
Table 2.1: A synthetic data set ................................................................................................. 11
Table 2.2: List of ranked (based on citation reports) evolutionary algorithm-based clustering
techniques from 1995-2015 ..................................................................................................... 28
Table 2.3: List of ranked (based on citation reports) evolutionary algorithm-based clustering
techniques from 1995-2015 ..................................................................................................... 29
Table 2.4: List of ranked (based on citation reports) GA-based clustering techniques from 1995-
2015.......................................................................................................................................... 35
Table 2.5: Advantages and limitations of currently used clustering techniques ...................... 53
Table 3.1: A brief description of the data sets ......................................................................... 69
Table 3.2: Comparative results between AGCUK and Modified AGCUK ............................. 71
Table 3.3: Comparative result between DeRanClust and DeRanClust without Crossover ..... 72
Table 3.4: Comparative results between DeRanClust and Modified AGCUK ....................... 72
Table 4.1: Data sets at a glance ................................................................................................ 86
Table 5.1: Data sets at a glance .............................................................................................. 104
Table 5.2: Comparative result between GCS and GCS without Health Check ..................... 109
Table 5.3: Comparative result between GCS and GCS with Traditional Crossover ............ 110
Table 6.1: A brief description of the data sets ....................................................................... 133
Table 6.2: Comparative results between HeMI and GCS ...................................................... 138
Table 6.3: Comparative Results among HeMI, GCS, GMC and DeRanClust based on
Silhouette Coefficient ............................................................................................................ 139
xxiv
Table 6.4: Comparative Results among HeMI, GCS, GMC and DeRanClust based on DB Index
................................................................................................................................................ 139
Table 6.5: Comparative results between AGCUK, AGCUK with 40 Population and AGCUK
with 80 Population ................................................................................................................. 140
Table 6.6: Comparative results between AGCUK with 80 Population and AGCUK with
Multiple Streams .................................................................................................................... 141
Table 6.7: Comparative results between AGCUK with Multiple Streams and AGCUK with
Neighbor Exchange ................................................................................................................ 141
Table 6.8: Comparative results between GenClust, GenClust with Multiple Streams and
GenClust with Neighbor Exchange ....................................................................................... 142
Table 6.9: Comparative results between HeMI, AGCUK with Neighbor Exchange and
GenClust with Neighbor Exchange ....................................................................................... 142
Table 6.10: Comparative results between AGCUK and AGCUK with HeMI Population .... 143
Table 6.11: Comparative results between AGCUK and AGCUK with HeMI Mutation ...... 144
Table 6.12: Comparative results between HeMI and HeMI without Mutation ..................... 144
Table 6.13: Comparative results between HeMI and HeMI without Health Improvement
Operation................................................................................................................................ 145
Table 6.14: Comparative results between HeMI and HeMI with K-means++ ...................... 151
Table 7.1: Data sets at a glance .............................................................................................. 168
Table 7.2: Clustering results of all techniques on the CHB-MIT Scalp EEG (chb01-03) data set
................................................................................................................................................ 170
Table 7.3: Channel wise number of records in Cluster 2 of CSClust on CHB-MIT Scalp EEG
(chb01-03) data set ................................................................................................................. 172
Table 8.1: Some sensible and non-sensible clustering solutions and their evaluation values
based on the existing cluster evaluation metrics .................................................................... 191
xxv
Table 8.2: Cluster results of some sensible and non-sensible clustering solutions based on Tree
Index ...................................................................................................................................... 194
Table 8.3: A brief description of the data sets ....................................................................... 203
Table 8.4: Clustering results of HeMI++ and other techniques based on Tree Index ........... 210
Table 8.5: Channel wise number of records in Cluster 2 of HeMI++ on the CHB-MIT Scalp
EEG (chb01-03) data set ........................................................................................................ 211
Table 8.6: Comparative results of all the techniques on the LD data set based on Tree Index
and other evaluation techniques ............................................................................................. 216
Table 8.7: Comparative results between HeMI++ and other techniques on 15 numerical data
sets based on Tree Index ........................................................................................................ 218
Table 8.8: Clustering results of HeMI ++ and other techniques on 6 categorical data sets based
on Tree Index ......................................................................................................................... 219
Table 8.9: Performance of HeMI++ compared to existing techniques, based on number of
records .................................................................................................................................... 222
Table 8.10: Performance of HeMI++ compared to existing techniques, based on number of
attributes ................................................................................................................................. 223
Table 8.11: Performance of HeMI++ compared to existing techniques, based on type of the
majority of attributes .............................................................................................................. 224
Table 8.12: Silhouette Coefficient rank of the techniques based on Friedman Test (Demšar,
2006; Friedman, 1940) ........................................................................................................... 229
Table 9.1: The complexities of the techniques ...................................................................... 271
Table 9.2: Strengths and weaknesses of the proposed techniques ......................................... 272
1
Chapter 1
Introduction
Nowadays with the advancement of scientific technology and increase of information, huge
amounts of data can be collected (Bello-Orgaz, Jung, & Camacho, 2016; Kuo, Syu, Chen, &
Tien, 2012). It is difficult for a domain expert to infer knowledge manually from the enormous
amount of data. To acquire information from the huge amounts of data and facilitate decision-
making process data mining techniques are required.
Clustering is an important and well-known technique in the area of data mining, which aims
to group similar records in one cluster and dissimilar records in other clusters (D.-X. Chang,
Zhang, & Zheng, 2009; D. Chang, Zhao, Zheng, & Zhang, 2012; Han & Kamber, 2006; Kuo et
al., 2012; Y. Liu, Wu, & Shen, 2011; Pang-Ning Tan, Michael Steinbach, 2005; Rahman &
Islam, 2014). Through clustering, hidden information can be extracted from a data set that can
subsequently help in various decision-making processes (Rahman & Islam, 2014).
Clustering has a wide range of applications including machine learning (Gan, 2013;
Mukhopadhyay & Maulik, 2009), image segmentation (Cai, Chen, & Zhang, 2007; B. N. Li,
Chui, Chang, & Ong, 2011; F. Zhao, Fan, & Liu, 2014), medical imaging and object detection
(Bai et al., 2013; Kannan, Ramathilagam, Sathya, & Pandiyarajan, 2010; Kaya, Pehlivanlı,
Sekizkardeş, & Ibrikci, 2017; B. N. Li et al., 2011; Liao, Lin, & Li, 2008; Masulli & Schenone,
1999; Saha, Alok, & Ekbal, 2016; Son & Tuan, 2017; Sonğur & Top, 2016; Stockman &
Shapiro, 2001), business (M.-Y. Chen, 2013; Montani & Leonardi, 2014) and social network
2
analysis (Girvan & Newman, 2002). It is therefore very important to produce good-quality
clusters from data sets.
Many approaches for clustering have been proposed (Arthur & Vassilvitskii, 2007; D.-X.
Chang et al., 2009; D. Chang et al., 2012; Y. Liu et al., 2011; Lloyd, 1982; Rahman & Islam,
2014). K-means is one of the most popular techniques for clustering. While K-means is popular
for its simplicity, it has a number of well-known drawbacks (D.-X. Chang et al., 2009; Jain,
2010; Mohd, Beg, Herawan, & Rabbi, 2012; Rahman & Islam, 2014). One of the main
disadvantages of K-means is that it requires a user defined number of clusters (𝑘) prior to
clustering. It is difficult for a user (data miner) to estimate the appropriate number of clusters in
advance. The appropriate number of clusters influences the quality of the final clustering
solution (Kuo et al., 2012).
Another drawback of the K-means clustering technique is that it has a tendency to get stuck
at local optima. Moreover, the random selection of the initial seeds is also considered to be a
major weakness as it heavily influences the final clustering quality (Arthur & Vassilvitskii,
2007). A recent technique called K-means++ (Arthur & Vassilvitskii, 2007) addresses the last
drawback of K-means. However, it also suffers from other drawbacks of K-means as listed
above.
The use of a Genetic Algorithm (GA) in clustering can help a data miner to avoid the local
optima issue of K-means (Agustín-Blas et al., 2012; D.-X. Chang et al., 2009; D. Chang et al.,
2012; He & Tan, 2012; Y. Liu et al., 2011; Peng et al., 2014; Rahman & Islam, 2014). Typically,
a genetic algorithm-based technique does not require any user input on the number of clusters 𝑘.
In GA a chromosome contains a set of genes, where a gene is a (real or pseudo) record. A gene
is considered to be the center of a cluster. Therefore, a chromosome is considered to be a
clustering solution.
3
However, GA-based clustering techniques have some limitations. Many existing techniques
(Y. Liu et al., 2011; Maio, Maltoni, & Rizzi, 1995; Maulik & Bandyopadhyay, 2000; Xiao, Yan,
Zhang, & Tang, 2010) generate the number of genes of a chromosome randomly, in the
population initialization phase. They also randomly choose records as genes, instead of carefully
choosing genes of a chromosome. Careful selection of genes can create an initial population
containing high-quality chromosomes. A high-quality initial population typically increases the
likelihood of obtaining a good clustering solution at the end of the genetic processing (Diaz-
Gomez & Hougen, 2007; Goldberg, Deb, & Clark, 1991; Rahman & Islam, 2014).
One existing GA-based clustering technique -GenClust (Rahman & Islam, 2014) finds high-
quality initial population and thereby obtains a good clustering solution. However, the initial
population selection process of GenClust is very complex-with a complexity of 𝑂(𝑛2), where
𝑛 is the number of records in a data set. Moreover, GenClust requires a user input on a number
of radius values for the clusters in the initial population selection. It can be very difficult for a
user to estimate the set of radius values (i.e. radii). Therefore, in this study we aim to produce
high-quality initial seeds with low complexity and require no user input. In this thesis, we
propose clustering techniques to improve the final clustering results.
We also carefully analyze the results obtained by both our techniques and other existing
techniques, whether they are sensible or not. In order to assess the quality of existing clustering
techniques, we use a brain data set (CHB-MIT Scalp) (Goldberger et al., 2000) as an example
which is available from https://physionet.org/cgi-bin/atm/ATM. We plot the data set so that we
can graphically visualize the clusters (see Fig. 1.1). We know that this data set has two types of
records: seizure and non-seizure. We can also see in the figure that there are clearly two clusters
of records. We then apply the existing clustering techniques on this data set and plot their
clustering results.
4
We find that some recent and state-of-the-art clustering techniques such as GAGR (D.-X.
Chang et al., 2009), AGCUK (Y. Liu et al., 2011) and GenClust (Rahman & Islam, 2014) do
not produce sensible clusters. Sometimes, they obtain huge number of clusters and sometimes
they obtain only two clusters, where one cluster contains one record and the other cluster
contains all remaining records. These solutions are typically not useful in knowledge discovery
from underlying data sets. Therefore, a clustering technique that can produce a sensible
clustering solution is highly desirable. Hence, in this study we also propose a clustering
technique which produces sensible clustering solutions.
Fig. 1.1: The three dimensional CHB-MIT Scalp EEG (chb01-03) data set
During the development of the proposed clustering techniques we realize that the existing
cluster evaluation techniques are biased towards either high numbers of clusters or very low
numbers of clusters. Therefore, we also evaluate the existing cluster evaluation techniques by
analyzing them on some ground truth results,which are also graphically visualized. We find that
the existing evaluation techniques produce better evaluation values for non-sensible clustering
solutions (compared to sensible clustering solutions). Therefore, a good evaluation technique is
also highly required in order to evaluate sensible and non-sensible clustering solutions.
5
Consequently, in this study we also propose a cluster evaluation technique for appraising
sensible and non-sensible clustering solutions.
The main research goals of this study are therefore as follows:
1. To produce parameter less clustering techniques with high-quality solutions and low
complexity;
2. To produce sensible clustering solutions; and
3. The evaluation of sensible and non-sensible clustering solutions.
In Chapter 2 we provide a comprehensive literature review of the existing techniques for
clustering and cluster evaluation. We discuss the advantages and limitations of the existing
techniques.
In Chapter 3 we propose a GA based clustering technique called DeRanClust that produces
high-quality initial seeds through a deterministic phase and a random phase. The basic idea
behind the proposed technique is that high-quality initial population typically increases the
possibility of obtaining a good clustering solution at the end of the genetic processing (Diaz-
Gomez & Hougen, 2007; Goldberg et al., 1991; Rahman & Islam, 2014).
DeRanClust therefore aims to produce high-quality initial seeds with a low complexity of
𝑂(𝑛). This technique automatically chooses the number of clusters for the chromosomes in the
initial population, and so does not require any user input for the number of clusters 𝑘.
DeRanClust also reduces the chance of getting stuck at local optima by using our proposed
genetic algorithm for high-quality chromosome selection. In Chapter 3, progress is made
towards achieving the first research goal, as presented above.
In Chapter 4 we present a technique titled GMC, which improves on DeRanClust. There is
room for further improvement of cluster quality of DeRanClust by improving other genetic
operations such as crossover and mutation. GMC therefore uses a new selection operation
6
whereby a chromosome with higher fitness value has a greater chance of being selected for other
genetic operations such as crossover and mutation.
GMC also proposes a new crossover operation where firstly the chromosomes of a population
are classified in one of two groups: Good group and Non-good group; and then different types
of crossover are performed on the two groups. Intuitively, this will increase the possibility of
obtaining good-quality offspring chromosomes from a pair of good-quality parent
chromosomes. GMC also performs different types of mutation operation for the two different
groups, whereby the number of changes on the good chromosomes are reduced, while the
number of changes on bad chromosomes are increased. In Chapter 4, further progress is made
towards achieving research goal 1.
In Chapter 5 we present a GA-based clustering technique called GCS that represents an
improvement on the techniques proposed in the previous two chapters. Typically, genetic
operations such as crossover and mutation tend to improve the health/fitness of a chromosome,
but they can also cause the health of some chromosomes to deteriorate. GCS therefore uses a
health check operation in order to ensure the presence of healthy chromosomes in a population.
GCS also modifies the process by which a pair of chromosomes is selected in a crossover
operation in order to increase the possibility of getting better quality offspring chromosomes.
GMC (proposed in Chapter 4) uses the new crossover operation whereby a chromosome with
low fitness value always pairs with another low-quality chromosome. GCS subsequently
introduces a new crossover operation whereby each chromosome is able to pair with the best
chromosome. GCS also uses a new selection operation in order to ensure the presence of good-
quality chromosomes in a population at the beginning of each generation. Chapter 5 further
refines the techniques proposed in the previous two chapters, and assists us to move closer
towards achieving our research goal 1.
7
In Chapter 6 we present a novel technique called HeMI which is a further improvement of
the techniques proposed in previous chapters. It is evident from the literature (Pourvaziri &
Naderi, 2014; Straßburg, Gonzàlez-Martel, & Alexandrov, 2012) and through empirical analysis
undertaken in this study that population size has a positive impact on clustering quality. That is,
a big population size is likely to contribute towards a good clustering solution. However, a large
population size also requires high time complexity.
HeMI therefore uses a big population in multiple streams, where each stream contains a
relatively small number of chromosomes, and thus can facilitate managing a low execution time
since they are suitable for parallel processing when necessary. HeMI also introduces information
sharing among the streams at a regular intervals in order to take advantage of the multiple
streams. The presence of healthy chromosomes (i.e. chromosomes with high fitness values) in
a population can increase the possibility of good clustering results. Therefore, HeMI replaces
the sick chromosomes (i.e. chromosomes with low fitness) with healthy chromosomes that are
produced through a novel approach. In Chapter 6, research goal 1 is achieved.
In Chapter 7 we present a GA-based Clustering technique called CSClust, with the aim of
producing sensible clusters. In order to achieve our second goal of producing sensible clustering
solutions we carefully analyze the results of some existing techniques. We find that some recent
clustering techniques do not produce sensible clusters and fail to discover knowledge from
underlying data sets. Sometimes, they obtain a huge number of clusters and sometimes they
obtain only two clusters, where one cluster contains one record and the other cluster contains all
remaining records. Therefore, in CSClust we propose a new cleansing and cloning operation
that helps to produce sensible clusters with high fitness values, which are also useful for
knowledge discovery.
8
In Chapter 8 we propose a new clustering technique, and an evaluation technique. In the
proposed clustering technique, we combine our previous technique called CSClust with HeMI
where we also significantly improve the components of CSClust and HeMI. Therefore, we call
the proposed technique HeMI++. We first explore the quality of HeMI and some existing
clustering techniques. We also explore the quality of existing evaluation techniques. In Chapter
7, we find that some existing techniques do not produce sensible clusters. However, in Chapter
8, we carefully assess the clustering quality of the existing techniques and HeMI through cluster
visulization.
We find that some of the existing clustering techniques do not produce sensible clusters. In
order to overcome this limitation, HeMI++ incorporates a new component titled Selection of
Sensible Properties. Through this component, HeMI++ first learns important properties of
sensible clustering solutions and then applies the information in producing its clustering
solutions. HeMI++ also uses a cleansing operation in order to identify the sick chromosomes.
The sick chromosomes are then replaced through its cloning operation by a pool of healthy
chromosomes found in the initial population.
During the development of the proposed clustering technique we realize that the existing
cluster evaluation techniques are biased towards either a high number of clusters or a very low
number of clusters. Hence, we also propose a novel cluster evaluation technique called Tree
Index. We validate the effectiveness of Tree Index by analyzing it on some ground truth
clustering results, which are also graphically visualized. While existing evaluation techniques
fail to correctly evaluate the cluster quality, Tree Index scores the sensible solutions higher than
non-sensible solutions.
We empirically compare our proposed clustering technique HeMI++ with five existing
techniques using 21 publicly available data sets in terms of Tree Index. The experimental results
on the 21 publicly available data sets clearly indicate the superiority of HeMI++ over the
9
existing techniques. We also graphically visualize the clustering results of HeMI++ on a brain
data set and find the results to be more sensible than others. Additionally, we discover some
useful knowledge from the clustering results produced by HeMI++ indicating its usefulness in
knowledge discovery. In Chapter 8, research goal 2 and 3 are attained.
In Chapter 9 we present a detailed analysis on the performance of the proposed techniques.
The performance of HeMI++ is analyzed based on some factors including number of records,
number of attributes and types of the majority of attributes in a data set. Contributions of the
thesis and future research directions are also discussed. Chapter 9 also presents a complexity
analysis of the proposed techniques and some existing techniques.
Finally in Chapter 10 we present our concluding remarks.
10
Chapter 2
Literature Review
2.1 Introduction
In this chapter, a data set with its notations and definitions is presented in Section 2.2, followed
by a short introduction to data mining in Section 2.3. Section 2.4 provides a brief overview of
machine learning, while Section 2.5 examines clustering, applications of clustering, and
categories of clustering. Different types of distance calculations are set out in Section 2.6; and
Section 2.7 introduces cluster evaluation techniques. A summary of the chapter is presented in
Section 2.8.
During the PhD candidature, we have published the following papers based on this chapter.
Beg, A. H. and Islam, M. Z. (2016): Branches of Evolutionary Algorithms and their Effectiveness for
Clustering Records, In Proc. of the 11th IEEE Conference on Industrial Electronics and Applications
(ICIEA 2016), Hefei, China, June 5-7, 2016, pp. 2484-2489. (ERA 2010 Rank A).
Beg, A. H. and Islam, M. Z. (2016): Advantages and Limitations of Genetic Algorithms for Clustering
Records, In Proc. of the 11th IEEE Conference on Industrial Electronics and Applications (ICIEA 2016),
Hefei, China, June 5-7, 2016, pp. 2478-2483. (ERA 2010 Rank A).
11
2.2 Data Set with its Notations and Definition
In this thesis, a data set 𝐷 is considered to be a two-dimensional matrix/table having 𝑛 records
(i.e. rows) and 𝑚 attributes (i.e. columns). The data set is represented as 𝐷 = {𝑅1, 𝑅2 … … 𝑅𝑛},
where 𝑅𝑖 is the 𝑖𝑡ℎ record. The set of attributes is denoted as 𝐴 = {𝐴1, 𝐴2 … … 𝐴𝑚}, where 𝐴𝑗 is
the 𝑗𝑡ℎ attribute. Each record 𝑅𝑖 has |𝐴| attributes. An attribute can be categorical and/or
numerical (Han & Kamber, 2006; Pang-Ning Tan, Michael Steinbach, 2005).
The domain of a numerical attribute 𝐴𝑖 is characterized as 𝐴𝑖 = [𝑙𝑖, 𝑢𝑖], where 𝑙𝑖 is the lower
limit and 𝑢𝑖 is the upper limit of the domain of 𝐴𝑖. The domain of a categorical attribute 𝐴𝑗 is
represented as 𝐴𝑗 = {𝐴𝑗1, 𝐴𝑗
2, … … . 𝐴𝑗𝑥}, where 𝐴𝑗
𝑘 is the 𝑘𝑡ℎ domain value and 𝑥 is the domain
size of 𝐴𝑗 .
Table 2.1: A synthetic data set
Record Student Name Course Marks Grade Study Mode
𝑅1 Daniel Advanced Electronics 87 A Full-Time
𝑅2 Andrew Computer Graphics 92 A+ Full-Time
𝑅3 Matthew Compiler Design 86 A Full-Time
𝑅4 Alex Computer Graphics 67 B Full-Time
𝑅5 Melissa Computer Graphics 75 B+ Part-Time
𝑅6 Anita Theory of Computing 82 A Part-Time
𝑅7 Andrew Object Oriented Programming 39 F Full-Time
𝑅8 Emily Data Structure and Algorithms 94 A+ Part-Time
𝑅9 Samuel Compiler Design 88 A Full-Time
𝑅10 Matthew Theory of Computing 93 A+ Full-Time
In Table 2.1, an example data set with ten records and five attributes is presented. Four
attributes “Student Name”, “Course”, “Grade”, and “Study Mode” are categorical, and one
attribute “Marks” is numerical. The domain values of the numerical attribute “Marks” range
from 39 to 94. The domain values for the categorical attribute “Study Mode” are {Full-Time,
12
Part-Time}. In a similar way, the domain values of all other attributes can be learnt from Table
2.1.
2.3 Data Mining
Due to the advancement of scientific technology and increase of information, huge amounts of
data can be collected (Bello-Orgaz et al., 2016; Kuo et al., 2012). In most of the cases, these
data are collected in the unstructured way. As it would be difficult for a domain expert to gather
knowledge manually from the enormous amount of data. Data mining techniques are required
to allow information (patterns) to be acquired and decision-making processes to be facilitated.
Data mining is a technique that discovers useful information from collections of data by
representing the data in a structured way (Han & Kamber, 2006; Pang-Ning Tan, Michael
Steinbach, 2005). Many similar terms are used interchangeably, such as knowledge extraction,
knowledge mining from data, data archaeology, data dredging, and data/pattern analysis (Han
& Kamber, 2006). Organizations use data mining to make better decisions. A data mining
technique identifies interesting patterns (such as the discovery of knowledge and predictive
patterns in data), that otherwise could be very difficult to ascertain, especially from large
collections of data (Hulse, Khoshgoftaar, & Huang, 2007; Pyle, 1999; Sumathi & Sivanandam,
2006).
2.4 Machine Learning
Machine learning is a process by which knowledge from previous data is automatically learnt.
Based on the data, a model that produces some knowledge about the data is built. The knowledge
is then used to analyze future data. Automatic development of learning algorithms without
human interference is the main aspect of machine learning. Typically, machine learning can be
13
divided into two categories as follows (Md Anisur Rahman, 2014; Roiger & Geatz, 2003; J. Y.
Yang & Ersoy, 2003):
Supervised learning; and
Unsupervised learning.
Supervised Learning
In supervised learning, the data set has a special attribute called the class attribute, which
contains the class value/output of a record. Typically, the domain value of the class attribute is
equal to or greater than two (B. Liu, 2011; Md Anisur Rahman, 2014). The data sets are also
divided into subsets; namely, training data sets and testing data sets. In the training data sets, a
record contains a class label (i.e. the class value of the record), whereas in the testing data sets,
the class value of a record needs to be anticipated. Based on the training data set a model is
developed which produces logic rules, which are used to predict the class labels for the records
of the testing data set (i.e. the future data). Supervised learning methods include Support Vector
Machine, Decision Tree, Bayesian Network, Neural Network, Regression Analyses, and so on
(Maimon & Rokach, 2010; Md Anisur Rahman, 2014).
Unsupervised Learning
Unsupervised learning is a data-driven approach, which is also known as learning by
observation (Han & Kamber, 2006). In unsupervised learning, the data set does not have a class
attribute. Unsupervised learning analyzes data to discover the inherent structure of the data.
Unsupervised learning is considered to be the pre-process of supervised learning. Unsupervised
learning includes Outlier Detection, Clustering, Dimensionality Reduction, and so on.
(Chapelle, Scholkopf, & Zien, 2006; Ghahramani, 2004; Md Anisur Rahman, 2014).
14
2.5 Clustering
Clustering is an important and well-known technique in the area of data mining, which aims to
group similar records in one cluster and dissimilar records in other clusters. Through clustering,
hidden information can be extracted from the data that can help in the decision-making process
(D.-X. Chang et al., 2009; D. Chang et al., 2012; Han & Kamber, 2006; Kuo et al., 2012; Y. Liu
et al., 2011; Pang-Ning Tan, Michael Steinbach, 2005; Rahman & Islam, 2014).
Clustering has a wide range of applications such as machine learning (Gan, 2013;
Mukhopadhyay & Maulik, 2009), image segmentation (Cai et al., 2007; B. N. Li et al., 2011; F.
Zhao et al., 2014), business (M.-Y. Chen, 2013; Montani & Leonardi, 2014), social network
analysis (Girvan & Newman, 2002), and medical imaging (Kannan et al., 2010; Masulli &
Schenone, 1999; Stockman & Shapiro, 2001). It is therefore vital that good-quality clusters are
produced.
2.5.1 Applications of Clustering
Clustering has a wide range of applications, with key applications of clustering as follows:
Psychology and Medicine
An illness or circumstance frequently has a number of abnormalities. Clustering technique can
be used to identify these different subcategories (Pang-Ning Tan, Michael Steinbach, 2005). For
example, clustering techniques have been used to diagnose different types of depression
(Deckersbach et al., 2016; Dipnall et al., 2017; Miller & Cole, 2012; Rivera-Baltanas et al.,
2014; Suzuki et al., 2014; Van Lancker, Beeckman, Verhaeghe, Van Den Noortgate, & Van
Hecke, 2016).
15
Gene Analysis
In DNA microarray technology, a huge amount of gene expression data are generated and
monitored simultaneously. Detecting useful patterns from the produced data is valuable for
biomedical research, as it can help to diagnose diseases such as cancer and heart attacks (Md
Anisur Rahman, 2014). Clustering is widely used in gene expression data in order to extract
patterns/hidden information (Brameier & Wiuf, 2007; Kerr, Ruskin, Crane, & Doolan, 2008;
Maraziotis, 2012; Pirim, Ekşioğlu, Perkins, & Yüceer, 2012; Szeto, Liew, Yan, & Tang, 2003;
Xu, Damelin, Nadler, & Wunsch, 2010; Zeng & Garcia-Frias, 2006).
Medical Imaging and Object Detection
Clustering is also widely used in segmenting medical images, robotics and object detection (Bai
et al., 2013; Kannan et al., 2010; Kaya et al., 2017; B. N. Li et al., 2011; Liao et al., 2008;
Masulli & Schenone, 1999; Saha et al., 2016; Son & Tuan, 2017; Sonğur & Top, 2016;
Stockman & Shapiro, 2001). Clustering can partition a medical image into different anatomical
structures that can help to detect diseases (Kannan et al., 2010; B. N. Li et al., 2011; Liao et al.,
2008; Md Anisur Rahman, 2014) .
Climate
Clustering is also widely used in predicting climate (Bador, Gilleland, Castellà, & Arivelo,
2015; Y. Chen et al., 2017; Gu, Zhang, Singh, Chen, & Shi, 2016; Merz, Nguyen, &
Vorogushyn, 2016; Parente, Pereira, & Tonini, 2016). In order to understand the earth’s climate
it is necessary to establish the existence of patterns in atmospheric pressure and ocean currents.
Clustering approaches have therefore been applied to determine patterns in the atmospheric
pressure of Polar Regions, and in ocean areas that impact significantly on land climate (Pang-
Ning Tan, Michael Steinbach, 2005).
16
Social Network Analysis
Social Network Analysis (SNA) is crucial in regard to investigating the social activities of
communities, by analyzing cultural activities, daily activities, employment status, and earnings
among people living within a community. Clustering has received major attention in SNA, in
identifying similar groups of people (Bello-Orgaz et al., 2016; Daraganova et al., 2012; de
Arruda, Costa, & Rodrigues, 2012; Firat, Chatterjee, & Yilmaz, 2007; Firestone, Ward,
Christley, & Dhand, 2011; Giebultowicz, Ali, Yunus, & Emch, 2011; Girvan & Newman,
2002; Hsieh & Magee, 2008; Levine & Kurzban, 2006; Mann, Matula, & Olinick, 2008; Md
Anisur Rahman, 2014; Opsahl & Panzarasa, 2009; Qiao, Li, Li, Peng, & Chen, 2012; Traud,
Mucha, & Porter, 2012; P. Zhao & Zhang, 2011; Z. Zhao et al., 2012).
Business
Clustering is used in business to analyze customers’ requirements and expectations. On the stock
market, clustering is used as a decision support system in predicting the price of a product
(Alexander & Peterson, 2007; ap Gwilym & Verousis, 2010; Ashton & Hudson, 2008; Brown,
Chua, & Mitchell, 2002; Chan, Kwong, & Hu, 2012; M.-Y. Chen, 2013; Gunaratne, Nicol,
Seemann, & Török, 2009; Hruschka, Fettes, & Probst, 2004; Md Anisur Rahman, 2014; Mo,
Kiang, Zou, & Li, 2010; Montani & Leonardi, 2014; Nanda, Mahanty, & Tiwari, 2010; Narayan,
Narayan, & Popp, 2011; Narayan, Narayan, Popp, & D’Rosario, 2011; C.-H. Wang, 2009).
2.5.2 Categories of Clustering Techniques
Different types of clustering techniques include partitioning, hierarchical, graph-based, and
evolutionary algorithms-based. In this thesis, the various clustering methods will be divided into
the following categories.
Partition-based Clustering Techniques;
Hierarchical Clustering Techniques;
17
Density-based Clustering Techniques;
Graph-based Clustering Techniques;
Grid-based Clustering Techniques;
Spectral Clustering Techniques;
Model-based Clustering Techniques; and
Evolutionary Algorithm-based Clustering Techniques.
2.5.2.1 Partition-based Clustering Techniques
The partition-based clustering technique divides a data set into 𝑘 partitions (𝑘 ≤ 𝑛, where 𝑛 is
the number of records in the data set), where each partition represents a cluster (Han & Kamber,
2006). The clusters are formed in such a way that the records within a cluster are more similar
to each other than the records in different clusters. A record is allocated to a cluster if the record
has the minimum distance to the seed/centroid of the clusters. During the clustering process,
the seed of a cluster is updated based on the records allocated to the cluster, and the records of
the data set change the clusters based on their distances to the updated seeds of the clusters
(Han & Kamber, 2006; Md Anisur Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005).
Partition-based clustering techniques have an objective function which is optimized during the
clustering process. Typically, partition-based clustering techniques can be divided into the
following two categories:
Non-Fuzzy Clustering; and
Fuzzy Clustering.
Non-Fuzzy Clustering
The non-fuzzy cluster is also known as hard clustering or exclusive clustering (Md Anisur
Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005). In this form of clustering, the data
set is separated into non-overlapping clusters in such a way that a record belongs to only one
18
cluster. K-means (Han & Kamber, 2006; Lloyd, 1982; Pang-Ning Tan, Michael Steinbach,
2005) is one of the most popular non-fuzzy clustering techniques. In K-means, a user (data
miner) is required to define the number of clusters (𝑘) in advance. Based on the user defined
number of clusters, K-means then randomly selects 𝑘 records as initial seeds from the data set
and each record of the data set is then allocated to its closest seed in order to form clusters. The
seed of a cluster is then updated based on the records allocated to the cluster. The updated seed
is a (real or pseudo) record where each attribute value of the updated seed is the average of all
values of the attribute for all records belonging to the cluster.
The process of the record allocation/re-allocation to the clusters and updating is considered
to be an iteration of K-means. The iterations continue until any of the termination conditions are
met. Typically there are two termination conditions: first, if the user defined number of
maximum iterations is reached then the process terminates; and second, if the improvement of
the objective function values of two consecutive iterations do not improve more than a user
defined threshold (Arthur & Vassilvitskii, 2007; Lloyd, 1982; Pang-Ning Tan, Michael
Steinbach, 2005).
The objective function of K-means is the sum of the squared error (SSE), also known as
scatter. K-means calculates the error of each record; i.e. its Euclidean distance (see Section
2.6.1) to the closest seed; and then calculate the total sum of the squared error (Pang-Ning Tan,
Michael Steinbach, 2005). The main objective of the K-means algorithm is to minimize the
objective function that is described by the equation:
𝑆𝑆𝐸 = ∑ ∑ 𝑑𝑖𝑠𝑡( 𝑆𝑗, 𝑅𝑖)2
𝑖∈ 𝐶𝑗
𝑘
𝑗=1
Eq. 2.1
19
where 𝑘 stands for the number of clusters, 𝑆𝑗 is the seed of the 𝑗𝑡ℎ cluster 𝐶𝑗
and 𝑑𝑖𝑠𝑡( 𝑠𝑗 , 𝑅𝑖) is the Euclidean distance between the record 𝑅𝑖 and the seed 𝑆𝑗 of the
𝑗𝑡ℎ cluster 𝐶𝑗.
K-means++ (Arthur & Vassilvitskii, 2007) is another well-known non-fuzzy clustering
technique. K-means++ chooses only the first seed randomly. It then chooses the second seed in
a probabilistic way so that the record having the highest distance with the first seed has the
highest probability of being chosen as the second seed. While choosing the third seed the record
having the maximum distance with its nearest seed has the highest probability. In a similar way,
it picks the fourth seed and so on; it picks as many seeds as the user defined number of clusters.
All other components of K-means++ are exactly the same as K-means.
Partition-based clustering techniques have a number of limitations. One of the main
disadvantages of partition-based clustering techniques is that they require a user defined number
of clusters prior to clustering. It is difficult for a user (data miner) to estimate the appropriate
number of clusters in advance. Another drawback to these techniques is that they tend to get
stuck at local optima.
Fuzzy Clustering
Fuzzy clustering (also known as soft clustering) is a process of clustering in which each record
of a data set can belong to more than one cluster (“Fuzzy Clustering,” 2017; Pang-Ning Tan,
Michael Steinbach, 2005). In many real-life data sets, the records may sometimes show a fuzzy
nature in the sense that they may have an association with more than one cluster, instead of
completely belonging to only one cluster. For example, a record may have a 70% membership
of one cluster, a 20% membership of another cluster, and a 10% membership of a third cluster
(Kannan et al., 2010; Md Anisur Rahman, 2014).
20
The fuzzy C-means – also known as FCM – explore this fuzzy nature of the records (Bezdek
& C., 1981; Md Anisur Rahman, 2014). A record in a data set belonging to multiple clusters has
different membership degrees. These membership degrees (known as fuzzy membership
degrees) indicate which records belong to each cluster (Abonyi János & Feil Balázs, 2007; Md
Anisur Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005). For each record of the data
set, FCM techniques allocate a fuzzy membership degree for the record and a cluster, in order
to signify the level of attachment between the record and the cluster. The fuzzy membership
degrees of a record with different clusters can vary, as it does not belong to just one cluster. The
summation of the fuzzy membership degrees of a record with all clusters is equal to one (Md
Anisur Rahman, 2014).
Typically, FCM works on data sets with numerical attributes only (Bezdek & C., 1981; Md
Anisur Rahman, 2014). After initialization, FCM computes the seeds/centroids of the clusters
based on the fuzzy membership degree. Once the seed of the clusters is obtained it next
calculates the fuzzy membership degree of each record. The fuzzy membership degree of a
record and the seed of a cluster is inversely proportional to the distance between the seed and
the record. In a similar way to K-means, FCM techniques iteratively recompute the seeds and
fuzzy membership degree until the seeds/centroids do not change (Pang-Ning Tan, Michael
Steinbach, 2005).
Fuzzy clustering techniques have a number of limitations. Many of the existing fuzzy
clustering techniques, including FCM, do not work on data sets with both categorical and
numerical attributes; in fact few fuzzy clustering techniques exist that can process such data
sets. However, many techniques require various user inputs such as the number of clusters, while
randomly selecting initial fuzzy membership degree.
21
2.5.2.2 Hierarchical Clustering Techniques
Hierarchical clustering merges smaller clusters into larger clusters or successively splits the
larger clusters into a smaller clusters (Han & Kamber, 2006; Pang-Ning Tan, Michael Steinbach,
2005). A tree structure dendrogram is used to illustrate hierarchical clustering. The two types of
hierarchical clustering include:
Agglomerative Hierarchical Clustering; and
Divisive Hierarchical Clustering.
Agglomerative Hierarchical Clustering
Agglomerative hierarchical clustering is a bottom-up approach where each individual record of
a data set is considered as a cluster and iteratively merges two similar clusters, using similarity
measures for merging agglomerative clustering. The merging process is continued until a single
cluster is obtained or the termination condition is satisfied (Han & Kamber, 2006; Md Anisur
Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005).
Divisive Hierarchical Clustering
Divisive hierarchical clustering is a top-down approach where the records of a data set are
considered as one large cluster. The large clusters are then split into smaller clusters in such a
way that the most similar records are placed in one cluster. The division process is continued
until each record of the data set forms a separate cluster (Han & Kamber, 2006; Pang-Ning Tan,
Michael Steinbach, 2005). The divisive hierarchical process is, in effect, the reverse of the
agglomerative hierarchical process.
The major problem with many hierarchical clustering techniques is that they require high
computational complexity. In general, the overall complexity of hierarchical clustering is 𝑂(𝑛3)
(Md Anisur Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005).
22
2.5.2.3 Density-based Clustering Techniques
Density-based clustering techniques discover areas (groups) of high density – in terms of
records in a data set – that are separated from one another by areas of low density. In this
technique, each area of high density (group of records) is considered to be a different cluster
(B. Andreopoulos, An, & Wang, 2007; W. Andreopoulos, 2006; Han & Kamber, 2006; Md
Anisur Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005). DBSCAN is a simple and
effective density-based clustering technique, which finds the densest areas in a data set (Ester,
Kriegel, Sander, & Xu, 1996; Han & Kamber, 2006; Md Anisur Rahman, 2014). To find the
densest areas DBSCAN uses a radius (𝑟𝑥) and finds the neighbourhood of a record within its 𝑟𝑥.
The neighbourhood within a radius of a given record is called the neighbourhood of the record.
If the neighbourhood of a record contains a user defined minimum number of records within a
radius, then the record is considered as a core record (Han & Kamber, 2006; Md Anisur Rahman,
2014).
DBSCAN then decreases the number of core records by using a directly density-reachable
function. A record is considered directly density-reachable from a core record if the record is
within the neighbourhood of the core record. DBSCAN iteratively collects directly density-
reachable records from the core records and forms to a few density-reachable clusters. The
process terminates when no record is left to be clustered.
One of the main disadvantages of density-based clustering techniques is that various user
inputs are required, including the radius of a cluster (W. Andreopoulos, 2006; Md Anisur
Rahman, 2014; Omran, Engelbrecht, & Salman, 2007; Sisodia, Singh, Sisodia, & Saxena,
2012).
23
2.5.2.4 Graph-based Clustering Techniques
In graph-based clustering techniques, records of a data set are represented by vertices, with the
edge (connection) between two records indicating their similarity is greater than a threshold
value (Han & Kamber, 2006; Md Anisur Rahman, 2014). Graph-based clustering aims to group
the vertices into different clusters based on their similarity. Grouping is performed in such a
way that there should be many edges within each cluster, and relatively, a small number of
edges between the clusters (Abonyi János & Feil Balázs, 2007; Z. Chen & Ji, 2010; Md Anisur
Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005).
Graph-based clustering typically uses 𝑘 nearest neighbours to build a graph. Typically, it
uses a minimum spanning tree (MST) to partition the graph into clusters (Galluccio, Michel,
Comon, & Hero, 2012; Md Anisur Rahman, 2014; Zhong, Miao, & Wang, 2010). Limitations
of graph-based clustering techniques include that a similarity function must be selected from the
wide range of available similarity functions. A similar problem exists in relation to the selection
of a similarity graph.
2.5.2.5 Grid-based Clustering Techniques
The grid-based clustering method uses a multiresolution grid structure. It quantizes the record’s
space into a finite number of cells that form a grid structure on which the clustering operations
are performed. The main advantage of this approach is that it is computationally fast regardless
of the number of records of the cells (Han & Kamber, 2006; Md Anisur Rahman, 2014; W.
Wang, Yang, & Muntz, 1997). An example of grid-based clustering technique is STING, which
is discussed below.
STING is a grid-based multiresolution clustering technique which divides the spatial area of
the data into rectangular cells (Han & Kamber, 2006; Md Anisur Rahman, 2014). Usually there
are several levels of hierarchy in the rectangular cells. Each cell at a higher level is partitioned
24
from the lower-level cells and stores the statistical information of attributes such as mean,
maximum, minimum, standard deviation, and the type of distribution for an attribute. All kinds
of statistical parameters of higher-level cells can easily be calculated based on corresponding
statistical information in lower-level cells.
STING follows the top-down approach. It commences from a predefined layer containing a
small number of cells and removes the irrelevant cells from further consideration (Han &
Kamber, 2006). STING continues this process until it reaches to the bottom layer. One of the
key limitations of grid-based clustering techniques is that they require huge amounts of memory
if the number of cells is high (Md Anisur Rahman, 2014; W.M. Ma & Chow, 2004).
2.5.2.6 Spectral Clustering Techniques
The spectral clustering algorithm is one of the most commonly used clustering techniques and
has become a popular research topic (Beauchemin, 2015; X. Hong, Wang, & Qi, 2014; Ma,
Cheng, Liu, & Xie, 2017; Md Anisur Rahman, 2014; Mur, Dormido, Duro, Dormido-Canto, &
Vega, 2016; Nascimento & de Carvalho, 2011; Rafailidis, Constantinou, & Manolopoulos,
2017; Shang, Zhang, Jiao, Wang, & Yang, 2016; von Luxburg, 2007; Y. Yang, Wang, & Xue,
2016). This algorithm is also used for partitioning of graphs. It uses numerous mathematical
theories, including the similarity matrix and the similarity graph for partitioning the records of
a data set into different clusters.
One of the main advantages of spectral clustering techniques is that arbitrary shapes of data
sets can be identified, which partition-based clustering techniques are incapable of detecting
(Matthias & Juri, 2009; Md Anisur Rahman, 2014). However, the drawback of spectral
clustering is that it is often hard to select the best similarity function from the wide range of
similarity functions. It is also affected by problems related to the similarity graph, the Laplacian
matrix, and the number of clusters 𝑘.
25
2.5.2.7 Model-based Clustering Techniques
Model-based clustering methods optimize the fit between the given data and several
mathematical models. Such methods are often based on the assumption that the records of the
data set used for clustering are produced by a mathematical model (Han & Kamber, 2006; Md
Anisur Rahman, 2014). An example of a model-based clustering technique is Expectation
Maximisation (EM), which is the extension of the K-means partitioning algorithm. The EM
algorithm is briefly discussed as follows:
Expectation Maximisation
EM assumes that each cluster can be represented by a probability distribution (Han & Kamber,
2006; Md Anisur Rahman, 2014). A data set is the mixture (combination) of these distributions.
Therefore, a mixture model is used to cluster the data, where each distribution represents a
cluster. It is difficult to estimate the parameters of the probability distributions. The EM method
is used for estimating the parameters of the probability distribution. The steps of the algorithm
are described as follows:
Select a set of initial parameters; and
Based on the following two steps iteratively refine the parameters:
Expectation step; and
Maximisation step.
EM first selects the preliminary parameters such as initial seeds and several other parameters.
These parameters are then iteratively updated based on expectation and maximisation. In the
expectation step, EM calculates the probability of cluster membership of each record for each
of the clusters. Based on probabilities from the expectation step, the maximisation step finds the
new approximations of the parameter that maximises the expected likelihood.
26
However in a similar way to K-means, EM may be affected by becoming obstructed at local
optima (Md Anisur Rahman, 2014). One of the downsides of many model-based clustering
techniques is that the probability distribution is considered to be the same for each cluster,
which may not always be the case (Md Anisur Rahman, 2014; Roy & Parui, 2014)
2.5.2.8 Evolutionary Algorithm-based Clustering Techniques
Partition-based clustering techniques such as K-Means has a number of well-known drawbacks.
One of the problems with K-means is that a user defined number of clusters 𝑘 is required. In
reality, it can be difficult for a user (data miner) to estimate the appropriate number of clusters
in advance. Another disadvantage of K-means is its tendency to get stuck at local optima (Arthur
& Vassilvitskii, 2007; Mohd, Beg, Herawan, & Rabbi, 2012; Rahman & Islam, 2014). In order
to overcome these limitations some existing clustering techniques use various evolutionary
algorithms including genetic algorithms, particle swarm optimization, and ant colony
optimization.
In Table 2.2 and Table 2.3, a review of some major evolutionary algorithm-based clustering
techniques from the last twenty years (1995-2015) is presented. In total, 65 ranked evolutionary
algorithm-based clustering approaches are reviewed (in this instance, the term “ranked” is based
on citation reports on 16/04/2016, and Journal Citation Reports/The Computing Research and
Education Association of Australasia ranks). Most of these techniques do not require users to
define the number of clusters in advance, and are used for many real-life applications such as
highway construction projects, gas companies, cellular networks, satellite image segmentations,
and real-world medical problems.
In this thesis, the various evolutionary algorithm-based clustering techniques are identified
as follows:
Ant Colony Algorithm-based Clustering Techniques;
27
Bee Colony Algorithm-based Clustering Techniques;
Particle Swarm Optimization (PSO) Algorithm-based Clustering Techniques;
Black hole Algorithm-based Clustering Techniques;
Firefly Algorithm-based Clustering Techniques; and
Genetic Algorithm-based Clustering Techniques.
Ant Colony Algorithm-based Clustering Techniques
The ant colony approach produces optimal clustering solutions (C.-L. Huang, Huang, Chang,
Yeh, & Tsai, 2013; İnkaya, Kayalıgil, & Özdemirel, 2015; Korürek & Nizam, 2008; Ramos,
Hatakeyama, Dong, & Hirota, 2009; Shelokar, Jayaraman, & Kulkarni, 2004; Wan, Wang, Li,
& Yang, 2012; L. Zhang & Cao, 2011; L. Zhang, Cao, & Lee, 2013) using a concept known as
“ants”, which is a collection of a number of software ants. An ant is an individual and a
clustering solution that contains a number of cluster centers. Typically, a user defined number
of records is randomly selected from the data set that collectively forms an ant. The quality of
each ant is measured through its fitness value/objective function.
Ant colony optimization, features a number of iterations. In each iteration, a user defined
number of ants is selected based on their fitness, and a local search operation is applied to the
selected ants. In the local search operation, the number of clusters is altered based on probability.
At the end of each iteration, the ant (clustering solution) is updated using the pheromone trail
matrix. The pheromone trail matrix works as an adaptive memory that contains information
about the previous best solution; this is updated at the end of each iteration (Shelokar et al.,
2004). The algorithm repeatedly carries out the local search and pheromone update procedure
for a maximum number of given iterations.
However, a limitation of the many ant colony algorithm-based clustering techniques is that
a user needs to define the number of clusters in advance to form an ant; and it is difficult for a
28
user to guess the correct number of clusters in advance. Some ant colony algorithm-based
clustering techniques generate the number of clusters randomly to form an ant; however, the
quality of the ant is unlikely to be high due to the random selection process. We argue that
having high-quality ants at the beginning of the iteration can result in better quality final
clustering solutions for a given number of iterations.
Table 2.2: List of ranked (based on citation reports) evolutionary algorithm-based clustering techniques from 1995-2015
Algorithms Authors Year of
publications
Number
of
citations
Rank Initial cluster
number
selection
Applications
GA Srikanth et al. 1995 88 Q2 Random Real life data sets
GA Maio et al. 1995 24 Q2 Random Map topologies
GA Murthy and
Chowdhury 1996 284 Q2 Random Real life data sets
GA Scheunders 1997 205 Q1 Random Image segmentation
GA Tseng and Yang 1997 46 B User defined Real life data sets
GA Tzes et al. 1998 44 Q1 Random DC-motor friction identification
GA Hanagandi and
Nikolaou 1998 43 Q1 Random Real life data sets
GA Cucchiara 1998 31 Q2 Random Image segmentation
GA Cowgill et al. 1999 135 Q1 Random Real life data sets
GA Lozano and Larrañaga 1999 47 Q2 Random Real life data sets
GA Demiriz et al. 1999 281 B User defined Real life data sets
GA Maulik and
Bandyopadhyay 2000 1011 Q1 Random Synthetic and real life data sets
GA Tseng and Yang 2001 201 Q1 Random Real life data sets
GA Bandyopadhyay and
Maulik 2002 301 Q1 Random
Satellite image of a part of the
city of Mumbai
GA Turgut et al. 2002 129 B Random Mobile ad hoc networks
GA Li and Chiao 2003 16 Q1 User defined Image segmentation
Ant Colony Shelokar et al. 2004 356 Q1 User defined Real life data sets
GA Pakhira et al. 2005 160 Q1 User defined Synthetic and real life data sets
PSO Paterlini and Krink 2006 296 Q3 Random Synthetic and real life data sets
GA Laszlo and Mukherjee 2006 113 Q1 User defined Real life data sets
PSO Jarboui et al. 2007 93 Q1 Random Real life data sets
GA Bandyopadhyay et al. 2007 182 Q1 Random Image Segmentation
Bee Colony Fathian et al. 2007 205 Q1 Random Real life data sets
PSO Das et al. 2008 110 Q2 Random Synthetic and real life data sets
GA Qing et al. 2008 62 Q1 User defined
Varied-line-spacing
holographic gratings
GA Sheng et al. 2008 31 Q1 Random Hand written signature data
Ant Colony Korürek and Nizam 2008 47 Q1 Random ECG Signals
29
Table 2.3: List of ranked (based on citation reports) evolutionary algorithm-based clustering techniques from 1995-2015
Algorithms Authors Year of
publications
Number
of
citations
Rank Initial cluster
number
selection
Applications
PSO Zhao et al. 2009 34 Q2 Random Box–Jenkins gas furnace data set
PSO Yang et al. 2009 100 Q1 Random Synthetic and real life data sets
GA Chang et al. 2009 88 Q1 User defined Real life data sets
Ant Colony Ramos et al. 2009 23 Q1 Random Synthetic and real life data sets
Bee Colony Zhang et al. 2010 208 Q1 Random Real life data sets
PSO Tsai and Kao 2011 51 Q1 Random Synthetic and real life data sets
PSO Kalyani and Swarup 2011 60 Q1 Random
Security assessment in power
systems
GA Liu et al. 2011 43 Q1 Automatic Real life data sets
GA Yücenur and Demirel 2011 44 Q1 Random Real life data sets
PSO Chuang et al. 2011 46 Q1 Random Real life data sets
Firefly Senthilnath and Mani 2011 172 NA Random Real life data sets
Ant Colony Zhang and Cao 2011 32 Q1 Random Synthetic and real life data sets
Bee Colony Karaboga and Ozturk 2011 493 Q1 Random Real life data sets
Bee Colony Yan et al. 2012 62 Q2 User defined Real life data sets
PSO Sun et al. 2012 42 Q1 Random Synthetic and real life data sets
PSO Cura 2012 41 Q1 Random Synthetic and real life data sets
PSO Kuo et al. 2012 65 Q1 Random Real life data sets
Ant Colony Wan et al. 2012 22 Q1 Random Real life data sets
GA Agustín-Blas et al. 2012 52 Q1 Random Synthetic and real life data sets
Bee Colony Lei et al. 2013 10 Q1 Random MIPS data set
GA Chang et al. 2012 16 Q1 Random Real life data sets
GA Aalaei et al. 2013 2 Q1 User defined
Mazandaran Gas Company in
Iran
PSO Jiang et al. 2013 24 Q1 Random Real life data sets
Ant Colony Huang et al. 2013 21 Q1 Random Real life data sets
GA Mungle et al. 2013 15 Q1 Random Highway construction project
Ant Colony Zhang et al. 2013 7 Q1 Random Real life data sets
Bee Colony Banharnsakun et al. 2013 12 Q2 Random Real life data sets
Black hole Hatamlou 2013 103 Q1 Random Real life data sets
GA Festa 2013 6 Q3 User defined Real life data sets
Bee Colony Kuo et al. 2014 9 Q1 Random Real life data sets
PSO Cagnina et al. 2014 11 Q1 Random Short-text corpora
GA Wikaisuksakul 2014 13 Q1 Random Synthetic and real life data sets
GA Rahman and Islam 2014 12 Q1 Random Real life data sets
GA Peng et al. 2014 4 Q1 Random Real life data sets
Bee Colony Forsati et al. 2015 3 Q2 Random Real life Data sets
GA Hong et al. 2015 2 Q1 Random Real life data sets
Ant Colony İnkaya et al. 2015 8 Q1 Random Real life data sets
Bee Colony Ozturk et al. 2015 13 Q1 Random Image clustering
30
Bee Colony Algorithm-based Clustering Techniques
The bee colony approach produces optimal clustering solutions using the concept of a honey
bee swarm (Banharnsakun, Sirinaovakul, & Achalakul, 2013; Fathian, Amiri, & Maroosi, 2007;
Forsati, Keikha, & Shamsfard, 2015; Karaboga & Ozturk, 2011; Kuo, Huang, Lin, Wu, &
Zulvia, 2014; Lei, Tian, Ge, & Zhang, 2013; Menon & Ramakrishnan, 2015; Ozturk, Hancer,
& Karaboga, 2015; Yan, Zhu, Zou, & Wang, 2012; C. Zhang, Ouyang, & Ning, 2010). Honey
bee swarms consist of three essential components; food sources, employed bees, and
unemployed bees (C. Zhang et al., 2010). The employed bees are associated with a particular
food source and share the information with onlooker bees. There are two types of unemployed
bees; onlooker bees and scout bees.
Onlooker bees wait in the nest and find the food source based on the information shared by
employed bees. In each food source, there is only one employed bee. If the position of a food
source doesn’t improve through its predefined number of iterations, then the employed bee
associated with that particular food source become a scout bee and starts a new search randomly.
In the bee colony algorithm, a food source represents a clustering solution. Initially, a number
of food sources are generated randomly. The food source is then improved through employed
bees, onlooker bees, and scout bees in each iteration. Based on fitness value, the best food source
is selected as the final clustering solution.
However, a limitation of the many bee colony algorithm-based clustering techniques is that
a user needs to estimate in advance the number of clusters required to form a bee; a difficult
task. Some bee colony algorithm-based clustering techniques randomly generate the number of
clusters required to form a bee. However, the quality of the bee is unlikely to be high due to the
random selection process. We propose that commencing the iteration with high-quality bees can
result in better-quality final clustering solutions for a given number of iterations.
31
Particle Swarm Optimization (PSO) Algorithm-based Clustering Techniques
Particle swarm optimization (PSO) algorithm-based clustering techniques produce optimal
clustering solutions using the concepts of swarm (Cagnina, Errecalde, Ingaramo, & Rosso,
2014; L.-Y. Chuang, Hsiao, & Yang, 2011; Cura, 2012; Das, Abraham, & Konar, 2008; Jarboui,
Cheikh, Siarry, & Rebai, 2007; Jiang, Wang, & Wang, 2013; Kalyani & Swarup, 2011; Kuo et
al., 2012; Paterlini & Krink, 2006; Sun, Chen, Fang, Wun, & Xu, 2012; Tsai & Kao, 2011; F.
Yang, Sun, & Zhang, 2009; L. Zhao, Yang, & Zeng, 2009). The swarm is a collection of a
number of particles, where a particle is a clustering solution that contains a number of cluster
centers. Typically, a user defined number of records is selected as the center of a cluster. The
centers of a cluster collectively form a particle/clustering solution. The number of particles of a
swarm varies from technique to technique; usually from 20 to 40 (Rahman & Islam, 2014). The
quality of a particle is measured through its objective function/fitness value.
In PSO, there are a number of iterations that vary from application to application. In each
iteration, a particle moves towards its best previous position and towards the best particle in the
swarm (Cura, 2012; Das et al., 2008; Kuo et al., 2012). Hence, the position represents the cluster
centers/seeds. Generally, in PSO, each particle has three properties; current position, current
velocity, and personal-best position. The best position of a particle represents cluster centers
that have the best fitness value of the particle. The current position of a particle represents the
cluster centers/seeds of a particle of the current iteration. The velocity of a particle represents
the speed of particle to change its current position. Typically, the velocity of a particle depends
on the differences between the current position and the best position of a particle, and the
differences between the current position and the best position of a particle in the swarm. Usually,
each particle has a tendency to move towards its best position and the best position of the particle
in the swarm. Hence, each particle of the swarm moves towards the best clustering solution over
iterations.
32
However, a limitation of many PSO-based clustering techniques is that the number of clusters
is randomly generated to form a particle. The quality of the particles is unlikely to be high due
to the random selection process. We contend that the possession of high-quality particles at the
beginning of the iteration can result in better-quality final clustering solutions for a given
number of iterations.
Black hole Algorithm-based Clustering Techniques
The black hole algorithm (Hatamlou, 2013) produces optimal clustering solutions using the
concept of a black hole. A black hole is a region of space that contains so much concentrated
mass that there is no way for a nearby object – including particles and electromagnetic radiation
such as light – to escape its gravitational pull. The black hole algorithm (BH) commences with
an initial population of candidate solutions called starts. In the BH algorithm, at each iteration,
the most applicable candidate solution is selected as the black hole, with the remainder forming
the other stars.
After initializing the population, the black hole starts drawing the stars towards it. If a star
comes into proximity of the black hole, then the star is swallowed by the black hole, and is gone
forever. In such a case, BH then generates a new star randomly, places it within the search space,
and starts a new search. The process is continued until it meets the termination condition (a
maximum number of iterations or a sufficiently good fitness is met).
The limitation of the black hole algorithm is that in each iteration some stars are deleted
based on the event horizon, and in order for the deleted stars to be replaced BH then generates
new stars randomly. We maintain that the quality of new randomly generated stars is unlikely
to be better than the stars which are deleted based on the event horizon. The deleted stars are
improved over the iterations.
33
Firefly Algorithm-based Clustering Techniques
A firefly algorithm (FA) is a nature-inspired optimization algorithm that simulates the flashing
behavior of social insects (fireflies) (Abshouri & Bakhtiary, 2012; Hassanzadeh & Meybodi,
2012; Lei, Wang, Wu, Zhang, & Pedrycz, 2016; Mathew & Vijayakumar, 2014; Senthilnath,
Omkar, & Mani, 2011). The firefly algorithm is a clustering solution that contains a number of
cluster centers; with records typically randomly selected as the center of a cluster. The centers
of clusters collectively form a firefly/clustering solution. Moreover, the fireflies carry a
luminescence quality known as luciferin that emits light proportional to this value. Each firefly
is attracted to others by the glow or brightness.
Attractiveness is decreased when the distance between fireflies is increased. A firefly will
move randomly if no other firefly is particularly bright to attract it. At the end of each iteration,
FA evaluates the fitness of each firefly and ranks the fireflies based on their fitness value;
selecting the firefly with the highest fitness value as the final clustering solution. The process is
continued until it meets the termination condition (i.e. a maximum number of iterations).
However, a limitation of the firefly algorithm is that it generates the number of clusters
randomly to form a firefly. Due to the random selection processes, the quality of the firefly is
unlikely to be high. We propose that having a high-quality firefly at the beginning of the iteration
can result in a better-quality final clustering solution for a given number of iterations.
Genetic Algorithm-based Clustering Techniques
Genetic algorithms (GA) are randomized search and optimization techniques based on the
concepts of Darwin’s law of evolution “Survival of the fittest in natural selection”, proposed by
John H. Holland (Holland, 1975). This algorithm simulates the biological structure of the
genetic evolution process. GA-based clustering techniques have a number of advantages
including that GA-based clustering techniques do not require any user defined “number of
34
clusters” prior to commencement of the clustering process, solving the local optima issue of the
partition-based clustering techniques.
In Table 2.4, an examination of some major GA-based clustering techniques of the last
twenty years is presented. A total of 45 ranked GA-based clustering approaches are reviewed,
which are used for real-life applications such as real-life data sets, highway construction
projects, gas companies, cellular networks, and satellite image segmentations (in this instance,
the term “ranked” is based on citation reports and JCR/CORE rank). Almost two-thirds of the
techniques do not require a user to define the number of clusters (Aalaei, Fazlollahtabar,
Mahdavi, Mahdavi-Amiri, & Yahyanejad, 2013; Abolhassani, Salt, & Dodds, 2004; Agustín-
Blas et al., 2012; S. Bandyopadhyay, Maulik, & Mukhopadhyay, 2007; Sanghamitra
Bandyopadhyay & Maulik, 2002; D.-X. Chang et al., 2009; D. Chang et al., 2012; Cheng, Lee,
& Wong, 2002; Chiou & Lan, 2001; Cowgill, Harvey, & Watson, 1999; Cucchiara, 1998;
Demiriz, Demiriz, Bennett, & Embrechts, 1999; Deng, He, & Xu, 2010; Dimopoulos & Mort,
2001; Festa, 2013; Garai & Chaudhuri, 2004; Hanagandi & Nikolaou, 1998; He & Tan, 2012;
T.-P. Hong, Chen, & Lin, 2015; Y. Hong & Kwong, 2008; M. Laszlo & Mukherjee, 2006;
Michael Laszlo & Mukherjee, 2007; C.-T. Li & Chiao, 2003; Lin Yu Tseng & Shiueng Bien
Yang, 1997; Y. Liu et al., 2011; Lozano & Larrañaga, 1999; Maio et al., 1995; Maulik &
Bandyopadhyay, 2000; Mungle, Benyoucef, Son, & Tiwari, 2013; Murthy & Chowdhury, 1996;
Neto, Meyer, & Jones, 2006; Pakhira, Bandyopadhyay, & Maulik, 2005; Peng et al., 2014; Qing,
Gang, Zaiyue, & Qiuping, 2008; Rahman & Islam, 2014; Scheunders, 1997; Sheng, Howells,
Fairhurst, & Deravi, 2008; Song, Li, & Park, 2009; Srikanth et al., 1995; Tseng & Yang, 2001;
Turgut, Das, Elmasri, & Turgut, 2002; Tzes, Pei-Yuan Peng, & Guthy, 1998; Wikaisuksakul,
2014; Xiao et al., 2010; Yücenur & Demirel, 2011).
35
Table 2.4: List of ranked (based on citation reports) GA-based clustering techniques from 1995-2015
Authors Year of
publications No. of
Citations Rank Initial cluster
number selection Applications
Srikanth et al. 1995 88 Q2 Random Real life data sets
Maio et al. 1995 24 Q2 Random Map topologies
Murthy and Chowdhury 1996 284 Q2 Random Real life data sets
Scheunders 1997 205 Q1 Random Image segmentation
Tseng and Yang 1997 46 B User defined Real life data sets
Tzes et al. 1998 44 Q1 Random DC-motor friction identification
Hanagandi and Nikolaou 1998 43 Q1 Random Real life data sets
Cucchiara 1998 31 Q2 Random Image segmentation
Cowgill et al. 1999 135 Q1 Random Real life data sets
Lozano and Larrañaga 1999 47 Q2 Random Real life data sets
Demiriz et al. 1999 281 B User defined Real life data sets
Maulik and Bandyopadhyay 2000 1011 Q1 Random Synthetic and real life data sets
Chiou and Lan 2001 108 Q1 User defined Synthetic data sets
Tseng and Yang 2001 201 Q1 Random Real life data sets
Dimopoulos and Mort 2001 75 Q2 Random Cell-formation problems
Cheng et al. 2002 99 Q1 Random Data partitioning
Bandyopadhyay and Maulik 2002 301 Q1 Random Satellite image of Mumbai
Turgut et al. 2002 129 B Random Mobile ad hoc networks
Li and Chiao 2003 16 Q1 User defined Image segmentation
Garai and Chaudhuri 2004 97 Q2 Random Real life data sets
Abolhassani et al. 2004 15 Q1 Random Cellular networks
Pakhira et al. 2005 160 Q1 User defined Synthetic and real life data sets
Neto et al. 2006 67 Q2 Random Leaf image segmentation
Laszlo and Mukherjee 2006 113 Q1 User defined Real life data sets
Bandyopadhyay et al. 2007 182 Q1 Random Image Segmentation
Laszlo and Mukherjee 2007 115 Q1 Random Real life data sets
Hong and Kwong 2008 31 Q1 Random Synthetic and real life data sets
Qing et al. 2008 62 Q1 User defined Varied-line-spacing holographic
gratings
Sheng et al. 2008 31 Q1 Random Hand written signature data
Chang et al. 2009 88 Q1 User defined Real life data sets
Song et al. 2009 75 Q1 Random Real life data sets
Deng et al. 2010 31 Q1 User defined Real life data sets
Xiao et al. 2010 42 Q1 Random Real life data sets
Liu et al. 2011 43 Q1 Automatic Real life data sets
Yücenur and Demirel 2011 44 Q1 Random Real life data sets
He and Tan 2012 41 Q2 Random Real life data sets
Agustín-Blas et al. 2012 52 Q1 Random Synthetic and real life data sets
Chang et al. 2012 16 Q1 Random Real life data sets
Aalaei et al. 2013 2 Q1 User defined Mazandaran Gas Company
Festa 2013 6 Q3 User defined Real life data sets
Mungle et al. 2013 15 Q1 Random Highway construction project
Wikaisuksakul 2014 13 Q1 Random Synthetic and real life data sets
Rahman and Islam 2014 12 Q1 Random Real life data sets
Peng et al. 2014 4 Q1 Random Real life data sets
Hong et al. 2015 2 Q1 Random Real life data sets
36
Steps of Genetic Algorithms
In GA, the roles of the initialization and recombination operators are very well-defined. The
initialization operator identifies the direction of search and the recombination operator generates
new regions for search (Sheikh, Raghuwanshi, & Jaiswal, 2008). GA starts a generation with an
initial population. The initial population is generated with a number of chromosomes. The
chromosomes are made up with a number of genes. For clustering, a chromosome is considered
to be a clustering solution, and a gene of a chromosome is considered to be the center of a
cluster.
Fig. 2.1: Basic steps of Genetic Algorithms (GA)
37
After initialization, in order to make a selection, an objective/fitness function is applied to
each chromosome that identifies the goodness of a chromosome. Biologically inspired
operators: crossover and mutation are then applied to the population in order to produce a
clustering solution. At the end of each generation, GA applies an elitist operation where the
newly generated populations are compared with the previous population. All these inter-related
parameters and operators influence the performance of a GA (Diaz-Gomez & Hougen, 2007;
Michael Laszlo & Mukherjee, 2007). The processes of selection, crossover, mutation, and elitist
operation are continued for a fixed number of generations or until a termination condition is
satisfied (Sheikh et al., 2008). GA consists of five main phases, namely initial population,
selection, crossover, mutation, and elitist operation (see Fig. 2.1). The main phases of genetic
algorithms will now be briefly explained, as follows.
Step 1: Initialization Population
The first step of GA is initial population, the size of which is typically user defined. Each
individual in the initial population is represented as a chromosome. In GA, a chromosome
contains a set of genes, where a gene is a (real or pseudo) record. A gene is regarded as the
center of a cluster; and therefore a chromosome is considered to be a clustering solution. GA
generally contains many iterations/generations. Each generation typically contains a number of
chromosomes that are known as the population of the generation.
Usually, a set of records is randomly selected from a data set to form a chromosome (Md
Anisur Rahman, 2014; Mukhopadhyay & Maulik, 2009; Xiao et al., 2010). The number of
records in a chromosome can vary from 2 to 𝐾∗ + 1, where 𝐾∗ is the soft estimation for the
maximum number of clusters (Md Anisur Rahman, 2014; Mukhopadhyay & Maulik, 2009).
The number of genes in a chromosome can range from 2 to √𝑛, where 𝑛 is the number of records
38
in a data set (D.-X. Chang et al., 2009; He & Tan, 2012; Y. Liu et al., 2011; Md Anisur Rahman,
2014; Pakhira et al., 2005; Xiao et al., 2010).
Step 2: Selection
For the next genetic operators of crossover and mutation, GA selects chromosomes based on
their fitness/objective function. Various methods are used to calculate fitness, such as the
Davies-Bouldin (DB) Index (D L Davies & Bouldin, 1979), Sum of the Squared Error (SSE)
(Agustín-Blas et al., 2012; Pang-Ning Tan, Michael Steinbach, 2005), and Silhouette
Coefficient (Pang-Ning Tan, Michael Steinbach, 2005). These methods will be discussed in
Section 2.7. A proportional selection method [roulette wheel technique (D. Chang et al., 2012;
Maulik & Bandyopadhyay, 2000; Mukhopadhyay & Maulik, 2009)] is used to choose
chromosomes. An existing technique known as AGCUK (Y. Liu et al., 2011) uses noise-based
selection for the selection of chromosomes.
Step 3: Crossover
Crossover is an important step in GA, where a pair of chromosomes swaps its segments/genes
and generates a pair of offspring chromosomes. Many types of selection criteria, including the
roulette wheel (D. Chang et al., 2012; Maulik & Bandyopadhyay, 2000; Mukhopadhyay &
Maulik, 2009), rank-based wheel (Agustín-Blas et al., 2012), and random selection (D.-X.
Chang et al., 2009) are used to select a chromosome pair for a crossover operation.
Some GA-based clustering techniques (D. Chang et al., 2012; Maulik & Bandyopadhyay,
2000; Mukhopadhyay & Maulik, 2009) use roulette wheel selection, where the best
chromosome (available in the current population) is chosen as the first chromosome of the pair,
while the second chromosome of the pair is selected using the roulette wheel technique. Agustín-
Blas et al. (2012) use rank-based wheel selection where chromosomes are sorted based on their
quality, and then a pair of chromosomes is chosen based on the rank of a chromosome.
39
Once the pair of chromosomes are selected, GA then applies a crossover operation on each
pair of chromosomes. There are many approaches for performing crossover on a pair of
chromosomes, such as single-point (Sanghamitra Bandyopadhyay & Maulik, 2002; D. Chang
et al., 2012; Garai & Chaudhuri, 2004; Michael Laszlo & Mukherjee, 2007; Pakhira et al., 2005;
Peng et al., 2014; Rahman & Islam, 2014; Song et al., 2009), multi-point (Agustín-Blas et al.,
2012), arithmetic (Yan et al., 2012), path-based (D.-X. Chang et al., 2009), and heuristic (D.-X.
Chang et al., 2009) crossover.
In single-point crossover, each chromosome of a pair is divided into two parts at a random
point between two genes. The left-hand portion (having one or more genes) of one chromosome
of a pair joins the right-hand portion of other chromosomes (having one or more genes), and
form an offspring chromosome (Rahman & Islam, 2014). In multi-point crossover, each
chromosome of a pair is divided into multiple parts, which are then swapped with each other to
generate new offspring chromosomes. In path-based crossover (D.-X. Chang et al., 2009), two
parent chromosomes create a path between them, from which two points are selected as
offspring chromosomes. Heuristic crossover (D.-X. Chang et al., 2009) uses the fitness values
of two parents, where the worst parent slightly moves towards the best parent.
Step 4: Mutation
Mutation operation randomly changes (Agustín-Blas et al., 2012; D.-X. Chang et al., 2009; D.
Chang et al., 2012; Y. Liu et al., 2011; Maulik & Mukhopadhyay, 2010; Rahman & Islam, 2014)
one or more genes (seeds) of a chromosome with a probability equal to the mutation rate. There
are many approaches for mutation such as division and absorption (Agustín-Blas et al., 2012;
D. Chang et al., 2012; Y. Liu et al., 2011), insertion (D. Chang et al., 2012), deletion (D. Chang
et al., 2012), perturbation (D. Chang et al., 2012) and movement (D. Chang et al., 2012). The
40
division operation divides one cluster of a chromosome into two clusters. The absorption
operation merges two clusters of a chromosome into one cluster.
The perturb mutator (D. Chang et al., 2012) randomly selects a cluster center and changes
the coordinates of the center, while the insert mutator (D. Chang et al., 2012) randomly generates
a center from the data set and inserts it into the chromosome. The delete mutator (D. Chang et
al., 2012) deletes a randomly selected center of the chromosome, while the move mutator (D.
Chang et al., 2012) transfers one record from one cluster to another cluster and re-computes the
cluster center of the chromosome.
Step 5: Elitist Operation
The elitist operation (D.-X. Chang et al., 2009; D. Chang et al., 2012; Y. Liu et al., 2011;
Rahman & Islam, 2014) preserves the best chromosome obtained so far at any stage (i.e. the
iteration) of the GA, and passes it to the next generation in order to ensure that the best
achievement at that time does not get lost during genetic operations. If the fitness of the worst
chromosome (𝑃𝑤𝑖 ) of the current (𝑖𝑡ℎ ) generation is less than the fitness of the best chromosome
(𝑃𝑏) found so far from all previous generations, then the worst chromosome (𝑃𝑤𝑖 ) is replaced
with the best chromosome (𝑃𝑏). Moreover, if the fitness of the best chromosome (𝑃𝑏𝑖 ) of 𝑖𝑡ℎ
generation is greater than 𝑃𝑏 then 𝑃𝑏 is replaced by 𝑃𝑏𝑖 .
Termination Condition
Typically, GA-based clustering techniques terminate when they meet the user defined number
of iterations/generations or if there is no improvement in the chromosomes of the current
generation compared to the previous generation (D.-X. Chang et al., 2009; Md Anisur Rahman,
2014). At the end of the total generations, GA-based clustering techniques select the best
chromosomes as the final clustering solution. The genes of the best chromosome represent the
cluster centers, and records are allocated to their closest seeds to form the final clusters.
41
However, GA-based clustering techniques have some limitations. Many existing GA-based
clustering techniques (Y. Liu et al., 2011; Maio et al., 1995; Maulik & Bandyopadhyay, 2000;
Xiao et al., 2010) randomly generate the number of genes of a chromosome in the population
initialization phase. They also randomly choose records as genes, instead of carefully choosing
genes of a chromosome; this is significant, given that a carefully selected initial population can
improve final clustering results. However, some existing techniques – such as GenClust – do
carefully select a high-quality initial population with a high complexity of 𝑂(𝑛2), where 𝑛 is
the number of records in a data set. Unfortunately, GenClust also requires user input on the
number of radius values for the clusters in the initial population selection. It can be very difficult
for a user to estimate the set of radius values (i.e. radii).
2.6 Distance Calculation
A data set typically consists of numerical and/or categorical attributes, with different distance
calculations required for each. Therefore, in this thesis, distance calculation is divided into two
categories:
Distance Calculation for numerical attributes; and
Distance Calculation for categorical attributes.
2.6.1 Distance Calculation for Numerical Attributes
Prior to making a distance calculation, the domain values of a numerical attribute are typically
normalized in the range between 0 and 1 in order to weigh each attribute equally, regardless of
domain size. Many different approaches have been proposed to calculate the distance between
two domain values of a numerical attribute. A number of different distance calculation
approaches for numerical attributes are listed as follows:
Minkowski Distance;
Manhattan Distance;
42
Euclidean Distance;
Chebyshev Distance;
Cosine Distance; and
Jaccard Distance.
Distance calculations for the attributes of a data set are commonly used in various data
mining approaches, including clustering. In this thesis, we use the Euclidean distance metric to
calculate the distance between two domain values of a numerical attribute.
Minkowski Distance
Minkowski distance is a generalized distance metric used in clustering to calculate the distance
between two domain values of a numerical attribute (Han & Kamber, 2006; Md Anisur
Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005; Schulz, 2008; Teknomo, 2015b).
Let us consider, 𝜏 is a positive integer, the number of attributes in a data set is 𝑚, the 𝑎𝑡ℎ
attribute value of the 𝑖𝑡ℎ record is 𝑅𝑖,𝑎, and 𝑠𝑗,𝑎 is the 𝑎𝑡ℎ attribute value of the seed of the 𝑗𝑡ℎ
cluster. The Minkowski distance (𝑑𝑖𝑗) between 𝑅𝑖,𝑎 and 𝑠𝑗,𝑎 can be calculated as follows:
𝑑𝑖𝑗 = √∑ |𝑅𝑖,𝑎 − 𝑠𝑗,𝑎|𝜏𝑚
𝑎=1
𝜏
Eq. 2.2
Manhattan Distance
Manhattan distance is generalized by Minkowski distance. It is also called city block distance,
taxicab norm or L1 norm distance (Han & Kamber, 2006; Md Anisur Rahman, 2014; Pang-
Ning Tan, Michael Steinbach, 2005; Schulz, 2008). In the equation of Minkowski distance, if
the value of 𝜏 is equal to 1 then it is considered as Manhattan distance.
Euclidean Distance
Euclidean Distance is also generalized by Minkowski distance. It is frequently used in many
clustering techniques to calculate the distance between two domain values of a numerical
attribute (Han & Kamber, 2006; Md Anisur Rahman, 2014; Pang-Ning Tan, Michael
43
Steinbach, 2005; Teknomo, 2015b). In the Minkowski distance equation, if the value of 𝜏 is
equal to 2 then it is considered as Euclidean Distance, which is also called L1 norm distance.
Chebyshev Distance
Chebyshev Distance is another distance calculation approach for numerical attributes
generalized by Minkowski distance (Han & Kamber, 2006; Md Anisur Rahman, 2014; Pang-
Ning Tan, Michael Steinbach, 2005; Schulz, 2008; Teknomo, 2015b). In the Minkowski
distance equation, if the value of 𝜏 is equal to ∞ then it is considered as Chebyshev Distance,
which is also called L1 norm distance.
Cosine Distance
Let us consider, 𝑅𝑖 is the 𝑖𝑡ℎ record of a data set and 𝑠𝑗 is the seed of the 𝑗𝑡ℎ cluster. The cosine
similarity (𝛱) between 𝑅𝑖 and 𝑠𝑗 can be calculated as follows (A. Huang, 2008; Md Anisur
Rahman, 2014):
𝛱𝑖𝑗 =𝑅𝑖. 𝑠𝑗
|𝑅𝑖| × |𝑠𝑗|
Eq. 2.3
The cosine distance (Ϛ𝑖𝑗) between 𝑅𝑖 and 𝑠𝑗 can be calculated as follows:
Ϛ𝑖𝑗=1-𝛱𝑖𝑗 Eq. 2.4
Jaccard distance
Let us consider, 𝑅𝑖 is the 𝑖𝑡ℎ record of a data where 𝑅𝑖 = {1,5,3,2} and 𝑠𝑗 is the seed of the
𝑗𝑡ℎ cluster where 𝑠𝑗 = {3,2,2,1}. The Jaccard coefficient (Ӻ𝑖𝑗) between 𝑅𝑖 and 𝑠𝑗 can be
calculated as follows (Md Anisur Rahman, 2014; Teknomo, 2015a):
Ӻ𝑖𝑗 =|𝑅𝑖 ∩ 𝑠𝑗|
|𝑅𝑖 ∪ 𝑠𝑗|=
3
4= 0.75
Eq. 2.5
44
The Jaccard distance (Ԓ𝑖𝑗) between 𝑅𝑖 and 𝑠𝑗 can be calculated as follows:
Ԓ𝑖𝑗 =1-Ӻ𝑖𝑗 Eq. 2.6
The Jaccard distance for the above example is Ԓ𝑖𝑗 =1-0.75=0.25.
2.6.2 Distance Calculation for Categorical Attributes
While it is evident that a number of formulae have been developed for the distance calculation
of numerical attributes, the creation of formulae for the distance calculation of values of
categorical attributes has received somewhat less attention (Md Anisur Rahman, 2014; C. Wang
et al., 2011). Generally, the distance between two domain values of a categorical attribute is
either regarded as zero (if the two values are identical) or one (if the two values are dissimilar)(Z.
Huang, 1997; Ji, Pang, Zhou, Han, & Wang, 2012; Md Anisur Rahman, 2014). However,
considering the distance between two domain values of a categorical attribute as either zero or
one may not be sensible. The distance between two domain values of a categorical attribute can
instead be measured, based on similarity (Islam & Brankovic, 2011; Md Anisur Rahman, 2014).
Typically, the similarity between two domain values of a categorical attribute is calculated based
on their co-appearance (relation) with the domain values of other categorical attributes in the
records of the whole data set (Ganti, Gehrket, & Ramakrishnant, 1999; H. Giggins & Brankovic,
2012; H. P. Giggins, 2009; Md Anisur Rahman, 2014).
An existing technique called VICUS (H. Giggins & Brankovic, 2012; H. P. Giggins, 2009)
calculates the distance between two categorical attributes based on their similarity. VICUS
measures the similarity between two domain values of a categorical attribute based on their co-
appearances with the domain values of other categorical attributes. VICUS converts the data set
into a graph where all the domain values of categorical attributes are considered as the vertices
of the graph. It uses the co-appearance of two attribute values for drawing the edges between
the vertices delineating the values.
45
Let us consider, 𝑆𝑎,𝑏′ is the similarity between two domain values 𝑎 and 𝑏 of categorical
attribute, 𝑑(𝑎) is the degree of the attribute value 𝑎 (i.e. the number of other attribute values
which is co-appearing with the attribute value 𝑎 in the whole data set), 𝑒𝑎𝑐 is the number of
edges between two attributes values 𝑎 and 𝑐 (i.e. number of times the two categorical values 𝑎
and 𝑐 co-appear in the complete data set), and 𝑙 is the total number of domain values for all
attributes (except 𝑎 and 𝑏) values. The similarity between two categorical attribute values (𝑎
and 𝑏) can be then computed as follows: (H. Giggins & Brankovic, 2012; Md Anisur Rahman,
2014):
𝑆𝑎,𝑏′ =
∑ √𝑒𝑎𝑘 × 𝑒𝑏𝑘𝑙𝑘=1
√𝑑(𝑎) × 𝑑(𝑏)
Eq. 2.7
However, Rahman (2014) advised that if a data set has both the numerical and categorical
attributes, the numerical attribute values can be categorized first, and then the similarity between
two categorical attribute values can be measured based on both the categorical and numerical
(categorized) attribute values.
Similarly, few other existing techniques (Ahmad & Dey, 2007b; Cost & Salzberg, 1993;
Ganti et al., 1999) calculate the distance between two categorical attributes based on their co-
appearance with the domain values of other attribute values. However, the similarity between
the domain values of two categorical attributes can be measured, not only based on their co-
appearance, but also on the frequency of the domain values throughout the data set (C. Wang et
al., 2011). The similarity that is measured based on co-appearance with the domain values of
other attribute values is known as inter attribute value similarity (𝛿𝑖), while the similarity that is
measured based on the frequency of other attribute values is 𝛿𝑜. The overall similarity between
two categorical attribute values is measured as follows:
46
𝛿 = 𝛿𝑖 ∗ 𝛿𝑜 Eq. 2.8
The intra-coupled attribute value similarity (𝛿𝑖) relies on the frequency of an attribute’s
values. If we consider, 𝑎 & 𝑏 are the domain values of attribute 𝐴 and the frequencies of attribute
values are 𝑓𝑎 and 𝑓𝑏 respectively, then the intra-coupled similarity between 𝑎 and 𝑏 can be
measured as follows:
𝛿𝑖(𝑎, 𝑏) =|𝑓𝑎||𝑓𝑏|
|𝑓𝑎|+|𝑓𝑏|+|𝑓𝑎|∗|𝑓𝑏|
Eq. 2.9
The inter-coupled attribute value similarity (𝛿𝑜) is measured with regard to its co-appearance
with other attribute values. For example, 𝜕 is the domain values of all attributes where X ⊆ 𝜕
and 𝑌 = 𝜕\𝑋. If 𝑎 and 𝑏 are the domain values of attribute A1, then the conditional probability
is 𝜏𝑖(𝑋|𝑎), and the inter-coupled similarity between a and b is computed with regard to other
attributes:
𝛿𝑜(𝑎, 𝑏) = 2 − min𝑋⊆|𝜕|
{2 − 𝜏(𝑋|𝑎) − 𝜏(𝑌|𝑏)} Eq. 2.10
The distance between two domain values of a categorical attribute can be computed as 1- 𝛿.
2.7 Cluster Evaluation Techniques
To measure the quality of a clustering solution an evaluation technique is required. Typically,
the quality of a clustering solution is measured on internal cluster evaluation criteria and external
cluster evaluation criteria. Therefore, in this thesis, cluster evaluation techniques are divided
into two categories:
Internal cluster evaluation criteria; and
External cluster evaluation criteria.
47
2.7.1 Internal Cluster Evaluation Techniques
Internal cluster evaluation criteria are also known as the unsupervised measurement of
clustering quality (Md Anisur Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005), which
allow the goodness of a cluster to be evaluated without any external information such as class
values (labels) of the records. A selection of internal cluster evaluation techniques is listed as
follows:
Sum of Square Error (SSE);
Davies-Bouldin (DB) Index;
Silhouette Coefficient;
Xie-Beni Index; and
Dunn Index.
Sum of Square Error (SSE)
The sum of square error (SSE) calculates clusters compactness (Md Anisur Rahman, 2014;
Pang-Ning Tan, Michael Steinbach, 2005). Simple K-means uses SSE as its objective function.
Many GA-based clustering techniques (D.-X. Chang et al., 2009; Michael Laszlo & Mukherjee,
2007) also use SSE as the fitness function. A lower value SSE indicates a better clustering result.
If 𝑘 is the number of clusters, 𝑠𝑗 is the seed of 𝑗𝑡ℎ cluster (𝑐𝑗), and 𝑑𝑖𝑠𝑡 (𝑠𝑗, 𝑥) is the distance
between a record 𝑥 and seed (𝑠𝑗) of cluster 𝑐𝑗, then SSE is calculated as follows:
SSE = ∑ ∑ 𝑑𝑖𝑠𝑡 (𝑠𝑗, 𝑥)2
𝑥∈𝑐𝑗
𝑘
𝑗
Eq. 2.11
Davies-Bouldin (DB) Index
The basic premise of the Davies-Bouldin (DB) Index (D. L. Davies & Bouldin, 1979) is to
minimize the distance of intra-cluster, while maximizing the distance of inter clusters (Agustín-
Blas et al., 2012). The DB index calculates the ratio of the sum of within-cluster scatter to
48
between-cluster separation (D. L. Davies & Bouldin, 1979; Y. Liu et al., 2011; Md Anisur
Rahman, 2014). Many GA-based clustering techniques use the DB Index (Y. Liu et al., 2011;
Xiao et al., 2010) as a fitness function. If 𝑠𝑗 is the seed of the 𝑗𝑡ℎcluster (𝑐𝑗) then scatter (𝑛𝑗) is
calculated as follows.
𝑛𝑗,𝑞 = ( 1
|𝑐𝑗|∑ ||𝑥 − 𝑠𝑗|| 𝑞
2
𝑥𝜖𝑐𝑗
)
1/𝑞
Eq. 2.12
If 𝑠𝑖 is the seed of the 𝑖𝑡ℎcluster (𝑐𝑖) and 𝑠𝑗 is the seed of the 𝑗𝑡ℎcluster (𝑐𝑗) then the distance
between them is 𝑑𝑖𝑗,𝑡 = || 𝑠𝑖 − 𝑠𝑗||𝑡. The DB Index of 𝑘 clusters is computed as follows:
𝑅𝑖,𝑞𝑡 = {𝑛𝑖,𝑞+𝑛𝑗,𝑞
𝑑𝑖𝑗,𝑡}𝑗,𝑗≠𝑖
𝑀𝑎𝑥
Eq. 2.13
𝐷𝐵 =1
𝐾∑ 𝑅𝑖,𝑞𝑡
𝐾
𝑖=1
Eq. 2.14
Silhouette Coefficient
The Silhouette Coefficient evaluates cluster quality based on matching the distances of the
records in the cluster with each other, and also in association with distances of records in other
clusters (Md Anisur Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005). Let us consider
that 𝑎𝑖 is the average distance of the 𝑖𝑡ℎ record of a cluster 𝑐𝑗 to all other records belonging in
the same cluster 𝑐𝑗, and the average distance between the 𝑖𝑡ℎ record and all other records of
another cluster 𝑐𝑘≠𝑗 is computed. Let us consider that 𝑏𝑖 is the minimum average distance with
respect to all clusters. Then the Silhouette Coefficient (𝑆𝑖) of the 𝑖𝑡ℎ record is calculated as
follows:
𝑆𝑖 =𝑏𝑖 − 𝑎𝑖
max(𝑎𝑖, 𝑏𝑖)
Eq. 2.15
49
The Silhouette Coefficient of a cluster 𝑐𝑗 is computed by simply taking the average
coefficients of all records belonging to the cluster 𝑐𝑗. An overall silhouette coefficient of
clustering (i.e. all clusters produced by a technique) can be obtained by computing the average
silhouette coefficient of all clusters 𝑐𝑗 , ∀j. The value of the silhouette coefficient can vary from
-1 to +1. A higher value of silhouette coefficient represents a better quality of clustering.
Xie-Beni Index
The Xie-Beni Index (XB) is the function of the ratio of the variation and separation of the
clusters (Maulik & Mukhopadhyay, 2010; Md Anisur Rahman, 2014; Xie & Beni, 1991). A
XB-Index of lower value indicates a better quality of clustering. Let us consider that a data set 𝐷
has 𝑛 number of records, the fuzzy membership degree of the 𝑖𝑡ℎ record of 𝑗𝑡ℎ cluster is 𝜇𝑖,𝑗,
the seed of the 𝑗𝑡ℎ cluster is 𝑠𝑗, 𝑅𝑖 is the 𝑖𝑡ℎ record of 𝑗𝑡ℎ cluster, and 𝛿(𝑠𝑗, 𝑅𝑖) is the distance
between the record 𝑅𝑖 and the seed 𝑠𝑗 of the 𝑗𝑡ℎ cluster. The variation of the clusters can be
calculated as follows:
𝜗 = ∑ ∑ 𝜇𝑗𝑖2
𝑛
𝑖=1
𝑘
𝑗=1
𝛿2(𝑠𝑗, 𝑅𝑖) Eq. 2.16
If 𝛿(𝑠𝑗, 𝑠𝑗) is the distance between the 𝑗𝑡ℎ and 𝑙𝑡ℎ seeds of the clusters then the separation
can be calculated as follows:
𝜑 = 𝑚𝑖𝑛𝑗!=𝑙
{𝛿2(𝑠𝑗, 𝑠𝑙)} Eq. 2.17
The XB Index of the clusters can be calculated as follows:
𝑋𝐵 =𝜗
n𝜑
Eq. 2.18
50
Dunn Index
The Dunn Index (DI) is a function of the ratio of minimal within cluster distance to maximum
within cluster distance (J. C. Dunn, 1974; Y. Liu et al., 2011; Peng et al., 2014). If 𝛥min is the
minimum within cluster distance and 𝛥max is the maximum within cluster distance then DI can
be calcuated as follows:
DI =𝛥min
𝛥max
Eq. 2.19
2.7.2 External Cluster Evaluation Techniques
External cluster evaluation criteria are also called the supervised measures of clusters (Md
Anisur Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005), which allow the goodness of
a cluster based on external information such as class values (labels) of the records to be
evaluated. A selection of external cluster evaluation techniques is listed as follows:
F-measure;
Purity; and
Entropy.
F-measure
F-measure is a combination of precision and recall (K.-T. Chuang & Chen, 2004; Kashef &
Kamel, 2009; Md Anisur Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005). Let us
consider that Ri,j is the number of records (belonging to a cluster 𝐶𝑖) in respect to a class value 𝑗,
and Ri is the number of records in 𝐶𝑖. The precision Υ(i, j) of cluster 𝐶𝑖 with regard to the class
value 𝑗 can be computed as follows (Kashef & Kamel, 2009; Md Anisur Rahman, 2014):
51
Υ(i, j) =Ri,j
Ri
Eq. 2.20
If Rj is the number of records (belonging to a cluster 𝐶𝑖) in respect to the class value 𝑗 in the
whole data set, then the recall δ(i, j) of cluster 𝐶𝑖 with respect to the class value 𝑗 can be
calculated as follows:
δ(i, j) =Ri,j
Rj
Eq. 2.21
The F-measure FM(i, j), of cluster 𝐶𝑖 with respect to the class value 𝑗 can be computed as
follows:
FM(i, j) =(∂2+1)∗Υ(i,j)∗ δ(i,j)
∂2∗Υ(i,j)+ δ(i,j)
Eq. 2.22
Generally, the value of ∂ is 1 (Md Anisur Rahman, 2014). A higher value of F-measure
represents a better clustering result, with the value of F-measure ranging between 0 and 1.
Purity
The purity of a cluster is measured to evaluate the correctness of a cluster with respect to the
class value. If 𝑗 is a class value, τi,j is the probability that a record of the 𝑖𝑡ℎcluster has 𝑗 class
value. The purity 𝜏𝑖 of 𝑖𝑡ℎ cluster 𝐶𝑖 can then be calculated as follows (Md Anisur Rahman,
2014; Pang-Ning Tan, Michael Steinbach, 2005):
ϑi = 𝑚𝑎𝑥𝑗(𝜏𝑖,𝑗), ∀𝑗 Eq. 2.23
If 𝑅𝑖 is the number of records in the 𝑖𝑡ℎ cluster, and 𝑅𝑖𝑗 is the number of records in the 𝑖𝑡ℎ
cluster which have 𝑗 class value, then the probability 𝜏𝑖𝑗 can be calculated as follows:
τi,j =𝑅𝑖,𝑗
𝑅𝑖 Eq. 2.24
52
If the total number of records in the data set is 𝑛, then the overall purity (PT) for 𝑘 number
of clusters can be computed as follows:
PT = ∑ 𝑅𝑖
𝑛
𝑘
𝑖=1
𝜏𝑖 Eq. 2.25
A higher value of purity represents better clustering results. The value of purity varies
between 0 and 1.
Entropy
In a similar way to purity, the entropy of a cluster is measured to evaluate the correctness of a
cluster with respect to the class value (Md Anisur Rahman, 2014; Pang-Ning Tan, Michael
Steinbach, 2005). The entropy ϱi of the 𝑖𝑡ℎ cluster can be calculated as follows (Md Anisur
Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005):
ϱi = − ∑ τi,j
𝑑
j=1
log2τi,j
Eq. 2.26
Here, 𝜏𝑖,𝑗 is calculated using equation Eq. 2.24 and 𝑑 is the domain size of the class attribute.
The overall entropy (eT) for 𝑘 number of clusters can be calculated as follows:
eT = ∑Ri
𝑛
𝑘
i=1
ϱi Eq. 2.27
A lower value of entropy represents better clustering quality. The value of entropy varies
between −∞ to +∞.
2.8 Summary
In this chapter, we first introduce a data set with its notations and definitions, and then offered
a short introduction to data mining, machine learning, clustering, applications and categories of
clustering, different types of distance calculations, and cluster evaluation techniques. We also
53
discuss the strengths and weaknesses of currently used clustering techniques, with Table 2.5
providing a summary of these.
Table 2.5: Advantages and limitations of currently used clustering techniques
Categories of Techniques Advantages Limitations
Partition-based
Clustering Techniques
(Ahmad & Dey, 2007a;
Han & Kamber, 2006;
Pang-Ning Tan, Michael
Steinbach, 2005)
Time complexity is
comparatively low 𝑂(𝑛).
Capable of separating
overlapping clusters.
Require a user to define the various
inputs including the number of clusters
𝑘 in advance.
Most techniques select the initial seeds
randomly.
The objective function tends to get
stuck at local optima.
Most techniques cannot process data
sets having both categorical and
numerical attributes.
Hierarchical Clustering
Techniques
(Han & Kamber, 2006;
Pang-Ning Tan, Michael
Steinbach, 2005)
Do not require any user input on
the number of clusters 𝑘.
Autonomous to the initial
conditions.
Time complexity is comparatively
high 𝑂(𝑛3).
May fail to separate overlapping
clusters.
Density-based Clustering
Techniques
Select high-quality and densest
initial seeds from the data set to
produce clusters.
Require a user to define the various
inputs including the number of radii.
Graph-based Clustering
Techniques
(Z. Chen & Ji, 2010;
Pang-Ning Tan, Michael
Steinbach, 2005;
Schaeffer, 2007; Zhong
et al., 2010)
Capable of detecting the
arbitrary shape of a cluster
Require a similarity function to be
selected from the wide range of
available similarity functions.
Require a similarity graph algorithm to
be selected.
Grid-based Clustering
Techniques
(Han & Kamber, 2006;
W.M. Ma & Chow,
2004; W. Wang et al.,
1997)
Time complexity is
comparatively low 𝑂(𝑛).
Require huge amounts of memory if
the number of cells is high.
54
Spectral Clustering
Techniques
(X. Hong et al., 2014;
Matthias & Juri, 2009;
Nascimento & de
Carvalho, 2011; von
Luxburg, 2007)
Capable of detecting the
arbitrary shape of a cluster.
Require a user to define the various
inputs including the number of clusters
𝑘 in advance.
Require a similarity function to be
selected from the wide range of
available similarity functions.
Require a similarity graph algorithm to
be selected.
Model-based Clustering
Techniques
(Han & Kamber, 2006;
Pang-Ning Tan, Michael
Steinbach, 2005; Roy &
Parui, 2014)
Optimise the fit between the
data and mathematical model.
Consider the probability distribution to
be the same for each cluster.
May get stuck at local optima.
Ant Colony Algorithm-
based Clustering
Techniques
(İnkaya et al., 2015;
Korürek & Nizam, 2008;
Ramos et al., 2009;
Shelokar et al., 2004)
Can avoid the local optima issue
of many partition-based
clustering techniques.
Generate the number of clusters
through the clustering process.
Some techniques require a user to
define various inputs in advance,
including the number of clusters 𝑘.
Some techniques generate the number
of clusters randomly to form an ant.
Bee Colony Algorithm-
based Clustering
Techniques
(Banharnsakun et al.,
2013; Karaboga &
Ozturk, 2011; Kuo et al.,
2014; Yan et al., 2012; C.
Zhang et al., 2010)
Can avoid the local optima issue
of many partition-based
clustering techniques.
Generate the number of clusters
through the clustering process.
Some techniques require a user to
define various inputs in advance,
including the number of clusters 𝑘.
Some techniques generate the number
clusters randomly to form a bee.
Particle Swarm
Optimization (PSO)
Algorithm-based
Clustering Techniques
(Cagnina et al., 2014; L.-
Y. Chuang et al., 2011;
Cura, 2012; Kuo et al.,
2012)
Can avoid the local optima issue
of many partition-based
clustering techniques.
Generate the number of clusters
through the clustering process.
Randomly generate the number of
clusters to form a particle.
Black hole Algorithm-
based Clustering
Techniques
Can avoid the local optima issue
of many partition-based
clustering techniques.
Randomly generate the number of
clusters to form a star.
55
(Hatamlou, 2013) Generate the number of clusters
through the clustering process.
Replacement of the deleted stars is
randomly generated.
Firefly Algorithm-based
Clustering Techniques
(Abshouri & Bakhtiary,
2012; Hassanzadeh &
Meybodi, 2012;
Senthilnath et al., 2011)
Can avoid the local optima issue
of many partition-based
clustering techniques.
Generate the number of clusters
through the clustering process.
Randomly generate the number of
clusters to form a firefly.
Genetic Algorithm-based
Clustering
(Agustín-Blas et al.,
2012; D.-X. Chang et al.,
2009; D. Chang et al.,
2012; He & Tan, 2012;
Y. Liu et al., 2011;
Rahman & Islam, 2014)
Can avoid the local optima issue
of many partition-based
clustering techniques.
Generate the number of clusters
through the clustering process
Compared to other evolutionary
algorithm-based clustering
techniques such as PSO, ant
colony, bee colony, black hole
and firefly algorithms, GA has
more components to improve
clustering quality.
Most techniques randomly generate the
number of genes of a chromosome in
population initialization.
Records are also randomly chosen as
genes.
Some existing techniques select a high-
quality initial population with a high
complexity of 𝑂(𝑛2).
56
Chapter 3
High-Quality Initial Population in a GA for High-
Quality Clustering with Low Complexity
3.1 Introduction
An introduction to different types of clustering techniques and the significance of the clustering
techniques are presented in Chapter 2. It is clear from the literature that the clustering is a well-
known and extremely important technique in the area of data mining. Therefore, it is a very
active research area. However, the existing clustering techniques have some limitations and
therefore, there is room for further improvement. The main focus of this chapter is to make some
progress towards achieving our first research goal (see Chapter 1).
There are many approaches for clustering (Arthur & Vassilvitskii, 2007; D.-X. Chang et al.,
2009; D. Chang et al., 2012; Y. Liu et al., 2011; Lloyd, 1982; Rahman & Islam, 2014). K-means
During the PhD candidature, we have published the following paper based on this chapter.
Beg, A. H., and Islam, M. Z. (2015): Clustering by Genetic Algorithm - High Quality Chromosome
Selection for Initial Population, In Proc. of the 10th IEEE Conference on Industrial Electronics and
Applications (ICIEA 2015), Auckland, New Zealand, 15-17 June, 2015, pp. 129-134. (ERA 2010 Rank
A).
57
is one of the most popular techniques for clustering. In K-means, it requires a user (data miner)
to define the number of clusters (𝑘) in advance (Lloyd, 1982). Based on the user defined number
of clusters 𝑘, it then randomly selects 𝑘 records as initial seeds from the data set and each record
of the data set is then allocated to its closest seed in order to form clusters.
While K-means is popular for its simplicity, it has a number of well-known drawbacks (D.-
X. Chang et al., 2009; Jain, 2010; Mohd et al., 2012; Rahman & Islam, 2014). One of the main
drawbacks of K-means is its requirement of the user defined number of clusters (𝑘) prior to
clustering. The appropriate number of clusters has influence on the quality of a final clustering
solution (Kuo et al., 2012). It is difficult for a user (data miner) to estimate the appropriate
number of clusters in advance. Another drawback of K-means is that it has a tendency to getting
stuck at local optima. Moreover, the random selection of the initial seeds is also considered to
be a major drawback since it influences heavily the final clustering quality (Arthur &
Vassilvitskii, 2007). A recent technique K-means++ (Arthur & Vassilvitskii, 2007) addresses
the last drawback of K-means. However, it also suffers from other drawbacks of K-means.
The use of a GA in clustering can help to avoid the local optima issue of K-means (Agustín-
Blas et al., 2012; D.-X. Chang et al., 2009; D. Chang et al., 2012; He & Tan, 2012; Y. Liu et al.,
2011; Peng et al., 2014; Rahman & Islam, 2014). Typically, a GA-based technique does not
require any user input on the number of clusters 𝑘.
However, GA-based clustering techniques have some limitations. Many existing techniques
(Y. Liu et al., 2011; Maio et al., 1995; Maulik & Bandyopadhyay, 2000; Xiao et al., 2010)
generate the number genes of a chromosome randomly, in the population initialization phase.
They also randomly choose records as genes, instead of carefully choosing genes of a
chromosome. Careful selection of genes can create an initial population containing high-quality
chromosomes. High-quality initial population typically increases the possibility of obtaining a
58
good clustering solution at the end of the genetic processing (Diaz-Gomez & Hougen, 2007;
Goldberg et al., 1991; Rahman & Islam, 2014).
An existing technique called GenClust (Rahman & Islam, 2014) finds high-quality initial
population and thereby obtains good clustering solution. However, its initial population
selection process is very complex with a complexity of 𝑂(𝑛2), where 𝑛 is the number of records
in a data set. Moreover, GenClust requires a user input on radius values for the clusters in the
initial population selection. It can be very difficult for a user to guess the set of radius values
(i.e. radii).
In this chapter, we propose a clustering technique called DeRanClust that produces high-
quality initial seeds through a deterministic phase and a random phase. It aims to produce high-
quality initial seeds with a low complexity of 𝑂(𝑛). DeRanClust chooses the number of clusters
automatically for the chromosomes in the initial population. Therefore, it does not require any
user input for the number of clusters 𝑘. DeRanClust also reduces the chance of getting stuck at
local optima by using our new genetic algorithm for high-quality chromosome selection.
We implement DeRanClust and compare its performance with AGCUK (Y. Liu et al., 2011),
GAGR (D.-X. Chang et al., 2009), K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii,
2007) and GenClust (Rahman & Islam, 2014). We compare the performance of the techniques
through two cluster evaluation criteria, namely Silhouette Coefficient (Agustín-Blas et al., 2012;
Pang-Ning Tan, Michael Steinbach, 2005) and DB Index (D L Davies & Bouldin, 1979) using
five real-life data sets that we obtain from the UCI machine learning repository (M. Lichman,
2013). We also carry out a thorough experimentation to investigate the usefulness of the
components used in DeRanClust.
The contributions of the chapter are presented as follows:
59
Proposing DeRanClust that produces high quality clustering solutions with low
complexity and require no user input.
The evaluation of DeRanClust by comparing it with existing techniques.
The organization of the chapter is as follows: in Section 3.2, we present main steps of
DeRanClust; the experimental results and discussion are presented in Section 3.3, and the
summary of the chapter is presented in Section 3.4.
3.2 DeRanClust: Deterministic and Random Selection for the Initial Population in a GA-
Based Clustering Technique
We now introduce the main steps of the proposed technique as follows and explain each of them
in detail. Out of the following steps, Step 2: Population Initialization is our novel contribution
of this chapter.
BEGIN
Step 1: Normalization
Step 2: Population Initialization
DO: t=1to I/* I=50; I is the user defined number of iterations */
Step 3: Noise-Based Selection Operation
Step 4: Crossover Operation
Step 5: Twin Removal
Step 6: Mutation Operation
Step 7: Elitist Operation
END
END
Step 1: Normalization
DeRanClust first normalizes a data set 𝐷 in order to weigh each attribute equally regardless of
their domain sizes (Rahman & Islam, 2014). If 𝐷 has an attribute (such as salary) with huge
domain sizes and an attribute (such as age) with relatively smaller domain sizes, then the
attribute with huge domain sizes has higher impacts (than the attribute with lower domain sizes)
60
on the distance calculations. Therefore, the different domain sizes of the attributes may give
different weigh/importance on different attributes.
The normalization avoids this undesirable situation and allows each attribute to have the
same level of impact. It brings the domain range of each numerical attribute between 0 and 1. It
generates a normalized attribute value 𝑋𝑁 = 𝑋𝑀𝑎𝑥−𝜇
𝑋𝑀𝑎𝑥− 𝑋𝑀𝑖𝑛, where 𝑋𝑀𝑎𝑥 is the maximum, 𝑋𝑀𝑖𝑛 is
the minimum, and 𝜇 is the average domain value of the numerical attribute. The distance
between records is computed using the Euclidean distance metric (Han & Kamber, 2006;
Schulz, 2008; Teknomo, 2015a). Hence, the distance between two records for a numerical
attribute can vary between 0 and 1.
Step 2: Population Initialization
This is a new/original contribution of DeRanClust that selects high-quality chromosomes in the
initial population through two phases: a deterministic phase and a random phase.
DeRanClust selects the first 50% of the chromosomes through a deterministic selection phase
and the remaining 50% chromosomes through a random selection phase, for the initial
population. For the deterministic phase, it uses a set of predefined numbers of genes/clusters 𝑘.
The default set of predefined 𝑘 is {2, 3, …. 10} where the size of the set is nine. DeRanClust
uses each element of the set as the number of clusters (𝑘) for K-means and thus produces a
clustering solution, i.e. chromosome. For each element it runs K-means five times and thus
produces five chromosomes. That is it produces altogether 5 × 9 = 45 chromosomes in the
deterministic phase (see Fig. 3.2). We actually have the opportunity to apply K-means many
times, which also increases the possibility of getting a good-quality chromosome (i.e. clustering
solution) at the beginning (the very first iteration) of the genetic algorithm.
Let us assume that we have a two-dimensional data set having 90 records as shown in Fig.
3.1(a). To generate a chromosome, at first, it takes an element from the predefined set as an
61
input. In the first iteration of K-means the user defined number of initial seeds/genes are
randomly generated from the data set. Fig. 3.1(b) shows the initial seeds/genes that are generated
randomly. An initial seed/gene is a record of the data set where each record is a set of attribute
values. Each record is then allocated to its closest seed/gene.
As usual, K-means computes a set of new seeds based on the records allocated to each seed.
The process continues until it settles to a set of seeds/genes that do not change any further. The
process of K-means is explained in Fig. 3.1. Fig. 3.1(e) has three seeds/genes, and the three
genes together form a chromosome.
Due to the use of K-means DeRanClust expects to get high-quality chromosomes for a given
𝑘 value. Since DeRanClust does not know the actual 𝑘 in a data set, it explores numbers from 2
to 10. Typically, the 𝑘 value for a data set varies between 2 and 10, which we find through the
empirical analysis on the data sets in the UCI machine learning repository (M. Lichman, 2013).
(a) Records
(b) Initial seed
(c) Iteration 1
(d) Iteration 2
(e) Final Iteration
Fig. 3.1: The formation of a chromosome through K-means
62
Fig. 3.2: Flowchart of the population initialization
Start
𝑖 = 1
Take seed number 𝑖, 𝑖 = 2,3, … 10
𝑗 = 1
Apply K-means to generate deterministic chromosome 𝑝𝑗𝑖 and
calculate the fitness 𝑓𝑗𝑖 and insert 𝑃𝑗
𝑖 into 𝑃𝑑
𝑗 = 𝑗 + 1
𝑗 = 5?
𝑖 = 𝑖 + 1
𝑖 = 10?
Sort the deterministic chromosomes fitness 𝑓 and select
the top 𝑃/2 chromosomes 𝑃𝑑
𝑙 = 1
Randomly generate seed number for random chromosome in a range 2
to √𝑛, where 𝑛 is the number of records in a dataset
Randomly generate chromosome 𝑃𝑙 and calculate the
fitness 𝑓𝑙 and insert 𝑃𝑙 into 𝑃𝑟
𝑙 = 𝑙 + 1
𝑙 = 10?
𝑃𝑠←𝑃𝑑 ∪ 𝑃𝑟 . Sort the fitness of the chromosomes (𝑃𝑠) and
find the chromosome 𝑃𝑏 having maximum fitness
End
Yes
Yes
Yes
No
No
No
63
In the UCI repository, there are 157 data sets for which the class sizes (i.e. the domain sizes
of the class attributes) have been reported. The domain size of the class attribute of a data set is
indicative to the number of clusters in the data set. The mean and standard deviation of the class
sizes of the data sets are 5.36 and 5.49, respectively. That is, the number of clusters of a data set
typically varies between 2 and 10. Hence, DeRanClust uses the set of 𝑘 {2, 3,…..10}, in the
deterministic phase.
However, the actual 𝑘 values in some data sets can be more than 10. In order to handle such
a situation DeRanClust uses the random phase where it generates 10 chromosomes (see Fig.
3.2). For each chromosome, it randomly generates the 𝑘 value between 2 and √𝑛 (𝑛 is the
number of records in a data set) and then randomly picks 𝑘 records to form 𝑘 genes of the
chromosome. DeRanClust by default uses 20 chromosomes in the initial population of a
generation. Therefore, it chooses the best 10 chromosomes from the 45 chromosomes generated
in the deterministic phase, and the 10 chromosomes from the random phase. While the use of
K-means helps to get high-quality chromosomes the use of the random approach helps to
explore the solution space through its randomness.
In this chapter, we prepare |𝑃| chromosomes through the population initialization process
(see Fig. 3.2 and Step 1 of Algorithm 3.1). We get |𝑃|/2 chromosomes from the deterministic
phase and |𝑃|/2 chromosomes from the random phase of the selection process. The value of
|𝑃| in the proposed technique is set to 20. When we get the set of initial chromosome, we then
compute the fitness of each chromosome and preserve the best chromosome 𝑃𝑏 for the elitist
operation. The proposed technique calculates the fitness of each chromosome using Davis
Bouldin (DB) index (D. L. Davies & Bouldin, 1979) where a small value of DB index represents
good clustering results. The fitness of a chromosome is calculated as 1/DB.
64
Step 3: Noise-based Selection Operation
In this chapter, the noise-based selection operation is used in order to select the chromosomes
for the consequent GA operations by comparing two generations. For example, if we have
twenty chromosomes in the current 𝑖𝑡ℎgeneration 𝑃1𝑖 , 𝑃2
𝑖 , … 𝑃20𝑖 and twenty chromosomes in
the previous (𝑖 − 1)𝑡ℎ generation 𝑃1𝑖−1 , 𝑃2
𝑖−1 ,…. 𝑃20𝑖−1 , then to select the chromosomes in the
current generation for the next GA operations such as crossover and mutation a pair wise
comparison (i.e. 𝑃𝑗𝑖 and 𝑃𝑗
𝑖−1 are compared, ∀𝑗 ) is carried out between the current and
previous generation (see Step 2 of Algorithm 3.1).
In the proposed technique, we aim to introduce some randomness that allows selecting the
worst chromosome of the pair instead of selecting only the best chromosome of the pair in order
to increase the diversity of the population. To achieve this goal, we use the nosing selection
approach of AGCUK (Y. Liu et al., 2011). In the noise-based selection approach, we add some
noise into the fitness of the current generation and compare it with the previous generation. The
noise value is a randomly generated real number. We set the noise value to be high at the begging
of the iteration, and the value is decreased when the iteration increases. As a result, the
chromosomes with the worst fitness value get a chance to be selected in the earliest iteration
whereas the chromosomes with the least fitness value in the later iteration have fewer chances
to be selected.
Step 4: Crossover Operation
In the crossover operation, two parent chromosomes swap their segments and generate two new
offspring chromosomes (Agustín-Blas et al., 2012; D.-X. Chang et al., 2009; D. Chang et al.,
2012) as shown in Fig. 3.3. Fig. 3.3 shows that Chromosome 1 has three genes (Gene 11, Gene
12 and Gene 13) and Chromosome 2 has four genes (Gene 21, Gene 22, Gene 23, and Gene 24
respectively).
65
Two-parent chromosomes swap their genes and generate two offspring chromosomes:
Offspring1 and Offspring 2 (see Fig. 3.3). To select the pair of chromosomes, we use the roulette
wheel (D. Chang et al., 2012; Maulik & Bandyopadhyay, 2000; Mukhopadhyay & Maulik,
2009b) selection approach. In this approach, one chromosome of the pair is selected based on
its highest fitness value, and another chromosome of the pair is selected based on the roulette
wheel selection. Peng et al. (2014) experimentally demonstrate that single point crossover
performs better than the multi-point crossover. Therefore, in this study we use the single-point
crossover to perform a crossover between two-parent chromosomes.
Fig. 3.3: Single point crossover between a pair of chromosomes
Step 5: Twin Removal
Two identical genes can somehow be generated in a chromosome (Rahman & Islam, 2014).
Therefore, we use the twin removal approach (Rahman & Islam, 2014) to remove/change the
identical genes. If the length of a chromosome is more than two then while there are two
identical genes we delete one of the two identical genes. Thus the length of the chromosomes is
decreased by one. If the length of a chromosome is two and both the genes are identical then we
randomly change one of the two identical genes in order to make sure that the genes are not
identical.
66
Algorithm 3.1: DeRanClust Input: A data set D having N records and |A| attributes, where A is the set of attributes
Output: A set of cluster C
Require:
Ps ← ∅ /* 𝑃𝑠 is the set of initial population (20 chromosomes), initially set 𝑃𝑠 to empty */
Po ← ∅ /* 𝑃𝑜 is the set of offspring chromosomes, initially set 𝑃𝑜 to empty*/
Pm ← ∅ /* Pm is the set of mutated chromosomes, initially set 𝑃𝑚 to empty*
I = 50 /* user defined number of iterations/generations, default value for I is 50*/
D ́← normalized (D) /* normalize each numerical attribute in the normalized data set D ́*/
Pd← ∅ /* 𝑃𝑑 is the set of deterministic chromosomes (45 chromosomes), initially set 𝑃𝑑 to empty */
Pr← ∅ /* 𝑃𝑟 is the set of random chromosomes (10 chromosomes), initially set 𝑃𝑟 to empty */
end
Step 1: /* Population Initialization */
Pd ←GenerateDeterministicChromosomes (D′) /* generate 𝑃𝑑 through K-means, the number of initial seeds for K-means are chosen deterministically */
Pd ←SelectDeterministicChromosomes (Pd ) /* select top 10 chromosomes (50% chromosomes of the initial population) based on fitness */
Pr ←GenerateRandomChromosomes (D′) /* generate 𝑃𝑟 randomly, the number of initial seeds are chosen randomly and the seeds also chosen randomly */
Ps ← Ps ∪ (Pr ∪ Pd) /* insert 𝑃𝑟 and 𝑃𝑑 into 𝑃𝑠 */
Pb= FindBestChromosome (Ps) /* 𝑃𝑏 is a chromosome that has the maximum fitness value in 𝑃𝑠 */
end
for (t=1 to I ) do /* default I = 50, I is the user defined number of iterations and t is the counter of I */
Step 2: /* Noise-Based Selection Operation */
if ( t >1) then
Fs ← Calculate Fitness (Ps) /* F = {𝐹1𝑡, 𝐹2
𝑡 , … … … . 𝐹|Ps| 𝑡 } is the set of fitness of every chromosome in 𝑃𝑠 of 𝑡𝑡ℎgeneration */
for j= 1 to |Ps| do
if (Fjt−1- Fj
t + noise > 0) then
select Pjt−1 /* select the chromosome 𝑃𝑗 of (𝑡 − 1)𝑡ℎgeneration */
Ps ← Ps ∪ Pjt−1 /* insert 𝑃𝑗
𝑡−1 into 𝑃𝑠 */
else
select Pjt /* select the chromosome 𝑃𝑗 of 𝑡𝑡ℎgeneration */
Ps ← Ps ∪ Pjt /* insert 𝑃𝑗
𝑡 into 𝑃𝑠 */
end
end
end
end
Step 3: /* Crossover operation */
while |Ps| ≥ 2 do
p ← SelectChromosomePair (Ps) /* select a pair of chromosome 𝑝 = {𝑝1, 𝑝2} from 𝑃𝑠 using roulette wheel */
O ← SinglePointCrossover (p) /* after crossover between 𝑝1 and 𝑝2, two offspring o = {o1, o2} are generated */
Po ← Po ∪ O /* insert offspring 𝑂 = {𝑂1, 𝑂2} into 𝑃𝑜 */
Ps ← Ps- p /* remove 𝑝 = {𝑝1, 𝑝2} from 𝑃𝑠 */
end end
Step 4: Twin Removal
Ps ←Twin Removal (Po) /* If the length of a chromosome is > 2 and if there are two identical genes, delete one of the identical genes,
end If the length of a chromosome is = 2 and if there are two identical genes, change one of two identical genes */
Step 5: /* Mutation operation */
Fs ← Calculate Fitness (Ps) /* Calculate the fitness of every chromosome in 𝑃𝑠 */
Pb= FindChromsomeHavingMaxFitness (Ps) /* 𝑃𝑏 is a chromosome having maximum fitness in 𝑃𝑠 */
P𝑣 ← DivisionAndAbsorptionOperation (Pb) /* Perform division and absorption on chromosome 𝑃𝑏 and get mutated chromosome 𝑃𝑣 */
Pm ← P𝑣 /* insert P𝑣 into Pm*/
Ps − Pb /* remove Pb from Ps */
for i= 1 to |Ps| do
P𝑣← DivisionOrAbsorptionOperation (𝑃𝑖) /* randomly apply either division or absorption on chromosome 𝑃𝑖 and get mutated chromosome 𝑃𝑖 */
Pm ← P𝑣 /* insert 𝑃𝑣 into 𝑃𝑚*/
end
end
Step 6: /* Elitist Operation */
Pb ←ElitistOperation (Pm & Pb ) /* Apply elitist operation on 𝑃𝑚 & 𝑃𝑏 and find the best chromosome 𝑃𝑏 */
C ← C ∪ Pb/* insert 𝑃𝑏 into 𝐶 */
Return C
end
end
Let us consider a chromosome 𝑃𝑗 has two genes 𝑔𝑗𝑖 and 𝑔𝑗𝑘 . The two genes 𝑔𝑗𝑖 and 𝑔𝑗𝑘
identical when the distance between 𝑔𝑗𝑖 and 𝑔𝑗𝑘 is zero. To change the identical genes, we
randomly select an attribute (say the 𝐴𝑡ℎattribute) of 𝑔𝑗𝑖 and its value is 𝑥 then we replace 𝑥
67
with a random number (i.e. a randomly generated real number within the range between 0 and
1) until the gene is identical.
Step 6: Mutation Operation
The main objective of the mutation operation is to arbitrarily change the genes of a chromosome
in order to explore different solutions. To perform the mutation operation, we use division and
absorption approach of AGCUK (Y. Liu et al., 2011). In this approach, the chromosomes that
we obtain from the crossover operation, we divide them into two parts: best and others.
The chromosome that has the highest fitness value is considered as the best. For the best
chromosome, we apply the division and absorption operation. For rest of the chromosomes, we
randomly apply either division or absorption (see Step 5 in Algorithm 3.1). For the division
operation, we find the sparsest cluster of the selected chromosome and divide them into two
clusters by applying K-means where the value of 𝑘 is 2. For the absorption operation, we find
the closest clusters of the selected chromosome and merge them into one cluster. The two
clusters that have minimum seed to seed distance are considered as the closest.
Step 7: Elitist Operation
The Elitist operation keeps track of the best chromosome throughout the generations in order to
ensure the continuous improvement of the quality of the best chromosome found so far over the
iterations. The operation is applied on a population at the end of all other operations in a
generation. If the fitness of the worst chromosome 𝑃𝑤𝑖 of the 𝑖𝑡ℎ population (i.e. the current
population) is less than the fitness of the best chromosome 𝑃𝑏 found so far from all previous
generations then 𝑃𝑤𝑖 is replaced by 𝑃𝑏 in the current population. Moreover, if the best
chromosome of the current population 𝑃𝑏𝑖 has higher fitness than the fitness of 𝑃𝑏 then 𝑃𝑏
𝑖 is
copied in 𝑃𝑏 , replacing its old value.
68
3.3 Experimental Results and Discussion
We implement our proposed technique DeRanClust and four existing techniques namely
AGCUK (Y. Liu et al., 2011), GAGR (D. Chang et al., 2012), K-Means (Lloyd, 1982), K-
means++ (Arthur & Vassilvitskii, 2007) and GenClust (Rahman & Islam, 2014). To implement
AGCUK, GenClust and DeRanClust, we set the population size to 20 for each
generation/iteration, and the total number of iteration to 50. Following the suggestion of
AGCUK, we set the parameter (i.e. the population size and the number of iterations) for
AGCUK, GenClust and DeRanClust as the same for a fair comparison.
The population size for GAGR is set to 30 in each generation, and the total number of
generations is set to 50 based on what was suggested in GAGR. Moreover, the cluster number
for GAGR is user defined, thus we set the cluster number for GAGR same as the cluster number
that we obtained from DeRanClust. We set the threshold value for K-means to 0.05 and the total
number of iterations to 50 as suggested in AGCUK (Y. Liu et al., 2011).
3.3.1 Data Sets
We apply the techniques on five (5) real-life data sets as shown in Table 3.1. The data sets are
publicly available in UCI Machine Learning Repository (M. Lichman, 2013). We consider the
data sets in the experiment having numerical attributes except the categorical attributes. All the
data sets contain some class attributes, we remove them during the clustering process. Moreover,
we also normalize each attribute of the data set in order to get the same level of impact of
attributes. We run each technique 20 times on each data set, and we take the average result.
3.3.2 Evaluation Criteria
To compare our technique with the existing techniques two well-known evaluation criteria
namely Silhouette Coefficient (Agustín-Blas et al., 2012; Pang-Ning Tan, Michael Steinbach,
2005) and DB Index (D. L. Davies & Bouldin, 1979) are used. Note that the smaller value of
69
DB Index indicates a better clustering result and the higher value of Silhouette Coefficient
represents a better clustering result.
Table 3.1: A brief description of the data sets
Data set Number of
Records
Number of Numerical
Attributes
Class Size Attribute Type
Pima Indian Diabetes (PID) 768 8 2 Integer, Real
Blood Transfusion (BT) 748 4 2 Real
Glass Identification (GI) 214 9 6 Real
Liver Disorder (LD) 345 6 2 Integer, Real
Bank Note Authentication (BN) 1372 4 2 Real
3.3.3 Experimental Results on All Techniques
In this section, we compare the experimental result of the proposed technique with four existing
techniques AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), K-means (Lloyd,
1982), K-means++ (Arthur & Vassilvitskii, 2007) and GenClust (Rahman & Islam, 2014) in
order to evaluate the usefulness of the proposed technique on 5 data sets where each technique
runs 20 times on each data set.
Fig. 3.4 shows the average Silhouette Coefficient of the clustering results, where DeRanClust
achieves better results than all other techniques in 5 out of 5 data sets. That is, in 5 out of 5 data
sets the average Silhouette Coefficient of 20 runs of DeRanClust is higher than the average
Silhouette Coefficient of 20 runs of AGCUK, GAGR, K-means and GenClust.
Fig. 3.4: Comparative results between DeRanClust and other techniques based on Silhouette Coefficient
70
As we can see in Fig. 3.5, DeRanClust achieves better clustering results (on an average) than
all other techniques in 5 out of 5 data sets based on DB Index for which a lower value indicates
a better result.
Fig. 3.5: Comparative result between DeRanClust and other techniques based on DB Index
The right most columns of Fig. 3.4 and Fig. 3.5 show the average Silhouette Coefficient and
DB Index of all techniques on all data sets. DeRanClust achieves clearly better results on an
overage than all other techniques.
3.3.4 An Analysis of the Impact of Various Component of DeRanClust
In this section, we present some interesting initial results. We carry out the initial experiments
to analyze and evaluate the components used in the proposed technique. In order to evaluate the
effectiveness of various components of the proposed technique, we use the five (5) data sets as
shown in Table 3.1. We run each technique 20 times on each data set and present the average
result.
3.3.4.1 An Analysis of the Impact of the Population Initialization
We incorporate our proposed high-quality chromosome selection with an existing technique
called AGCUK (Y. Liu et al., 2011) for the selection of the initial population. Since our proposed
technique generates the initial population through the high-quality chromosomes selection
approaches (i.e. deterministic and random selection process) we explore the impact of high-
71
quality chromosome selection in the initial population. We generate 20 initial chromosomes
through our high-quality chromosome selection process for AGCUK. We then use these in
AGCUK as the initial population, and we run AGCUK for 50 iterations. We give the name of
this version of AGCUK as Modified AGCUK. We run both AGCUK and Modified AGCUK for
50 iterations on 5 data sets.
As we can see in Table 3.2, Modified AGCUK achieves better clustering result compared to
AGCUK in four (4) out of five (5) data sets according to the Silhouette Coefficient and in five
(5) out of five (5) data sets according to the DB Index. The average clustering result of Modified
AGCUK on five (5) data sets is also better than the AGCUK in terms of the both evaluation
criteria. Note that the shaded cells represent the best clustering results among the techniques.
Table 3.2: Comparative results between AGCUK and Modified AGCUK
3.3.4.2 An Analysis of the Impact of the Crossover Operation
We also explore the effectiveness of the crossover operation used in the proposed technique. In
order to evaluate the effectiveness of the crossover operation, we compare our proposed
technique with a different version of the proposed technique. We call this version as DeRanClust
without Crossover. The DeRanClust without Crossover is exactly the same as the DeRanClust
except it does not have the crossover operation. We run both DeRanClust and DeRanClust
without Crossover for 50 iterations on 5 data sets.
Data set DB Index (lower the better)
AGCUK Modified AGCUK
Silhouette Coefficient (higher the better)
AGCUK Modified AGCUK
PID 1.4062 1.3657 0.2728 0.2670
BT 0.5478 0.4724 0.6498 0.6812
GI 0.6247 0.5570 0.6573 0.7068
Liver 0.8731 0.8560 0.4612 0.4678
BN 0.7978 0.7596 0.4785 0.4994
Average 0.84992 0.80214 0.50392 0.52444
72
Table 3.3 shows that DeRanClust achieves better clustering results than DeRanClust without
Crossover in five (5) out of five (5) data sets based on both the Silhouette Coefficient and DB
Index. The average result of DeRanClust on five (5) data set is also better than DeRanClust
without Crossover in terms of both the Silhouette Coefficient and DB Index.
Table 3.3: Comparative result between DeRanClust and DeRanClust without Crossover
Data set DB Index (lower the better)
DeRanClust DeRanClust without Crossover
Silhouette Coefficient (higher the better)
DeRanClust DeRanClust without Crossover
PID 0.9145 1.3317 0.4684 0.2942
BT 0.1854 0.4931 0.8616 0.6636
GI 0.2490 0.5491 0.8321 0.7125
Liver 0.2843 0.7585 0.8025 0.5066
BN 0.4234 0.9133 0.6956 0.4290
Average 0.41132 0.80914 0.73204 0.52118
3.3.4.3 Cluster Quality Comparison between DeRanClust and Modified AGCUK
Since Table 3.2 shows an improvement in AGCUK when the initial population is selected
through our proposed high-quality initial population selection (i.e. Modified AGCUK), in Table
3.4, we compare the cluster quality obtained by the proposed technique with Modified AGCUK.
This gives a fairer comparison between DeRanClust and AGCUK. In Table 3.4, we can see a
clear domination of the proposed technique over Modified AGCUK.
Table 3.4: Comparative results between DeRanClust and Modified AGCUK
DB Index (lower better) Silhouette Coefficient (higher the better)
Data set DeRanClust Modified AGCUK DeRanClust Modified AGCUK
PID 0.9145 1.3657 0.4684 0.2670
BT 0.1854 0.4724 0.8616 0.6812
GI 0.2490 0.5570 0.8321 0.7068
Liver 0.2843 0.8560 0.8025 0.4678
BN 0.4234 0.7596 0.6956 0.4994
Average 0.41132 0.80214 0.73204 0.52444
73
3.3.5 Complexity Analysis
In this section, we present the complexity of DeRanClust and compare it with the complexity of
AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), GenClust (Rahman & Islam,
2014) and K-means (Lloyd, 1982). The main factors related to the complexity of DeRanClust
are as follows: in a data set 𝐷 number of records is 𝑛, number of attributes is 𝑚, number of
genes in a chromosome is 𝑘, number of chromosomes in a population is 𝑧, number of iterations
in K-means is 𝑁′ and number of iterations in DeRanClust is 𝑁. We realize that out of these
factors 𝑛, 𝑚, 𝑘 and 𝑧 can be much bigger than others. Hence, we consider 𝑛, 𝑚, 𝑘 and 𝑧 to
compute the complexity.
For the initial population, DeRanClust uses K-means to get a number of deterministic
chromosomes, the complexity of which is 𝑂(𝑛𝑚𝑘𝑧). It also randomly selects some
chromosomes, for which the complexity is 𝑂(𝑘𝑧). The fitness function is DB index which has
a complexity of 𝑂(𝑛𝑚𝑘𝑧). Once fitness is computed the noising selection requires pair wise
comparison which can be done in 𝑂(𝑧) complexity. The crossover operation requires roulette
wheel for which we need 𝑂(𝑧2) complexity. For the twin removal, we need
𝑂(𝑚𝑘2𝑧) complexity. In the mutation operation, complexities for the division and absorption
are 𝑂(𝑛𝑚𝑘𝑧), 𝑂(𝑚𝑘𝑧), respectively. The elitist operation has a complexity of 𝑂(𝑧) once the
fitness is calculated with the cost of 𝑂(𝑛𝑚𝑘𝑧). Hence, the overall complexity of DeRanClust
is 𝑂(𝑛𝑚𝑘2𝑧2). With respect to 𝑛 and 𝑚 (the two most significant factors), it has a linear
complexity 𝑂(𝑛𝑚).
The complexity of AGCUK, K-means, GAGR and GenClust are 𝑂(𝑛𝑚)(Y. Liu et al., 2011),
𝑂(𝑛𝑚) (Lloyd, 1982), 𝑂(𝑛𝑚) (D.-X. Chang et al., 2009) and 𝑂(𝑛𝑚2 + 𝑛2𝑚) (Rahman &
Islam, 2014), respectively.
74
3.4 Summary
In this chapter, we propose a GA-based clustering technique called DeRanClust. The proposed
technique generates high-quality chromosomes in the initial population. It produces high-quality
chromosomes in the initial population through two phases: a deterministic phase and a random
phase.
In the deterministic phase, DeRanClust uses K-means to produce high-quality chromosomes.
The justification for using K-means is that it is known for producing a reasonably good quality
clustering solution in linear time. Due to its light weight and the ability to produce reasonably
good-quality solutions, it is expected to produce a good-quality chromosome with a low
complexity. DeRanClust also uses randomly selected chromosomes in the initial population in
order to maintain both high quality and randomness.
We compare the performance of DeRanClust with AGCUK, GAGR, GenClust, K-means++
and K-means in terms of the two cluster evaluation criteria, namely Silhouette Coefficient and
DB Index. In the experiments, we use five (5) natural data sets that we obtain from the UCI
Machine Learning Repository (M. Lichman, 2013). From the experimental results we find that
DeRanClust performs better than AGCUK, GAGR, GenClust and K-means in 5 out of 5 data
sets based on both Silhouette Coefficient and DB Index. The average clustering result of
DeRanClust in 5 data sets is also better than all other techniques.
We also compare the complexity of DeRanClust with the complexity of AGCUK, GAGR,
GenClust and K-means. From the complexity analysis, we find that DeRanClust produces its
clustering solutions with a low complexity of 𝑂(𝑛𝑚), whereas GenClust requires 𝑂(𝑛𝑚2 +
𝑛2𝑚) complexity to produce its clustering solutions. GenClust (Rahman & Islam, 2014) is a
recent technique and shown to be better than many other high-quality techniques (Ahmad &
Dey, 2007a; D.-X. Chang et al., 2009; Lee & Pedrycz, 2009; Y. Liu et al., 2011; Lloyd, 1982).
75
However, from the experimental results and complexity analyses, we find that DeRanClust
produces better clustering results than GenClust with a low complexity. Therefore, we
empirically demonstrate that through the proposed DeRanClust technique we progress towards
achieving our research goal 1.
We also experimentally evaluate the effectiveness of the proposed component for high-
quality initial population by applying it on AGCUK. The experimental results on 5 data sets
clearly indicate the usefulness of the proposed high-quality initial population selection. We also
explore the usefulness of other genetic operations such as the crossover operation. The
experimental results show that DeRanClust with crossover performs better than DeRanClust
without crossover. The results indicate that there is room for further improvement of clustering
quality by improving other genetic operations such as crossover and mutation.
Therefore, in the next chapter, we propose a new GA-based clustering technique called GMC
that proposes a new selection, crossover and mutation operation in order to improve the
chromosomes quality.
76
Chapter 4
Extensive Crossover and Mutation in a GA for
High-Quality Clustering with Low Complexity
4.1 Introduction
In this chapter, we propose a GA-based clustering technique called GMC, which is a further
improvement on DeRanClust (as presented in Chapter 3). DeRanClust produces high-quality
clustering solutions (see Fig. 3.4 and Fig. 3.5) with a low complexity of 𝑂(𝑛𝑚) through the
proposed high-quality initial population selection. We believe that there is room for further
improvement of cluster quality of DeRanClust by improving other genetic operations such as
crossover and mutation.
Therefore, GMC proposes a new selection, crossover and mutation operation in order to
improve the cluster quality. In this chapter, we aim to further progress to attain our research goal
1. GMC uses a probabilistic selection where a chromosome with higher fitness value has a
During the PhD candidature, we have published the following paper based on this chapter.
Beg, A. H. and Islam, M. Z. (2016): Novel crossover and mutation operation in genetic algorithm for
Clustering, In Proc. of the IEEE Congress on Evolutionary Computation (IEEE CEC 2016), Vancouver,
Canada, July 24-29, 2016, pp. 2114-2121. (ERA 2010 Rank A).
77
greater chance to be selected for other genetic operations such as crossover and mutation
operations.
GMC also proposes two phases of crossover operation. In the proposed crossover operation,
it first classifies the chromosomes in a population in one of the two groups: Good group and
Non-good group. It then performs different types of crossover on the two different groups. The
intuition behind this is to increase the possibility of getting good-quality offspring chromosomes
from a pair of good-quality parent chromosomes.
GMC also performs different types of mutation operation for the two different groups. In the
mutation operation, it applies two steps of mutation on the chromosomes of the good group and
three steps of mutation on the chromosomes of the non-good group. The proposed mutation
operation reduces the amount of changes on the good chromosomes, and increases the amount
of changes on bad chromosomes in order to improve their quality.
We implement GMC and compare its performance with AGCUK (Y. Liu et al., 2011),
GAGR (D.-X. Chang et al., 2009), K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii,
2007) and GenClust (Rahman & Islam, 2014). We compare the performance of the techniques
based on two cluster evaluation techniques namely Silhouette Coefficient (Agustín-Blas et al.,
2012; Pang-Ning Tan, Michael Steinbach, 2005) and DB Index (D L Davies & Bouldin, 1979)
using 10 real-life data sets that we obtain from the UCI machine learning repository (M.
Lichman, 2013).
The contributions of the chapter are presented as follows:
Proposing GMC that produces high-quality clustering solutions with low complexity
and require no user input;
The evaluation of GMC by comparing it with existing techniques.
78
The organization of the chapter is as follows: in Section 4.2, we discuss the motivation behind
the proposed technique; in Section 4.3, we present our proposed technique; the experimental
results and discussion are presented in Section 4.4, and the summary of the chapter is presented
in Section 4.5.
4.2 The Motivation Behind the Proposed Technique
An existing technique called AGCUK (Y. Liu et al., 2011) uses a noising selection operation in
order to give a better chance for the selection of a chromosome with low fitness value in the
earlier iterations. However, GMC uses a probabilistic selection comparing chromosomes with
two generations where a chromosome with higher fitness value has a greater chance to be
selected. Hence, GMC aims to ensure good-quality chromosomes at the beginning of each
generation before the crossover and mutation operation.
The proposed technique also modifies the process that selects a pair of chromosomes on the
crossover operation in order to encourage crossover between two good-quality chromosomes.
There are many selection approaches including the roulette wheel (D. Chang et al., 2012; Maulik
& Bandyopadhyay, 2000; Mukhopadhyay & Maulik, 2009), rank-based wheel (Agustín-Blas et
al., 2012) and random selection (D.-X. Chang et al., 2009). These approaches do not carefully
select a pair of chromosomes for a crossover operation. As a result, a good-quality chromosome
often makes a pair with a bad quality chromosome, and there is a high chance to produce bad
quality offspring chromosomes.
Therefore, for crossover operation we classify the chromosomes in a population into two
different groups: good group and non-good group. In the good group, we apply crossover
operation only on two high-quality chromosomes in order to increase the possibility of getting
good-quality offspring chromosomes. In the good group we introduce an opportunity so that
each chromosome within the group forms a pair with each other chromosome within the group.
79
For example, if there are three chromosomes in the good group, then Chromosome 1 and
Chromosome 2 are selected as one pair, Chromosome 2 and Chromosome 3 are selected for
another pair, and Chromosome 3 and Chromosome 1 are selected for the other pair.
Similar to the crossover, GMC uses different types of mutation for different groups in the
mutation operation. Hence, the proposed mutation operation reduces the amount of changes on
the good-quality chromosomes, and increases the amount of changes on bad quality
chromosomes in order to improve their quality.
4.3 GMC: Genetic Algorithm with Novel Mutation and Crossover for Clustering
We first mention the main steps of the proposed technique as follows and then explain each of
them in detail. Out of the following steps, Step 3, Step 4 and Step 6 are our novel contributions
of this chapter.
BEGIN
Step 1: Normalization
Step 2: Population Initialization
DO: t=1to I/* I=50; I is the user defined number of iterations */
Step 3: Probabilistic Selection
Step 4: Two Phases of Crossover Operation
Step 5: Twin Removal
Step 6: Three Steps of Mutation Operation
Step 7: Elitist Operation
END
END
Step 1: Normalization
The proposed technique first normalizes the data set 𝐷 in order to consider each attribute equally
regardless of their domain sizes while calculating the fitness of a chromosome. For
normalization, we use the same approach of normalization that we used in DeRanClust (see
Section 3.2 of Chapter 3).
80
Step 2: Population Initialization
For the population initialization, the proposed technique uses the same approach of population
initialization that we used in DeRanClust (see Section 3.2 of Chapter 3). GMC selects |P|
number of chromosomes in the initial population, |P|/2 from the deterministic phase and |P|/2
from random phase. In the experiments of this chapter, we use |P| to be 20. GMC uses Davis
Bouldin (DB) index (D L Davies & Bouldin, 1979) to calculate the fitness of each chromosome.
Algorithm 4.1: GMC Input: A data set D having N records and |A| attributes, where A is the set of attributes
Output: A set of cluster C
Require:
Ps ← ∅ /* 𝑃𝑠 is the set of initial population (20 chromosomes), initially set 𝑃𝑠 to empty */
Po ← ∅ /* 𝑃𝑜 is the set of offspring chromosomes, initially set 𝑃𝑜 to empty*/
Pm ← ∅ /* Pm is the set of mutated chromosomes, initially set 𝑃𝑚 to empty*/
I = 50 /* user defined number of iterations/generations, default value for I is 50*/
D ́← normalized (D) /* normalize each numerical attribute in the normalized data set D´ */
Pd← ∅ /* 𝑃𝑑 is the set of deterministic chromosomes (45 chromosomes), initially set 𝑃𝑑 to empty */
Pr← ∅ /* 𝑃𝑟 is the set of random chromosomes (10 chromosomes), initially set 𝑃𝑟 to empty */
end
Step 1: /* Population Initialization */
Pd ←GenerateDeterministicChromosomes (D′) /* generate 𝑃𝑑 through K-means, the number of initial seeds for K-means are chosen deterministically */
Pd ←SelectDeterministicChromosomes (Pd ) /* select top 10 chromosomes (50% chromosomes of the initial population) based on fitness */
Pr ←GenerateRandomChromosomes (D′) /* generate 𝑃𝑟 randomly, the number of initial seeds are chosen randomly and the seeds also chosen randomly */
Ps ← Ps ∪ (Pr ∪ Pd) /* insert 𝑃𝑟 and 𝑃𝑑 into 𝑃𝑠 */
Pb= FindBestChromosome (Ps) /* 𝑃𝑏 is a chromosome that has the maximum fitness value in 𝑃𝑠 */
end
for (t=1 to I ) do /* default I = 50, I is the user defined number of iterations and t is the counter of I */
Step 2: /* Probabilistic Selection Operation */ if ( t >1) then
Ms ← MergedChromosomes (Pst, Ps
t−1) /* merged the chromosomes of 𝑡𝑡ℎgeneration with the chromosomes in the (𝑡 − 1)𝑡ℎgeneration*/
Ps ← ProbabilisticSelection (Ms) /* probabilistically select 𝑃𝑠 chromosomes from 𝑀𝑠*/
end
end
Step 3: /* Two Phases of Crossover operation */
P𝑜 ← TwoPhasesOfCrossoverOperation (P𝑠) /* Apply two phases of crossover operation on 𝑃𝑠 and get a set of offspring chromosomes 𝑃𝑜*/
end
Step 4: Twin Removal
Ps ←Twin Removal (Po) /* If the length of a chromosome is > 2 and if there are two identical genes, delete one of the identical genes,
end If the length of a chromosome is = 2 and if there are two identical genes, change one of two identical genes */
Step 5: /* Three Steps Mutation operation */
Gs ← SelectGoodGroupChromosomes (P𝑠) /* classify the chromosomes for good group */
Ns ← SelectNonGoodChromosomes (P𝑠) /* classify the chromosomes for non-good group */
G𝑣 ← DivisionAndAbsorptionOperation (Gs) /* Perform division and absorption on 𝐺𝑠 and get set of mutated chromosomes 𝐺𝑣 */
Pm ← G𝑣 /* insert 𝐺𝑣 into 𝑃𝑚*/
N𝑣 ← DivisionAndAbsorptionOperation (Ns) /* Perform division and absorption on 𝑁𝑠 and get set of mutated chromosomes 𝑁𝑣 */
N𝑟 ← RandomChangeOperation (Nv) /* Perform random change on 𝑁𝑣 and get set of mutated chromosomes 𝑁𝑟 */
Pm ← 𝑁𝑟 /* insert 𝑁𝑟 into 𝑃𝑚*/
end
Step 6: /* Elitist Operation */
Pb ←ElitistOperation (Pm & Pb ) /* Apply elitist operation on 𝑃𝑚 & 𝑃𝑏 and find the best chromosome 𝑃𝑏 */
C ← C ∪ Pb /* insert 𝑃𝑏 in 𝐶 */
Return C
end
end
81
Step 3: Probabilistic Selection
This is an original contribution of GMC that uses a probabilistic selection in order to select
chromosomes for the consequent genetic operations in the next generations. GMC first merges
the chromosomes of the current 𝑖𝑡ℎ and the previous (𝑖 − 1)𝑡ℎgeneration. It then
probabilistically selects a set of chromosomes from the merged chromosomes (see Algorithm
4.1). The chromosome with a higher fitness value has more chances to be selected than the
chromosome with a lower fitness value. The probability of the 𝑗𝑡ℎ chromosome of the
𝑖𝑡ℎ generation is calculated as follows,
𝑃𝑗𝑖 =
𝑓𝑗𝑖
∑ 𝑓𝑙𝑖 |𝑃|
𝑙=1
Eq. 4.1
where 𝑓𝑗𝑖 is the fitness of the 𝑗𝑡ℎ chromosome of the 𝑖𝑡ℎ generation and, |𝑃| is the number of
chromosomes in a population.
Step 4: Two Phases of Crossover Operation
This is an original contribution of GMC. We perform a crossover operation on a pair of
chromosomes where the chromosomes swap their segments in order to generate a pair of
offspring chromosomes. The proposed technique uses two different phases of crossover: Single
point (Sanghamitra Bandyopadhyay & Maulik, 2002; D. Chang et al., 2012; Garai & Chaudhuri,
2004; Michael Laszlo & Mukherjee, 2007; Pakhira et al., 2005; Peng et al., 2014; Rahman &
Islam, 2014; Song et al., 2009) and Random crossover. Before applying crossover, GMC first
classifies the chromosomes in a population in one of the two groups: Good group and Non-good
group. In order to categorize two different groups, it first identifies the fitness range 𝑅𝑏 as
follows:
82
𝑅𝑏 = 𝑓𝑏
∑ 𝑓𝑙|𝑃|𝑙=1
Eq. 4.2
where 𝑓𝑏 is the fitness of best chromosome and |𝑃| is the number of chromosomes in a
population. It then separates the chromosomes of a population into two groups by using Eq. 4.3
and Eq. 4.4. When the difference between the fitness (𝑓𝑗) of a chromosome (𝑃𝑗) and the fitness
(𝑓𝑏) of the best chromosome (𝑃𝑏) is less than 𝑅𝑏 , then the chromosome 𝑃𝑗 is selected for good
group otherwise 𝑃𝑗 is selected for non-good group.
𝑓𝑏-𝑓𝑗≤𝑅𝑏 good group Eq. 4.3
𝑓𝑏-𝑓𝑗>𝑅𝑏 non-good group Eq. 4.4
Hence, the chromosomes of a population are divided into two groups. Once the group
selection is complete, GMC then selects a pair of chromosomes for the crossover operation. For
the chromosomes in the good group, GMC selects pair of chromosomes in such a way so that
each chromosome within the group forms a pair with each other chromosome within the group.
Once the pairs are selected, GMC carries out altogether 10 crossover operations between the
chromosomes of the pair. In the first crossover operation it carries out a single-point crossover
between the two chromosomes. For each of the nine remaining crossover operations, GMC
applies a random crossover between the two chromosomes (see Algorithm 4.2).
In the single-point crossover phase, each chromosome of a pair is divided into two parts at a
random point, and the segments of each chromosome are then swapped to each other and
generate two offspring chromosomes. In the random crossover phase, GMC combines a pair of
chromosomes (𝑃𝑥 and 𝑃𝑦), and generates a random number (𝑅𝑚) between 0 and the length of
83
the combined chromosomes (𝑃𝑥 + 𝑃𝑦). For offspring one, it then randomly selects 𝑅𝑚 genes
(without replacement) from the combined chromosomes and deletes the 𝑅𝑚 genes form the set
of (𝑃𝑥 + 𝑃𝑦) genes. The remaining genes ((𝑃𝑥 + 𝑃𝑦) -𝑅𝑚)) in the combined chromosomes are
then selected for offspring two.
Algorithm 4.2: Two Phases of Crossover Operation Input: A set of chromosome Ps after probabilistic selection operation Output: A set of offspring chromosome Po
Require:
P𝑥 ← ∅ /* 𝑃𝑥 is the set of offspring chromosomes obtain from good group crossover, initially set 𝑃𝑥 to empty */
P𝑦 ← ∅ /* 𝑃𝑦 is the set of offspring chromosomes obtain from non-good crossover, initially set 𝑃𝑦 to empty *
Gs ← ∅ /* 𝐺𝑠 is the set of chromosomes in the good group, initially set 𝐺𝑠 to empty */
Ns ← ∅ /* 𝑁𝑠 is the set of chromosomes in the non-good group, initially set 𝑁𝑠 to empty */
end
Step 1: /* Classify the chromosomes*/
Gs ← SelectGoodGroupChromosomes (P𝑠) /* classify the chromosomes for good group */
Ns ← SelectNonGoodChromosomes (P𝑠) /* classify the chromosomes for non-good group */
end
Step 2: /* Perform crossover on good group */
Gp ← ∅ /* 𝐺𝑝 is the set of pair of chromosomes in the good group, initially set 𝐺𝑝 to empty */
Gp ← PairSelection (Gs) /* Select set of pair of chromosomes, where each chromosome within the group forms a pair with each other */
Gx ← PerformSinlePointCrossover (Gp) /* Perform single point crossover on each pair of chromosomes in 𝐺𝑝 */
P𝑥 ← P𝑥 ∪ Gx /* insert offspring chromosomes 𝐺𝑥 into 𝑃𝑥 */
Gy ← PerformRandomCrossover (Gp) /* Perform random crossover on each pair of chromosomes in 𝐺𝑝 */
P𝑥 ← P𝑥 ∪ Gy /* insert offspring chromosomes 𝐺𝑦 into 𝑃𝑥 */
P𝑥 ← SelectOffspringChromosomes (Px) /* select |𝑃𝑠|/2 offspring chromosomes from 𝑃𝑥 based on their fitness */
end
Step 3: /* Perform crossover on non-good group */
Np ← ∅ /* 𝑁𝑝 is the set of pair of chromosomes in the non-good group, initially set 𝑁𝑝 to empty */
Np ← RouleteWheelSelection (Ns) /* Select set of pair of chromosomes using roulette wheel*/
Nx ← PerformSinlePointCrossover (Np) /* Perform single point crossover on each pair of chromosomes in 𝑁𝑝 */
P𝑦 ← P𝑦 ∪ Nx /* insert offspring chromosomes 𝑁𝑥 into 𝑃𝑦 */
Ny ← PerformRandomCrossover (Np) /* Perform random crossover on each pair of chromosomes in 𝑁𝑝 */
P𝑦 ← P𝑦 ∪ Ny /* insert offspring chromosomes 𝑁𝑦 into 𝑃𝑦 */
P𝑦 ← SelectOffspringChromosomes (Py) /* select |𝑃𝑠|/2 offspring chromosomes from 𝑃𝑦 based on their fitness */
end
Step 4: /* Return the offspring */
Po ← Po ∪ (P𝑥 ∪ P𝑦) /* insert (𝑃𝑥 and 𝑃𝑦) into 𝑃𝑜 */
Return Po
end
Once the crossover is complete on this group, it then selects offspring chromosomes based
on their fitness values. For example, if the number of chromosomes in the good group is 3, then
for 3 pairs (12, 21 and 31) altogether it generates 30 offspring chromosomes (through the single-
point crossover phase and random crossover phase). GMC then selects top 3 offspring
chromosomes from the 30 offspring chromosomes.
In the non-good group, a pair of chromosomes are selected using the roulette wheel (D.
Chang et al., 2012; Maulik & Bandyopadhyay, 2000; Mukhopadhyay & Maulik, 2009)
84
selection. In the roulette wheel selection, the best chromosome of this group is selected as the
first chromosome of the first pair. The second chromosome of the pair is selected
probabilistically. The chromosomes of the pair are then excluded from the selection process for
the second pair. The probability of a chromosome is calculated using Eq. 4.1.
Moreover, in the non-good group phase, the chromosomes are selected pair-wise for the
crossover operation. If the total number of chromosomes in the non-good group is any odd
number, GMC then deletes the worst chromosomes from the group. The deleted chromosome
is then replaced with the offspring chromosomes from a good group. For each pair of
chromosomes, it applies both the random crossover and single point crossover. Similar to the
good group phase, it then selects offspring chromosomes based on their fitness values.
Step 5: Twin Removal
GMC uses the twin removal operation in order to remove/modify twin genes (if any) from each
chromosome. For twin removal, GMC uses the same approach of twin removal of DeRanClust
(see Section 3.2 of Chapter 3).
Step 6: Three Steps of Mutation Operation
The mutation operation of the proposed technique changes each chromosome using three
operations: division (Y. Liu et al., 2011), absorption (Y. Liu et al., 2011) and/or a random
change. Similar to the crossover, in mutation GMC also classifies the chromosomes in one the
two groups by using Eq. 4.3 and Eq. 4.4. For the good group, it applies division and absorption
operation. For the non-good group, it applies division, absorption, and a random change
operation (see Algorithm 4.1).
In the division operation for a chromosome, it identifies the sparsest cluster 𝐶𝑗 of a
chromosome 𝑃𝑗 and then divides 𝐶𝑗 into two clusters by applying K-means on 𝐶𝑗 using 𝑘 = 2.
The absorption operation finds the two closest clusters of the chromosome and merges them
85
into one cluster. The clusters that have the minimum seed to seed distance are considered to be
the closest clusters. In the random change operation, one gene of a chromosome is randomly
chosen and an attribute value of the gene is randomly changed to another value within its
domain.
Step 7: Elitist Operation
The elitist operation keeps track of the best chromosome throughout the generations. For finding
the best chromosome, GMC uses the same approach of elitist operation of DeRanClust (see
Section 3.2 of Chapter 3).
4.4 Experimental Results and Discussion
We empirically compare our technique with five existing techniques called AGCUK (Y. Liu et
al., 2011), GAGR (D.-X. Chang et al., 2009), GenClust (Rahman & Islam, 2014), K-Means
(Lloyd, 1982) and K-Means++ (Arthur & Vassilvitskii, 2007) on ten (10) natural data sets that
are available in the UCI machine learning repository (M. Lichman, 2013).
4.4.1 Data Sets
Detailed information about the data set is presented in Table 4.1. All the data sets used in this
chapter have only numerical attributes except the class attribute. We evaluate and compare the
clustering result based on two evaluation criteria namely Silhouette Coefficient (Agustín-Blas
et al., 2012; Pang-Ning Tan, Michael Steinbach, 2005) and DB Index (D L Davies & Bouldin,
1979). A smaller value of DB Index indicates a better clustering result and a higher value of
Silhouette Coefficient represents a better clustering result.
4.4.2 The Parameter used in the Experiments
In the experimentation of AGCUK, GAGR, GenClust, and GMC, we consider the population
size to be 20 and the number of generations/iterations to be 50. We maintain this consistency
86
for all the techniques in order maintain a fair comparison among them. The number of iterations
in K-means and K-means++ is set to be 50. The number of iterations of K-means in GenClust
also set to be 50. The cluster numbers in GAGR, K-means and K-means++ is user defined. The
number of clusters in GAGR, K-means and K-means++ are generated randomly in the range
between 2 to √𝑛 (𝑛 is the number of records in the data set). The values of 𝑟𝑚𝑎𝑥 and 𝑟𝑚𝑖𝑛 in
AGCUK are set to be 1 and 0 respectively. The threshold value for K-means is set to 0.005.
Table 4.1: Data sets at a glance
Data set No. of
Records with
missing
No. of Records
without
missing
No. of
numerical
attributes
No. of
categorical
attributes
Class
size
Glass Identification (GI) 214 214 10 0 7
Vertebral Column (VC) 310 310 6 0 2
Leaf (LF) 340 340 16 0 36
Liver Disorder (LD) 345 345 6 0 2
Dermatology (DT) 366 358 34 0 6
Pima Indian Diabetes (PID) 768 768 8 0 2
Statlog Vehicle Silhouettes (SV) 846 846 18 0 4
Bank Note Authentication (BN) 1372 1372 4 0 2
Yeast (YT) 1484 1484 8 0 10
Image Segmentation (IS) 2310 2310 18 0 7
4.4.3 The Experimental Setup
For each data set, we run GMC 10 times since it can produce different clustering results in
different runs. We then present the average clustering results. We also run all other techniques
AGCUK, GAGR, GenClust, K-means and K-means++ 10 times. We then present the average
clustering result. In order to evaluate the effectiveness of various components of the proposed
technique, we randomly choose five (5) data sets (GI, PID, LF, LD, and DT) as shown in Table
4.1.
87
4.4.4 Experimental Results on All Techniques
In this section, we experimentally evaluate the performance of the proposed technique by
comparing it with AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), GenClust
(Rahman & Islam, 2014), K-Means (Lloyd, 1982) and K-Means++ (Arthur & Vassilvitskii,
2007) on all 10 data sets where each technique runs 10 times on each data set. Fig. 4.1 shows
the average Silhouette Coefficient of the clustering solutions, where GMC achieves better
results than all other techniques in 10 out of 10 data sets.
Fig. 4.1: Comparative results between GMC and other techniques based on Silhouette Coefficient (higher the better)
Fig. 4.2: Comparative results between GMC and other techniques based on DB Index (lower the better)
As we can see in Fig. 4.2, GMC achieves better clustering results (on average) than all other
techniques in 10 out of 10 data sets based on DB Index. Moreover, the right most columns in
Fig. 4.1 and Fig. 4.2 show the average Silhouette Coefficient and DB Index of all techniques on
all data sets. GMC achieves better results on an average than all other techniques.
88
4.4.5 An Analysis of the Impact of Various Properties of GMC
We now explore the effectiveness of the proposed components of GMC in the following
subsections.
4.4.5.1 An Analysis of the Impact of the Crossover Operation
We explore the effectiveness of the crossover operation (see Step 4 of Section 4.3). In Fig. 4.3
and Fig. 4.4 we present the experimental results of GMC comparing with a different version of
GMC called GMC without Crossover that is exactly same as the GMC except that it does not
have a crossover in it. We run both GMC and GMC without Crossover for 50 iterations on 5
data sets. We run both techniques 10 times on each data set and present the average results.
We can see in Fig. 4.3 and Fig. 4.4 that GMC achieves better clustering result than GMC
without Crossover based on Silhouette Coefficient and DB Index. The average clustering result
of GMC on 5 data sets is also better than GMC without crossover based on Silhouette Coefficient
and DB Index.
Fig. 4.3: Comparative results between GMC and GMC
without Crossover based on Silhouette Coefficient
(higher the better)
Fig. 4.4: Comparative results between GMC and GMC without
Crossover based on DB Index (lower the better)
89
4.4.5.2 An Analysis of the Impact of the Mutation Operation
In order to explore the effectiveness of the proposed mutation operation, we introduce a new
version of GMC where we remove its mutation operation. We call this version to be GMC
without Mutation. We then compare this version of GMC with complete GMC. We run both
GMC and GMC without Mutation for 50 iterations. We run both techniques 10 times on each
data set and present the average results
Fig. 4.5: Comparative results between GMC and GMC
without Mutation based on Silhouette Coefficient (higher
the better)
Fig. 4.6: Comparative results between GMC and GMC
without Mutation based on DB Index (lower the better)
Fig. 4.5 and Fig. 4.6 indicate that GMC achieves better clustering result compared to GMC
without Mutation according to Silhouette Coefficient and DB Index. The average clustering
result of GMC on 5 data sets is also better than GMC without Mutation based on Silhouette
Coefficient and DB Index.
4.4.5.3 An Analysis of the Impact of the Probabilistic Selection Operation
We also explore the effectiveness of the proposed probabilistic selection operation (see Step 3
of Section 4.3). In Fig. 4.7 and Fig. 4.8, we present the experimental result of GMC with a
different version of GMC called GMC without Probabilistic Selection (PS) that is exactly same
as the GMC except that it does not have a probabilistic selection. We run both GMC and GMC
90
without PS for 50 iterations on 5 data sets. We run both techniques 10 times on each data set
and present the average results
Fig. 4.7: Comparative results between GMC and GMC
without Probabilistic Selection (PS) based on Silhouette
Coefficient (higher the better)
Fig. 4.8: Comparative results between GMC and GMC
without Probabilistic Selection (PS) based on DB Index
(lower the better)
Fig. 4.7 and Fig. 4.8 show that GMC achieves better clustering results than GMC without
Probabilistic Selection in 5 out of 5 data sets according to both Silhouette Coefficient and DB
Index. The average clustering result of GMC on 5 data sets is also better than GMC without
Probabilistic Selection based on Silhouette Coefficient and DB Index.
4.4.5.4 An Analysis of Improvement in Chromosomes over the Iterations
In Fig. 4.9, we present the average fitness (in terms of DB Index, where Fitness=1/DB) values
of the best chromosomes of GMC and AGCUK over the 10 runs for 10 data sets. Both GMC
and AGCUK use the same fitness function (DB Index) to calculate the fitness of a chromosome.
As we can see in Fig. 4.9 that the fitness of the best chromosome of GMC shows a rapid
improvement within first 5 iterations, and then continues to steadily increase over the 50
iterations. Moreover, the average fitness of the best chromosome of GMC is always higher than
the average fitness of the best chromosome of AGCUK, clearly indicating the effectiveness of
various components of GMC.
91
Fig. 4.9: Average fitness (best chromosome fitness) versus iterations over the 10 data sets
4.4.6 Statistical Analysis
We now analyze the results by using statistical sign test (D.Mason, 1998) to evaluate the
superiority of the results (Silhouette Coefficient and DB Index) obtained by GMC over the
existing techniques. We observe that results do not follow a normal distribution and thus the
conditions for the parametric test do not satisfy. Hence, we carry out a non-parametric sign test
on the Silhouette Coefficient and DB Index.
The sign test (Triola, 2001) analyzes the frequencies of the plus and minus signs to determine
whether they are significantly different. For example, suppose that we test the use of a guide
book written for student’s final exam. If there are 100 students and 51 of them are successful
and 49 of them are unsuccessful, common sense suggests that there is not sufficient evidence to
say that the guide book is useful, because 51 students are successful out of 100 students is not
significant. But how about 53 students are successful and 47 students are unsuccessful? Or 92
students are successful and 8 students are unsuccessful? The sign test is useful to determine
when such results are significant. Fig 4.10 summarizes the procedure of sign test. If the number
92
Fig. 4.10: Flow chart of sign test (Triola, 2001)
Start
Assign positive and negative signs
and reject any zeros.
Is the number of
positive and
negative signs are equal?
Is the test statistic
value less than or equal to the
critical value?
Yes
No
Let 𝑛 is the total number of signs.
Let 𝑥 is the number of the less
frequent sign.
Is
𝑛 ≤ 25
?
Find the critical value from Table A-
7 of statistics book.
Yes
No
No
Yes
Convert the test statistic 𝑥 to the test statistic
𝑧 =(x + 0.5) − (𝑛/2)
√𝑛/2
Find the critical 𝑧 value from Table
A-2 of statistics book.
Fail to reject the null
hypothesis.
Reject the null
hypothesis.
No
93
Fig. 4.11: Sign test of GMC on 10 data sets
of positive and negative signs are equal then we fail to reject the null hypothesis and do not
proceed with the sign test.
In Fig. 4.11, we compare the sign test of GMC with the existing techniques on 10 data sets
in terms of Silhouette Coefficient and DB Index. The first five bars of Silhouette Coefficient
and DB Index in Fig. 4.11 show the z-values (test statistic) values; the sixth bar denotes the z-
ref value. If the z-value is greater than the z-ref value, then the result obtains by GMC considered
as significant. We carry out right-tailed sign test at z > 1.96, p < 0.025 in terms of Silhouette
Coefficient and DB Index. The statistical sign test of GMC shown in Fig. 4.11 indicates the
superiority of GMC over the existing techniques.
4.5 Summary
In this chapter, we propose a GA-based clustering technique called GMC. The proposed
technique uses a new selection operation comparing chromosomes with two generations, where
a chromosome with a higher fitness value has greater chance to be selected for other genetic
operations such as crossover and mutation.
The proposed technique also modifies the process that selects a pair of chromosomes in the
crossover operation in order to encourage crossover between two good-quality chromosomes.
94
The proposed crossover operation aims to increase the possibility of getting good-quality
offspring chromosomes. GMC also introduces a new mutation operation which aims to reduce
the amount of changes on the good chromosomes, and increases the amount of changes on bad
chromosomes in order to improve their quality.
We compare the proposed technique by comparing its clustering quality with five existing
techniques namely AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), K-means
(Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007) and GenClust (Rahman & Islam,
2014) on 10 natural data sets that are publicly available from the UCI machine learning
repository (M. Lichman, 2013) in terms of two well-known evaluation criteria namely
Silhouette Coefficient (Agustín-Blas et al., 2012; Pang-Ning Tan, Michael Steinbach, 2005) and
DB Index (D L Davies & Bouldin, 1979). The experimental results indicate a clear superiority
of the proposed technique over the existing techniques. The proposed technique also presents
some interesting results in order to demonstrate the effectiveness of the proposed mutation and
crossover operation. The experimental results clearly indicate the effectiveness of the proposed
mutation and crossover operation.
However, the crossover operation of GMC has a drawback. In the crossover operation, GMC
first classifies the chromosomes in a population in one of the two groups: Good group and Non-
good group. The good chromosomes are classified in the good group and the bad chromosomes
are classified in the bad group. It then performs different types of crossover on the two different
groups. In the good group, GMC carries out crossover between all possible pairs of
chromosomes. However, in the non-good group, a bad chromosome makes a pair with another
bad chromosome. Hence, there are fewer chances to obtain good offspring chromosomes from
a pair of bad quality parent chromosomes.
Therefore, in the next chapter we propose a GA-based clustering technique called GCS that
proposes a new crossover operation. In the new crossover operation, GCS introduces an
95
opportunity for each chromosome to participate in crossover with the best chromosome.
Typically, the genetic operations such as the crossover and mutation tend to improve the
health/fitness of a chromosome, but they can also cause the health of some chromosomes to
deteriorate. Therefore, GCS also introduces a new genetic operation called the health check in
order to ensure the healthy chromosomes (Chromosomes with good fitness values) in a
population.
96
Chapter 5
High-Quality Clustering through Novel Crossover,
Selection and Health Check with Low Complexity
5.1 Introduction
In this chapter, we propose a GA-based clustering technique called GCS, which is a further
improvement on the techniques proposed in the previous two chapters. In Chapter 5, we aim to
move closer to accomplish our research goal 1.
We now briefly introduce the novel components/properties of GCS and their logical
justifications as follows. Typically, the chromosomes in a population improve their quality
through some genetic operations such as crossover and mutation. However, the health of some
chromosomes can also deteriorate through the genetic operations. Therefore, GCS introduces a
During the PhD candidature, we have published the following paper based on this chapter
Beg, A. H. and Islam, M. Z. (2016): Genetic Algorithm with Novel Crossover, Selection and Health
Check for Clustering, In Proc. of the 24th European Symposium on Artificial Neural Networks,
Computational Intelligence and Machine Learning (ESANN 2016), Bruges, Belgium, April 27-29, 2016,
pp. 575-580. (ERA 2010 Rank B).
97
new genetic operation called the health check operation in order to ensure the presence of
healthy chromosomes in a population. The proposed technique also uses a new selection
operation in order to ensure the presence of good-quality chromosomes in a population at the
begging of each generation. GCS uses the elitist operation after each genetic operation within a
generation, in order to keep track of the best solution obtained so far.
GCS also modifies the process which selects a pair of chromosomes in a crossover operation
in order to increase the possibility of getting better quality offspring chromosomes. GMC (as
presented in Chapter 4) also uses a new crossover operation where a chromosome with low
fitness value always makes a pair with another low-quality chromosome. Therefore, GCS
introduces a new crossover operation where each chromosome gets an opportunity to make a
pair with the best chromosome.
We implement GCS and compare its performance with AGCUK (Y. Liu et al., 2011), GAGR
(D.-X. Chang et al., 2009), K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007)
and GenClust (Rahman & Islam, 2014). We compare the performance of the techniques through
two cluster evaluation criteria, namely Silhouette Coefficient (Agustín-Blas et al., 2012; Pang-
Ning Tan, Michael Steinbach, 2005) and DB Index (D L Davies & Bouldin, 1979) using 15
real-life data sets that we obtain from the UCI machine learning repository (M. Lichman, 2013).
The contributions of the chapter are presented as follows:
Presentation of GCS that contains some new genetic operations;
The evaluation of GCS by comparing it with existing techniques.
The organization of the chapter is as follows: in Section 5.2, we discuss the motivation behind
the proposed technique; in Section 5.3, we present our proposed technique; the experimental
results and discussion are presented in Section 5.4, and the summary of the chapter is presented
in Section 5.5.
98
5.2 The Motivation Behind the Proposed Technique
The presence of good chromosomes in a population increases the possibility of getting good
quality of final clustering solution (Diaz-Gomez & Hougen, 2007; Goldberg et al., 1991;
Rahman & Islam, 2014). Therefore, it is important to ensure the presence of good-quality
chromosomes at the begging of each generation. Hence, GCS uses a new selection operation in
order to ensure the presence of good-quality chromosomes in a population at the beginning of
each generation.
Moreover, gradual health improvement is also important for a GA to finally find a good-
quality chromosome. In each generation, GA goes through some genetic operations such as
crossover and mutation. The crossover and mutation operation can improve the health of a
chromosome, but they can also deteriorate the health of some chromosomes. Therefore, it is
important to check the chromosomes health at the end of each generation. Hence, GCS uses a
health check operation in order to find sick chromosomes, and replaces them with healthy
chromosomes found in the previous generation.
GCS also modifies the process which selects a pair of chromosomes in a crossover operation
through two phases in order to increase the possibility of getting better quality offspring
chromosomes. In Phase1, we introduce an opportunity for each chromosome to participate in
crossover with the best chromosome. We increase the opportunity to use the best chromosome
many times, for example: if the population size is twenty (20) then the best chromosome
participates in crossover with other nineteen (19) chromosomes in separate crossover
operations. Due to the use of the best chromosome many times in the crossover operation, this
phase increases the possibility of getting good-quality offspring chromosomes.
99
In GMC (see Chapter 4) the best chromosome only participates in crossovers with good
chromosomes. However in GMC, the chromosomes in the non-good group do no not get a
chance to crossover with the best chromosome. Moreover in GCS, in Phase 2, a crossover
operation on a pair of randomly selected chromosomes through the roulette wheel approach
supports the exploration in a non-deterministic way which is a property and advantage of genetic
algorithms.
5.3 GCS: GA with Novel Crossover, Health Check and Selection for Clustering
We now introduce the main steps of the proposed technique as follows and explain each of them
in detail. Out of the following steps, Step 3, Step 4 and Step 7 are our novel contributions of this
chapter.
BEGIN
Step 1: Normalization
Step 2: Population Initialization
DO: t=1to I/* I=50; I is the user defined number of iterations */
Step 3: Two Phases of Selection Operation
Step 4: Crossover Operation
Step 5: Twin Removal
Step 6: Mutation Operation
Step 7: Health Check Operation
Step 8: The Elitist Operation
END
END
Step 1: Normalization
GCS takes a data set 𝐷 as input. It first normalizes the data set 𝐷 in order to weigh each attribute
equally regardless of their domain sizes. The normalization brings the domain range of each
numerical attribute of the data set between 0 and 1. For normalization, GCS uses the same
approach of normalization that we used in DeRanClust (see Section 3.2 in Chapter 3).
100
Algorithm 5.1: GCS Input: A data set D having N records and |A| attributes, where A is the set of attributes
Output: A set of cluster C
Require:
Ps ← ∅ /* 𝑃𝑠 is the set of initial population (20 chromosomes), initially set 𝑃𝑠 to empty */
Po ← ∅ /* 𝑃𝑜 is the set of offspring chromosomes, initially set 𝑃𝑜 to empty*/
Pm ← ∅ /* Pm is the set of mutated chromosomes, initially set 𝑃𝑚 to empty*/
I = 50 /* user defined number of iterations/generations, default value for I is 50*/
D ́← normalized (D) /* normalize each numerical attribute in the normalized data set D´ */
Pd← ∅ /* 𝑃𝑑 is the set of deterministic chromosomes (45 chromosomes), initially set 𝑃𝑑 to empty */
Pr← ∅ /* 𝑃𝑟 is the set of random chromosomes (10 chromosomes), initially set 𝑃𝑟 to empty */
Hs← ∅ /* 𝐻𝑠 is the set of good chromosomes, initially set 𝐻𝑠 to empty */
end
Step 1: /* Population Initialization */
Pd ←GenerateDeterministicChromosomes (D′) /* generate 𝑃𝑑 through K-means, the number of initial seeds for K-means are chosen deterministically */
Pd ←SelectDeterministicChromosomes (Pd ) /* select top 10 chromosomes (50% chromosomes of the initial population) based on fitness */
Pr ←GenerateRandomChromosomes (D′) /* generate 𝑃𝑟 randomly, the number of initial seeds are chosen randomly and the seeds also chosen randomly */
Ps ← Ps ∪ (Pr ∪ Pd) /* insert 𝑃𝑟 and 𝑃𝑑 into 𝑃𝑠 */
Pb= FindBestChromosome (Ps) /* 𝑃𝑏 is a chromosome that has the maximum fitness value in 𝑃𝑠 */
end
for (t=1 to I ) do /* default I = 50, I is the user defined number of iterations and t is the counter of I */
Step 2: /* Two Phases of Selection Operation */
if ( t >1) then
Hs ← SelectTopChromosome (Pst) /* select |𝑃𝑠
𝑡|/2 top chromosomes of current 𝑡𝑡ℎ generation based on their fitness */
Pst ← Ps
t-Hs /* remove 𝐻𝑠 from 𝑃𝑠𝑡 */
Ms ← MergedChromosomes (Pst, Ps
t−1) /* merged the remaining chromosomes of 𝑡𝑡ℎgeneration with all chromosomes in the (𝑡 − 1)𝑡ℎgeneration*/
Hs ← ProbabilisticSelection (Ms) /* probabilistically select |𝑃𝑠𝑡|/2 chromosomes from 𝑀𝑠*/
Ps ←Pst ∪ Hs /* insert 𝐻𝑠 into 𝑃𝑠
𝑡 */
end
end
Step 3: /* Two Phases of Crossover operation */
P𝑜 ← TwoPhasesOfCrossoverOperation (P𝑠) /* Apply two phases of crossover operation on 𝑃𝑠 and get set of offspring chromosomes 𝑃𝑜*/
end
Step 4: Twin Removal
Ps ←Twin Removal (Po) /* If the length of a chromosome is > 2 and if there are two identical genes, delete one of the identical genes,
end If the length of a chromosome is = 2 and if there are two identical genes, change one of two identical genes */
Step 5: /* Mutation operation */
Pb= FindChromsomeHavingMaxFitness (Ps) /* 𝑃𝑏 is a chromosome having maximum fitness in 𝑃𝑠 */
P𝑣 ← DivisionAndAbsorptionOperation (Pb) /* Perform division and absorption on chromosome 𝑃𝑏 and get mutated chromosome 𝑃𝑣 */
Pm ← P𝑣 /* insert 𝑃𝑣 into 𝑃𝑚*/
Ps − Pb /* remove 𝑃𝑏 from 𝑃𝑠 */
for i= 1 to |Ps| do
P𝑣← DivisionOrAbsorptionOperation (P𝑖) /* randomly apply either division or absorption on chromosome 𝑃𝑖 and get mutated chromosome 𝑃𝑣 */
Pm ← P𝑣 /* insert 𝑃𝑣 into 𝑃𝑚*/
end
end
Step 6: /* Health Check Operation */
while 𝐢 ≤ 20 do
Pa ← Pa ∪ Pb /* store the best chromosome of each generation of the first 20 iteration */
end
Fd ← CalculateAverageFitness (Pa) /* Find the average fitness of the chromosomes 𝑃𝑎 */ if (t>20)
Fm ← Calculate Fitness (Pm) /* F = {𝐹1𝑡 , 𝐹2
𝑡 , … … … . 𝐹𝑀𝑡 } is the set of fitness of every chromosome in 𝑃𝑚 of the 𝑡𝑡ℎgeneration */
for j= 1 to |Pm| do
𝐢𝐟 (Fjt > Fd)
Pm ← Pm ∪ Pjt /* select 𝑃𝑗
𝑡 as a healthy chromosome and insert it into 𝑃𝑚 */
end
else
Ph←ProbabilisticSelection (Pa) /* probabilistically select a healthy chromosome from 𝑃𝑎 */
Pm ← Pm ∪ Ph /* insert 𝑃ℎ into 𝑃𝑚*/
end
end
end
end
Step 7: /* Elitist Operation */
Pb ←ElitistOperation (Pm & Pb ) /* Apply elitist operation on 𝑃𝑚 & 𝑃𝑏 and find the best chromosome 𝑃𝑏 */
C ← C ∪ Pb /* insert 𝑃𝑏 into 𝐶 */
Return C
end
end
101
Step 2: Population Initialization
For the population initialization, GCS uses the same approach of population initialization that
we used in DeRanClust (see Section 3.2 of Chapter 3). In this chapter, we prepare an initial
population of 2 × |𝑃| chromosomes, |𝑃| chromosomes from the deterministic phase and
|𝑃| chromosomes from the random phase. In the experiments of this chapter, we use |𝑃| to be
10. In the deterministic phase, GCS selects top |𝑃| chromosomes (see Step 1 of Algorithm 5.1).
In the random phase, GCS produces |𝑃| chromosomes. Thus, GCS produces 2 ×
|𝑃| chromosomes from two phases. It then find the best chromosome 𝑃𝑏 from 2 ×
|𝑃| chromosomes and stores it for the elitist operation. The fitness of each chromosome is
calculated using Davis Bouldin (DB) index.
Step 3: Two Phases of Selection Operation
Starting from generation 2, GCS applies the two phases of selection operation in order to get a
new population for the next genetic operations such as crossover and mutation. In Phase 1, GCS
selects the top |𝑃| chromosomes (according to the fitness values) from 2 × |𝑃| chromosomes
of the current population.
In Phase 2, it selects |𝑃| chromosomes probabilistically from a set of 3 × |𝑃| chromosomes,
which is made of the remaining bottom |𝑃| chromosomes of the current population and 2 × |𝑃|
chromosomes from the last population of the immediate previous generation (see Step 2 of
Algorithm 5.1).
Step 4: Crossover Operation
GCS performs a crossover operation on a pair of chromosomes, where each chromosome is
divided into two segments, and then the chromosomes swap their segments in order to generate
a pair of offspring chromosomes. GCS uses two phases of crossover operation. In Phase 1, it
102
selects 2 × |𝑃| − 1 pair of chromosomes, where in each pair the first chromosome is always
the best chromosome of the population. All other chromosomes are chosen one by one as the
second chromosome of a pair. All pairs have different second chromosome (see Step 1 of
Algorithm 5.2).
Algorithm 5.2: Two Phases of Crossover Operation Input: A set of chromosome Ps after selection operation Output: A set of offspring chromosome Po
Require:
p ← ∅ /* 𝑃 is the set of chromosome pair, initially set 𝑃 to empty */
P𝑥 ← ∅ /* 𝑃𝑥 is the set of offspring chromosomes obtain from phase 1 crossover, initially set 𝑃𝑥 to empty */
P𝑦 ← ∅ /* 𝑃𝑦 is the set of offspring chromosomes obtain from phase 2 crossover, initially set 𝑃𝑦 to empty */
pb= FindBestChromosome (Ps) /* 𝑃𝑏 is a chromosome that has the maximum fitness value in 𝑃𝑠 */
PT ← PT ∪ Ps /* insert offspring Ps into PT */
end
Step 1: /* Perform Phase one of crossover operation */
while |P𝑥| ≤ (|Ps| − 1) do
p ← SelectChromosome (PT) /* select a chromosome form PT for crossover */
while (𝑗 ≤ 1) do /* j is the counter of number of crossover, the default value of j is 5*/
o ← PerformPhase1Crossover (pb, p) /* after crossover between 𝑝𝑏 and 𝑝, two offspring 𝑜 = {𝑜1, 𝑜2} are generated */
P𝑥 ← P𝑥 ∪ o /* insert offspring 𝑜 = {𝑜1, 𝑜2} into 𝑃𝑥 */
end
PT ← PT- p /* remove 𝑃 from 𝑃𝑇 */
end
P𝑥 ← SelectPhase1OffspringChromosomes (Px) /* select top |Ps|/2 offspring chromosomes form 𝑃𝑥 based on their fitness */
end
Step 2: /* Perform Phase two of crossover operation */
while |Ps| ≥ 2 do
p ← SelectChromosomePair (Ps) /* select a pair of chromosome 𝑃 = {𝑃1, 𝑃2} from 𝑃𝑠 using roulette wheel */
o ← PerformPhase2Crossover (p) /* after crossover between 𝑝1 and 𝑝2, two offspring 𝑜 = {𝑜1, 𝑜2} are generated */
P𝑦 ← P𝑦 ∪ o /* insert offspring 𝑜 = {𝑜1, 𝑜2} into 𝑃𝑦 */
Ps ← Ps- p /* remove 𝑝 = {𝑝1, 𝑝2} from 𝑃𝑠 */
end
P𝑦 ← SelectPhase2OffspringChromosomes (P𝑦) /* select |Ps|/2 offspring chromosomes form 𝑃𝑦 based on their fitness */
Po ← Po ∪ (P𝑥 ∪ P𝑦) /* insert (𝑃𝑥 and 𝑃𝑦) into 𝑃𝑜 */
end
Step 3: /* Return the offspring */ Return Po
end
For the extensive exploration, GCS applies crossover operation five times on each pair and
thereby generates 5 different pair of offspring chromosomes. That is it produces altogether 5 ×
(2 × |𝑃| − 1) × 2 chromosomes, from which it then selects the top |𝑃| chromosomes (see Step
1 in Algorithm 5.2). This phase increases the possibility of getting good-quality offspring
chromosomes. In order to maintain the random exploration ability, in Phase 2 it uses the
traditional roulette wheel (D. Chang et al., 2012; Maulik & Bandyopadhyay, 2000;
Mukhopadhyay & Maulik, 2009) approach for selecting a pair of chromosomes for crossover.
In Phase 2, it selects |𝑃| pair of chromosomes and applies the traditional single point crossover.
103
GCS then selects |𝑃| offspring chromosomes from |𝑃| pair of offspring chromosomes. Thus,
from the two phases it finally produces 2 × |𝑃| offspring chromosomes.
Step 5: Twin Removal
GCS uses the twin removal operation in order to remove/modify twin genes (if any) from each
chromosome. For twin removal, GCS uses the same approach of twin removal of DeRanClust
(see Section 3.2 of Chapter 3).
Step 6: Mutation Operation
GCS uses the same approach of mutation operation which is used in DeRanClust (see Section
3.2 of Chapter 3 and Step 5 of Algorithm 5.1).
Step 7: Health Check Operation
GCS applies the proposed health check operation after 𝐼 generations. In this study we use 𝐼 =
20. It prepares a set of chromosomes 𝑃, where it stores the best chromosomes of each generation
for the first 𝐼 generations. It then calculates the average fitness 𝐹𝑑 of the chromosomes in 𝑃. If
the fitness of a chromosome in the current population is less than 𝐹𝑑, then the chromosome is
selected as sick. GCS then probabilistically selects a chromosome from 𝑃 to replace the sick
chromosome (see Step 6 of Algorithm 5.1).
Step 8: Elitist Operation
Generally GA (D.-X. Chang et al., 2009; Y. Liu et al., 2011; Rahman & Islam, 2014) applies
the elitist operation at the end of each generation. However, GCS applies the elitist operation at
the end of each genetic operation within a generation. If the fitness of the worst chromosome
𝑃𝑤𝑖 of the 𝑖𝑡ℎpopulation (i.e. the current population) is less than the fitness of the best
chromosome 𝑃𝑏 (from all previous generation), then 𝑃𝑤𝑖 is replaced with 𝑃𝑏. Moreover, if the
104
fitness of the best chromosome 𝑃𝑏𝑖 of the 𝑖𝑡ℎpopulation is higher than 𝑃𝑏 , then 𝑃𝑏 is replaced
by 𝑃𝑏𝑖 .
5.4 Experimental Results and Discussion
We empirically compare the performance of our technique with five existing techniques called
AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), K-means (Lloyd, 1982), K-
means++ (Arthur & Vassilvitskii, 2007) and GenClust (Rahman & Islam, 2014). For the
experimentation of AGCUK, GAGR, GenClust and GCS, we consider the population size to be
20. The number of generations/iterations for all techniques set to be 50 for a fair comparison.
The cluster number in GGAR and AGCUK are generated randomly in the range 2 to √𝑛 (n is
the number of records in a data set). We run each technique 20 times on each data set, and we
take the average result. We set the threshold value for K-means to be 0.05 and the total number
of iterations to be 50 as suggested in AGCUK.
Table 5.1: Data sets at a glance
Data set No. of Records
with missing
No. of Records
without missing
No. of numerical
attributes
No. of categorical
attributes
Class size
Glass Identification (GI) 214 214 10 0 7
Vertebral Column (VC) 310 310 6 0 2
Ecoli (EC) 336 336 8 0 8
Leaf (LF) 340 340 16 0 36
Liver Disorder (LD) 345 345 6 0 2
Dermatology (DT) 366 358 34 0 6
Blood Transfusion (BT) 748 748 4 0 2
Pima Indian Diabetes (PID) 768 768 8 0 2
Statlog Vehicle Silhouettes (SV) 846 846 18 0 4
Bank Note Authentication (BN) 1372 1372 4 0 2
Yeast (YT) 1484 1484 8 0 10
Image Segmentation (IS) 2310 2310 18 0 7
Wine Quality (WQ) 4898 4898 11 0 7
Page Blocks Classification (PBC) 5473 5473 10 0 5
MAGIC Gamma Telescope (MGT) 19020 19020 11 0 2
105
5.4.1 Data Sets
We apply the techniques on 15 real-life data sets as shown in Table 5.1. The data sets are
publicly available in UCI Machine Learning Repository (M. Lichman, 2013). We consider the
data sets in the experiment having numerical attributes except the categorical attributes. All the
data sets contain some class attributes. We remove them during the clustering process.
5.4.2 Evaluation Criteria
To compare our technique with the existing techniques two well-known evaluation criteria
namely Silhouette Coefficient (Agustín-Blas et al., 2012; Pang-Ning Tan, Michael Steinbach,
2005) and DB Index (D L Davies & Bouldin, 1979) are used. A smaller value of DB Index
indicates a better clustering result and the higher value of Silhouette Coefficient represents a
better clustering result.
5.4.3 Experimental Results on All Techniques
In this section, we compare the experimental result of the proposed technique with five existing
techniques AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), K-means (Lloyd,
1982), K-means++ (Arthur & Vassilvitskii, 2007) and GenClust (Rahman & Islam, 2014) in
order to evaluate the usefulness of the proposed technique on 15 data sets where each technique
runs 20 times on each data set.
Fig. 5.1: Silhouette Coefficient of the techniques on eight data sets
106
Fig. 5.2: Silhouette Coefficient of the techniques on seven data sets
Fig. 5.1 and Fig. 5.2 show the average Silhouette Coefficient of the clustering solutions,
where GCS achieves better results than all other techniques in all 15 data sets. That is, in 15 out
of 15 data sets the average Silhouette Coefficient of 20 runs of GCS is higher than the average
Silhouette Coefficient of 20 runs of AGCUK, GAGR, K-means, K-means++ and GenClust.
Fig. 5.3: DB Index of the techniques on eight data sets
Fig. 5.4: DB Index of the techniques on seven data sets
Moreover, in 14 out of 15 data sets the standard deviations of GCS do not overlap the
standard deviation of GAGR based on Silhouette Coefficient. The standard deviations of GCS
107
do not overlap the standard deviations of AGCUK in 11 out of 15 data sets based on Silhouette
Coefficient. The standard deviations of GCS do not overlap the standard deviations of K-means,
K-means++ and GenClust on 15 out of 15 data sets. Note that the cases where the standard
deviations of GCS overlaps with the standard deviations of other techniques are indicated with
an arrow in Fig. 5.1 and Fig. 5.2. That is, for all cases without an arrow GCS achieves a better
result with no overlap of standard deviations.
As we can see in Fig. 5.3 and Fig. 5.4, GCS achieves better clustering results (on an average)
than all other techniques in 15 out of 15 data sets, based on DB Index for which a lower value
indicates a better result. The standard deviations of GCS do not overlap the standard deviations
of all other techniques on 15 out of 15 data sets.
The right most columns in Fig. 5.2 and Fig. 5.4 show the average Silhouette Coefficient and
DB Index of all techniques on all data sets. GCS achieves clearly better results on an average
than all other techniques without any overlapping of standard deviations.
5.4.4 Comparative Results between GCS and GMC
In this section, we compare GCS with GMC (as presented in Chapter 4) through two cluster
evaluation criteria, namely Silhouette Coefficient and DB Index using 10 real-life data sets that
we obtain from the UCI machine learning repository (M. Lichman, 2013) (see Table 4.1). We
compare GCS with GMC on 10 data sets that we used in Chapter 4. For each data set, we run
GMC and GCS 10 times and present the average clustering results.
As we can see in Fig. 5.5, GCS achieves better clustering results than GMC in 9 out of 10
data sets, based on Silhouette Coefficient. Fig. 5.6 shows that GCS performs better than GMC
on all 10 data sets based on DB Index. The right most columns in Fig. 5.5 and Fig. 5.6 show the
average Silhouette Coefficient and DB Index of the techniques on all data sets, respectively.
108
GCS achieves clearly better results on an average than GMC indicating effectiveness of the
components of GCS.
Fig. 5.5: Comparative results between GCS and GMC based on Silhouette Coefficient
Fig. 5.6: Comparative results between GCS and GMC based on DB Index
5.4.5 An Analysis of the Impact of Various Component of GCS
In this section, we explore the effectiveness of various components of the proposed technique,
including the health check and the crossover operation. In order to evaluate the effectiveness of
various components of the proposed technique, we randomly choose five (5) data sets (PID, LD,
LF, GI and VC) as shown in Table 5.1. We run each technique 20 times on each data set and
present the average result.
109
5.4.5.1 An Analysis of the Impact of the Health Check Operation
We explore the effectiveness of the proposed health check operation (see Step 7 of Section 5.3).
In order to evaluate the effectiveness of the health check operation, we compare our proposed
technique with a different version of the proposed technique. We call this version as GCS
without Health Check. GCS without Health Check is exactly the same as GCS except it does not
have the health check operation. We run both GCS and GCS without Health Check for 50
iterations on 5 data sets.
Table 5.2: Comparative result between GCS and GCS without Health Check
Data set DB Index (lower the better)
GCS GCS without Health Check
Silhouette Coefficient (higher the better)
GCS GCS without Health Check
PID 0.17 0.18 0.83 0.79
LD 0.27 0.29 0.81 0.79
LF 0.44 0.45 0.70 0.68
GI 0.27 0.27 0.81 0.81
VC 0.25 0.25 0.83 0.83
Average 0.280 0.288 0.796 0.780
Table 5.2 shows that GCS achieves better clustering results than GCS without Health Check
in three (3) out of five (5) data sets based on both the silhouette coefficient and DB Index. The
average result of GCS on five (5) data set is better than the GCS without Health Check in terms
of both the Silhouette Coefficient and DB Index.
5.4.5.2 An Analysis of the Impact of the Crossover Operation
We also explore the effectiveness of the proposed crossover operation (see Step 4 of Section
5.3). In order to evaluate the effectiveness of the crossover operation, we compare our proposed
technique with a different version of the proposed technique. We call this version as GCS with
Traditional Crossover. In GCS with Traditional Crossover, we incorporate the traditional
crossover (single point crossover) with GCS by replacing its own crossover operation. We run
both GCS and GCS with Traditional Crossover for 50 iterations on 5 data sets. We can see in
110
Table 5.3 that GCS achieves better clustering results than GCS with Traditional Crossover based
on both Silhouette Coefficient and DB Index.
Table 5.3: Comparative result between GCS and GCS with Traditional Crossover
Data set DB Index (lower the better)
GCS GCS with Traditional Crossover
Silhouette Coefficient (higher the better)
GCS GCS with Traditional Crossover
PID 0.17 0.18 0.83 0.86
LD 0.27 0.38 0.81 0.73
LF 0.44 0.50 0.70 0.65
GI 0.27 0.30 0.81 0.82
VC 0.25 0.25 0.83 0.83
Average 0.280 0.322 0.796 0.778
5.4.6 An Analysis of the Improvement in Chromosomes over the Iterations
In Fig. 5.7, we present the grand average fitness (in terms of DB Index, where Fitness= 1/DB
Index (David L. Davies & Bouldin, 1979) values of the best chromosome of 20 runs of GCS on
PID data set. We run GCS 20 times and then present the grand average fitness of the 20 runs.
The grand average fitness are plotted against the iterations. Fig. 5.7 shows the gradual
improvement of the best chromosome over the iterations.
Fig. 5.7: Average fitness (best chromosome) versus Iteration of 20 runs on PID data set
In Fig. 5.8, we present the average fitness of all chromosomes (20 chromosomes in a
population) of GCS and AGCUK on PID data set. Both GCS and AGCUK use the same fitness
111
function (DB Index) to calculate the fitness of a chromosome. Fig. 5.8 shows that average fitness
of 20 runs of all chromosomes of GCS are always higher than the average fitness of 20 runs of
all chromosomes of AGCUK, clearly indicating the effectiveness of various components of
GCS including health check.
Fig. 5.8: Average fitness (all chromosomes) versus Iterations. Each line represents the average fitness of 20 runs on PID data
set
5.5 Summary
In this chapter, we propose a GA-based clustering technique called GCS. The proposed
technique also uses a new selection operation in order to ensure the presence of good-quality
chromosomes in a population at the beginning of each generation. It also modifies the process
which selects a pair of chromosomes in a crossover operation in order to increase the possibility
of getting better quality offspring chromosomes. GCS also uses a health check operation in order
to maintain the chromosome health in a population. It also uses the elitist operation after each
genetic operation within a generation, in order to keep track of the best solution obtained so far.
We evaluate the proposed technique by comparing its performance with the performance of
five existing techniques namely AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al.,
2009), GenClust (Rahman & Islam, 2014), K-means (Lloyd, 1982) and K-means++ (Arthur &
112
Vassilvitskii, 2007). Two evaluation criteria called Silhouette Coefficient (Pang-Ning Tan,
Michael Steinbach, 2005) and DB index (D L Davies & Bouldin, 1979) are used.
We run each technique 20 times on each data set, and present the average clustering results
of 20 runs and standard deviation of the average clustering result. The experimental results show
that GCS performs better than all other techniques on all 15 data sets based on DB Index without
any overlapping of standard deviations. GCS achieves better results than AGCUK in all 15 data
sets based on Silhouette Coefficient. The standard deviations of GCS do not overlap the standard
deviations of AGCUK in 11 out of 15 data sets based on Silhouette Coefficient. GCS performs
better than GAGR in all 15 data sets based on Silhouette Coefficient. In 14 out of 15 data sets
the standard deviations of GCS do not overlap with the standard of GAGR based on Silhouette
Coefficient. GCS also achieves higher Silhouette Coefficient than K-means, K-means++ and
GenClust in all 15 data sets. In 15 out of 15 data sets the standard deviations of GCS do not
overlap with the standard deviations of any of these techniques.
We also present the average Silhouette Coefficient and DB Index of all techniques on all data
sets. The results show that GCS achieves clearly better results on an average than all other
techniques without any overlapping of the standard deviation. We also compare GCS with GMC
(as presented in Chapter 4). The empirical results indicate that GCS achieves clearly better
results than GMC based on two cluster evaluation criteria.
We also empirically evaluate the effectiveness of the proposed components: health check and
crossover operation. The experimental results indicate that the proposed crossover and health
check operation has positive influence in improving clustering results. However, the health
check operation of GCS has a drawback. GCS applies the health check operation from the 21st
iteration and onward. Its keeps collecting the best chromosome of an iteration for the first 20
iterations. This pool of 20 chromosomes then used in the health check operation from the 21st
113
iteration. Any chromosome having lower fitness than the average fitness of the pool is replaced
by a chromosome probabilistically selected from the pool.
The problem is the chromosomes in the pool do not change ever during the iterations. The
chance is high that chromosomes in later iterations (such as the 40th iteration or so) have better
fitness than the average fitness of the pool. Hence, the health check operation may become
ineffective. Moreover, even if there is a chromosome with low fitness in a later iteration (say
the 40th iteration) that requires to be replaced, the replacement of the chromosome by a
chromosome from the pool may not be very effective. This is because, the quality of
chromosomes in the pool is unlikely to be good enough to be useful for a later
generation/iteration.
Therefore, in the next chapter we propose a new clustering technique called HeMI that uses
a new health check operation in each iteration in order to find the sick chromosomes and replace
them with healthy chromosomes. HeMI also improves some other components including the
population initialization and mutation in order to improve the clustering quality.
In addition, from the literature (Pourvaziri & Naderi, 2014; Straßburg et al., 2012), we realize
that bigger population size plays a positive role in achieving better clustering solutions.
However, a bigger population size typically requires higher execution time. Therefore, HeMI
uses multiple streams to facilitate the maintenance of a low execution time while using a bigger
population size. Moreover, it uses the bigger population in such a way so that it produces better
clustering solution than just using the bigger population in a naive way. We discuss them in
details in the next chapter.
114
Chapter 6
GA with Multiple Streams and Neighbor Information
Sharing for Clustering
6.1 Introduction
In this chapter, we propose a GA-based clustering technique called HeMI which is a further
improvement on the techniques proposed in the previous chapters. We in this chapter achieve
our first research goal.
We now briefly introduce the novel components/properties of HeMI and their logical
justifications as follows. It is evident from the literature (Pourvaziri & Naderi, 2014; Straßburg
et al., 2012) and through our empirical analysis (carried out in this chapter) that the population
size has a positive impact on the clustering quality. That is, a big population size is likely to
contribute towards a good clustering solution. However, big population size requires high
During the PhD candidature, we have published the following paper based on this chapter with PhD
supervisors.
Beg, A. H., Islam, M. Z., and Estivill-Castro. V. (2016): Genetic Algorithm with Healthy Population
and Multiple Streams Sharing Information for Clustering, Knowledge-Based Systems, 114 (2016) 61-78.
(ABDC 2016 Rank A, SJR 2016 Rank Q1, 5 Year Impact Factor: 3.433, H Index 63).
115
execution time. Therefore, HeMI uses a big population in multiple streams, where each stream
contains a relatively small number of chromosomes, and thus can facilitate managing a low
execution time since they are suitable for parallel processing when necessary.
Various genetic operations (such as crossover and mutation) can applied on each stream in
parallel. As a result, HeMI is likely to produce better quality clustering solutions. Moreover, due
to splitting the chromosomes into a number of streams and processing the splits separately HeMI
exhibits the higher ability to explore the solution space than the traditional approach of
processing all chromosomes in a single stream. We present empirical evidence of this
phenomenon where we use a single stream of 20 chromosomes, 40 chromosomes and 80
chromosomes, and four streams of 20 chromosomes.
Note that there are some existing techniques that use parallel genetic algorithms (Kumar,
Mills, Hoffman, & Hargrove, 2011; Y. Y. Liu & Wang, 2015; Moore, 2004; Straßburg et al.,
2012) where they divide the total number of chromosomes into a number of parallel runs,
whereas in our technique we increase the total number of chromosomes. The main goal of these
existing techniques is to reduce the time complexity through the parallelization of the genetic
algorithms, whereas the main goal of HeMI is to improve clustering results. While employing
parallelization these existing techniques do not share information among the parallel streams,
whereas HeMI introduces information sharing among the streams at a regular interval in order
to take advantage of the multiple streams.
For a stream 𝑆𝑖, HeMI first identifies its neighboring streams and then spots out the best
chromosome from all neighboring streams and 𝑆𝑖. It then replaces the worst chromosome of 𝑆𝑖
by the best chromosome. The information sharing is carried out at a regular interval such as at
every 10th iteration.
116
For Stream 1, Stream 2 and Stream 3 are considered to be neighbors. Similarly for Stream 2,
Stream 3 and Stream 4 are considered to be neighbors. While the sharing of the best
chromosome from the neighbors increases the fitness of the best chromosome, it maintains the
divergence among the streams. That is, had HeMI used/inserted the best chromosome out of all
streams into all streams then they would have the same best chromosome in all streams.
Similar to DeRanClust (Chapter 3), GMC (Chapter 4) and GCS (Chapter 5), HeMI also uses
a high-quality initial population with a low complexity of 𝑂(𝑛) into two phases: a deterministic
phase and a random phase. Including HeMI all these techniques produce 45 chromosomes in
the deterministic phase and select top |𝑃|
2 chromosomes for the initial population, where |𝑃| (|𝑃|
is set to be 20 in our experiments) is the number of chromosomes in a population. However, in
the random phase excluding HeMI all these techniques generate |𝑃|
2 (i.e. 10) chromosomes. We
realize that similar to deterministic phase, we can also increase the possibility of getting good-
quality chromosomes through the random phase. Therefore, in HeMI we generate same number
of chromosomes (45 chromosomes) in the random phase and then select top |𝑃|
2 chromosomes.
The presence of healthy chromosomes (i.e. chromosomes with high fitness values) in a
population can increase the possibility of good clustering results. Hence, HeMI replaces the sick
chromosomes (i.e. chromosomes with low fitness) by healthy chromosomes. GCS (as presented
in Chapter 5) also uses a health check operation that finds sick chromosomes in a population,
and probabilistically replaces them with healthy chromosomes found in the previous 20
generations. GCS applies the health check operation after 20 generations. However, we
empirically find that the chromosomes and best chromosome in a population improve their
quality over the iterations (see Fig. 6.6, Fig. 6.7 and Fig. 6.8). Hence, GCS’s approach of using
the pool of best chromosomes obtained from the first twenty iteration may not be effective in
the health improvement in later iterations such as the 40th iteration.
117
Hence, HeMI uses a new health check operation where some of the healthy chromosomes
are chosen from a pool of healthy chromosomes obtained by the initial population operation,
whereas some of the healthy chromosomes are generated through the crossover operation of the
existing healthy chromosomes of a generation with the hope that the crossover of two healthy
chromosomes may generate new healthy chromosomes.
HeMI uses the three steps of mutation operation, which employs a division and absorption
operation in sequence if they improve the quality of clustering solutions. Additionally, at the
end of the division and absorption operation, it also applies a random change in chromosomes.
Unlike HeMI, an existing technique (Y. Liu et al., 2011) applies either division or absorption
randomly. Another existing technique (Agustín-Blas et al., 2012) applies division (they call it
splitting) on the largest cluster instead of the sparsest cluster, and absorption on two randomly
chosen clusters. Another existing technique (D. Chang et al., 2012) applies division and
absorption of randomly chosen clusters. Hence, HeMI has a better approach to carefully improve
cluster quality through mutation while exploring unconventional solution space. HeMI, also
maintained the randomness through the noising selection and crossover operation in order to
explore the solution space through its randomness.
We evaluate our technique by comparing its performance with the performance of five high-
quality techniques namely AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), K-
means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007) and GenClust (Rahman &
Islam, 2014). We conduct experiments for the techniques on twenty (20) real-life data sets that
are available in the UCI machine learning repository (M. Lichman, 2013). The experimental
results clearly indicate that the proposed technique performs significantly better than other
techniques in terms of the evaluation criteria considered in this chapter: Silhouette Coefficient
(Agustín-Blas et al., 2012; Pang-Ning Tan, Michael Steinbach, 2005) and DB Index (D L Davies
118
& Bouldin, 1979). We also experimentally evaluate the usefulness of various components of
HeMI.
The main contributions of HeMI are as follows:
The use of multiple streams (see Component 2 in Section 6.2.2).
The three steps mutation operation (see Component 7 in Section 6.2.2).
The Health improvement operation (see Component 8 in Section 6.2.2).
Neighbor information sharing (see Component 10 in Section 6.2.2).
The Global Best Selection operation (see Component 11 in Section 6.2.2).
HeMI works on data sets having numerical and/or categorical attributes.
The rest of the chapter is organized as follows: in Section 6.2 we describe our proposed
technique. We discuss the experimental result in Section 6.3 and in section 6.4 we present the
summary of the chapter.
6.2 HeMI: Healthy Population and Multiple Streams Sharing Information in a GA for
Clustering
6.2.1 Basic Concepts
One of the most important component of HeMI is multiple streams. It is evident from the
relevant literature that (Pourvaziri & Naderi, 2014; Straßburg et al., 2012) in genetic algorithm
based clustering techniques bigger population size tends to increase the quality of the final
clustering solution. Therefore, we realize that a population size of 80 chromosomes is more
likely to produce better clustering solution than a smaller population size such as 20
chromosomes. In this study, we carry out empirical analysis on this (presented in Section
6.3.7.1) where we can see the improvement of clustering quality with the increase of population
119
size in an existing genetic algorithm based clustering technique called AGCUK (Y. Liu et al.,
2011) and GenClust (Rahman & Islam, 2014).
An obvious issue related to the increase of population size is the increased execution time.
Therefore, HeMI uses multiple streams, where each stream contains a relatively small number
of chromosomes, and thus can facilitate managing a low execution time since they are suitable
for parallel processing if necessary. Another advantage of using the multiple streams is a better
exploration of clustering solutions. That is, if we run all chromosomes in a single stream (like
traditional techniques) then we get one best chromosome from the whole population, whereas
if we divide the chromosomes in multiple streams and run them independently then we get
multiple best chromosomes; one best chromosome from each stream. We naturally expect this
approach to get better clustering solution. The empirical analysis carried out in this study
(presented in Section 6.3.7.1) also supports the expectation.
While running multiple streams independently we also make them help each other in
achieving better clustering solutions. That is, the independent streams can exchange message at
a regular interval in order to increase the clustering quality of each stream. One way we could
do this is by identifying the best chromosome out of all streams and implanting the chromosome
in each stream. However, in that case, all streams would have the same best chromosome and
would lose the diversity among them.
Therefore, for each stream, we first identify a set of neighboring streams and then identify
the best chromosome within the neighboring streams which is then implanted into the stream.
This way we can ensure that all streams will not have the same best chromosome in them. Our
empirical analysis again shows (presented in Section 6.3.7.1) a clear evidence that the use of
multiple streams with sharing information among the neighboring streams results in better
clustering solutions.
120
Another interesting idea of HeMI is the continuous health improvement in every generation
in order to ensure the presence of high-quality chromosomes in each population. In each
population, it identifies a number of sick chromosomes and replaces them by healthy
chromosomes. Some of the healthy chromosomes are obtained from the pool of high-quality
chromosomes created for the initial population using K-means/K-means++ many times.
Moreover, some of the healthy chromosomes are created by applying the crossover operation
on pairs of good chromosomes. Again our empirical analysis indicates the effectiveness of the
health improvement as presented in Section 6.3.7.4.
The mutation operation generally changes some chromosome randomly (Agustín-Blas et al.,
2012; D. Chang et al., 2012; Rahman & Islam, 2014). However, HeMI aims to use the mutation
operation for improving the chromosome health while changing them randomly. The mutation
operation in HeMI has three components: division, absorption, and random change. In the
division operation, it examines whether dividing the sparsest cluster into two separate clusters
can improve the chromosome health. Similarly in the absorption operation, it examines whether
the chromosome health can be improved by merging the two closest clusters. After the division
and absorption operation, it finally makes a slight change randomly. The effectiveness of this
mutation operation has been empirically analysed in Section 6.3.7.3.
6.2.2 Main Steps
In this subsection, we introduce the main components and steps of, HeMI before we present the
complete algorithm of HeMI in the next subsection. Out of the following components,
Component 2, Component 7, Component 8, Component 10, and Component 11 are our novel
contributions of this chapter.
121
BEGIN
Step 1: Normalization
DO: k=1to m /* m is the user defined number of streams */
Step 2: Population Initialization
END
DO: j=1to G /* G is the user defined number of intervals*/
DO: k=1to m /* m is the user defined number of streams */
DO: t=1to I /* I=10; I is the user defined number of iterations */
Step 3: Noise-Based Selection
Step 4: Crossover Operation
Step 5: Twin Removal
Step 6: Three Steps Mutation Operation
Step 7: Health Improvement Operation
Step 8: Elitist Operation
END
END
Step 9: Neighbor Information Sharing
END
Step 10: Global Best Selection
END
Component 1: Normalization
Numerical Attributes:
For normalize numerical attributes, HeMI uses the same approach of normalization that we used
in DeRanClust (see Section 3.2 in Chapter 3).
Categorical Attributes:
For a categorical attribute, HeMI uses an existing technique (H. Giggins & Brankovic, 2012) to
compute the similarity 𝑠 between two categorical values of the categorical attribute. The
distance between two values of a categorical attribute 𝑑 = 1 − 𝑠. The similarity 𝑠 varies
between 0 and 1 and hence the distance 𝑑 also varies between 0 and 1. As a result the distance
between any two records varies between 0 and 1 and all attributes have equal weight in the
distance calculation.
122
Algorithm 6.1: HeMI
Input: A data set D having N records and |A| attributes, where A is the set of attributes Output: A set of cluster C
Require:
Ps ← ∅ /* 𝑃s is the set of initial population (20 chromosomes), initially set 𝑃s to empty */
Po ← ∅ /* 𝑃𝑜 is the set of offspring chromosomes, initially set 𝑃𝑜 to empty*/
Pm ← ∅ /* 𝑃𝑚 is the set of mutated chromosomes, initially set 𝑃𝑚 to empty*/
pc ← ∅ /* 𝑝𝑐 is the set of healthy chromosomes, initially set 𝑝𝑐 to empty*/
D′ ← Normalized ( 𝐷) /* normalize each attribute of the data set in the normalized data set (𝐷′) */
Pd← ∅ /* 𝑃𝑑 is the set of deterministic chromosomes (45 chromosomes), initially set 𝑃𝑑 to empty */
Pr← ∅ /* 𝑃𝑟 is the set of random chromosomes (45 chromosomes), initially set 𝑃𝑟 to empty */
end
for k= 1 to m do /* m=4, user defined number of streams, default value of m is set to 4 and k is the counter of m */
Step 1: /* Population Initialization */
Pd ←GenerateDeterministicChromosomes (D′) /* generate 𝑃𝑑 through K-means, the number of initial seeds for k-means are chosen deterministically */
Pd ←SelectDeterministicChromosomes (Pd ) /* select top 10 chromosomes (50% chromosomes of the initial population) based on fitness */
Pr ←GenerateRandomChromosomes (D′) /* generate 𝑃𝑟 randomly, the number of initial seeds are chosen randomly and the seeds also chosen randomly */
Pr ← SelectRandomChromosomes (Pr) /* select top 10 chromosomes (50% chromosomes of the initial population) based on fitness */
Ps ← Ps ∪ (Pr ∪ Pd) /* insert 𝑃𝑟 and 𝑃𝑑 into 𝑃𝑠 */
Pb= FindBestChromosome (Ps) /* 𝑃𝑏 is a chromosome that has the maximum fitness value in 𝑃𝑠 */
end
end
for j=1 to G do /* default G=5, G is the number of intervals of the total number of iterations and j is the counter of G */
for k= 1 to m do /* default m=4, m is the user defined number of streams, and k is the counter of m */
for t=1 to I do /* default I = 10, I is the user defined number of iterations for each interval and t is the counter of I */
Step 2: /* Noised Based Selection */
if t >1 then
Ps = NoiseBasedSelection(Pst, Ps
t−1) /* perform noise based selection between current ( 𝑃𝑠𝑡 ) and previous (𝑃𝑠
𝑡−1) generation*/
end end
Step 3: /* Crossover operation */
Po ← PerformCrossover (Ps) /* perform single point crossover on 𝑃𝑠 and get a set of offspring chromosomes 𝑃𝑜*/
end
Step 4: /*Twin Removal */
P0 =Twin Removal (P0) /* perform twin removal on 𝑃0 and get a set of chromosomes 𝑃0 */
end
Step 5: /* Three Steps Mutation operation */
while |P0 | ≥ 0 do
Pv ← DivisionOperation (P0) /* perform division operation on 𝑃𝑜 and get chromosome 𝑃𝑣 */
Pv ← AbsorptionOperation (Pv) /* perform absorption operation on 𝑃𝑣 and get chromosome 𝑃𝑣 */
Pv ← RandomChangeOperation (Pv) /* perform random change operation on 𝑃𝑣 and get chromosome 𝑃𝑣 */
Pm ← P𝑣 /* insert P𝑣 in Pm*/
end
end
Step 6: /* Health Improvement Operation */
Px = Phase1 (Pm) /* select 10 best chromosomes from 𝑃𝑚 based on their fitness */
Py = Phase2 (Pm) /* select 4 best chromosomes from 𝑃𝑚 and perform single point crossover and get offspring chromosomes 𝑃𝑦 */
Pz = Phase3 (Pd) /* select 6 best chromosomes from 𝑃𝑑 and perform random mutation and get chromosomes 𝑃𝑧 */
Pc ∪ (Px ∪ Py ∪ Pz) /* insert 𝑃𝑥, 𝑃𝑦 and 𝑃𝑧 into 𝑃𝑐 */
end
Step 7: /* Elitist Operation */
Pbk ←ElitistOperation (Pc & Pb ) /* apply elitist operation on 𝑃𝑐 & 𝑃𝑏 and find the best chromosome 𝑃𝑏
𝑘 */
Pg← Pbk /* insert 𝑃𝑏
𝑘 into 𝑃𝑔, 𝑃𝑔 is the set of chromosomes that contains the best chromosome of each stream */
end
end
end
Step 8: /* Neighbor Information Sharing */
for k= 1 to m do /* default m=4, m is the user defined number of streams and k is the counter of m */
Pwk = FindWorstChromosome (Pc) /* find the worst chromosome 𝑃𝑤
𝑘 in 𝑃𝑐 */
Pwk ← ReplaceWithNeighborBestChromosome (Pg) /* replace Wp
k with the best chromosome of its neighbor */
Pbk ←FindLocalBestChromosome (Pc) /* find the local best chromosome */
Lb← Pb k /* insert 𝑃𝑏
𝑘 into 𝐿𝑏, 𝐿𝑏 is the set of chromosomes that contains the best chromosome of each k */
end
end
end
Step 9: /* Global Best Selection*/
C ←FindGlobalBestChromosome (Lb ) /* find the global best chromosome 𝐶 from 𝐿𝑏*/ Return C
end
123
Component 2: Multiple Streams
This component is an original/new contribution of HeMI that aims to take advantage of using a
big population through multiple streams where each stream contains a relatively small number
of chromosomes. Generally, in the genetic algorithm based clustering techniques, the bigger
population size tends to increase the quality of final clustering solutions (Pourvaziri & Naderi,
2014; Straßburg et al., 2012). Therefore, HeMI aims to use a big population in order to produce
better clustering solution. The chromosomes for each stream are generated separately through
the population initialization. Various components such as crossover and mutation are applied
on each stream separately.
Component 3: Population Initialization
HeMI selects high-quality chromosomes in the initial population through two phases: a
deterministic phase and a random phase.
Deterministic Phase
HeMI uses the same approach of deterministic phase that we used in DeRanClust (see Section
3.2 of Chapter 3).
Random Phase
In the random phase, HeMI generates 45 chromosomes. For each chromosome, it randomly
generates the 𝑘 value between 2 and √𝑛(𝑛 is the number of records in a data set) and then
randomly picks 𝑘 records to form k genes of the chromosome.
In the experiments of this chapter, we use 20 chromosomes in the population of a generation.
Therefore, HeMI chooses the best 10 chromosomes from the 45 chromosomes generated in the
deterministic phase and the best 10 chromosomes from the 45 chromosomes generated in the
random phase. The best chromosome out of the 20 chromosomes of the initial population is
124
stored separately as the best chromosome which is then used in the elitist operation later on. The
DB Index (D L Davies & Bouldin, 1979) is used by default as the fitness function of the
chromosomes throughout all steps in HeMI.
Component 4: Noise-based Selection
At the beginning of each generation starting from Generation 2, we carry out the Noise Based
Selection (Y. Liu et al., 2011) in order to get a new population for subsequent genetic operations
such as crossover and health improvement. HeMI uses the same approach of noise-based
selection that we used in DeRanClust (see Section 3.2 in Chapter 3).
Component 5: Crossover Operation
HeMI performs a crossover operation on a pair of chromosomes where the chromosomes swap
their segments/genes to each other and generate a pair of offspring (Agustín-Blas et al., 2012;
D. Chang et al., 2012; Rahman & Islam, 2014).Typically, there are many selection criteria such
as roulette wheel (D. Chang et al., 2012; Maulik & Bandyopadhyay, 2000; Mukhopadhyay &
Maulik, 2009) rank-based wheel (Agustín-Blas et al., 2012) and random selection (D.-X. Chang
et al., 2009) that are used to select a chromosome pair for a crossover operation.
In HeMI, the best chromosome (which is currently available in the population) is chosen as
the frist chromosome of the pair. The second chromosome of the pair is chosen using the roulette
approach, where a chromosome 𝑃𝑗 is chosen with a probability 𝑇𝑗 = (𝑓𝑗/ ∑ 𝑓𝑖|P|𝑖=1 ). Here, 𝑓𝑗 is the
fitness of the chromosome 𝑃𝑗 and |𝑃| is the size of the current population. Once a pair of
chromosomes is chosen it is removed from the current population. For the selection of the next
pair, again the new best chromosome is chosen. The second chromosome of the pair is chosen
using the same process described above. The intuition behind the roulette wheel selection is to
take a non-deterministic approach with high probability of choosing a pair of good
chromosomes.
125
There are many approaches to perform crossover between a pair of chromosome such as
single-point (Sanghamitra Bandyopadhyay & Maulik, 2002; D. Chang et al., 2012; Garai &
Chaudhuri, 2004; Michael Laszlo & Mukherjee, 2007; Pakhira et al., 2005; Peng et al., 2014;
Rahman & Islam, 2014; Song et al., 2009), multi-point (Agustín-Blas et al., 2012), arithmetic
(Yan et al., 2012) and path-based crossover (D.-X. Chang et al., 2009). However, in the genetic
algorithm a single-point crossover is very commonly used. There are many existing techniques
(Sanghamitra Bandyopadhyay & Maulik, 2002; D. Chang et al., 2012; Garai & Chaudhuri,
2004; Michael Laszlo & Mukherjee, 2007; Pakhira et al., 2005; Peng et al., 2014; Rahman &
Islam, 2014; Song et al., 2009) that use single-point crossover.
Moreover, Peng et al. ( 2014) experimentally demonstrates that a single-point crossover
performs better than a multi-point crossover. Therefore, in HeMI a single-point crossover is
used where it randomly generates a crossover point for each chromosome of the pair in order to
divide a chromosome into two segments and then swaps the segments between the
chromosomes.
Component 6: Twin Removal
HeMI uses the twin removal operation in order to remove/modify twin genes (if any) from each
chromosome. For twin removal, HeMI uses the same approach of twin removal of DeRanClust
(see Section 3.2 of Chapter 3).
Component 7: Three Steps Mutation Operation
This is another new contribution of HeMI that changes a chromosome using three
steps/operations: division, absorption, and a random change. Note that the division and
absorption operations are also used in some existing techniques (Agustín-Blas et al., 2012; D.
Chang et al., 2012; Y. Liu et al., 2011), but there are differences between them and HeMI as
follows.
126
Chang et al. (D. Chang et al., 2012) applies both division and absorption on a chromosome
where clusters are chosen randomly for division and absorption, unlike HeMI that carefully
chooses clusters for division and absorption. Blas et al. (Agustín-Blas et al., 2012) also choose
a cluster for division carefully, where the largest cluster is chosen for the division. We argue
that a large cluster can also be compact and may not always require being divided into smaller
clusters. Therefore, HeMI applies division on the sparsest (not largest) cluster of a chromosome.
Moreover, Blas et al. (Agustín-Blas et al., 2012) chooses two random clusters for absorption,
whereas HeMI chooses the two closest clusters for absorption.
Liu et al. (Y. Liu et al., 2011) also chooses the sparsest cluster for the division and two closest
clusters for absorption. However, they randomly apply either division or absorption on a
chromosome, regardless of the improvement of its fitness. However, HeMI applies both division
and absorption on a chromosome. Division/absorption is applied only if it improves the fitness
of a chromosome. Additionally after the division and absorption operation, HeMI also applies
the random change operation on a chromosome based on a mutation probability in order to
support the exploration nature of genetic algorithms.
In the division operation, HeMI identifies the sparsest cluster 𝐶j of a chromosome 𝑃j
and
then divides the cluster 𝐶j into two clusters by applying K-means on 𝐶j
where the value of 𝑘 is
set to 2. If the fitness of the chromosome after division 𝑃j,d is better than the fitness of the
chromosome 𝑃j then 𝑃j,d
is selected, otherwise 𝑃j is selected for the absorption operation. The
absorption operation finds the two closest clusters 𝐶i and 𝐶j
of the chromosome 𝑃j or 𝑃j,d
(whichever is selected from the division operation), and merges them into one cluster. Thus it
forms a new chromosome𝑃j,a . If the fitness of 𝑃j,a
is better than the fitness of 𝑃j and 𝑃j,d
then
𝑃j,a is selected, otherwise either 𝑃j
or 𝑃j,d (whichever is selected from the division operation)
is selected for the random change operation.
127
Once the division and absorption operations for all chromosomes of a population are
completed then the random change operation is carried out. In the random change operation, the
mutation probabilities for each chromosome are computed. The mutation probability of a
chromosome 𝑃j is calculated using its fitness 𝑓j
, and the maximum fitness value 𝑓max and
average fitness value 𝑓mean of all chromosomes in the current population. The mutation
probability (D.-X. Chang et al., 2009; Srinivas & M.Patnaik, 1994) of the 𝑗𝑡ℎ chromosome is
calculated as follows, where 𝑘1 and 𝑘2 are equal to 0.5.
𝑀𝑗 = {𝑘1 ∗
𝑓𝑚𝑎𝑥 − 𝑓𝑗
𝑓𝑚𝑎𝑥 −𝑓𝑚𝑒𝑎𝑛 𝑖𝑓 𝑓𝑗 > 𝑓𝑚𝑒𝑎𝑛,
𝑘2 , 𝑖𝑓 𝑓𝑗 ≤ 𝑓𝑚𝑒𝑎𝑛
Eq. 6.1
The intuition behind this is to reduce the amount of random changes on good chromosome.
The 𝑇𝑗 value for the chromosome having the best fitness is zero. The 𝑇𝑗 value increases for the
chromosome with lower fitness value. The 𝑇𝑗 value is 0.5 for all chromosomes having fitness
less than the average fitness.
If the mutation probability of a chromosome is greater than a random number (between 0 and
1) then the chromosome is selected for the random change operation, otherwise, it remains
unchanged. In the random change operation, a gene of the chromosome is randomly chosen
where an attribute value of the gene is randomly changed to another value within its domain.
Component 8: Health Improvement Operation
This is an original contribution of HeMI. The aim of this component is to continuously improve
the health of chromosomes in every generation in order to ensure the presence of high-quality
chromosomes within a population. Crossover and mutation operations are likely to improve
health/fitness of some chromosomes, but they can also decrease health/fitness of some
128
chromosomes. Therefore, after the crossover and mutation operations, HeMI identifies sick
chromosomes and replaces them through three phases.
In Phase 1, HeMI identifies the healthy and sick chromosomes. It sorts the chromosomes in
descending order of their fitness values and identifies 50% of the chromosomes to be sick. For
example, if there are 20 chromosomes in a population (i.e. population size = 20) then it identifies
the 10 sickest chromosomes to be sick and the others to be healthy. The sick chromosomes are
then removed from the population. So the population size temporarily decreases to 50% where
all of them are considered to be healthy. In the following two phases 50% new chromosomes
are added to bring the population size back to 100%.
In Phase 2, HeMI generates 20% new chromosomes i.e. if the original population size is 20
then it generates 4 chromosomes. For this, it first picks the healthiest 20% chromosomes from
the set of healthy chromosomes found in Phase 1. Applying the same approach of Component
5 it then chooses pairs of chromosomes from these 20% healthy chromosomes. It next applies
the crossover operation on each pair in order to generate offspring chromosomes which are then
added into the population. Hence, at this stage the population size is back to 70% of the original
size.
In Phase 3, HeMI adds the remaining 30% chromosomes in the population. These
chromosomes are chosen from the pool of chromosomes that was obtained through the
deterministic phase of Component 3 which are supposed to be healthy chromosomes due to the
use of K-means/K-means++. Moreover, in this phase HeMI chooses the best chromosomes of
the pool. For each of these chromosomes HeMI then randomly changes an attribute value of a
gene within its original domain. These chromosomes are then added into the population to bring
the population size back to 100%.
129
Fig. 6.1: Flowchart of HeMI algorithm
Component 9: Elitist Operation
The Elitist operation keeps track of the best chromosome throughout the generations in order to
ensure the continuous improvement of the quality of the best chromosome found so far over the
130
iterations. For finding the best chromosome, HeMI uses the same approach of elitist operation
that we used in DeRanClust (see Section 3.2 of Chapter 3).
Component 10: Neighbor Information Sharing
This is a new contribution of HeMI where neighboring streams share/exchange the best
chromosome among them, at a regular interval such as at every 10th generation. If a stream
somehow suffers from the low quality of its best chromosome then it gets an opportunity to
borrow the best chromosome from its neighboring streams.
For a stream 𝑆𝑖 , HeMI first identifies its neighboring streams. The streams 𝑆 =
{𝑆1, 𝑆2 … … . 𝑆|𝑠| } are number sequentially, where |𝑠| is the user defined number of streams.
The default number of streams is four in this study. For any stream 𝑆𝑖 , the two streams
𝑆𝑖+1 𝑀𝑂𝐷 |𝑠| and 𝑆𝑖+2 𝑀𝑂𝐷 |𝑠| are considered to be the neighboring streams. The 𝑀𝑂𝐷 operation
ensures that the neighbors are found in a wrap-up fashion where for the stream 𝑆|𝑠| the
neighboring streams will be 𝑆1 and 𝑆2 .
For a stream 𝑆𝑖 , HeMI spots out the best chromosome 𝑃𝑏 out of its neighboring streams
and 𝑆𝑖 . It then replaces the worst chromosome of 𝑆𝑖 by 𝑃𝑏 , if 𝑃𝑏 comes from a neighboring
stream of 𝑆𝑖 . The chromosomes of 𝑆𝑖 are then sorted and the best chromosome is stored as 𝑃𝑏
which is the best chromosome found so far for 𝑆𝑖 , as explained in Component 9. While the
sharing of the best chromosome from the neighboring streams increases the fitness of the best
chromosome 𝑆𝑖 , it maintains the divergence among the streams since the sets of neighboring
streams for any two streams 𝑆𝑖 and 𝑆𝑗 are different.
Component 11: Global Best Selection
This is another contribution of HeMI. At the end of all iterations/generations, each stream has
the best chromosome for the stream. HeMI compares all such best chromosomes from all
streams and then select the best of the best chromosomes as the final clustering solution. The
131
genes of the best chromosome represent the cluster centers and records are allocated to their
closest seeds to form the final clusters.
6.2.3 The HeMI Algorithm
After introducing the main components we are now ready to present the overall algorithm of
HeMI, which is also explained in Algorithm 6.1 and Fig. 6.1. We use the same notations in
Algorithm 6.1 and Fig. 6.1. HeMI takes a data set 𝐷 as input. It first normalizes all numerical
attributes separately as explained in Component 1. HeMI uses a user defined number of multiple
streams as explained in Component 2. The use of multiple streams aiming to improve clustering
results. The default number of multiple streams is four (4) in this study. Each stream contains a
subpopulation (see Component 2) in the sense that the total number of chromosomes is equally
(or as equally as possible) divided among the streams.
HeMI then generates initial chromosomes for each stream separately through its proposed
Population Initialization component (see Component 3 and Step 1 of Algorithm 6.1, and Fig.
6.1). It skips the Noise Based Selection operation in the first iteration as shown in Step 2 of
Algorithm 6.1 and Fig. 6.1. The Noise Based Selection operation is applied from the 2nd
iteration.
The single point crossover, Twin Removal, Mutation, Health Improvement and Elitist
operation are then applied sequentially. All these operations are described before (see from
Component 5 to Component 9). They can also be studied in various steps (from Step 3 to Step
7) of Algorithm 6.1.
In order to take the advantage of the multiple streams HeMI then performs the Neighbor
Information Sharing operation at a regular interval, which is by default 10 iterations. This
operation has been explained in Component 10 and Step 8 of Algorithm 6.1. At the end of all
132
iterations, HeMI applies the Global Best Selection operation in order to find the final clustering
solution (see Component 11 and Step 9 of Algorithm 6.1).
6.3 Experimental Results and Discussion
6.3.1 The Data sets and the Evaluation Criteria
We empirically compare our proposed technique called HeMI with five existing techniques
namely AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), K-means (Lloyd,
1982), K-means++ (Arthur & Vassilvitskii, 2007) and GenClust (Rahman & Islam, 2014) on
twenty (20) natural data sets that are available from the UCI machine learning repository (M.
Lichman, 2013). HeMI is compared with these existing techniques because they are recent and
were shown to be better than many other high-quality techniques (Sanghamitra Bandyopadhyay
& Maulik, 2001, 2002; Lai, 2005; Lin, Yang, & Kao, 2005; Murthy & Chowdhury, 1996).
Detailed information on the data sets is provided in Table 6.1. We choose data sets with a
wide variety. For example, some data sets (such as Glass identification) have only numerical
attributes, and some data sets (such as BC) have categorical attributes. The Credit Approval
(CA) data set has 6 numerical and 9 categorical attributes. The reason why we choose most of
the data sets with only numerical attributes is that all techniques (except HeMI and GenClust)
that we use in this study can handle only numerical attributes.
Some data sets have a low number of attributes such as Blood Transfusion (BT) that has only
4 attributes and some data sets have a high number of attributes such as Dermatology (DT) that
has 34 attributes. Similarly, some data sets have a low number of records such as Glass
identification that has 214 records and some data sets have relatively high number of records
such as MGT that has 19,020 records. Also, some data sets have a low number of class values
(i.e. low domain size of class attributes) such as BC that has only two class values, but some
data sets have high number of class values such as LF that has 36 class values.
133
Table 6.1: A brief description of the data sets
Data set No. of
Records with
missing
No. of Records
without missing
No. of
numerical
attributes
No. of
categorical
attributes
Class size
Glass Identification (GI) 214 214 10 0 7
Breast Cancer (BC) 286 277 0 9 2
Vertebral Column (VC) 310 310 6 0 2
Ecoli (EC) 336 336 8 0 8
Leaf (LF) 340 340 16 0 36
Liver Disorder (LD) 345 345 6 0 2
Dermatology (DT) 366 358 34 0 6
Credit Approval (CA) 690 653 6 9 2
Breast Cancer Wisconsin Original
(WBC)
699 683 10 0 2
Blood Transfusion (BT) 748 748 4 0 2
Pima Indian Diabetes (PID) 768 768 8 0 2
Statlog Vehicle Silhouettes (SV) 846 846 18 0 4
Mammographic Mass (MGM) 961 830 5 0 2
Bank Note Authentication (BN) 1372 1372 4 0 2
Contraceptive Method Choice (CMC) 1473 1473 9 0 3
Yeast (YT) 1484 1484 8 0 10
Image Segmentation (IS) 2310 2310 18 0 7
Wine Quality (WQ) 4898 4898 11 0 7
Page Blocks Classification (PBC) 5473 5473 10 0 5
MAGIC Gamma Telescope (MGT) 19020 19020 11 0 2
Class values are the labels of records which show an important property of a data set.
Typically, clustering algorithms are applied on data sets that do not have any class values.
Hence, we delete the class attributes from all data sets prior to any experimentation. Some of
the data sets contain missing values in them, meaning that for some records some attribute values
are missing. Column 2 of Table 6.1 shows the total number of records of the data sets including
the records that have some missing values. We first delete the records having any missing
value/s. Column 3 of Table 6.1 shows the number of records without missing value/s. For
example, the BC data set has altogether 286 records, but 9 of them have one or more missing
values. Hence, after these 9 records are deleted the data set has 277 records without any missing
values. In all experiments, we use the data sets without any missing values.
134
We evaluate and compare the clustering techniques based on two well-known evaluation
criteria namely Silhouette Coefficient (Agustín-Blas et al., 2012; Pang-Ning Tan, Michael
Steinbach, 2005) and DB Index (D L Davies & Bouldin, 1979).
6.3.2 The Parameters used in the Experiments
In the experiments on AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009),
GenClust (Rahman & Islam, 2014) and HeMI we consider the population size to be 20 and
number of iterations/generations to be 50. We maintain this consistency for all of these
techniques in order to ensure a fair comparison among them.
In the experiments, the number of iterations of K-means/K-means++ in HeMI is set to 50
and the number of iterations of K-means in GenClust is also set to be 50. The cluster numbers
in GAGR, K-means and K-means++ is user defined. However in order to simulate a natural
scenario, the cluster number for GAGR, K-means and K-means++ are generated randomly
between 2 and√𝑛, where 𝑛 is the number of records in a data set.
The number of iterations for K-means and K-means++ is also set to 50 and the threshold
value is set as 0.005 that was suggested in GenClust. The value of 𝑟𝑚𝑎𝑥 and 𝑟𝑚𝑖𝑛 for AGCUK
and HeMI is set to be are 1 and 0 respectively as suggested in AGCUK.
6.3.3 The Experimental Setup
For each data set, we run HeMI 20 times since it can produce different clustering results in
different runs. We then present the average and standard deviation of the clustering results. We
also run all other techniques AGCUK, GAGR, K-means, K-means++ and GenClust 20 times.
We present the average and standard deviation of the clustering results. We run each of the
techniques 20 times on all 20 data sets. Moreover, in order to evaluate the effectiveness of
various components of HeMI we use 5 data sets where we run the techniques 20 times.
135
6.3.4 Experimental Results on All Techniques
In this section, we experimentally evaluate the performance of HeMI by comparing it with K-
means, K-means++, GGAR, AGCUK and GenClust on all 20 data sets where each technique
runs 20 times on each data set. Since there are 2 data sets with categorical attributes AGCUK,
GAGR, K-means and K-means++ cannot handle these data sets and therefore, these techniques
are tested on 18 (instead of 20) data sets. However, HeMI and GenClust are tested on all 20 data
sets.
(a)
(b)
Fig. 6.2: (a) Comparative results between HeMI and other techniques on ten data sets based on Silhouette Coefficient. (b)
Comparative results between HeMI and other techniques on ten data sets based on Silhouette Coefficient
Fig. 6.2 (a) and Fig. 6.2 (b) show the average and standard deviation of the Silhouette
Coefficient of the clustering solutions, where HeMI achieves better results than GenClust in 20
out of 20 data sets. That is, in 20 out of 20 data sets the average Silhouette Coefficient of 20
runs of HeMI is higher than the average Silhouette Coefficient of 20 runs of GenClust.
Moreover, in 17 out of 20 data sets the standard deviations of HeMI do not overlap the standard
136
deviations of GenClust and the average Silhouette coefficient of HeMI is higher than GenClust.
Note that the cases where the standard deviations of HeMI overlap with the standard deviations
of other techniques are indicated by an arrow in Fig. 6.2 (a), Fig. 6.2 (b), Fig. 6.3 (a) and Fig.
6.3 (b). That is, for all cases without an arrow HeMI achieves a better result with no overlap of
standard deviations.
HeMI achieves higher Silhouette Coefficient than AGCUK in 18 out of 18 data sets. The
standard deviations of HeMI do not overlap the standard deviations of AGCUK in 17 out of 18
data sets. HeMI also achieves higher Silhouette Coefficient than K-means, K-means++ and
GAGR in all 18 data sets. In 18 out of 18 data sets the standard deviations of HeMI do not
overlap with the standard deviations of any of these techniques.
(a)
(b)
Fig. 6.3: (a) Comparative results between HeMI and other techniques on ten data sets based on DB Index. (b)
Comparative results between HeMI and other techniques on ten data sets based on DB Index
137
All bar graphs in Fig. 6.2 (a), Fig. 6.2 (b), Fig. 6.3 (a) and Fig. 6.3 (b) are in the same
sequence: K-means, K-means++, GAGR, AGCUK, GenClust, and HeMI. As we can see in Fig.
6.3 (a) and Fig. 6.3 (b), HeMI achieves better clustering results (on an average) than GenClust
in 19 out of 20 data sets, based on DB Index for which a lower value indicates a better result. In
18 out to 20 data sets HeMI does not have any overlap of standard deviations with the standard
deviations of GenClust. Moreover, HeMI performs better than K-means, K-means++, GAGR
and AGCUK on all 18 data sets based on DB Index. In 18 out to 18 data sets HeMI does not
have any overlap of standard deviations with the standard deviations of these techniques.
The right most columns of Fig. 6.2 (b) and Fig. 6.3 (b) show the average Silhouette
Coefficient and DB Index of all techniques on all data sets. HeMI achieves clearly better results
on an average than all other techniques without any overlapping of standard deviations. We
believe that this is a very strong result in order to demonstrate the superiority of HeMI over a
number of recent and high-quality clustering techniques.
6.3.5 Comparative Results between HeMI and GCS
In this section, we compare HeMI with GCS (as presented in Chapter 5) through two cluster
evaluation criteria, namely Silhouette Coefficient (Agustín-Blas et al., 2012; Pang-Ning Tan,
Michael Steinbach, 2005) and DB Index (D L Davies & Bouldin, 1979) using 15 real-life data
sets that we obtain from the UCI machine learning repository (M. Lichman, 2013) (see Table
5.1). In this section, we compare HeMI with GCS on 15 data sets that we used in Chapter 5 (see
Table 5.1).
As we can see in Table 6.2 that HeMI gets inferior clustering results than GCS in only 1 out
of 15 data sets based on DB Index and 2 out of 15 data sets based on Silhouette Coefficient. The
bottom row of Table 6.2, shows the average Silhouette Coefficient and DB Index of the
techniques on all data sets. HeMI achieves clearly better results on an average than GCS
138
indicating the effectiveness of the components of HeMI. Note that the results in Table 6.2 are
averages of 20 runs for each data set.
Table 6.2: Comparative results between HeMI and GCS
Data set DB Index (lower the better) Silhouette Coefficient (higher the better)
HeMI GCS HeMI GCS
GI 0.24 0.27 0.80 0.81
VC 0.25 0.25 0.83 0.83
EC 0.32 0.32 0.80 0.77
LF 0.43 0.44 0.71 0.70
LD 0.26 0.27 0.81 0.81
DT 0.63 0.71 0.61 0.59
BT 0.17 0.21 0.86 0.81
PID 0.40 0.17 0.73 0.83
SV 0.43 0.53 0.68 0.63
BN 0.39 0.47 0.70 0.66
YT 0.23 0.30 0.80 0.78
IS 0.45 0.50 0.73 0.71
WQ 0.23 0.23 0.84 0.84
PBC 0.10 0.12 0.92 0.90
MGT 0.24 0.51 0.82 0.65
Average 0.31 0.35 0.77 0.75
6.3.6 Comparative Results among HeMI, GCS, GMC and DeRanClust
We now compare HeMI with GCS (presented in Chapter 5), GMC (see Chapter 4) and
DeRanClust (see Chapter 3) on 10 natural data sets (obtain from the UCI machine learning
repository (M. Lichman, 2013) ) that we used in Chapter 4 (see Table 4.1). We evaluate and
compare the clustering techniques based on two well-known evaluation criteria namely
Silhouette Coefficient and DB Index.
Table 6.3 and Table 6.4 present that HeMI achieves better clustering result than DeRanClust
and GCS in 8 out of 10 data sets based on Silhouette Coefficient and DB Index, and in 1 out of
10 data sets HeMI performs equally with DeRanClust and GCS based on both Silhouette
Coefficient and DB Index.
139
Table 6.3: Comparative Results among HeMI, GCS, GMC and DeRanClust based on Silhouette Coefficient
Silhouette Coefficient (higher the better)
Data sets DeRanClust GMC GCS HeMI
GI 0.832 0.781 0.789 0.827
VC 0.830 0.461 0.830 0.830
LF 0.664 0.571 0.691 0.699
LD 0.802 0.648 0.812 0.815
DT 0.605 0.508 0.602 0.615
PID 0.468 0.518 0.818 0.788
SV 0.657 0.592 0.638 0.683
BN 0.695 0.600 0.671 0.696
YT 0.786 0.724 0.799 0.837
IS 0.698 0.613 0.719 0.734
Average 0.7037 0.6016 0.7369 0.7524
Table 6.4: Comparative Results among HeMI, GCS, GMC and DeRanClust based on DB Index
DB Index (lower the better)
Data sets DeRanClust GMC GCS HeMI
GI 0.249 0.331 0.280 0.265
VC 0.254 0.822 0.254 0.254
LF 0.509 0.653 0.475 0.440
LD 0.284 0.458 0.270 0.266
DT 0.724 1.005 0.724 0.635
PID 0.914 0.897 0.174 0.409
SV 0.495 0.582 0.525 0.437
BN 0.423 0.589 0.470 0.401
YT 0.272 0.499 0.323 0.233
IS 0.522 0.604 0.493 0.453
Average 0.464 0.644 0.398 0.379
We can see in Table 6.3 and Table 6.4 that HeMI achieves better results than GMC in all 10
data sets based on Silhouette Coefficient and DB Index. Moreover, the bottom row of Table 6.3
and Table 6.4 show the average Silhouette Coefficient and DB Index of all techniques on all
140
data sets. HeMI achieves clearly better results on an average than GCS, GMC and DeRanClust
based on both Silhouette Coefficient and DB Index. Note that all results presented in the Table
6.3 and Table 6.4 are averages of 10 runs.
6.3.7 An Analysis of the Impact of Various Properties of HeMI
We now explore the effectiveness of some novel properties/components of HeMI in the
following subsections. For the experiments, we add a component of HeMI to an existing
technique called AGCUK, and investigate its impact on AGCUK.
6.3.7.1 An Analysis of the Impact of the Multiple Streams that Exchange Information
An important contribution of HeMI is its multiple streams that share/exchange information at a
regular interval. Due to having multiple streams HeMI can accommodate more chromosomes
than existing techniques. Hence, in this section we carry out experiments to first investigate
whether higher number of chromosomes improves the clustering results. We next investigate
the impact of exchanging information among the streams at a regular interval. The results justify
the usefulness of the components.
Table 6.5 demonstrate that AGCUK achieves better clustering results (in terms of Silhouette
Coefficient and DB Index) when it uses 40 chromosomes instead of 20 chromosomes and 80
chromosomes instead of 40 chromosomes. We run 50 iterations as usual. Average results of 20
runs are presented in the tables.
Table 6.5: Comparative results between AGCUK, AGCUK with 40 Population and AGCUK with 80 Population
Data set DB Index (lower the better)
AGCUK AGCUK with AGCUK with
40 Population 80 Population
Silhouette Coefficient (higher the better)
AGCUK AGCUK with AGCUK with
40 Population 80 Population
PID 1.40 1.30 1.32 0.27 0.31 0.29
BT 0.54 0.53 0.47 0.64 0.63 0.66
GI 0.62 0.53 0.54 0.65 0.70 0.69
LD 0.87 0.82 0.79 0.46 0.50 0.51
BN 0.79 0.76 0.67 0.47 0.49 0.55
Average 0.84 0.78 0.75 0.49 0.52 0.54
141
Table 6.6 present clustering results obtained by AGCUK with 80 chromosomes in a single
stream and AGCUK with 80 chromosomes equally divided among 4 streams called AGCUK
with Multiple Streams. In this case, the streams do not exchange information. We pick the best
clustering result of 4 streams at the end of 50 iterations. It clearly shows that multiple streams
help AGCUK to achieve better results.
Table 6.6: Comparative results between AGCUK with 80 Population and AGCUK with Multiple Streams
Data set DB Index (lower the better)
AGCUK with 80 Population AGCUK with Multiple
Streams
Silhouette Coefficient (higher the better)
AGCUK with 80 Population AGCUK with Multiple
Streams
PID 1.32 1.30 0.29 0.31
BT 0.47 0.43 0.66 0.69
GI 0.54 0.40 0.69 0.78
LD 0.79 0.77 0.51 0.55
BN 0.67 0.68 0.55 0.55
Average 0.75 0.71 0.54 0.57
Table 6.7 compare clustering results obtained by AGCUK with 4 streams that do not
exchange information and AGCUK with 4 streams called AGCUK with Neighbor Exchange that
exchange information among neighbors regularly at 10 iterations. We can clearly see the impact
of information exchange on the final clustering results.
Table 6.7: Comparative results between AGCUK with Multiple Streams and AGCUK with Neighbor Exchange
Data set DB Index (lower the better)
AGCUK with AGCUK with
Multiple Streams Neighbor Exchange
Silhouette Coefficient (higher the better)
AGCUK with AGCUK with
Multiple Streams Neighbor Exchange
PID 1.30 1.25 0.31 0.33
BT 0.43 0.44 0.69 0.70
GI 0.40 0.40 0.78 0.74
LD 0.77 0.46 0.55 0.67
BN 0.68 0.53 0.55 0.63
Average 0.71 0.61 0.57 0.61
142
The total number of chromosomes and iterations in all versions of AGCUK are same, but
still AGCUK with multiple streams that exchange information achieves better results than
others. This clearly indicates the effectiveness of the component.
Similar experiments are then carried out on another existing technique called GenClust. The
results are consistent with AGCUK and indicate the effectiveness of the proposed component
(see Table 6.8).
Table 6.8: Comparative results between GenClust, GenClust with Multiple Streams and GenClust with Neighbor Exchange
Data set DB Index (Lower the better)
GenClust GenClust with GenClust with
Multiple Streams Neighbor Exchange
Silhouette Coefficient (higher the better)
GenClust GenClust with GenClust with
Multiple Streams Neighbor Exchange
PID 1.11 1.03 0.95 0.37 0.42 0.47
BT 0.89 0.80 0.77 0.41 0.44 0.47
GI 0.71 0.69 0.65 0.62 0.63 0.65
LD 0.85 0.83 0.75 0.56 0.57 0.63
BN 0.85 0.82 0.76 0.41 0.42 0.46
Average 0.88 0.83 0.77 0.47 0.49 0.53
Since we see a clear evidence of improvement in AGCUK and GenClust with the inclusion
of multiple streams that exchange information, we now compare them with HeMI in order to
investigate the effectiveness of other components of HeMI.
Table 6.9: Comparative results between HeMI, AGCUK with Neighbor Exchange and GenClust with Neighbor Exchange
Data set DB Index (lower the better)
AGCUK with GenClust with HeMI
Neighbor Exchange Neighbor Exchange
Silhouette Coefficient (higher the better)
AGCUK with GenClust with HeMI
Neighbor Exchange Neighbor Exchange
PID 1.25 0.95 0.40 0.33 0.47 0.73
BT 0.44 0.77 0.17 0.70 0.47 0.86
GI 0.40 0.65 0.26 0.74 0.65 0.82
LD 0.46 0.75 0.26 0.67 0.63 0.81
BN 0.53 0.76 0.39 0.63 0.46 0.70
Average 0.61 0.77 0.29 0.61 0.53 0.78
Table 6.9 present that HeMI achieves better clustering result than GenClust with multiple
streams that exchange information in 5 out of 5 data sets based on Silhouette Coefficient and
143
DB Index. HeMI performs better than AGCUK with multiple streams that exchange information
in all 5 data sets based on Silhouette Coefficient and DB Index. Moreover, the average clustering
result of 5 data sets of HeMI based on Silhouette Coefficient and DB Index shows the clear
domination of HeMI over AGCUK and GenClust with multiple streams that exchange
information. All results presented in the tables are average of 20 runs.
6.3.7.2 An Analysis of the Impact of the Population Initialization
In order to evaluate the effectiveness of our proposed population initialization we incorporate
this component (see Component 3 in Section 6.2.2) with an existing technique called AGCUK,
and then see how the component impacts AGCUK.
We generate 20 initial chromosomes through the proposed component. We then use these
chromosomes in AGUCK as the initial population and run AGCUK for 50 iterations on 5 data
sets. We call this as AGCUK with HeMI population.
Table 6.10: Comparative results between AGCUK and AGCUK with HeMI Population
Data set DB Index (lower the better)
AGCUK AGCUK with HeMI Population
Silhouette Coefficient (higher the better)
AGCUK AGCUK with HeMI Population
PID 1.40 1.29 0.27 0.30
BT 0.54 0.48 0.64 0.66
GI 0.62 0.57 0.65 0.70
LD 0.87 0.87 0.46 0.45
BN 0.79 0.75 0.47 0.50
Average 0.84 0.79 0.49 0.52
Table 6.10 clearly indicates that AGCUK with HeMI population achieves better clustering
result compared to AGCUK according to the Silhouette Coefficient and DB Index. The average
clustering result of AGCUK with HeMI Population on 5 data sets is also better than AGCUK
in terms of the both evaluation criteria.
144
6.3.7.3 An Analysis of the Impact of the Mutation Operation
In order to evaluate the effectiveness of the proposed mutation operation of HeMI we
incorporate the component (see Component 7 in Section 6.2.2) with AGCUK by replacing its
own mutation operation. We call this version of AGCUK as AGCUK with HeMI mutation. We
run both AGCUK and AGCUK with HeMI mutation for 50 iterations on 5 data sets.
Table 6.11 shows that AGCUK with HeMI mutation achieves better clustering result
compared to AGCUK in 5 out of 5 data sets according to both Silhouette Coefficient and DB
Index.
Table 6.11: Comparative results between AGCUK and AGCUK with HeMI Mutation
Data set DB Index (lower the better)
AGCUK AGCUK with HeMI Mutation
Silhouette Coefficient (higher the better)
AGCUK AGCUK with HeMI Mutation
PID 1.40 0.84 0.27 0.53
BT 0.54 0.23 0.64 0.83
GI 0.62 0.32 0.65 0.79
LD 0.87 0.32 0.46 0.78
BN 0.79 0.60 0.47 0.52
Average 0.84 0.46 0.49 0.69
We also extend this analysis by introducing a version of the proposed HeMI where we
remove its mutation operation (let us call this version to be HeMI without mutation) and then
compare this with complete HeMI.
Table 6.12: Comparative results between HeMI and HeMI without Mutation
Data set DB Index (lower the better)
HeMI HeMI without Mutation
Silhouette Coefficient (higher the better)
HeMI HeMI without Mutation
PID 0.40 0.46 0.73 0.69
BT 0.17 0.23 0.86 0.82
GI 0.26 0.28 0.82 0.80
LD 0.26 0.31 0.81 0.78
BN 0.39 0.47 0.70 0.66
Average 0.29 0.35 0.78 0.75
145
We run both HeMI and HeMI without mutation for 50 iterations. Table 6.12 shows that HeMI
achieves better clustering results than HeMI without mutation in 5 out of 5 data sets based on
Silhouette Coefficient and DB Index. This clearly indicates the effectiveness of the proposed
mutation operation used in HeMI.
6.3.7.4 An Analysis of the Impact of the Health Improvement
We also explore the effectiveness of the health improvement operation (see Component 8) of
HeMI. In Table 6.13 we present the experimental results of HeMI comparing with a different
version of HeMI called HeMI without health improvement operation that is exactly same as
HeMI except that it does not have Component 8 in it. We run both HeMI and HeMI without
health improvement operation for 50 iterations on 5 data sets.
Table 6.13: Comparative results between HeMI and HeMI without Health Improvement Operation
Data set DB Index (lower the better)
HeMI HeMI without Health
Improvement Operation
Silhouette Coefficient (higher the better)
HeMI HeMI without Health
Improvement Operation
PID 0.40 0.80 0.73 0.50
BT 0.17 0.18 0.86 0.86
GI 0.26 0.28 0.82 0.81
LD 0.26 0.27 0.81 0.80
BN 0.39 0.44 0.70 0.68
Average 0.29 0.39 0.78 0.73
We can see in Table 10 that HeMI achieves better clustering results than HeMI without health
improvement operation based on Silhouette Coefficient and DB Index.
6.3.7.5 An Analysis of the Impact of the Interval
We also explore the impact of various intervals, while carrying out the Neighbor Information
Sharing component (see Component 10 in Section 6.2.2) of HeMI. In Fig. 6.4 we present the
experimental results of HeMI comparing with different versions of HeMI called HeMI with
Interval 5 and HeMI with Interval 15. In HeMI with Interval 5 and HeMI with Interval 15 the
146
neighboring streams share/exchange the best chromosome among them at every 5th generation
and 15th generation, respectively. Note that in HeMI the neighboring streams share/exchange
the best chromosome among them at every 10th generation. We run HeMI, HeMI with interval
5 and HeMI with Interval 15 for 50 iterations on 5 data sets.
Fig. 6.4: Comparative results between HeMI and HeMI with different Intervals
We can see in Fig. 6.4 that HeMI with Interval 5 achieves better clustering results than HeMI
and HeMI with Interval 15 based on Silhouette Coefficient. The results clearly indicate that the
interval with a lower number of iterations achieves better clustering results than the interval with
a higher number of iterations. However, although the interval with a lower (5) number of
iterations achieves better clustering results than the higher (15) number of iterations, it will also
increase the execution time. Therefore, in HeMI the value of the interval is set to 10, as a
heuristic.
6.3.7.6 An Analysis of the Impact of the number of Streams
In order to evaluate the impact of different numbers of multiple streams (see Component 2 in
Section 6.2.2), we compare HeMI (where we use 4 streams) with another version of HeMI with
8 streams. We run both HeMI and HeMI with 8 streams for 50 iterations on 5 data sets.
147
Fig. 6.5 shows that HeMI with 8 streams achieves better clustering results compared to HeMI
according to Silhouette Coefficient. The results indicate that HeMI with a higher number of
streams performs better than HeMI with a lower number of streams. A higher number of streams
will increase the execution time. Considering this we in this study have used 4 streams in HeMI.
However, a user may want to use more streams as necessary.
Fig. 6.5: Comparative results between HeMI and HeMI with 8 Streams
6.3.7.7 An Analysis of the Improvement in Chromosomes over the Iterations
In Fig. 6.6 we present the average fitness (in terms of DB Index, where Fitness = 1/DB Index)
values of the best chromosomes over 5 separate runs of HeMI. We run HeMI 5 times and then
present the average fitness of these 5 runs. Average fitness values are plotted against the
iterations, for all 20 data sets. Most of the data sets achieve a rapid improvement within first 5
to 10 iterations, and then continues to steadily increase over the iterations. This is also clear
from Fig. 6.7 that presents the grand average fitness of the best chromosomes over all 20 data
sets, instead of each data set separately.
148
Fig. 6.6: Average Fitness versus Iteration. Each line represents the average fitness of the best chromosome of 5 consecutive
runs of HeMI on a data set
Fig. 6.7: Average Fitness (best chromosome) versus Iteration over the 20 data sets
In Fig. 6.8 we present the average fitness of the best chromosome of HeMI, AGCUK and
HeMI with a single stream on the PID data set. Both HeMI and AGCUK use the same fitness
function (DB Index (D L Davies & Bouldin, 1979)) to calculate the fitness of a chromosome.
Average fitness values of the best chromosomes of HeMI are always higher than those of HeMI
with a single stream and AGCUK, clearly indicating the effectiveness of various components of
HeMI including its multiple streams.
149
Fig. 6.8: Average Fitness (best chromosome) versus Iteration. Each line represents the average fitness of 5 consecutive runs
on PID data set
6.3.8 Statistical Analysis
We now analyze the results by using a statistical sign test (D.Mason, 1998; Triola, 2001) on all
20 data sets for all 20 runs to evaluate the superiority of the results (Silhouette Coefficient and
DB Index) obtained by HeMI over the results obtained by the existing techniques. We observe
that the results do not follow a normal distribution and thus do not satisfy the conditions for a
parametric test.
(a)
(b)
Fig. 6.9: (a) Sign test of HeMI based on Silhouette Coefficient on ten data sets. (b) Sign test of HeMI based on Silhouette
Coefficient on ten data set
150
Hence, we perform a non-parametric sign test on the Silhouette Coefficient and DB Index as
shown in Fig. 6.9 (a), Fig. 6.9 (b), and Fig. 6.10 (a), Fig. 6.10 (b). The first five bars for each
data set in Fig. 6.9 (a), Fig. 6.9 (b), and Fig. 6.10 (a), Fig. 6.10 (b) show the z-values (test
statistics) values for HeMI and the five existing techniques while the sixth bar shows the z(ref.)
value. If the z-value is greater than the z(ref.) value then the results obtained by HeMI are
significantly better than the results of existing techniques.
(a)
(b)
Fig. 6.10: (a) Sign test of HeMI based on DB Index on ten data sets. (b) Sign test of HeMI based on DB Index on ten data sets
In Fig. 6.9 (a) and 6.9 (b) we present the sign test of HeMI compared with the existing
techniques on 20 data sets in terms of Silhouette Coefficient, where HeMI significantly performs
better than other techniques on 19 out of 20 data sets. Fig. 6.10 (a) and Fig. 6.10 (b) show the
statistical significance of HeMI compared with other existing techniques based on DB Index,
where HeMI significantly performs better than other techniques on 19 out 20 data sets. We carry
out the statistical analysis at z > 1.96, p < 0.025 and right-tailed in terms of Silhouette Coefficient
151
and DB Index. Note that the cases where we have a lower z value than z(ref.) are indicated with
arrows in Fig. 6.9 (a) and Fig. 6.10 (b).
6.3.9 An Analysis on the use of K-means++ instead of K-means in HeMI
The HeMI algorithm allows us to use any lightweight clustering techniques for the initial
population including K-means and K-means++. In our experiments so far we used K-means for
the initial population and we see that HeMI clearly outperforms all other existing techniques
used in this study. Table 6.14 indicate that HeMI with K-means++ for the initial population
achieves better clustering results than HeMI with K-means. Hence, we are confident that HeMI
with K-means++ will win against other existing techniques even more strongly.
Table 6.14: Comparative results between HeMI and HeMI with K-means++
Data set DB Index (lower the better)
HeMI HeMI with K-means++
Silhouette Coefficient (higher the better)
HeMI HeMI with K-means++
PID 0.40 0.17 0.73 0.86
BT 0.17 0.17 0.86 0.87
GI 0.26 0.24 0.82 0.83
LD 0.26 0.26 0.81 0.82
BN 0.39 0.38 0.70 0.71
Average 0.29 0.24 0.78 0.81
6.3.10 Complexity Analysis
In this section, we estimate and present the complexity of HeMI and compare it with the
complexities of the existing techniques used in this study. The main factors involving the
complexity of HeMI are number of records 𝑛 in a data set 𝐷, number of attributes 𝑚 in 𝐷,
number of genes 𝑘 in a chromosome, number of chromosomes z in a population of a stream,
number of iterations 𝑁′ of k-means and number of iterations 𝑁 of HeMI. Out of these factors
we consider that 𝑛, 𝑚, 𝑘 and z can be much bigger than others and hence we compute the
complexity in terms of them.
152
For the initial population, HeMI uses K-means to get a number of deterministic
chromosomes, the complexity of which is 𝑂(𝑛𝑚𝑘𝑧). It also randomly selects some
chromosomes, for which the complexity is 𝑂(𝑘𝑧). The fitness function is DB index which has
a complexity of 𝑂(𝑛𝑚𝑘𝑧). Once fitness is computed the noising selection requires pairwise
comparison which can be done in 𝑂(𝑧) complexity. The crossover operation requires roulette
wheel for which we need 𝑂(𝑧2) complexity. For the twin removal, we need
𝑂(𝑚𝑘2𝑧) complexity. In the mutation operation, complexities for the division, absorption, and
random change are 𝑂(𝑛𝑚𝑘𝑧), 𝑂(𝑚𝑘𝑧) and 𝑂(𝑧), respectively. Complexities for Phase 1, Phase
2 and Phase 3 of the Health Improvement component are 𝑂(𝑛𝑚𝑘𝑧), 𝑂(𝑧) and
𝑂(𝑧), respectively.
The elitist operation has a complexity of 𝑂(𝑧) once the fitness is calculated with the cost of
𝑂(𝑛𝑚𝑘𝑧). Information exchange among neighboring streams requires 𝑂(𝑧) complexity.
Similarly, the global best selection also requires 𝑂(𝑛𝑚𝑘𝑧) + 𝑂(𝑧) complexity. Hence, the
overall complexity of HeMI is 𝑂(𝑛𝑚𝑘2𝑧2). With respect to 𝑛 and 𝑚 (the two most significant
factors), it has a linear complexity 𝑂(𝑛𝑚). The complexity of K-means, AGCUK, GAGR and
GenClust are 𝑂(𝑛𝑚) (Lloyd, 1982), 𝑂(𝑛𝑚), (Y. Liu et al., 2011), 𝑂(𝑛𝑚) (D.-X. Chang et al.,
2009) and 𝑂(𝑛𝑚2 + 𝑛2𝑚) (Rahman & Islam, 2014) respectively.
6.3.11 Comparison between HeMI and Multiple Runs of K-means
Although the complexity of K-means and HeMI in terms of 𝑛 and 𝑚 are the same, i.e. 𝑂(𝑛𝑚),
we realize that HeMI may require higher execution time than K-means. For example, it will
require executing the distance calculation more frequently than K-means. Therefore, by the time
we run HeMI once we can perhaps run K-means multiple times. We compute that K-means
requires executing the distance calculation approximately 𝑛𝑚𝑘𝑖 times, where 𝑖 is the number of
153
iterations and 𝑘 is the number of seeds. Considering 50 iterations (i.e. 𝑖 = 50) for K-means, it
requires the distance calculation 50 𝑛𝑚𝑘 times.
On the other hand, HeMI requires executing the distance calculation approximately 𝑛𝑚𝑘𝑧𝑖
(for initial population) + 8 𝑛𝑚𝑘𝑧𝐺 (for genetic operations) times, where 𝑧 is the number of
chromosomes, 𝑖 is the number of iteration in K-means and 𝐺 is the number of generations.
Considering, 𝑖 = 50, 𝑧 = 20 and 𝐺 = 50, HeMI requires the distance calculation 9000 𝑛𝑚𝑘
times. That is, HeMI requires the distance calculation (9000𝑛𝑚𝑘
50𝑛𝑚𝑘) = 180 times more than K-
means. As a result, by the time we can run HeMI once we can run K-means approximately 180
times.
Fig. 6.11: Comparative result between HeMI and K-means
Therefore, in this section we run K-means up to 500 times and pick the best result out of
these 500 runs. We also run HeMI 20 times and pick the worst result out of the 20 runs. Finally
in Fig. 6.11, we compare the best result of K-means with the worst result of HeMI on three
randomly chosen data sets. The top three lines in Fig. 6.11show the worst result of HeMI on
three data sets and the bottom three lines represent the best results of K-means at different runs
starting from 50 to 500 for the same three data sets. The results clearly indicate that K-means
cannot beat HeMI (in terms of the Silhouette Coefficient) even if K-means runs 500 times.
154
6.4 Summary
In this chapter, we propose a clustering technique that in addition to selecting an initial
population with a low complexity of 𝑂(𝑛), uses new components including multiple streams,
information exchange between neighboring streams, regular health improvement of the
chromosomes and mutation which also aim to improve chromosome health/quality.
We evaluate the proposed technique (HeMI) by comparing its clustering quality with five
existing techniques namely AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009),
K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007) and GenClust (Rahman &
Islam, 2014) on twenty (20) natural data sets that are publicly available from the UCI machine
learning repository (M. Lichman, 2013) in terms of two well-known evaluation criteria:
Silhouette Coefficient and DB Index.
We run each technique 20 times on each data set, and we present the average clustering
results of 20 runs and standard deviation of the average clustering result. The experimental
results show that HeMI achieves better results than GenClust in 20 out of 20 data sets and in 17
out of 20 data sets the standard deviations of HeMI do not overlap the standard deviations of
GenClust and the average Silhouette coefficient of HeMI is higher than GenClust.
HeMI achieves higher Silhouette Coefficient than AGCUK in 20 out of 20 data sets. The
standard deviations of HeMI do not overlap the standard deviations of AGCUK in 19 out of 20
data sets. HeMI also achieves higher Silhouette Coefficient than K-means, K-means++ and
GAGR in all 20 data sets. In 20 out of 20 data sets the standard deviations of HeMI do not
overlap with the standard deviations of any of these techniques.
Moreover, HeMI achieves better clustering results (on an average) than GenClust in 19 out
of 20 data sets, based on DB Index for which a lower value indicates a better result. In 18 out to
20 data sets HeMI does not have any overlap of standard deviations with the standard deviations
155
of GenClust. Moreover, HeMI performs better than K-means, K-means++, GAGR and AGCUK
on all 20 data sets based on DB Index. In 20 out to 20 data sets HeMI does not have any overlap
of standard deviations with the standard deviations of these techniques.
We also carry out thorough investigation to evaluate major components of HeMI one by one.
It is evident that all of these components have positive impact on the final clustering quality.
We also present a complexity analysis which shows that HeMI has a complexity of 𝑂(𝑛), where
𝑛 is the number of records in a data set.
With this result of HeMI, we achieve our first goal of proposing a parameter less clustering
technique with a high-quality solution and low complexity. However, in order to achieve our
second goal of producing sensible clustering solutions we need to carefully analyse the results
obtained by HeMI and other existing techniques. We carry out this analysis in the next chapter.
156
Chapter 7
A Novel GA-based Clustering Technique and its
Suitability for Knowledge Discovery from a Brain
Data Set
7.1 Introduction
In this chapter, we present a GA-based Clustering technique called CSClust which aims to
produce sensible clusters. We realize that some recent clustering techniques do not produce
sensible clusters and fail to discover knowledge from underlying data sets. Sometimes, they
obtain a huge number of clusters and sometimes they obtain only two clusters, where one cluster
contains one record and the other cluster contains all remaining records. Interestingly, these
clustering solutions often achieve high fitness values based on existing evaluation criteria.
During the PhD, candidature we have published the following paper based on this chapter.
Beg, A. H. and Islam, M. Z. (2016): A Novel Genetic Algorithm-Based Clustering Technique and its
Suitability for Knowledge Discovery from a Brain Data set, In Proc. of the IEEE Congress on
Evolutionary Computation (IEEE CEC 2016), Vancouver, Canada, July 24-29, 2016, pp. 948-956. (ERA
2010 Rank A).
157
Therefore, in CSClust we propose a new cleansing and cloning operation that helps to
produce sensible clusters with high fitness values, which are also useful for knowledge
discovery. We now briefly introduce the novel components/properties of CSClust and their
logical justifications as follows.
The central component of CSClust is a cleansing operation in each generation in order to
ensure that all chromosomes in a population have a sensible solution. Through our empirical
analysis using GenClust and GAGR (see Section 7.2) we find that although they produce a
clustering solution with better fitness value, they often end up producing a non-sensible
clustering result. Therefore, we introduce a cleansing operation by applying two conditions: (i)
the number of clusters must be within the range of maximum number and a minimum number
of clusters which is learned by CSClust from some properties of a data set, and (ii) the minimum
number of records in a cluster must be greater than a threshold minimum number of records
which is again data-driven (i.e. not user defined). CSClust uses the initial population in order to
learn the range of maximum and a minimum number of clusters and the threshold minimum
number of records.
Another important component of CSClust is the selection of sensible properties that makes
better use of initial population. CSClust also produces high-quality initial population (see Step
2 in Section 7.3) through a deterministic phase and a random phase. It uses the same approach
of initial population that we use in Chapter 6 (see Section 6.2.2 of Chapter 6). It does not require
users to determine the cluster number and/or radii of clusters. CSClust keeps the complexity as
low as 𝑂(𝑛) in the initial population selection operation. CSClust selects top |P| chromosomes
(|P|= 20 in this study) from the two phase. It then finds the necessary properties of a sensible
clustering solution. The properties of the sensible clustering solution are then used in each
population in order to ensure that chromosomes in a population do not contradict the properties.
158
Another interesting idea associated with CSClust is the cloning operation used to replace sick
chromosomes in each population. In each population, the cleansing operation identifies the sick
chromosomes, which are then replaced by a pool of healthy chromosomes found in the initial
population through the cloning operation.
We evaluate our technique CSClust by comparing its performance with the performance of
five existing techniques, namely: AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al.,
2009), GenClust (Rahman & Islam, 2014), K-means (Lloyd, 1982) and K-means++ (Arthur &
Vassilvitskii, 2007). We conduct experiments with the techniques on 10 real-life data sets that
are publicly available in the UCI machine learning repository (M. Lichman, 2013). The
experimental results clearly indicate that CSClust performs better than the existing techniques
in terms of two evaluation criteria: Silhouette Coefficient (Agustín-Blas et al., 2012; Pang-Ning
Tan, Michael Steinbach, 2005) and DB Index(D L Davies & Bouldin, 1979).
Moreover, we apply all the techniques on a brain data set CHB-MIT Scalp data set (see
Section 7.4.4), in order to assess their quality in producing sensible clustering solutions. The
empirical analysis presented in Section 7.4.5 indicates that CSClust produces better clustering
solutions which are suitable for deriving knowledge from a data set whereas all other techniques
typically fail to generate sensible clustering solutions.
The contributions of the chapter are presented as follows:
Proposing CSClust which aims to produce sensible clusters;
The evaluation of CSClust by comparing it with existing techniques;
The organization of the chapter is as follows: in Section 7.2 we discuss the motivation behind
the proposed technique; in Section 7.3 we present our proposed technique; the experimental
results and discussion are presented in Section 7.4, and the summary of the chapter is presented
in Section 7.5.
159
7.2 The Motivation Behind the Proposed Technique
In this section, we discuss the motivation behind our proposed technique. We first explore some
clustering solutions made by existing clustering techniques. We use a brain data set called CHB-
MIT Scalp EEG (Goldberger et al., 2000) as an example which is available from
https://physionet.org/cgi-bin/atm/ATM. We plot the data set so that we can graphically visualize
the clusters (see Fig. 7.1). Fig. 7.1 shows the structure of the data set where it has two clusters:
seizure and non-seizure.
Fig. 7.1: The three-dimensional CHB-MIT Scalp EEG (chb01-03) data set
The original data set (Goldberger et al., 2000) contains 9 non- class attributes and a class
attribute having two possible values: seizure and non-seizure. We pick three non-class attributes
(max, min and std) so that we can plot the records on a paper (see Fig. 7.1). In Fig. 7.1, dots
represent non-seizure and plus signs represent seizure records.
We apply GenClust (Rahman & Islam, 2014) on the CHB-MIT data set. The empirical
analysis shows that GenClust generates 477 clusters (see Fig. 7.2), which appears to be a non-
sensible clustering solution since the actual number of clusters of this data set is only two
160
(seizure and non-seizure) as shown in Fig. 7.1. Note that while clustering the records only the
non-class attributes are used and the class attribute is not used.
Fig. 7.2: Clustering result of GenClust on the CHB-MIT Scalp EEG (chb01-03) data set
GenClust uses the fitness function called COSEC (Rahman & Islam, 2014) to evaluate the
fitness of a chromosome. The COSEC value of a chromosome increases when the number of
clusters of a chromosome is increases. Therefore, due to use of CSOEC GenClust tends to obtain
a clustering solution with a large number of clusters.
Fig. 7.3: Clustering result of GAGR on the CHB-MIT Scalp EEG (chb01-03) data set
161
We also apply GAGR (D.-X. Chang et al., 2009) on the CHB-MIT Scalp data set. GAGR
generates 56 clusters (see Fig. 7.3) which is also not sensible as the original number of clusters
of this data set is only two. It uses SSE (Pang-Ning Tan, Michael Steinbach, 2005) as its fitness
function. In GAGR, like GenClust the fitness of a chromosome increases when the number of
clusters of the chromosome increases. Accordingly, GAGR tends to generate a clustering
solution with a high number of clusters.
We then explore DB Index (D L Davies & Bouldin, 1979) in GenClust as the fitness function
instead of COSEC. Typically, DB Index considers the compactness of a cluster and maximum
seed distance between clusters. Therefore, a clustering solution with a low number of clusters
and fewer numbers of records obtain higher DB Index value.
Fig. 7.4 shows that using DB Index as the fitness function GenClust generates two clusters
but the number of records in one cluster is one and all other records belong to the other cluster,
which is also not sensible. In order to handle such a situation CSClust introduces the cleansing
operation.
Fig. 7.4: Clustering result of GenClust using DB Index on CHB-MIT Scalp EEG (chb01-03) data set
One crucial component of CSClust is the cleansing operation aiming to ensure that all
chromosomes in a population do not contradict the properties of a sensible clustering solution.
162
The sensible chromosomes are selected by applying two conditions: (i) the number of clusters
must be within the range of maximum number and a minimum number of clusters and (ii) the
minimum number of records in a cluster must be greater than a threshold minimum number of
records. The threshold values are learned from the sensible chromosomes selected in the initial
population. The chromosomes which are selected in the initial population are supposed to be
sensible due to the selection of top chromosomes from a pool of high-quality chromosomes.
Another interesting idea associated with CSClust is the cloning operation used to replace sick
chromosomes found in the cleansing operation. CSClust replaced the sick chromosomes by the
pool of high-quality chromosomes found in the initial population. The pool of high-quality
chromosomes created for the initial population are supposed to be reasonably healthy due to the
use of K-means many times.
7.3 CSClust: High-quality Chromosome Selection and Cleansing Operation in a GA for
Clustering
We first mention the main steps of CSClust as follows and then explain each of them in detail.
Out of the following steps, Step 3, Step 7 and Step 8 are our novel contributions of this chapter.
BEGIN
Step 1 Normalization
Step 2: Population Initialization
Step 3: Sensible Properties Selection
DO: j=1to t /* t is the user defined number of iterations/generations */
Step 4: Crossover Operation
Step 5 Mutation Operation
Step 6: Twin Removal Operation
Step 7: Cleansing Operation
Step 8: Cloning Operation
Step 9: Elitist Operation
END
END
163
Step 1: Normalization
CSClust takes a data set 𝐷 as input. It first normalizes the data set 𝐷 in order to weigh each
attribute equally regardless of their domain sizes. For normalization, CSClust uses the same
approach of normalization that we used in DeRanClust (see Section 3.2 in Chapter 3).
Step 2: Population Initialization
For the population initialization, CSClust uses the same approach of population initialization
that we used in HeMI (see Section 6.2.2 of Chapter 6). CSClust selects |P| number of
chromosomes in the initial population, |P|/2 from the deterministic phase and |P|/2 from
random phase. In the experiments of this chapter, we use |P| to be 20. CSClust uses Davis
Bouldin (DB) index (D L Davies & Bouldin, 1979) to calculate the fitness of a chromosome.
Step 3: Sensible Properties Selection
This is an original contribution of CSClust that makes better use of initial population for finding
necessary properties of a sensible clustering solution. In the population initialization step, It
selects |𝑃| top chromosomes from the generated initial chromosomes based on their fitness.
CSClust learns the necessary properties (minimum (𝑀𝑛) and maximum (𝑀𝑥) number of clusters,
and minimum number of records (𝑀𝑟) of a cluster) of a sensible clusteirng solution from the |𝑃|
chromosomes (see Step 3 of Algorithm 7.1). The properties of the sensible clustering solution
are then used in the cleansing and cloning operation.
Step 4: Crossover Operation
All chromosomes participate in the crossover pair by pair. The best chromosome (available in
the current population) is selected as the 1st chromosome of the pair and the 2nd chromosome of
the pair is selected probabilistically using the roulette wheel technique (D. Chang et al., 2012;
Maulik & Bandyopadhyay, 2000; Mukhopadhyay & Maulik, 2009). The probability of a
164
chromosome 𝑃𝑗 is computed as 𝑇𝑗 = (𝑓𝑗/ ∑ 𝑓𝑖|P|𝑖=1 ). Here, 𝑓𝑗 is the fitness of the chromosome 𝑃𝑗
and |𝑃| is the size of the current population.
Once the pair of chromosomes is selected for crossover, CSClust then applies the gene re-
arrangement operation (Rahman & Islam, 2014) in order to avoid the inappropriate arrangement
of genes. The pair of chromosomes are then participates in a conventional single point crossover
(Sanghamitra Bandyopadhyay & Maulik, 2002; D. Chang et al., 2012; Garai & Chaudhuri,
2004; Michael Laszlo & Mukherjee, 2007; Pakhira et al., 2005; Peng et al., 2014; Rahman &
Islam, 2014; Song et al., 2009). CSClust applies crossover operation on each pair of
chromosomes of a population and all together it produces |𝑃| offspring chromosomes. It then
applies twin removal operation (Rahman & Islam, 2014) in order to delete/modify twin genes
(if any) of a chromosome.
Step 5: Mutation Operation
The aim of the mutation operation is to randomly change some of the chromosomes in order to
explore different solution. CSClust uses the random change operation (D. Chang et al., 2012;
Rahman & Islam, 2014) probabilistically where the chromosome with low fitness has a high
chance to be selected for the random chance, and vice versa (D.-X. Chang et al., 2009; Rahman
& Islam, 2014). The mutation probability of the 𝑗𝑡ℎ chromosome is calculated as follows.
𝑀𝑗 = {𝑘1 ∗
𝑓𝑚𝑎𝑥 − 𝑓𝑗
𝑓𝑚𝑎𝑥 −𝑓𝑚𝑒𝑎𝑛 𝑖𝑓 𝑓𝑗 > 𝑓𝑚𝑒𝑎𝑛,
𝑘2 , 𝑖𝑓 𝑓𝑗 ≤ 𝑓𝑚𝑒𝑎𝑛
Eq. 7.1
where, 𝑘1 and 𝑘2 equal 0.5, 𝑓𝑚𝑎𝑥 is the maximum fitness value of a chromosome in the
population, 𝑓𝑚𝑒𝑎𝑛 is the average fitness value of the chromosome in the population and 𝑓𝑗 is the
165
𝑗𝑡ℎ chromosome fitness. Once a chromosome is selected for random change operation CSClust
then randomly select an attribute of each gene and modify the attribute value randomly.
Step 6: Twin Removal Operation
CSClust uses the twin removal operation in order to remove/modify twin genes (if any) from
each chromosome. For twin removal, CSClust uses the same approach of twin removal of
DeRanClust (see Section 3.2 of Chapter 3).
Algorithm 7.1: CSClust
Input: A data set 𝐷 having 𝑛 records and |𝐴| attributes, where 𝐴 is the set of attributes
Output: A set of cluster C
Require:
Pd← ∅ /* 𝑃𝑑 is the set of deterministic chromosome (45 chromosomes), initially set 𝑃𝑑 to empty */
Pr← ∅ /* 𝑃𝑟 is the set of random chromosome (45 chromosomes), initially set 𝑃𝑟 to empty */
end
Step 1: /* Normalization */
D′ ← Normalized ( 𝐷) /* normalize each attribute of the data set in the normalized data set (𝐷′) */
end
Step 2: /* Population Initialization */
Pd ← GenerateDeterministicChromosomes (D′) /* generate 𝑃𝑑 through K-means, the number of initial seeds for K-means are chosen deterministically *
Pr ← GenerateRandomChromosomes (D′) /* generate 𝑃𝑟 randomly, the number of initial seeds are chosen randomly and the seeds also chosen randomly */
Px ←SelecteDeterministicChromosomes (Pd ) /* select top 10 chromosomes (50% chromosomes of the initial population) based on fitness */
Py ← SelecteRandomChromosomes (Pr) /* select top 10 chromosomes (50% chromosomes of the initial population) based on fitness */
Ps ← Ps ∪ (Px ∪ Py) /* 𝑃s is the set of initial population (20 chromosomes) */
Pb= FindBestChromosome (Ps) /* 𝑃𝑏 is a chromosome that has the maximum fitness value in 𝑃𝑠 */
end
Step 3: /* Sensible Properties Selection*/
(Mn, Mx, Mr) ←FindSensibleProperties (Ps) /* Find minimum number of clusters (𝑀𝑛), maximum number of clusters (𝑀𝑥) and
end minimum number of records in a cluster ( 𝑀𝑟) */
for t=1 to I do /* default I = 50, I is the user defined number of iterations and t is the counter of I */
Step 4: /* Crossover Operation */
Po ← PerformCrossover (Ps) /* perform single point crossover on 𝑃𝑠 and get a set of offspring chromosomes 𝑃𝑜*/
Po =Twin Removal (Po) /* perform twin removal on 𝑃𝑜 and get a set of chromosomes 𝑃𝑜 */
end
Step 5: /* Mutation operation */
Pm ← PerformMutationOperation (Po) /* perform mutation operation on 𝑃𝑜 and get a set of mutated chromosomes 𝑃𝑚 */
end
Step 6: /* Twin Removal */
Pm =Twin Removal (Pm) /* perform twin removal on 𝑃𝑚 and get a set of chromosomes 𝑃𝑚 */
end
Step 7: /* Cleansing Operation */
Sc ← FindSickChromosomes (Pm, Mn, Mx, Mr ) /* Find a set of sick chromosomes 𝑆𝑐 from 𝑃𝑚 based on 𝑀𝑛 , 𝑀𝑥, 𝑀𝑟*/
Pm ← Pm-Sc /* Remove 𝑆𝑐 chromosomes from 𝑃𝑚 */
end
Step 8: /* Cloning Operation */
while |Sc| ≤ 0 do
Hc ← Cloning Operation (Pd) /* Replace the sick chromosomes from 𝑃𝑑 and get a set of healthy chromosomes 𝐻𝑐 */
end
Pm ← Pm ∪ Hc /* Insert 𝐻𝑐 into 𝑃𝑚 */
end
Step 9: /* Elitist Operation */
Pb ←ElitistOperation (Pm & Pb ) /* Apply elitist operation on 𝑃𝑚 & 𝑃𝑏 and find the best chromosome 𝑃𝑏 */
C ← C ∪ Pb /* insert 𝑃𝑏 into 𝐶 */
Return C
end
end
166
Step 7: Cleansing Operation
This is an original contribution of CSClust. The aim of this component is to identify the
chromosomes in a population with sensible and non-sensible solutions. CSClust first learns the
necessary properties [minimum (𝑀𝑛) and maximum (𝑀𝑥) number of clusters, and minimum
number of records (𝑀𝑟) in a cluster] of a sensible clustering solution through the Component 3.
In different runs CSClust may finds different value of 𝑀𝑛, 𝑀𝑥 and 𝑀𝑟. Therefore, it avoids using
a rigid set of values for 𝑀𝑛, 𝑀𝑥 and 𝑀𝑟 and thus relaxes the boundaries. It increases the value
of 𝑀𝑥 by 𝑡 % and decreases the value of 𝑀𝑛 and 𝑀𝑟 by 𝑡 %. The value of 𝑡 in this study is set
to 10.
CSClust applies the cleansing operation on each chromosome in a population based on the
properties of a sensible clustering solution. If the length of a chromosome (i.e. the number of
genes in the chromosome) is greater than or equal to 𝑀𝑛 and less than or equal to 𝑀𝑥, and the
number of records in each cluster is greater than 𝑀𝑟 then the chromosome is selected as a
sensible solution otherwise it is considered as a sick Chromosome (see Step 7 of Algorithm 7.1).
The sick chromosomes are then removed from the population.
Step 8: Cloning Operation
This is another new contribution of CSClust. The cloning operation replaces the sick
chromosomes found in the cleansing operation. To replace a sick chromosome, CSClust
probabilistically selects a chromosome from the pool of chromosomes that was obtained through
the deterministic phase of the population initialization (see Component 2).The chromosomes in
the pool are expected to be generally of good health since they are obtained by multiple use of
K-means. CSClust then randomly changes an attribute value of a gene to another value within
the domain of the attribute (see Step 8 of Algorithm 7.1). Thus, the chosen chromosome is
slightly modified before being added in the population replacing a sick chromosome.
167
Step 9: Elitist Operation
The Elitist operation keeps track of the best chromosome throughout the generations in order to
ensure the continuous improvement of the quality of the best chromosome found so far over the
iterations. For finding the best chromosome, CSClust uses the same approach of elitist operation
that we used in DeRanClust (see Section 3.2 of Chapter 3).
7.4 Experimental Results and Discussion
7.4.1 The Data sets and the Cluster Evaluation Criteria
We empirically compare the performance of our proposed technique CSClust with five existing
techniques namely K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007),
AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009) and GenClust (Rahman &
Islam, 2014) on a brain data set (CHB-MIT data set)(Goldberger et al., 2000). We also compare
the performance of CSClust with the five existing techniques on 10 other real-life data sets that
are available in the UCI machine learning repository (M. Lichman, 2013). Detailed information
about the data sets is provided in Table 7.1. All the data sets used in this study for the
experimentation have only numerical attributes except the class attribute. All the data sets we
choose with numerical attributes because the techniques (except CSClust and GenClust) that
we use in this study can only handle numerical attributes.
Some of the data sets contain missing values in them. It means that some attribute values of
some records are missing. We delete all the records having any missing values. Two well-known
evaluation criteria namely Silhouette Coefficient (Pang-Ning Tan, Michael Steinbach, 2005)and
DB Index (D L Davies & Bouldin, 1979) are used to compare the performance of our technique.
Note that the higher value of Silhouette Coefficient indicates a good clustering result and the
lower value of DB Index represents a good clustering result.
168
Table 7.1: Data sets at a glance
Data set No. of Records
with missing
No. of Records
without missing
No. of numerical
attributes
No. of categorical
attributes
Class size
Glass Identification (GI) 214 214 10 0 7
Vertebral Column (VC) 310 310 6 0 2
Dermatology (DT) 366 358 34 0 6
Blood Transfusion (BT) 748 748 4 0 2
Bank Note Authentication (BN) 1372 1372 4 0 2
Yeast (YT) 1484 1484 8 0 10
Image Segmentation (IS) 2310 2310 18 0 7
Wine Quality (WQ) 4898 4898 11 0 7
Page Blocks Classification (PBC) 5473 5473 10 0 5
MAGIC Gamma Telescope (MGT) 19020 19020 11 0 2
7.4.2 The Parameter used in the Experiments
In the experimentation on AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009),
GenClust (Rahman & Islam, 2014) and CSClust the population size is set to be 20 and the
number of iterations/generations is set to be 50. We maintain this consistency for all techniques
in order to ensure a fair comparison among them.
The number of iterations in K-means (Lloyd, 1982) and K-means++ (Arthur & Vassilvitskii,
2007) is set to 50 and the number of iterations of K-means in GenClust also set to 50. The cluster
numbers 𝑘 in GAGR, K-means and K-means++ is user defined. However, in order to simulate
a natural scenario the cluster number of GAGR, K-means and K-means++ are generated
randomly in the range between 2 to √𝑛 , where 𝑛 is the number of records in a data set. The
threshold value for K-means is set to 0.005. The value of 𝑟𝑚𝑎𝑥 and 𝑟𝑚𝑖𝑛 in AGCUK set to be 1
and 0 respectively.
7.4.3 The Experimental Setup
On each data set, we run CSClust 10 times, since it can produce different clustering results in
different runs. We also run all other techniques GenClust, GAGR, AGCUK, K-means and K-
169
means++ for 10 times. We then present the average clustering results of CSClust and all other
techniques.
7.4.4 Brain Data set (CHB-MIT Scalp EEG) Pre-processing
Before experimentation, we first prepare the CHB-MIT Scalp EEG data set (Goldberger et al.,
2000). This data set consists of EEG recordings of 22 epileptic patients from different age
groups. The data was collected at the Children’s Hospital Boston consisting of EEG recordings
from pediatric subjects with intractable seizures. All the EEG signals were sampled at 256
samples per second with 16-bit resolution. The International 10-20 system of EEG (Jasper,
1958; Oostenveld & Praamstra, 2001) channel positions and nomenclature were used.
The International 10-20 system is usually employed to record spontaneous EEG. In this
system, 21 channels are located on the surface of the scalp, and three other channels are placed
on each side equidistant from the neighboring points (Jasper, 1958) (see Fig. 7.6). In the data
set most of the cases 23 channels were used, only in some cases 24 or 26 channels were used.
For each channel, we divide the data in epochs of 10 seconds. We then calculate the Maximum
(Max), Minimum (Min), Mean, Standard deviation (Std), Kurtosis, Skewness, Entropy, Line
length and Energy for each epoch. Hence, from each channel of one hour data we get 360 records
containing nine attributes: Max, Min, Mean, Std, Kurtosis, Skewness, Entropy, Line Length and
Energy.
For example, we prepare one-hour data of one patient (chb01_03) who is an 11 years old girl.
This data set has the recordings of 23 channels. Hence, from all 23 channels altogether, we get
360*23=8280 records. In this data set the patient experienced a seizure for around 40 seconds
(from the 2996th second to 3036th second). During this period we get 5 records. These records
are considered as seizure records and all other records are considered as non-seizure records.
170
Therefore, from the chb_01_03 data set altogether we get 23*5= 115 seizure records and 8165
non-seizure records.
7.4.5 Experimental Results on Brain Data Set
In this section, we experimentally evaluate the performance of CSClust by comparing it with K-
means, K-means++, AGCUK, GAGR and GenClust on a brain data set (Goldberger et al., 2000).
Columns 2 and 3 of Table 7.2 present the clustering results obtained by CSClust and all other
techniques based on Silhouette Coefficient and DB Index.
Columns 2 and 3 of Table 7.2 indicate that CSClust achieves better clustering results than
all other techniques based on Silhouette Coefficient and DB Index. Moreover, CSClust produces
the actual cluster number (see Column 4 in Table 7.2 and Fig. 7.5), where all other techniques
GenClust, AGCUK, GAGR, K-means and K-means++ fail to generate the actual number of
clusters on the data set. The original number of cluster of this data set is two (see Fig. 7.1 of
Section 7.2). This results indicates the usefulness of the proposed components including
cleansing and cloning operation in order to produce a sensible clustering solution.
Table 7.2: Clustering results of all techniques on the CHB-MIT Scalp EEG (chb01-03) data set
Clustering Techniques Silhouette Coefficient
(higher the better)
DB Index
(lower the better)
Number of Clusters
CSClust 0.77 0.29 2
GenClust 0.50 0.74 468.40
AGCUK 0.54 0.60 2.80
GAGR 0.17 1.06 45.5
K-means 0.23 0.96 39.3
K-means++ 0.12 1.07 35.7
7.4.6 Analysis of the Clustering Result obtained by CSClust on the Brain Data set
We now analyze the clustering results obtained by CSClust on CHB-MIT Scalp EEG (chb01-
03) data set in order to explore knowledge from this data set. We plot the clustering result of
171
CSClust as shown in Fig. 7.5. In Fig. 7.5, dots (black) and circles (red) represent cluster 1
obtained by CSClust. Dots (black) represent the records with the class value of non-seizure and
the circles (red) represent the records with the class value of seizure. Moreover, the circles
(green) represent Cluster 2 obtained by CSClust. Circles (green) represent the records with the
class value of seizure and dots (green) represent the records with class value of non-seizure.
Detailed information of the records in Cluster 2 is presented in Table 7.3. The first column
in Table 7.3 shows the channel number and position of the channel on the surface of the scalp
according to the International 10-20 system. Column 2 and 3 present the number of seizure and
non-seizure records in Cluster 2 for every channel.
The number of seizure records in this data set is 115 according to 23 channels. Through our
analysis of the original signals of 23 channels we find that around 11 out of the 23 channel show
seizure signal during the seizure time. Therefore, we consider 11*5= 55 records as seizure
records. However, only around 20-25 seizure records out of these 55 seizure records can be
clearly identified (see Fig. 7.1) where other records are overlapped with non-seizure records.
Each channel has 5 seizure records.
Fig. 7.5: Clustering result of CSClust on CHB-MIT Scalp EEG (chb01-03) data set
172
Table 7.3: Channel wise number of records in Cluster 2 of CSClust on CHB-MIT Scalp EEG (chb01-03) data set
Channel (Position) Number of
seizure records
Number of
non-seizure records
Channel 1 (FP1-F7) 2 0
Channel 2 (F7-T7) 1 0
Channel 5 (FP1-F3) 2 0
Channel 6 (F3-C3) 1 0
Channel 9 (FP2-F4) 3 0
Channel 10 (F4-C4) 3 0
Channel 11 (C4-P4) 1 0
Channel 12 (P4-O2) 1 0
Channel 13 (FP2-F8) 3 0
Channel 14 (F8-T8) 3 0
Channel 15 (T8-P8) 3 0
Channel 16 (P8-O2) 2 1
Channel 17 (FZ-CZ) 2 2
Channel 18 (CZ-PZ) 0 2
Channel 21 (FT9-FT10) 3 2
Channel 23 (T8-P8) 3 0
In order to visualize the seizure records we plot the records of Channel-5, Channel-13,
Channel-17 and Channel-21 as shown in Fig. 7.7 (a), (b), (c) and (d), respectively. Fig. 7.7 (a)
to (d) show that in each channel, around 2/3 out of 5 seizure records are clearly visible.
Therefore, CSClust finds 40 records in cluster 2, where 33 of them are seizure records. It finds
some non-seizure records in cluster 2 because they are very similar to seizure records.
In addition, through the analysis of the original EEG signals of 23 channels during the seizure
time, we find that around 11 out of 23 channels show the seizure signal (see Fig. 7.9, Fig. 7.11,
Fig. 7.12) during the seizure time. Other channels show the non-seizure signal during the seizure
time (see Fig. 7.10). Fig. 7.10 shows the signal of Channel-7 during the seizure time but the
amplitude of this channel is low (varies between 200 uV and -200 uV). Fig. 7.8 shows the signals
of Channel-5 during the non-seizure time where the amplitude varies between 200 uV and -200
173
uV. Fig. 7.9, Fig. 7.11 and Fig. 7.12 show the signals of Channel-5, Channel-9 and Channel-13,
respectively during the seizure time, where the signals show high amplitude (between 400 uV
and - 400 uV).
We also find that all 11 channels showing the seizure signal during the seizure time are
located in the frontal lobe and temporal lobe of the scalp (see Fig. 7.6) indicating that the seizure
for this patient was a localized seizure originated in the frontal lobe and temporal lobe. From
these 11 channels 8 of them are located in the frontal lobe and 3 others (Chanel-15, Channel-16
and Channel-23) are located in the temporal-parietal lobe. Interestingly, CSClust also finds the
maximum records in cluster 2 from these 11 channels. This again re-confirms the quality of the
clustering results obtained by our proposed technique.
Fig. 7.6: Channel positions according to the International 10-20 system (Jasper, 1958; Sharbrough F et al., 1991)
A = Ear lobe C = Central
P = Parietal F = Frontal
Fp = Frontal polar O = Occipital
Frontal
lobe
Temporal-
Parietal lobe
174
(a) Channel-5
(b) Channel-9
(c) Channel-13
(d) Channel-21
Fig. 7.7: Seizure records on different channels
Fig. 7.8: EEG signals (10 seconds) of channel-5 during the non-seizure
time
Fig. 7.9: EEG signals (10 seconds) of channel-5 during the seizure time
175
Fig. 7.10: EEG signals (10 seconds) of channel-7 during the seizure time
Fig. 7.11: EEG signals (10 seconds) of channel-9 during the seizure time
Fig. 7.12: EEG signals (10 seconds) of channel-13 during the seizure time
7.4.7 Knowledge from Decision Tree on Brain Data set
In this section, we present a number of decision trees to discover logic rules for seizure and non-
seizure records from the CHB-MIT (chb01-03) data set. We apply an existing technique SysFor
(Islam & Giggins, 2011) on this data set. This technique requires the class values of a data set
in order to build a decision forest. We labeled the data set based on the clustering result of
CSClust. CSClust produces two clusters: Cluster 1 and Cluster 2. During the labeling, we
consider the records in Cluster 1 as non-seizure records and records in Cluster 2 as seizure
records. Thus, we get two-class values: seizure and non-seizure in the labeled data set.
SysFor generates four decision tree on the data set shown in Fig. 7.13 (a) - Fig 7.13 (d).
Typically, a decision tree (see Fig. 7.13 (a) - Fig 7.13 (d)) consists of nodes and leaves. In Fig.
7.13 (a) - Fig 7.13 (d) the nodes are denoted by rectangles and leaves are denoted by ovals. The
shortest path from the root node to the leaves makes a logic rule that represents the relationship
between the class attribute and non-class attribute (Adnan & Islam, 2014).
176
(a ) Decision tree 1
(b) Decision tree 2
(c ) Decision tree 3
(d) Decision tree 4
Fig. 7.13: Decision trees on CHB-MIT (chb01-03) data set
Fig. 7.13 (a) shows the decision tree 1 where the logic rule for leaf 1 and leaf 2 are “if Std
<= 102.44 → Records =Non-seizure” and “if Std >102.44 → Records = Seizure”, respectively.
In decision tree 2 as shown in Fig. 7.13 (b) the logic rule for leaf 1 is “if Max <= 239.51 →
Records =Non-seizure” and the logic rule for leaf 2 is “if Max > 239.51 AND Std <=102.44 →
Records =Non-seizure” and the logic rule for leaf 3 is “if Max > 239.51 AND Std >102.44 →
Records =Seizure”.
Moreover, from all the decision trees shown in Fig. 7.13, it can be identified that the attribute
standard deviation has more influence to categorize seizure and non-seizure records. In addition,
the standard deviation of seizure signals is higher than the standard deviation of non-seizure
signals. This is perfectly matching with the logic rule if Std > 102.44 Seizure (see Fig. 7.13).
Moreover, the data set is labeled based on the clustering results obtained by our technique. If
the clustering results were inaccurate the decision trees produced from the data set would not be
Leaf 1
Leaf 2
Leaf 1
Leaf 2
Leaf 3
Leaf 1
Leaf 2
Leaf 3
Leaf 4
Leaf 1
Leaf 2
Leaf 3
Leaf 4
177
so accurate, where 12 out of 13 leaves in the trees (see Fig. 7.13) have 100% accurate
classification. This re-confirms the sensible clustering solution obtained by our technique.
7.4.8 Experimental Results on 10 Real Life Data sets
In Section 7.4.5, we empirically compare our proposed technique CSClust with K-means
(Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007), AGCUK (Y. Liu et al., 2011), GAGR
(D.-X. Chang et al., 2009) and GenClust (Rahman & Islam, 2014) on a brain data set (CHB-
MIT Scalp EEG(Goldberger et al., 2000)). In this section, we experimentally evaluate the
performance of CSClust by comparing it with the five existing techniques on 10 real-life data
sets.
Fig. 7.14: Comparative results between CSClust and other techniques based on Silhouette Coefficient (higher the better)
Fig. 7.15: Comparative results between CSClust and other techniques based on DB Index (lower the better)
Fig. 7.14 shows the average Silhouette Coefficient of the clustering solutions, where CSClust
achieves better clustering results than AGCUK in 9 out of 10 data sets based on Silhouette
Coefficient. Moreover, in 10 out of 10 data sets the average DB Index of 10 runs of CSClust is
higher than K-means, K-means++, GAGR, and GenClust ( see Fig. 7.15). The right most
178
columns of Fig. 7.14 and Fig. 7.15 show the average Silhouette Coefficient and DB Index on all
techniques on all data sets. CSClust achieves a clearly better result on an average than all other
techniques.
7.4.9 An Analysis of the Improvement in Chromosomes over the Iterations
In Fig. 7.16, we present the grand average fitness (in terms of DB Index, where Fitness= 1/DB)
values of the best chromosomes of CSClust over the 10 data sets for 10 runs. The grand average
fitness is plotted against the iterations. As we can see in Fig. 7.16, the fitness of the best
chromosome increases steadily over the iterations.
Fig. 7.16: Grand Average fitness versus iteration over the 10 data sets
7.4.10 Statistical Analysis
We now carry out statistical sign test (D.Mason, 1998; Triola, 2001) in order to evaluate the
superiority of the results (Silhouette Coefficient and DB Index) obtained by CSClust over the
results obtained by the existing techniques . We observe that the results do not follow a normal
distribution, and the conditions for a parametric test do not satisfy. Hence, we perform a non-
parametric sign test on the Silhouette Coefficient and DB Index as shown in Fig. 7.17. The first
five bars for Silhouette Coefficient and DB Index in Fig. 7.17 shows the z-values (test statistics
value) for CSClust and other techniques while the sixth bar shows the z-ref value. If the z-value
179
is greater than the z-ref value, then the result obtained by CSClust can be considered to be
significantly better than the results obtained by the existing techniques.
Fig. 7.17: Sign test of CSClust on 11 data sets
We carry out the right-tailed sign test at z > 1.96, p<0.025 in terms of Silhouette Coefficient
and DB Index. Fig. 7.17 shows that CSClust results are significantly better than the five existing
techniques based on Silhouette Coefficient and DB Index.
7.5 Summary
We realize that many existing clustering techniques do not produce sensible clustering solutions,
although their solutions achieve high fitness values based on existing evaluation criteria. These
solutions are typically not useful in knowledge discovery from underlying data sets. Therefore,
in this chapter, we propose a novel clustering technique called CSClust that first learns important
properties of sensible clustering solutions and then applies the information in producing its
clustering solutions.
We apply some existing techniques on a brain data set and realize that their clustering
solutions either have a way too many clusters or only two clusters where one cluster contains
only one record and all other records are stored in the other cluster. The proposed clustering
180
technique overcomes this problem and produces the right number of clusters with right records
in the clusters. From the brain data set, it captures 40 seizure records in one cluster and non-
seizure records in another cluster.
While preparing the data set we consider all records from all channels within the seizure time
period to be labelled as seizure records. Thus, we get 115 seizure records in the prepared brain
data set. However, since it was a localized seizure some channels do not capture the seizure
signals even during the seizure period as evident in Fig. 7.10. Hence, although these records are
labelled as seizure records many of them actually do not exhibit any seizure properties and hence
behave like non-seizure records. This is also evident in Fig. 7.7 where we can see many seizure
records are overlapped with non-seizure records.
Interestingly, CSClust captures only the real seizure records in a cluster and does not capture
the fake seizure records in the cluster. This demonstrates the suitability of CSClust in producing
sensible clustering solutions for knowledge discovery. We then re-label the records based on
our clustering solutions and produce a number of decision trees to discover logic rules for
seizure and non-seizure records. The logic rules (such as if Std > 102.44 Seizure) obtained
by the forest appear to be sensible further confirming the accuracy of clustering results obtained
by the proposed clustering technique.
The data set where the decision forest is built from is labelled based on the clustering results
obtained by our technique. If the clustering results were inaccurate the decision trees produced
from the data set would not be so accurate, where 12 out of 13 leaves in the trees (see Fig. 7.13)
have 100% accurate classification. Moreover, the logic rules obtained by the trees make perfect
sense. For example, it is clear from the signals from a non-seizure period (see Fig. 7.8) and
seizure period (see Fig. 7.9) that the standard deviation of seizure signals is higher than non-
seizure signals. This is perfectly matching with the logic rule if Std > 102.44 Seizure (see
Fig. 7.13). This re-confirms the high quality of the clustering results obtained by our technique.
181
We then compare our clustering technique with five existing clustering techniques on 10 data
sets and demonstrate the statistically significant superiority of the proposed technique over the
existing techniques in terms of two evaluation criteria namely Silhouette Coefficient and DB
Index.
However, CSClust also has a number of limitations. CSClust learns the properties of sensible
clustering solutions by applying DB-Index on the initial population (see Step 2 in Section 7.3).
This approach can be problematic since the selection can be biased by the limitation of DB-
Index. Therefore, In the next chapter we propose a clustering technique called HeMI++ as
further improvement of HeMI and CSClust. HeMI++ identifies the properties of a sensible
clustering solution without any influence of DB Index.
In addition, we also find that while the existing clustering techniques produce non-sensible
clustering, they achieve better evaluation values based on the existing evaluation criteria.
Therefore, a good evaluation technique is also highly required in order to evaluate sensible and
non-sensible clustering solutions. In the next chapter, we also propose a new cluster evaluation
technique which is suitable for evaluating sensible and non-sensible clustering solution.
182
Chapter 8
Application of a Novel GA-based Clustering and Tree
based Validation on a Brain Data Set for Knowledge
Discovery
8.1 Introduction
In this chapter we propose a new clustering technique and an evaluation technique. In the
proposed clustering technique, we combine our previous technique called CSClust (see Chapter
7) with HeMI (see Chapter 6) where we also significantly improve the components of CSClust
and HeMI. Therefore, we call the proposed technique HeMI++. In this chapter, we achieve our
second and third research goals (producing sensible clustering solutions, and a cluster
evaluation technique for better evaluation sensible and non-sensible clustering results).
During the PhD candidature, we have published the following paper based on this chapter with PhD
supervisors.
Beg, A. H., Islam, M. Z., and Estivill-Castro. V. (2016): Application of a Novel GA-based Clustering
and Tree based Validation on a Brain Data Set for Knowledge Discovery, Information Systems, (Status:
Under Review). (ERA 2010 Rank A*, SJR 2016 Rank Q1, H Index 64).
183
We first explore the quality of HeMI and some existing clustering techniques. We also
explore the quality of existing evaluation techniques. In Chapter 7, we find that some existing
techniques do not produce sensible clusters. However, in this chapter, we carefully assess the
clustering quality of the existing techniques and HeMI through cluster visualization.
In order to assess the quality of the existing clustering techniques and cluster evaluation
techniques, we use a brain data set (CHB-MIT Scalp) (Goldberger et al., 2000) as an example
which is available from https://physionet.org/cgi-bin/atm/ATM. We plot the data set so that we
can graphically visualize the clusters (see Fig. 8.1). We know that this data set has two types of
records: seizure and non-seizure. We can also see in the figure (Fig. 8.1) that there are clearly
two clusters of records. We then apply the existing clustering techniques on this data set and
plot their clustering results.
We find that some recent and state of the art clustering techniques such as GAGR (D.-X.
Chang et al., 2009), GenClust (Rahman & Islam, 2014) do not produce sensible clusters. We
also find that our technique HeMI (as presented in Chapter 6) does not produce sensible clusters.
GenClust produces 447 clusters (see Fig. 8.2) which is not sensible as the actual clusters of this
data set is supposed to be only two. GAGR produces 56 clusters as shown in Fig. 8.3. HeMI
produces two clusters (see Fig. 8.4) where one cluster contains one record, and the other cluster
contains all remaining records. Therefore, a clustering technique that can produce a sensible
clustering solution is highly desirable.
We also evaluate the clustering quality of the existing techniques based on the internal and
external evaluation criteria (see Table 8.1). While the existing clustering techniques produce
non-sensible clustering, they achieve better evaluation values (compared to a sensible clustering
solution) based on the existing evaluation criteria. Therefore, a good evaluation technique is
also highly required in order to evaluate sensible and non-sensible clustering solutions.
184
In the proposed clustering technique HeMI++, we introduce a new cleansing and cloning
operation that helps to produce sensible clustering solution. We now briefly introduce the novel
components/properties of HeMI++ and their logical justifications as follows.
The central component of HeMI++ is a cleansing operation in each generation in order to
ensure that all chromosomes in a population have a sensible solution. Through our empirical
analysis on the existing techniques (see Section 8.2.1) we find that although they produce a
clustering solution with better fitness value, they often end up producing a non-sensible
clustering result.
Therefore, we introduce a cleansing operation by applying two conditions: (i) the number of
clusters must be within the range of a maximum number and a minimum number of clusters
which are learned by HeMI++ from some properties of a data set, and (ii) the minimum number
of records in a cluster must be greater than a threshold minimum number of records which are
again data driven (i.e. not user defined). HeMI++ uses the initial population in order to learn the
range of a maximum and a minimum number of clusters and the threshold minimum number of
records.
Another important component of HeMI++ is the initial population selection targeting high-
quality chromosomes and better use of the initial population. It produces high-quality initial
population using the same approach of HeMI (see Section 6.2.2 in Chapter 6). HeMI++ stores
the information of all the chromosomes that it generates in the initial population. It then learns
necessary properties of a sensible clustering solution for a data set from these initial population,
without requiring any user input.
Another interesting idea associated with HeMI++ is the cloning operation that replaces sick
chromosomes in each generation/population. In each population, the cleansing operation
identifies the sick chromosomes, which are then replaced by a pool of healthy chromosomes
185
found in the initial population. The pool of high-quality chromosomes created for the initial
population are expected to be reasonably healthy chromosomes due to the use of K-means many
times.
Through our empirical analysis on the existing cluster evaluation techniques (see Section
8.2.1) we also observe that the existing cluster evaluation techniques produce inaccurate
evaluation values. Sometimes they produce higher evaluation values for non-sensible clustering
solutions and lower evaluation values for sensible clustering solutions. Sometimes they produce
higher evaluation values both for the sensible and non-sensible clustering solutions which is not
as useful for measuring clustering quality.
Therefore, we also propose a new evaluation technique called Tree Index where we first label
a data set based on the clustering solutions and produce a decision tree (Quinlan, 1993, 1996).
We then calculate the entropy (Pang-Ning Tan, Michael Steinbach, 2005) for each leaf (i.e. the
entropy of the distribution of class values within the leaf) and learn the depth of the leaf in the
tree.
Based on the entropy and depth of a leaf, for all leaves, we then compute an evaluation value
that represents the clustering quality. The basic idea here is the fact that if a clustering result is
good then the labels assigned to the records based on the clustering result should lead to a
decision tree having homogeneous leaves (i.e. low entropy) with small depth in general. On the
other hand if the clustering result is bad, then the resulting tree should struggle to find a pattern
which will be reflected by heterogeneous leaves (i.e. high entropy) with high depth overall.
Imagine an extreme example where the labels are assigned completely randomly (i.e. a very bad
quality clustering) then it will be almost impossible for a tree to find any suitable pattern and
the leaves are likely to be very heterogeneous and perhaps deep.
186
We evaluate our technique HeMI++ by comparing its performance with the performance of
five existing techniques, namely AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al.,
2009), GenClust (Rahman & Islam, 2014), K-means (Lloyd, 1982), K-means++ (Arthur &
Vassilvitskii, 2007). The existing techniques are recent, high-quality and better than many other
techniques as shown in the literature (Sanghamitra Bandyopadhyay & Maulik, 2001, 2002; Lai,
2005; Lin et al., 2005; Murthy & Chowdhury, 1996).
We also compare HeMI++ with HeMI (see Chapter 6). We conduct experiments with the
techniques on 21 real-life data sets that are publicly available in the UCI machine learning
repository (M. Lichman, 2013). The experimental results clearly indicate that HeMI++ performs
better than the existing techniques in terms of our new cluster evaluation technique. The validity
of our new cluster evaluation technique is assessed by applying it on the clustering results
(obtained by various techniques), some sensible and some non-sensible (see Section 8.2.2 and
Section 8.3.5).
We first apply all the techniques on a brain data set (CHB-MIT Scalp data set, see Section
8.3.4), in order to test their suitability in discovering knowledge from a data set. The empirical
analysis presented in Section 8.3.5 indicates that HeMI++ produces a sensible clustering
solution which is suitable for deriving knowledge from a data set whereas all other techniques
fail to generate sensible clustering solutions which are not useful for discovering knowledge
from a data set.
The main contributions of this chapter can be summarized as follows.
Selection of important properties of sensible clustering solutions (see Component 4
of Section 8.2.3);
Application of the sensible properties in producing clustering solutions (see
Components 10 and 11 of Section 8.2.3);
187
New cluster evaluation technique called Tree Index (see Section 8.2.5);
Validation of the effectiveness of Tree Index by analyzing it on some ground truth
clustering results, which are also graphically visualized (see Section 8.2.1, Section
8.2.2 and Section 8.3.5);
Demonstration of the effectiveness of the proposed technique through knowledge
discovery from a brain data set (see Section 8.3.6 and Section 8.3.8);
Extensive evaluation of 21 publicly available data sets (see Section 8.3.7).
The rest of the chapter is organized as follows: The proposed technique is described in
Section 8.2. In Section 8.3, we discuss experimental results and the summary of the chapter is
presentd in Section 8.4.
8.2 Our Technique
8.2.1 Basic Concepts of Our Clustering Technique HeMI++
In this section, we discuss the basic concepts behind our proposed clustering technique,
HeMI++. We first explore some sensible and non-sensible clustering solutions, and their
evaluations made by various existing evaluation techniques/metrics. We use a brain data set
called CHB-MIT Scalp EEG (Goldberger et al., 2000). Fig. 8.1 shows the structure of the data
set where it has two clusters: seizure and non-seizure.
The original data set (Goldberger et al., 2000) contains 9 non-class attributes and a class
attribute having two possible values: seizure and non-seizure. We pick three non-class attributes
(max, min and standard deviation) so that the records can be plotted on a paper (see Fig. 8.1).
Throughout the chapter we use these three attributes for this data set. The class values of the
records are presented through the shapes; dots representing non-seizure and the plus signs
188
representing seizure records. From Fig. 8.1, we can clearly see the existence of two clusters; one
for mainly the seizure records and the other one for mainly non-seizure records.
Fig. 8.1: The three dimensional CHB-MIT Scalp EEG (chb01-03) data set
Fig. 8.2: Clustering result of GenClust on the CHB-MIT Scalp EEG (chb01-03) data set
Fig. 8.2 shows a non-sensible clustering result which is produced by GenClust on the brain
data set. It appears to be a non-sensible clustering solution since it produces 477 clusters where
189
the actual number of clusters of this data set is only two (seizure and non-seizure) as shown in
Fig. 8.1. The clusters are plotted with 477 shapes such as dots, plus sign, triangles (see Fig. 8.2).
Note that while clustering the records only the non-class attributes are used and the class
attribute is not used.
GenClust uses the fitness function called COSEC (see Eq. 8.3) to evaluate the fitness of a
chromosome. It calculates the compactness Compj (see Eq. 8.1) of a cluster 𝐶𝑗, the separation
Sepj (see Eq. 8.2) of the cluster 𝐶𝑗, where |𝐶𝑗| is the number of records belonging to the
cluster 𝐶𝑗, sj is the seed of 𝐶𝑗 and 𝑥𝑎 is a record of cluster 𝐶𝑗.
Compj =∑ distxaϵCj
(xa,sj)
|Cj|
Eq. 8.1
Sepj = min∀k≠j{d(mj,mk)} Eq. 8.2
Fitness = ∑(Sepj −
∀j
Compj) Eq. 8.3
The COSEC value of a chromosome increases when the number of clusters of a chromosome
increases. Therefore, due to use of CSOEC GenClust tends to obtain a clustering solution with
a large number of clusters.
Although GenClust produces a non-sensible clustering solution, it surprisingly achieves
higher evaluation values based on the existing cluster evaluation techniques/metrics namely F-
measure, Purity, Silhouette Coefficient, XB Index and DB Index as shown in Table 8.1. Shaded
cells represent the best evaluation values among the techniques.
Fig. 8.3 shows another non-sensible clustering result which is obtained by GAGR on the
brain data set. GAGR generates 56 clusters, which is also not sensible as the original number of
clusters of this data set is only two. GAGR uses SSE (see Eq. 8.4) as its fitness function, where
190
𝑘 stands for the number of clusters and 𝑑𝑖𝑠𝑡 (𝑠𝑗 , 𝑥) denotes the distance between a record 𝑥 and
seed (𝑠𝑗) of cluster 𝐶𝑗. In GAGR, the fitness of a chromosome is computed by 1/ SSE.
Fig. 8.3: Clustering result of GAGR on the CHB-MIT Scalp EEG (chb01-03) data set
SSE = ∑ ∑ dist (sj, 𝑥)2
x∈cj
k
j=1
Eq. 8.4
In GAGR, like GenClust, the fitness of a chromosome increases when the number of clusters
of the chromosome increases. Accordingly, GAGR tends to generate a clustering solution with
a high number of clusters. Interestingly, GAGR also achieves good evaluation metrics based on
the existing cluster evaluation techniques as shown in Table 8.1.
Fig. 8.4 shows another non-sensible clustering result which is obtained by HeMI on the brain
data set. HeMI generates two clusters but the number of records in one cluster is one and all
other records belong to the other cluster which is also not sensible. However, it also achieves
good evaluation values based on F-measure, Purity, Silhouette Coefficient, XB Index and DB
Index as shown in Table 8.1.
191
Table 8.1: Some sensible and non-sensible clustering solutions and their evaluation values based on the existing cluster
evaluation metrics
Techniques F-measure
(higher the
better)
Purity
(higher the
better)
Silhouette
Coefficient
(higher the
better)
XB Index
(lower the
better )
SSE
(lower the
better)
DB Index
(lower the
better)
Non-
sensible
Clustering
GenClust 0.99 0.99 0.50 0.25 65.68 0.78
HeMI 0.83 0.71 0.89 0.27 2441.59 0.13
GAGR 0.99 0.98 0.13 1.03 345.66 1.55
Sensible
Clustering
0.99 0.98 0.74 0.26 1949.67 0.33
Fig. 8.4: Clustering result of HeMI on the CHB-MIT Scalp EEG (chb01-03) data set
In Fig. 8.4, dots and plus signs represent Cluster 1 obtained by HeMI. Dots represent the
records with the class value of non-seizure, and the plus signs represent the records with seizure.
Moreover, the circles represent Cluster 2 obtained by HeMI. Circles represent the records with
the class value of seizure and triangles represent the records with class value of non-seizure. In
this figure there are no triangles meaning that no records in Cluster 2 with non-seizure.
HeMI uses DB Index (see
192
Eq. 8.8) to calculate the fitness of a chromosome. The DB index is the function of the ratio
of the sum of within cluster scatter to between-cluster separation. If 𝑠𝑗 is the seed of the
𝑗𝑡ℎcluster (𝑐𝑗) then scatter (𝑛𝑗) is calculated as follows.
𝑛𝑗,𝑞 = ( 1
|𝑐𝑗|∑ ||𝑥 − 𝑠𝑗|| 𝑞
2
𝑥𝜖𝑐𝑗
)
1/𝑞
Eq. 8.5
If 𝑠𝑖 is the seed of the 𝑖𝑡ℎcluster (𝑐𝑖) and 𝑠𝑗 is the seed of the 𝑗𝑡ℎcluster (𝑐𝑗) then the distance
between them is
𝑑𝑖𝑗,𝑡 = || 𝑠𝑖 − 𝑠𝑗||𝑡 Eq. 8.6
The Davies-Bouldin (DB) index of 𝑘 clusters is computed as follows.
𝑅𝑖,𝑞𝑡 = {𝑛𝑖,𝑞+𝑛𝑗,𝑞
𝑑𝑖𝑗,𝑡}𝑗,𝑗≠𝑖
𝑀𝑎𝑥 Eq. 8.7
𝐷𝐵 =1
𝐾∑ 𝑅𝑖,𝑞𝑡
𝐾
𝑖=1
Eq. 8.8
Fig. 8.5 shows an example of a sensible clustering solution where it produces two clusters.
Cluster 1 contains mostly the non-seizure records and Cluster 2 contains mostly the seizure
records.
Therefore, our proposed clustering technique HeMI++ aims to produce a clustering solution
like the sensible clustering solution shown in Fig. 8.5. In doing so, we first realize that HeMI
produces actual number of clusters as shown in Fig. 8.4 but the number of records in one cluster
in only one and all other records belong to other cluster. If we can handle this problem then we
will be able to produce a sensible clustering result using HeMI. Therefore, we introduce a
193
cleansing and cloning operation aiming to ensure that all chromosomes in a population have a
sensible solution.
Fig. 8.5: A sensible clustering result on the CHB-MIT Scalp EEG (chb01-03) data set
We apply a cleansing operation at the end of each iteration and based on the threshold values
it identifies the sick chromosomes that produce bad clustering solution. We replace the sick
chromosomes by a pool of high-quality chromosomes found in the initial population through
our cloning operation.
8.2.2 Basic Concepts of Our Cluster Evaluation Technique Tree Index
From our empirical analysis in the previous section, we observe that the existing evaluation
techniques fail to produce accurate evaluation values to identify sensible and non-sensible
clustering solutions (see Table 8.1). Therefore, we realize that a good evaluation technique is
required.
Therefore, we propose a new cluster evaluation technique called Tree Index that can identify
sensible and non-sensible clustering solution. Table 8.2 shows the clustering results of sensible
and non-sensible clustering solutions based on Tree Index. From the evaluation results shown
194
in Table 8.2 it is clear that the proposed evaluation technique is able to identify sensible and
non-sensible clustering solutions. It produces a good evaluation value for the sensible clustering
solution (shown in Fig. 8.5) and a bad evaluation value for the non-sensible clustering solutions.
Table 8.2: Cluster results of some sensible and non-sensible clustering solutions based on Tree Index
Clustering Techniques Tree Index (lower the better)
GenClust 5.36
HeMI ∞
GAGR 15.36
Sensible Clustering 0.14
8.2.3 Main Components of HeMI++
We first mention the main steps of HeMI++ as follows and then explain each of them in detail.
Note that Components 4 and 10 are new contributions of HeMI++.
BEGIN
Step 1: Normalization
DO: k=1to m /* m is the user defined number of streams */
Step 2: Population Initialization
END
Step 3: Selection of Sensible Properties
DO: j=1to G /* G is the user defined number of intervals*/
DO: k=1to m /* m is the user defined number of streams */
DO: t=1to I /* I=10; I is the user defined number of iterations */
Step 4: Noise-Based Selection
Step 5: Crossover Operation
Step 6: Twin Removal
Step 7: Three Steps Mutation Operation
Step 8: Health Improvement Operation
Step 9: Cleansing Operation
Step 10: Cloning Operation
Step 11: The Elitist Operation
END
END
Step 12: Neighbor Information Sharing
END
Step 13: Global Best Selection
END
195
Component 1: Normalization
HeMI++ takes a data set 𝐷 as input. It first normalizes the data set 𝐷 in order to weigh each
attribute equally regardless of their domain sizes. For normalization, HeMI++ uses the same
approach of normalization of HeMI (see Section 6.2.2 of Chapter 6).
Component 2: Multiple Stream
HeMI++ uses the multiple stream approach of HeMI (see Section 6.2.2 of Chapter 6) in order
to take the advantage of using a big population through multiple streams where each stream
contains relatively small number of chromosomes.
Component 3: Population Initialization
For the population initialization, HeMI++ uses the same approach of population initialization
that we used in HeMI (see Section 6.2.2 of Chapter 6). HeMI++ selects |P| number of
chromosomes in the initial population, |P|/2 from the deterministic phase and |P|/2 from
random phase. In the experiments of this chapter, we use |P| to be 20. HeMI++ uses Davis
Bouldin (DB) index (D L Davies & Bouldin, 1979) to calculate the fitness of a chromosome.
Component 4: Selection of Sensible Properties
This is an original contribution of HeMI++ that makes better use of initial population for finding
necessary properties of a sensible clustering solution. HeMI++ selects |𝑃| top chromosomes
from the generated initial chromosomes based on their fitness (DB value). DB Index is biased
towards low number of clusters and low number of records in a cluster (se Section 8.2.1).
Therefore, HeMI++ finds the necessary properties of a sensible clustering solution. The
properties of the sensible clustering solution are then used in each generation in order to ensure
that chromosomes in a population do not contradict the properties.
196
In the initial population, HeMI++ produces 9×4=360 chromosomes as it has 4 streams. Each
stream generates 90 chromosomes. HeMI++ finds the minimum number of records in a cluster
for each of the 360 chromosomes. It then sorts these numbers in descending order and calculates
the median of these numbers. The median value is then used as a property (of a sensible
clustering solution) relating to the minimum number of records in a cluster. HeMI++ similarly
finds the minimum and maximum number of clusters in a clustering solution based on the 360
chromosomes. These values are then used in the cleansing operation (see Component 10) in
order to identify a sensible clustering solution.
Note that CSClust (see Step 3 of Section 7.3 in Chapter 7) also uses a similar component.
However, there are some significant differences, First, CSClust does not use multiple streams
and hence it finds the properties based on 90 chromosomes of its single stream. Second, in order
to find the minimum number of records in a cluster CSClust identifies the best 20 chromosomes
according to their DB Index values and then picks the minimum number of records in a cluster
out of all clusters in these 20 chromosomes. Since the best 20 chromosomes are selected based
on their DB Index values, CSClust also suffers from the drawbacks of DB Index in identifying
the properties of a sensible clustering solution.
Component 5: Noise-based Selection
At the beginning of each generation starting from Generation 2, we carry out the Noise Based
Selection (Y. Liu et al., 2011) in order to get a new population for subsequent genetic operations
such as crossover and health improvement. HeMI++ uses the same approach of noise-based
selection that we used in DeRanClust (see Section 3.2 in Chapter 3).
197
Algorithm 8.1: HeMI++
Input: A data set 𝐷 having 𝑛 records and |𝐴| attributes, where 𝐴 is the set of attributes Output: A set of cluster C
Require:
Pd← ∅ /* 𝑃𝑑 is the set of deterministic chromosome (45 chromosomes), initially set 𝑃𝑑 to empty */
Pr← ∅ /* 𝑃𝑟 is the set of random chromosome (45 chromosomes), initially set 𝑃𝑟 to empty */
end
Step 1: /* Normalization */
D′ ← Normalized ( 𝐷) /* normalize each attribute of the data set in the normalized data set (𝐷′) */
end
for k= 1 to m do /* m=4, user defined number of streams, default value of m is set to 4 and k is the counter of m */
Step 2: /* Population Initialization */
Pd ← GenerateDeterministicChromosomes (D′) /* generate 𝑃𝑑 through K-means, the number of initial seeds for K-means are chosen deterministically *
Pr ← GenerateRandomChromosomes (D′) /* generate 𝑃𝑟 randomly, the number of initial seeds are chosen randomly and the seeds also chosen randomly */
Px ←SelectDeterministicChromosomes (Pd ) /* select 10 chromosomes (50% chromosomes of the initial population) based on fitness */
Py ← SelectRandomChromosomes (Pr) /* select 10 chromosomes (50% chromosomes of the initial population) based on fitness */
Ps ← Ps ∪ (Px ∪ Py) /* 𝑃s is the set of initial population (20 chromosomes) */
Pb= FindBestChromosome (Ps) /* 𝑃𝑏 is a chromosome that has the maximum fitness value in 𝑃𝑠 */
end
end
Step 3: /* Selection of Sensible Properties */
(Mn, Mx, Mr) ←FindSensibleProperties (Pr, Pd) /* Find minimum number of clusters (𝑀𝑛), maximum number of clusters (𝑀𝑥) and
end minimum number of records in a cluster ( 𝑀𝑟) */
for j=1 to G do /* default G=5, G is the number of intervals of the total number of iterations and j is the counter of G */
for k= 1 to m do /* default m=4, m is the user defined number of streams, and k is the counter of m */
for t=1 to I do /* default I = 10, I is the user defined number of iterations for each interval and t is the counter of I */
Step 4: /* Noised-Based Selection */
if t >1 then
Ps = NoiseBasedSelection(Pst, Ps
t+1) /* perform noise based selection between current ( 𝑃𝑠𝑡+1 ) and previous (𝑃𝑠
𝑡) generation*/
end
end
Step 5: /* Crossover Operation */
Po ← PerformCrossover (Ps) /* perform single point crossover on 𝑃𝑠 and get a set of offspring chromosomes 𝑃𝑜*/
end
Step 6: /* Twin Removal */
P0 =Twin Removal (P0) /* perform twin removal on 𝑃0 and get a set of chromosomes 𝑃0 */
end
Step 7: /* Three Steps Mutation operation */
Pm ← PerformMutationOperation (Po) /* perform three steps mutation operation on 𝑃𝑜 and get a set of mutated chromosomes 𝑃𝑚 */
end
Step 8: /* Health Improvement Operation */
Pc ← PerformHealthImprovementOperation (Pm) /* perform three steps health improvement operation on 𝑃𝑚 and get healthy chromosomes 𝑃𝑐 */
end
Step 9: /* Cleansing Operation */
Sc ← FindSickChromosomes (Pc, Mn, Mx, Mr ) /* Find sick chromosomes 𝑆𝑐 from 𝑃𝑐 based on 𝑀𝑛, 𝑀𝑥, 𝑀𝑟*/
Pc ← Pc-Sc /* Remove 𝑆𝑐 chromosomes from 𝑃𝑐 */
end
Step 10: /* Cloning Operation */
while |Sc| ≤ 0 do
Hc ← Cloning Operation (Pd) /* Replace the sick chromosomes from 𝑃𝑑 and get a set of healthy chromosomes 𝐻𝑐 */
end
Pc ← Pc ∪ Hc /* Insert 𝐻𝑐 into 𝑃𝑐 */
end
Step 11: /* Elitist Operation */
Pbk ←ElitistOperation (Pc & Pb ) /* apply elitist operation on 𝑃𝑐 & 𝑃𝑏 and find the best chromosome 𝑃𝑏
𝑘 */
Pg← Pbk /* insert 𝑃𝑏
𝑘 into 𝑃𝑔, 𝑃𝑔 is the set of chromosomes that contains the best chromosome of each stream */
end
end
end
Step 12: /* Neighbor Information Sharing */
for k= 1 to m do /* default m=4, m is the user defined number of streams and k is the counter of m */
Pk ← NeighborInformationSharing (Pc, Pg) /* apply neighbor information sharing on 𝑃𝑐 and get updated chromosomes 𝑃𝑘*/
Pbk ←FindStreamBestChromosome (Pk) /* find the best chromosome of the stream */
Sb← Pb k /* insert 𝑃𝑏
𝑘 into 𝑆𝑏, 𝑆𝑏 is the set of chromosomes that contains the best chromosome of each stream */
end
end
end
Step 13: /* Global Best Selection*/
C ←FindGlobalBestChromosome (Sb ) /* find the global best chromosome 𝐶 from 𝑆𝑏*/ Return C
end
198
Component 6: Crossover Operation
All chromosomes in a population participate in crossover pair by pair. To perform crossover,
HeMI++ uses the same approach of crossover that we used in HeMI (see Section 6.2.2 of
Chapter 6).
Component 7: Twin Removal
HeMI++ uses the twin removal operation in order to remove/modify twin genes (if any) from
each chromosome. For twin removal, HeMI++ uses the same approach of twin removal of
DeRanClust (see Section 3.2 of Chapter 3).
Component 8: Three Steps Mutation Operation
HeMI++ uses three steps of mutation operations: division, absorption and a random change.
The mutation operation of HeMI++ is the same as the mutation operation of HeMI (see Section
6.2.2 of Chapter 6).
Component 9: Health Improvement Operation
This component aims to continuously improve the health of chromosomes within a population
in order to ensure the presence of healthy (high-quality) chromosomes in each generation.
HeMI++ uses the same approach of health improvement operation of HeMI (see Section 6.2.2
of Chapter 6).
Component 10: Cleansing Operation
This is an original contribution of HeMI++. The aim of this component is to identify the
chromosomes in a population with sensible and non-sensible solutions. HeMI++ first learns the
necessary properties [minimum (𝑀𝑛) and maximum (𝑀𝑥) number of clusters, and minimum
number of records (𝑀𝑟) in a cluster] of a sensible clustering solution through Component 4.
199
HeMI++ then applies the cleansing operation on each chromosome using the same approach of
cleansing operation that we used in CSClust (see Section 7.3 of Chapter 7).
Component 11: Cloning Operation
The cloning operation replaces the sick chromosomes found in the cleansing operation. To
replace a sick chromosome, HeMI++ uses the same approach of cloning operation that we used
in CSClust (see Section 7.3 in Chapter 7).
Component 12: The Elitist Operation
The Elitist operation keeps track of the best chromosome throughout the generations in order to
ensure the continuous improvement of the quality of the best chromosome found so far over the
iterations. For finding the best chromosome, HeMI++ uses the same approach of elitist operation
that we used in DeRanClust (see Section 3.2 of Chapter 3).
Component 13: Neighbor Information Sharing
HeMI++ uses the Neighbor Information Sharing component of HeMI (see Section 6.2.2 of
Chapter 6) in order to share/exchange the best chromosome among neighboring streams at a
regular interval such as at every 10th iteration. The main idea of this operation is to give a stream
a chance to borrow a good chromosome form its neighboring streams after every 10 iterations.
Thus, steams can share their good chromosomes and help each other.
Component 14: Global Best Selection
HeMI++ uses this component in order to find the global best chromosome among multiple
streams. At the end of all iterations, each stream obtains its best chromosome. HeMI++
compares the best chromosome of all streams and selects the best of the best chromosomes as
the final clustering solution. The genes of the best chromosome represent the seed/centroid of a
cluster and records are allocated to their closest seeds to form the final clusters.
200
8.2.4 The HeMI++ Algorithm
We now present the HeMI++ algorithm, which is shown in Algorithm 8.1. HeMI++ first takes
a data set 𝐷 as an input and normalizes all attributes separately as explained in Component 1. It
then takes the user defined number of multiple streams as explained in Component 2. The default
number of multiple streams in HeMI++ is set to 4.
HeMI++ then produces initial chromosomes for each stream separately through the
Population Initialization as explained in Component 3 (see Step 2 of Algorithm 8.1). It then
applies its proposed component the Selection of Sensible Properties (see Component 4 and Step
3 in Algorithm 8.1) in order to find the necessary properties of a sensible clustering solution.
HeMI++ applies the noise-based selection operation from the 2nd iteration as explained in
Component 5 (see Step 4 of Algorithm 8.1).
HeMI++ then sequentially applies the Crossover, Twin Removal, Mutation and Health
Improvement operation. All these operations are explained before (see from Component 6 to
Component 9 and Step 5 to Step 8 in Algorithm 8.1). HeMI++ applies the cleansing and cloning
operation (Component 10 and Component 11, and Step 9 and 10 of Algorithm 8.1) in order to
increase the chance that all chromosomes in a population do not contradict with the properties
of a sensible solution.
It then performs the Elitist operation (see Component 12 and Step 11 of Algorithm 8.1) to
find the best chromosome. In order to take the advantage of multiple streams HeMI++ then
applies the neighbor information sharing component (see Component 13 and Step 12 of
Algorithm 8.1) at a regular interval. In this study, the default value of the interval is 10 iterations.
At the end of all iterations, HeMI++ applies the Global Best Selection (see Component 14 and
see Step 13 of Algorithm 8.1) operation in order to find the final clustering solution.
201
8.2.5 Our Cluster Evaluation Technique (Tree Index)
We now propose a new cluster evaluation technique called Tree Index which is able to better
evaluate clustering solutions than conventional cluster evaluation metrics. The steps of the
proposed cluster evaluation technique are as follows.
Step 1. The proposed cluster evaluation technique first labels a data set based on the
clustering result that it wants to evaluate. For example: if a clustering technique generates a
clustering result with three clusters then Tree Index labels the data set considering the three
clusters as three different class values.
Step 2. It then builds a decision tree on the labelled data set to classify the records based on
their labels. It can use any existing decision tree algorithm. In this study we have used C4.5
(Quinlan, 1993, 1996).
Step 3. Tree Index then finds the entropy (Pang-Ning Tan, Michael Steinbach, 2005) of each
leaf of the tree. The entropy is a well-known evaluation technique that measures the level of
uncertainty in a distribution.
Step 4. It then finds the depth of each leaf of the tree. Typically, a tree having a lower depth
represents a higher agreement between the class labels and corresponding records of a data set
than a tree with a higher depth.
Step 5. It then computes the evaluation value (𝐸) as follows.
𝐸 =∑ 𝐸𝑖×𝑘𝑖
𝑙
𝑖=1
|c| {
𝑘𝑖 = 𝑑𝑖 (𝑑𝑒𝑝𝑡ℎ), 𝑖𝑓 𝑑𝑖 > 0
𝑘𝑖 = ∞, 𝑖𝑓 𝑑𝑖 = 0
Eq. 8.9
where, 𝐸𝑖 is the entropy of the 𝑖𝑡ℎ leaf, |c| is the number of possible class values, which is
the same as the number of clusters, 𝑑𝑖 is the depth of the 𝑖𝑡ℎleaf. The value of 𝑘𝑖 is 𝑑𝑖 when the
value of 𝑑𝑖 is greater than 0. The value of 𝑘𝑖 is ∞ when the value of 𝑑𝑖 is 0. The depth 𝑑𝑖 = 0
202
means that the tree has a single leaf with depth zero; that is the root node itself is the only leaf.
It means that a tree has not been built indicating that there is no strong pattern in the data set.
This can happen when the records are labelled incorrectly meaning that the clustering results
are of poor quality. On the other hand a good clustering will result in a good labeling of records
which will then build a shallow tree with homogeneous leaves (zero entropy). This will obtain
a very low 𝐸 value in Eq. 8.9.
8.3 Experimental Results and Discussion
8.3.1 The Data Sets and the Evaluation Criteria
We empirically compare the performance of our proposed technique HeMI++ with five existing
techniques namely AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), GenClust
(Rahman & Islam, 2014), K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007),
and our technique HeMI (see Chapter 6) on a brain data set (CHB-MIT data set) (Goldberger et
al., 2000). We also compare the performance of HeMI++ with the existing techniques on 21
natural data sets that are available in the UCI machine learning repository (M. Lichman, 2013).
Detailed information about the data sets is presented in Table 8.3. We choose a wide variety
of data sets. For example, some data sets (such as Glass Identification) have only numerical
attributes and some data sets (such as Credit Approval) have both the numerical and categorical
attributes. Most of the data sets we choose with numerical attributes because the techniques
(except HeMI, HeMI++ and GenClust) that we use in this study can only handle numerical
attributes.
Some data sets have a low number of attributes and some data sets have a high number of
attributes. For example, the Blood Transfusion (BT) data set has only 4 attributes and the
Hepatitis (HT) data set has 19 attributes. Similarly some data sets have a low number of records
such as the Zoo data set that has only 101 records. Some data sets have a relatively high number
203
of records such as the Chess (King-Rook vs. King) (CKRK) data set that has 28,056 records. In
addition, some data sets have a low number of class values such as Vertebral Column (VC) that
has only two class values and some data sets have a high number of class value such as Leaf
(LF) that has 36 class values.
Table 8.3: A brief description of the data sets
Data set No. of Records
with missing
No. of Records
without missing
No. of numerical
attributes
No. of categorical
attributes
Class size
Zoo 101 101 1 18 7
Hepatitis (HT) 155 80 6 13 2
Glass Identification (GI) 214 214 10 0 7
Statlog Heart (STH) 270 270 6 7 2
Vertebral Column (VC) 310 310 6 0 2
Ecoli (EC) 336 336 8 0 8
Leaf (LF) 340 340 16 0 36
Liver Disorder (LD) 345 345 6 0 2
Credit Approval (CA) 690 653 6 9 2
Breast Cancer Wisconsin Original
(WBC)
699 683 10 0 2
Blood Transfusion (BT) 748 748 4 0 2
Pima Indian Diabetes (PID) 768 768 8 0 2
Statlog Vehicle Silhouettes (SV) 846 846 18 0 4
Bank Note Authentication (BN) 1,372 1,372 4 0 2
Contraceptive Method Choice (CMC) 1,473 1,473 2 7 3
Yeast (YT) 1,484 1,484 8 0 10
Image Segmentation (IS) 2,310 2,310 18 0 7
Wine Quality (WQ) 4,898 4,898 11 0 7
Page Blocks Classification (PBC) 5,473 5,473 10 0 5
MAGIC Gamma Telescope (MGT) 19,020 19,020 11 0 2
Chess (King-Rook vs. King) (CKRK) 28,056 28,056 3 3 18
Class values are the labels of records that represent an important property of a data set.
Typically the data set, for which a clustering technique is used, does not have class values for
its records. Therefore, before applying a clustering technique on a data set we remove the class
attribute.
204
Some of the data sets contain missing values in them. It means that some attribute values of
some records are missing. We delete all the records having any missing values. For example,
the Credit Approval (CA) data set has altogether 690 records, but 37 of them have one or more
missing values. Hence, after deleted these 37 records the data set has 653 records that do not
have any missing values. We evaluate and compare the techniques based on the proposed
evaluation technique Tree Index, where the lower evaluation value represents a good clustering
result.
8.3.2 The Parameters used in the Experiments
In the experiments on AGCUK, GAGR, GenClust, HeMI and HeMI++ the population size is
set to 20 and the number of iterations/generations is set to 50. In order to ensure a fair
comparison among the techniques we maintain this consistency. The number of iterations in K-
means and K-means++ is set to 50 and the number of iterations of K-means in GenClust also
set to 50. The cluster number 𝑘 in GAGR, K-means and K-means++ is user defined. Hence, to
simulate a user defined 𝑘 we in this study generate 𝑘 randomly (for GAGR, K-means and K-
means++) in the range between 2 to √𝑛 , where 𝑛 is the number of records in a data set. The
threshold value for K- means is set to 0.005.
The value of 𝑟𝑚𝑎𝑥 and 𝑟𝑚𝑖𝑛 in AGCUK and HeMI are set to 1 and 0 respectively. For our
proposed cluster evaluation technique we need to build a decision tree from a data set where
records are labelled on the clustering result that is being evaluated. While building the decision
tree we need to assign a minimum number of records for each leaf. In this study we assign 1%
of records of a data set, as long as it stays within the range between 2 to 15. If 1% of records is
less than 2 then we assign 2, and if 1% of records is more than 15 then we assign 15.
205
8.3.3 The Experimental Setup
On each data set, we run HeMI++ 20 times, since it can give different clustering results in
different runs. We also run all other techniques GenClust, GAGR, AGCUK, K-means, K-
means++ and HeMI for 20 times. We then present the average clustering results. We understand
that clustering techniques are generally applied on unlabelled data sets that do not have any class
attribute. Hence, the class attributes are removed before clustering them. They are again used
for an external evaluation of the clustering results.
8.3.4 Brain Data set Pre-processing
Before experimentation, we first prepare the CHB-MIT Scalp EEG data set (Goldberger et al.,
2000). For the data pre-processing HeMI++ uses the same approach of data pre-processing that
we used in CSCClust (see Section 7.4.4 of Chapter 7). In HeMI++, we prepare one-hour data of
one patient (chb01_03) who is an 11 years old girl. This data set has the recordings of 23
channels. Hence, from all 23 channels altogether, we get 360*23=8280 records. In this data set
the patient experienced a seizure for around 40 seconds (from the 2996th second to 3036th
second). During this period we get 5 records. These records are considered as seizure records
and all other records are considered as non-seizure records. Therefore, from the chb_01_03 data
set altogether we get 23*5= 115 seizure records and 8165 non-seizure records.
8.3.5 Clustering Quality Comparison between HeMI++ and Other Techniques on the
MIT-Chb01_03 Data Set
In this section, we empirically compare HeMI++ with AGCUK (Y. Liu et al., 2011), GAGR
(D.-X. Chang et al., 2009), GenClust (Rahman & Islam, 2014), K-means (Lloyd, 1982), K-
means++ (Arthur & Vassilvitskii, 2007) and our technique HeMI (see Chapter 6) on a brain data
set (MIT-chb01_03) through visual analysis of clustering results. We also compare all the
techniques based on our proposed cluster evaluation technique. In this section, we use three
206
attributes (max, min and std) of the data set in order to plot the records so that we can see the
records and their orientations. Such plots also help us to see clustering results and their
appropriateness.
Fig. 8.6: Clustering result of HeMI on the CHB-MIT Scalp EEG (chb01-03) data set
Fig. 8.7 : Clustering result of AGCUK on the CHB-MIT Scalp EEG (chb01-03) data set
207
Fig. 8.6 shows the clustering result of HeMI on the CHB-MIT Scalp EEG (chb01-03) data
set. HeMI generates two clusters but one cluster contains only one record and all other records
belong to the other cluster. Clearly, this does not appear to be a sensible clustering. From Table
8.4 we can see that according to our proposed cluster evaluation technique HeMI receives a poor
evaluation result which is ∞. Therefore, the evaluation made by our proposed evaluation
technique matches with the manual evaluation through the visual analysis of the plotted records.
Fig. 8.8: Clustering result of GAGR on the CHB-MIT Scalp EEG (chb01-03) data set
Fig. 8.7 shows the clustering result of AGCUK where it generates two clusters: seizure and
non-seizure. Mainly, the non-seizure records found in Cluster 1 and mixed of seizure and non-
seizure records are found in Cluster 2. Cluster 1 has 2836 non-seizure records (dots in Fig. 8.7)
and 0 seizure records (plus signs in Fig. 8.7), while Cluster 2 has 5389 non-seizure records
(triangles in Fig. 8.7) and 55 seizure records (circles in Fig. 8.7). We can clearly see that while
the clustering result is more sensible than the clustering result of HeMI (see Fig. 8.6), it is still
not a good clustering result. Our proposed cluster evaluation technique also identifies that as we
208
can see in Table 8.4 that AGCUK is better than HeMI. This again re-confirms the effectiveness
of our proposed evaluation technique.
Fig. 8.8, Fig. 8.9, Fig. 8.10 and Fig. 8.11 show the clustering results of GAGR, GenClust, K-
means and K-means++ where GAGR, GenClust, K-means and K-means++ produce 56, 477, 28
and 13 clusters, respectively. Considering that the data set has only two types of records: Seizure
and Non-seizure these clustering results with so many clusters also do not make a perfect scene.
This is also identified by our proposed evaluation technique as shown in Table 8.4.
Fig. 8.9: Clustering result of GenClust on the CHB-MIT Scalp EEG (chb01-03) data set
As we can see in Fig. 8.12, HeMI++ produces a sensible clustering solution as it matches
with the original orientation of records in the data set. It produces two clusters: Cluster 1 and
Cluster 2. Cluster 1 contains 8219 non-seizure records and 38 seizure records, while Cluster 2
contains 6 non-seizure records and 17 seizure records. As a result, HeMI++ also achieves a good
evaluation value based on our proposed evaluation technique as shown in Table 8.4. This re-
confirms that the proposed evaluation technique produces better evaluation value for better
clustering solutions.
209
Fig. 8.10: Clustering result of K-means on the CHB-MIT Scalp EEG (chb01-03) data set
Fig. 8.11: Clustering result of K-means++ on the CHB-MIT Scalp EEG (chb01-03) data set
210
Fig. 8.12: Clustering result of our proposed technique, HeMI++ on the CHB-MIT Scalp EEG (chb01-03) data set
Table 8.4: Clustering results of HeMI++ and other techniques based on Tree Index
Clustering Techniques Tree Index (lower the better)
HeMI++ 0.55
HeMI ∞
GenClust 5.27
GAGR 19.89
AGCUK 18.19
K-means 27.41
K-means++ 31.01
8.3.6 Analysis of the Clustering Result Obtained by HeMI++ from the CHB-MIT Scalp
EEG (chb01-03) Data Set
We now analyze the clustering results obtained by HeMI++ on the CHB-MIT Scalp EEG
(chb01-03) data set in order to explore knowledge from this data set. Detailed information of
the records in Cluster 2 is presented in Table 8.5. The first column in Table 8.5 shows the
channel number and position of the channel on the surface of the scalp according to the
211
International 10-20 system. Column 2 and 3 present the number of seizure and non-seizure
records in Cluster 2 for every channel.
Table 8.5: Channel wise number of records in Cluster 2 of HeMI++ on the CHB-MIT Scalp EEG (chb01-03) data set
Channel (Position) Number of
seizure records
Number of
non-seizure records
Channel 1 (FP1-F7) 2 0
Channel 5 (FP1-F3) 2 0
Channel 9 (FP2-F4) 2 0
Channel 10 (F4-C4) 1 0
Channel 13 (FP2-F8) 2 0
Channel 14 (F8-T8) 2 0
Channel 15 (T8-P8) 2 1
Channel 16 (P8-O2) 0 3
Channel 17 (FZ-CZ) 0 1
Channel 21 (FT9-FT10) 2 0
Channel 23 (T8-P8) 2 1
The number of seizure records in this data set is 115, considering that each of the 23 channels
has 5 seizure records. However, we have access to the original brain signal for all 23 channels.
10-second epochs of the signal for some of the channels are shown in Fig. 8.14 - Fig. 8.18.
Through our analysis of the signal for all 23 channels we realize that only 11 out of 23 channels
actually show seizure pattern during the seizure time. Therefore, there are actually only 11×5 =
55 seizure records. When we plot these records in Fig. 8.1 around 25 of them are clearly visible
and others are overlapped with other records.
In order to visualize the seizure records we plot the records of Channel-5, Channel-13,
Channel-17 and Channel-21 as shown in Fig. 8.13 (a), (b), (c) and (d), respectively. Fig. 8.13
(a) to Fig. 8.13 (d) show that in each channel, around 2/3 out of 5 seizure records are clearly
visible. Therefore, HeMI++ finds 23 records in Cluster 2, where 17 of them are seizure records.
It finds some non-seizure records in Cluster 2 because they are very similar to seizure records.
212
(a) Channel-5
(b) Channel-13
(c) Channel-17
(d) Channel-21
Fig. 8.13: Seizure records of different channels
Fig. 8.14 shows the signals of Channel-5 during the non-seizure time where the amplitude
varies between 200 uV and -200 uV. Fig. 8.15, Fig. 8.17 and Fig. 8.18 show the signals of
Channel-5, Channel-9 and Channel-13, respectively during the seizure time, where the signals
show high amplitude (between 400 uV and - 400 uV). Hence, we realize that the seizure time
signal generally displays high amplitude whereas the non-seizure time signal displays low
amplitude. However, when we look at Fig. 8.16 that displays overall low amplitude (much like
213
any non-seizure time signal) although it is the signal for the channel during the seizure time.
Thus, we realize that Channel-7 does not experience a seizure signal even during the seizure
time.
Fig. 8.14: EEG signals (10 seconds) of Channel-5 during the non-seizure time
Fig. 8.15: EEG signals (10 seconds) of Channel-5 during the seizure time
Fig. 8.16: EEG signals (10 seconds) of Channel-7 during the seizure time
Fig. 8.17: EEG signals (10 seconds) of Channel-9 during the seizure time
Fig. 8.18: EEG signals (10 seconds) of Channel-13 during the seizure time
We also find that all 11 channels showing the seizure signal during the seizure time are
located in the frontal lobe and temporal-parietal lobe of the scalp (see Fig. 8.19) indicating that
the seizure for this patient was a localized seizure originated in the frontal lobe and temporal-
parietal lobe. Out of these 11 channels 8 of them are located in the frontal lobe and 3 others
(Chanel-15, Channel-16 and Channel-23) are located in the temporal–parietal lobe.
214
Interestingly, HeMI++ also finds the maximum records in Cluster 2 from these 11 channels.
This again re-confirms the quality of the clustering results obtained by our proposed technique.
Fig. 8.19: Channel positions according to the International 10-20 system (Jasper, 1958; Sharbrough F et al., 1991)
8.3.7 Evaluation of HeMI++ and Tree Index on the LD data set
In order to further evaluation of Tree Index and HeMI++, in this section, we empirically
compere the clustering results of all the techniques on the LD data set based on Tree Index. We
also graphically visualize the clustering results in order to validate the correctness of Tree Index
evaluation. In this section, we use three attributes (mcv mean corpuscular volume, alkphos
alkaline phosphatase and sgpt alamine aminotransferase) of the data set in order to plot the
records so that we can see the records and their orientations. Such plots of clustering results help
us to evaluate the correctness of Tree Index evaluation. Fig. 8.20, 8.21, 8.22, 8.23, 8.24, 8.25
A = Ear lobe C = Central
P = Parietal F = Frontal
Fp = Frontal polar O = Occipital
Frontal
lobe
Temporal-
Parietal lobe
215
Fig. 8.20: Clustering result of HeMI on the LD data set
Fig. 8.21: Clustering result of AGCUK on the LD data set
Fig. 8.22: Clustering result of GAGR on the LD data set
Fig. 8.23: Clustering result of GenClust on the LD data set
Fig. 8.24: Clustering result of K-means on the LD data set
Fig. 8.25: Clustering result of K-means++ on the LD data set
216
Fig. 8.26: Clustering result of HeMI++ on the LD data set
Fig. 8.27: The three dimensional LD data set
Table 8.6: Comparative results of all the techniques on the LD data set based on Tree Index and other evaluation techniques
Techniques F-measure
(higher the
better)
Entropy
(lower the
better)
Purity
(higher the
better)
Silhouette
Coefficient
(higher the
better)
XB Index
(lower the
better )
SSE
(lower the
better)
DB Index
(lower the
better)
Tree Index
(lower the
better)
GenClust 0.82 0.43 0.82 0.67 0.19 2.64 0.50 3.53
HeMI 0.73 0.97 0.57 0.73 0.05 100.96 0.35 ∞
GAGR 0.63 0.92 0.60 0.09 1.04 59.56 1.98 5.62
HeMI++ 0.73 0.98 0.57 0.43 0.63 95.85 0.96 0.39
K-Means 0.60 0.94 0.60 0.21 0.34 16.60 1.10 6.21
K-Means++ 0.65 0.96 0.59 0.12 0.23 57.27 1.13 9.62
AGCUK 0.73 0.97 0.57 0.75 0.20 33.73 0.36 ∞
8.26 and 8.27 show the clustering results of HeMI, AGUCK, GAGR, GenClust, K-means, K-
means++ and HeMI++, respectively. Fig. 8.27 shows the original structure of the LD data set.
As we can see in Fig. 8.27, HeMI++ produces a sensible clustering solution as it quite
matches with the original orientation of records in the data set. Therefore, HeMI++ achieves a
good evaluation value based on Tree Index evaluation as shown in Table 8.6. However, it
achieves bad evaluation values based on Entropy, Purity, Silhouette Coefficient, XB Index and
DB Index as shown in Table 8.6. Fig. 8.20 shows that HeMI produces non sensible clustering
results. It produces two clusters where one cluster contains one record and other clusters
217
contains all the remaining records. It also achieves good evaluation values based on F-measure,
Silhouette Coefficient, XB Index and DB Index as shown in Table 8.6. Similar to HeMI,
AGCUK also produces non-sensible clustering solution (see Fig. 8.21) and achieves good
evaluation values based on F-measure, Silhouette Coefficient, XB Index and DB Index as shown
in Table 8.6. However, Tree Index produces bad evaluation values both for HeMI and AGCUK
as shown in Table 8.6. Similarly, Tree Index produces bad evaluation values for other non-
sensible clustering results produced by GAGAR, GenClust, K-means and K-means ++ (see
column 8 of Table 8.6). This again re-confirms that Tree Index produces good evaluation value
for good clustering solutions and bad evaluation value for bad clustering results.
8.3.8 Experimental Results on All Techniques on 21 Real Life Data Sets
In Section 8.3.5 we empirically compare our proposed technique HeMI++ with AGCUK (Y.
Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), GenClust (Rahman & Islam, 2014), K-
means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007) and our technique HeMI (see
Chapter 6) on a brain data set (CHB-MIT Scalp EEG). We now experimentally evaluate the
performance of HeMI++ by comparing it with K-means, K-means++, GAGR, AGCUK,
GenClust and HeMI on 21 other real-life data sets. For each data set, we run each technique 20
times. Since there are 6 data sets with categorical attributes K-means, K-means++, AGCUK and
GAGR cannot handle these data sets and therefore, these techniques are tested on 15 (instead of
21) data sets. However, HeMI++, HeMI and GenClust are tested on all 21 data sets.
In Section 8.3.5 we demonstrate the effectiveness of our proposed cluster evaluation
technique. We find that our cluster evaluation technique matches the expectation when we
investigate the clustering results manually through visual analysis. Besides, in Table 8.1 we have
presented the limitations of the existing cluster evaluation metrics. Therefore, in Table 8.7 we
present the average evaluation values based on our proposed evaluation technique. HeMI++
achieves better results than GenClust in 13 out of 15 numerical data sets. HeMI++ achieves
218
better evaluation values than K-means, K-means++, AGCUK, GenClust and HeMI in all 15
data sets.
Table 8.7: Comparative results between HeMI++ and other techniques on 15 numerical data sets based on Tree Index
Data set Tree Index (lower the better)
K-means K-means++ GAGR AGCUK GenClust HeMI HeMI++
GI 2.11 1.87 3.47 0.92 2.47 ∞ 0.31
VC 4.94 4.11 5.37 ∞ 3.55 ∞ 1.53
EC ∞ ∞ 6.12 ∞ ∞ ∞ 2.94
LF 2.64 3.05 3.53 1.32 3.71 ∞ 0.95
LD 6.31 4.85 7.52 1.21 4.28 0.46 0.24
WBC 5.92 7.07 8.24 3.21 5.58 2.38 1.28
BT 5.81 5.73 0.47 ∞ ∞ ∞ 0.27
PID 13.40 14.19 6.20 ∞ 3.72 ∞ 6.19
SV 5.18 3.25 4.46 1.2 3.05 ∞ 0.00
BN 4.25 5.59 4.64 1.89 2.51 ∞ 0.77
YT 13.78 12.98 ∞ ∞ 4.87 ∞ 2.44
IS 3.12 2.48 5.19 2.11 ∞ ∞ 1.53
WQ 32.64 47.09 15.37 ∞ 7.98 ∞ 13.26
PBC 13.15 14.18 10.21 ∞ 4.77 ∞ 0.44
MGT 62.06 128.61 100.09 72.92 30.67 ∞ 18.89
Fig. 8.28 shows the total score of all techniques on 15 numerical data sets based on Tree
Index. In this scoring system, the technique with the best clustering result gets 7 points and the
technique with the worst result gets 1 point - for each data set. Fig. 8.28 shows the total scores
of a technique over all data sets. The bar graph shows that HeMI++ achieves a higher score than
all other techniques.
We compare HeMI++ with GenClust and HeMI on 6 categorical data sets (data sets that have
only categorical attributes or both the categorical and numerical attributes). For each data set we
run each technique 20 times (except CKRK data set) and present the average clustering result.
We run each technique on CKRK data set 5 times and present the average result. Table 8.8
219
shows that HeMI++ achieves better results in 5 out of 6 data sets than GenClust. HeMI++
performs better than HeMI in 6 out of 6 categorical data sets.
Table 8.8: Clustering results of HeMI ++ and other techniques on 6 categorical data sets based on Tree Index
Tree Index (lower the better)
Data set GenCslust HeMI HeMI++
Zoo 0.87 0.86 0.08
HT 1.94 0.59 0.46
STH 1.75 ∞ 1.38
CA ∞ ∞ 1.57
CMC 0.78 ∞ 2.51
CKRK 55.79 1.76 0.64
Fig. 8.28: Scores of the techniques on 15 numerical data sets based on Tree Index
8.3.9 An Analysis of the Clustering Quality of HeMI++ on Different Data Sets
In Chapter 8, we present the detailed experimental results of HeMI++ on 21 real life data sets
and compare with five other techniques. We now reanalyze the findings in order to investigate
some of the factors that may have influence the clustering quality of HeMI++. The principal
elements are as follows:
Number of records (𝑛) in a data set;
220
Number of attributes (𝑚) in a data set; and
Types of the majority of the attributes in a data set.
We divide the data sets into six groups as follows:
Group A: This group contains the data sets having a number of records fewer than 5000.
From the 21 data sets (see Table 8.3), 18 data sets fall within this group.
Group B: We consider the data sets having a number of records 5000 or more in this
group. Therefore, from our analysis the PBC, MGT and CKRK data sets of Table 8.3
are part of this group.
Group C: This group contains the data set having a number of attributes fewer than ten.
Therefore, in our analysis nine data sets of Table 8.3 belong to this group.
Group D: We consider the data sets having a number of attributes 10 or more in this
group. From the 21 data sets (see Table 8.3), 12 data sets are in this group.
Group E: This group contains the data sets having a higher number of numerical
attributes than the categorical attributes. Therefore, in our analysis 15 data sets of Table
8.3 are in this group.
Group F: In this group, we consider the data sets having a higher number of categorical
attributes than numerical attributes. Five data sets (namely Zoo, Hepatitis, Statlog
Heart, Credit Approval and CMC) of Table 8.3 fall within this group.
We analyze the performance of HeMI++ by comparing its performance with that of five
existing techniques, namely K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii,
2007), AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), and GenClust (Rahman
& Islam, 2014) in terms of the number of wins, losses, and draws. We define these as follows.
Win: This means that HeMI++ performs better than an existing technique.
221
Loss: The term, “Loss” means that HeMI++ does not perfom better than an existing
technique.
Draw: This indicates that the performance of HeMI++ is the same as the performance
of an existing technique.
We now analyze the performance of HeMI++ based on the above factors and groups in the
following subsections.
8.3.9.1 Performance of HeMI++ compared to Existing Techniques, based on Number of
Records
In this subsection we analyze the results of HeMI++ in order to explore the influence of the
number of records on the performance of HeMI++. In Table 8.9 we present the number of wins,
losses, and draws for HeMI++ compare to AGCUK, GAGR, GenClust, K-means, and K-
means++ based on the cluster evaluation technique Tree Index (see Section 8.2.5 of Chapter 8)
for Group A and Group B. In the table, the percentage of wins, losses, and draws for HeMI++
compared to the existing techniques are also presented. As there are five data sets with
categorical attributes in Group A (see Table 8.3), AGCUK, GAGR, K-means and K-means++
cannot contend with these data sets, and therefore, for Group A, HeMI++ is tested against these
techniques on 13 (instead of 18) data sets. However, HeMI++ and GenClust are tested on all 18
data sets.
From Table 8.9, it is evident that for Group A data sets, HeMI++ wins more than 80% of
cases (see Column 8 of the table) against GenClust. HeMI++ wins 100% of cases compared to
AGCUK, GAGR, K-means, and K-means++ for Group A data sets. HeMI++ suffers a loss of
18.75% of cases against GenClust for the data sets of Group A. However, HeMI++ achieves a
100% win record (see Column 11 of Table 8.9) alongside the existing techniques, including
GenClust, for large data sets with 5000 or more records (i.e. Group B).
222
Table 8.9: Performance of HeMI++ compared to existing techniques, based on number of records
Against
techniques
Scores in numbers Scores in percentage (%)
Group A Group B Group A Group B
No. of data sets = 18 No. of data sets = 3 No. of data sets = 18 No. of data sets = 3
Win Loss Draw Win Loss Draw Win Loss Draw Win Loss Draw
AGCUK 13 0 0 3 0 0 100.00 0.00 0.00 100.00 0.00 0.00
GAGR 13 0 0 3 0 0 100.00 0.00 0.00 100.00 0.00 0.00
GenClust 16 2 0 3 0 0 81.25 18.75 0.00 100.00 0.00 0.00
K-means 13 0 0 3 0 0 100.00 0.00 0.00 100.00 0.00 0.00
K-Means++ 13 0 0 3 0 0 100.00 0.00 0.00 100.00 0.00 0.00
8.3.9.2 Performance of HeMI++ compared to Existing Techniques, based on Number of
Attributes
The results comparing HeMI++ to the existing techniques based on the number of attributes will
now be analyzed. Table 8.10 shows the number of wins, losses, and draws for HeMI++ against
AGCUK, GAGR, GenClust, K-means, and K-means++ based on Tree Index (see Section 8.2.5
in Chapter 8) for Group C and Group D. As there are two data set with categorical attributes in
Group C (see Table 8.3), AGCUK, GAGR, K-means, and K-means++ cannot process these data
sets, and therefore, for Group C, HeMI++ is tested against AGCUK, GAGR, K-means, and K-
means++ on 8 (instead of 9) data sets. However, HeMI++ and GenClust are tested on all nine
data sets. Similarly, in Group D there are four data sets with categorical attributes. Therefore,
for Group D, HeMI++ is tested against AGCUK, GAGR, K-means, and K-means++ on eight
(instead of twelve) data sets. HeMI++ and GenClust are tested on all twelve data sets.
Table 8.10 shows the results of HeMI++ alongside the existing techniques for Group C and
Group D data sets. HeMI++ achieves a 100% win rate compared to AGCUK, GAGR, K-means,
and K-means++ for the data sets of Group C and Group D. For Group C, HeMI++ suffers a loss
of 28.57% of cases against GenClust. However, HeMI++ performs better (i.e. 9.09 % loss)
against GenClust for the data sets having a number of attributes of ten or more (i.e. Group D),
223
compared to the data sets having a number of attributes of fewer than 10 (i.e. Group C) based
on Tree Index.
Table 8.10: Performance of HeMI++ compared to existing techniques, based on number of attributes
Against
techniques
Scores in numbers Scores in percentage (%)
Group C Group D Group C Group D
No. of data sets = 9 No. of data sets = 12 No. of data sets = 9 No. of data sets = 12
Win Loss Draw Win Loss Draw Win Loss Draw Win Loss Draw
AGCUK 7 0 0 8 0 0 100.00 0.00 0.00 100.00 0.00 0.00
GAGR 7 0 0 8 0 0 100.00 0.00 0.00 100.00 0.00 0.00
GenClust 7 2 0 11 1 0 71.43 28.57 0.00 90.91 9.09 0.00
K-means 7 0 0 8 0 0 100.00 0.00 0.00 100.00 0.00 0.00
K-Means++ 7 0 0 8 0 0 100.00 0.00 0.00 100.00 0.00 0.00
8.3.9.3 Performance of HeMI++ compared to Existing Techniques, based on Type of the
Majority of Attributes
We now analyze the results of HeMI++ over AGCUK, GAGR, GenClust, K-means and K-
means++ based on the majority attribute types. Table 8.11 present the performance of HeMI++
over the existing techniques based on Tree Index. Note that in Gorup F there are 5 data sets, and
all the data sets are with catagorical attributes. Since AGCUK, GAGR, K-means and K-
means++ cannot handle the data sets with categorical attributes and therefore, for Group F
HeMI++ is not tested against these techniques. However, HeMI++ is tested against all the
techniques on all the 15 data sets in Group E.
From Table 8.11 it appears that HeMI++ achieves 100% win against AGCUK, GAGR, K-
means, and K-means++ for the data sets that have a higher number of numerical attributes than
categorical attributes (i.e. Group E). HeMI++ performs better (i.e. 15.38% loss) against
GenClust for the data sets with a higher number of numerical attributes (i.e. Group E), compared
to the data sets with a higher number of categorical attributes (i.e. Group F) (i.e. 25.00% loss)
based on Tree Index.
224
Table 8.11: Performance of HeMI++ compared to existing techniques, based on type of the majority of attributes
Against
techniques
Scores in numbers Scores in percentage (%)
Group E Group F Group E Group F
No. of data sets = 15 No. of data sets = 5 No. of data sets = 15 No. of data sets = 5
Win Loss Draw Win Loss Draw Win Loss Draw Win Loss Draw
AGCUK 15 0 0 0 0 0 100.00 0.00 0.00 0.00 0.00 0.00
GAGR 15 0 0 0 0 0 100.00 0.00 0.00 0.00 0.00 0.00
GenClust 13 2 0 4 1 0 84.62 15.38 0.00 75.00 25.00 0.00
K-means 15 0 0 0 0 0 100.00 0.00 0.00 0.00 0.00 0.00
K-Means++ 15 0 0 0 0 0 100.00 0.00 0.00 0.00 0.00 0.00
8.3.10 Knowledge from the Brain Data
The CHB-MIT (chb01-03) data set has altogether 8280 records. According to the data provider
the numbers of seizure and non-seizure records are as follows.
Seizure Records Non-seizure Records
23x5 = 115 8165
However, from our previous discussion we know that actually there are 11 × 5 = 55 seizure
records as this was a localized seizure and all channels do not experience a seizure signal even
during the seizure time. HeMI++ groups the records in two clusters. Cluster 1 has mostly seizure
records and Cluster 2 has mostly non-seizure records. Hence, according to HeMI++ the numbers
of seizure and non-seizure records are as follows.
Seizure Records Non-seizure Records
Cluster 1: 23 Cluster 2: 8257
We now make two variants of the Brain data set: 𝐷1 and 𝐷2. In 𝐷1 all 8280 records are
labelled as either seizure or non-seizure according to the HeMI++ clustering result. In 𝐷2 all
225
records are labelled as either seizure or non-seizure according to the information supplied by
the data provider.
We build a number of decision trees from 𝐷1 using two existing decision forest algorithms,
SysFor (Islam & Giggins, 2011) and Forest CERN (Adnan & Islam, 2016) as shown in Fig. 8.29
(see Fig. 8.29 (a) to Fig. 8.29 (e)). We also build a number of decision trees from 𝐷2 by using
Forest CERN. In Fig. 8.29 (f) we show one representative tree from the trees built from 𝐷2.
We can see from these figures that the trees built from 𝐷1 are clearly more accurate than the
tree built from 𝐷2. For example, the tree in Fig. 8.29 (a) has only 7 inaccurate records compared
to 83 records in the tree in Fig. 8.29 (f). Moreover, the trees built from 𝐷1 are also shallower
than the tree from 𝐷2. This clearly indicates that the labeling in 𝐷1 is better than the labeling in
𝐷2, which in turn means the clustering results obtained by HeMI++ are more sensible than the
grouping of the records based on the knowledge/observation of the seizure time without
considering the localized seizure impact. We thus realize that HeMI++ correctly identifies the
seizure records and non-seizure records.
We now study the trees in Fig. 8.29 (a) to Fig. 8.29 (e) closely to discover knowledge. We
find that if the Standard deviation of the signal is high (higher than 125 or so) then it generally
represents seizure signal and otherwise non-seizure signal. This matches our current
understanding suggesting that erratic or abrupt signal (see Fig. 8.15 and Fig. 8.17) represents
seizure.
We also find that if Max amplitude of signal is high (say higher than 400 uV or so) then it
represents seizure signal which again matches our existing knowledge. Although the decision
trees discover knowledge that are already known or understood they play a useful role in
verifying them. Moreover, they also give a figure (such as Std > 125) for seizure
detection/prediction as this figure may vary from patient to patient. Thus, we can clearly see the
226
value of knowledge discovery from our clustering results. We aim to carry out a thorough
knowledge discovery from the records in the Brain data set.
(a) Decision Tree on the CHB-MIT Scalp EEG (chb01-03) data
set (labelled by clustering result of HeMI++)
(b) Decision Tree on the CHB-MIT Scalp EEG (chb01-03) data set
(labelled by clustering result of HeMI++)
(c) Decision Tree on the CHB-MIT Scalp EEG (chb01-03) data
set (labelled by clustering result of HeMI++)
(d) Decision Tree on the CHB-MIT Scalp EEG (chb01-03) data set (labelled by clustering result of HeMI++)
(e) Decision Tree on the CHB-MIT Scalp EEG (chb01-03) data
set (labelled by clustering result of HeMI++)
(f) Decision Tree on the original Chb_MIT_01_03 data set
Fig. 8.29: Decision trees on the CHB-MIT Scalp EEG (chb01-03) data set
227
8.3.11 Complexity Analysis
In this section, we present the complexity of HeMI++ and compare it with the complexity of
AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), GenClust (Rahman & Islam,
2014), K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007) and HeMI (see
Chapter 6). The main factors related to the complexity of HeMI++ are as follows: in a data set
𝐷 number of records is 𝑛, number of attributes is 𝑚, number of genes in a chromosome is 𝑘,
number of chromosomes in a population is z, number of iterations in K-means is 𝑁′ and number
of iterations in HeMI++ is 𝑁. We realize that out of these factors 𝑛, 𝑚, 𝑘 and 𝑧 can be much
bigger than others. Hence, we consider 𝑛, 𝑚, 𝑘 and 𝑧 to compute the complexity.
In the initial population, HeMI++ uses a deterministic phase and a random phase to generate
a number of chromosomes. In the deterministic phase, it uses K-means to generate a number of
chromosomes, the complexity of which is 𝑂(𝑛𝑚𝑘𝑧). In the random phase, HeMI++ generates
initial chromosomes randomly. The complexity for this phase is 𝑂(𝑘𝑧). HeMI++ uses DB Index
as its fitness function to compute the fitness of a chromosome. The complexity of DB Index
is 𝑂(𝑛𝑚𝑘𝑧). HeMI++ selects the necessary properties for a sensible solution which can be done
in 𝑂(𝑛𝑚𝑘𝑧) complexity. The noising selection requires pair wise comparison for which we
need 𝑂(𝑧) complexity.
The crossover operation uses roulette wheel for which the complexity is 𝑂(𝑧2). For twin
removal, it requires 𝑂(𝑚𝑘2𝑧) complexity. In the mutation operation, HeMI++ uses a division,
absorption and random change. Complexities for the division, absorption and random change
are 𝑂(𝑛𝑚𝑘𝑧), 𝑂(𝑚𝑘𝑧) and 𝑂(𝑧), respectively. In the health improvement are 𝑂(𝑛𝑚𝑘𝑧), 𝑂(𝑧)
and 𝑂(𝑧) respectively.
HeMI++ performs a cleansing operation with the complexity of 𝑂(𝑛𝑚𝑘𝑧). Similarly, the
cloning operation requires 𝑂(𝑧) complexity. The complexity of elitist operation is 𝑂(𝑧) once
228
the fitness is computed with the cost of 𝑂(𝑛𝑚𝑘𝑧). The neighbor information sharing requires
𝑂(𝑧) complexity. Similarly, the complexity of global best selection is 𝑂(𝑛𝑚𝑘𝑧). Hence, the
overall complexity of HeMI++ is 𝑂(𝑛𝑚𝑘2𝑧2). With respect to the two most significant factors
(𝑛 and 𝑚), it has a linear complexity 𝑂(𝑛𝑚). The complexity of AGCUK, K-means, GAGR,
GenClust and HeMI are 𝑂(𝑛𝑚)(Y. Liu et al., 2011), 𝑂(𝑛𝑚) (Lloyd, 1982), 𝑂(𝑛𝑚) (D.-X.
Chang et al., 2009), 𝑂(𝑛𝑚2 + 𝑛2𝑚) (Rahman & Islam, 2014) and 𝑂(𝑛𝑚) (see Chapter 6),
respectively.
8.3.12 Statistical Friedman Test
We now carry out statistical Friedman test (Demšar, 2006; Friedman, 1940) in order to evaluate
the superiority of the results (Silhouette Coefficient) obtained by HeMI++ over the results
obtained by the existing techniques and our proposed technique HeMI. We compute the
Silhouette Coefficient for each algorithm according to the rank-ordering used in the Friedman
Test (Demšar, 2006; Friedman, 1940). Among the 7 competing algorithms, the one providing
the best Silhouette Coefficient is assigned the Rank of 1, the second best to the Silhouette
Coefficient Rank of 2 and so on (hence, the lower the average rank the better result). For the
result of ties are resolved by assigning the average of the sequential Silhouette Coefficient ranks
they would have received. The Silhouette Coefficient Rank of each competing algorithm for
each data set is presented within parentheses. The bottom row of Table 8.12 presents (within
parentheses) the average of Silhouette Coefficient Rank (in short, Rank) of each competing
algorithm from all data sets considered.
From Table 8.12, we can see that K-means provides the best Silhouette Coefficient for 0 data
set (Rank: 4.26), K-means++ for 0 data sets (Rank: 4.40), GAGR for 0 data sets (Rank: 4.60),
AGCUK for 0 data sets (Rank: 4.20), GenClust for 2 data sets (Rank: 3.50), HeMI for 0 data
sets (Rank: 5.90) whereas HeMI++ achieves the best Silhouette Coefficient in 13 out of 15 data
229
sets (Rank 1.13). We now conduct statistical significance test (Demšar, 2006) in order to assess
the superiority of HeMI++ over the existing techniques.
Table 8.12: Silhouette Coefficient rank of the techniques based on Friedman Test (Demšar, 2006; Friedman, 1940)
Data set Tree Index (lower the better)
K-means K-means++ GAGR AGCUK GenClust HeMI HeMI++
GI 2.11 (4) 1.87 (3) 3.47 (6) 0.92 (2) 2.47 (5) ∞ (7) 0.31 (1)
VC 4.94 (4) 4.11 (3) 5.37 (5) ∞ (6.5) 3.55 (2) ∞ (6.5) 1.53 (1)
EC ∞ (5) ∞ (5) 6.12 (2) ∞ (5) ∞ (5) ∞ (5) 2.94 (1)
LF 2.64 (3) 3.05 (4) 3.53 (5) 1.32 (2) 3.71 (6) ∞ (7) 0.95 (1)
LD 6.31 (6) 4.85 (5) 7.52 (7) 1.21 (3) 4.28 (4) 0.46 (2) 0.24 (1)
WBC 5.92 (5) 7.07 (6) 8.24 (7) 3.21 (3) 5.58 (4) 2.38 (2) 1.28 (1)
BT 5.81 (4) 5.73 (3) 0.47 (2) ∞ (6) ∞ (6) ∞ (6) 0.27 (1)
PID 13.40 (4) 14.19 (5) 6.20 (3) ∞ (6.5) 3.72 (1) ∞ (6.5) 6.19 (2)
SV 5.18 (6) 3.25 (4) 4.46 (5) 1.2 (2) 3.05 (3) ∞ (7) 0.00 (1)
BN 4.25 (4) 5.59 (6) 4.64 (5) 1.89 (2) 2.51 (3) ∞ (7) 0.77 (1)
YT 13.78 (4) 12.98 (3) ∞ (6) ∞ (6) 4.87 (2) ∞ (6) 2.44 (1)
IS 3.12 (4) 2.48 (3) 5.19 (5) 2.11 (2) ∞ (6.5) ∞ (6.5) 1.53 (1)
WQ 32.64 (4) 47.09 (5) 15.37 (3) ∞ (6.5) 7.98 (1) ∞ (6.5) 13.26 (2)
PBC 13.15 (4) 14.18 (5) 10.21 (3) ∞ (6.5) 4.77 (2) ∞ (6.5) 0.44 (1)
MGT 62.06 (3) 128.61 (6) 100.09 (5) 72.92 (4) 30.67 (2) ∞ (7) 18.89 (1)
Average rank (4.26) (4.40) (4.60) (4.20) (3.50) (5.90) (1.13)
The Friedman (Friedman, 1940) test is a non-parametric test used to compare multiple
algorithms on multiple data sets. Using the Friedman test the null hypothesis is that all
algorithms are equivalent. If the null hypothesis is rejected, we can proceed with a post-hoc test
such as Bonferroni-Dunn test (Demšar, 2006; O. J. Dunn, 1961). The Friedman statistics is
distributed according to 𝑋𝐹2 with (𝑘 − 1) degrees of freedom when 𝑘 (the number of competing
algorithms) and 𝑁 (the number of data sets) are big enough (as a rule of a thumb, 𝑘 > 5 and
𝑁 > 10) (Demšar, 2006). Iman and Davenport (Iman & Davenport, 1980) demonstrated that
Friedman’s 𝑋𝐹2 is undesirably conservative and derived a better statistic 𝐹𝐹 =
(𝑁−1)𝑋𝐹2
(𝑘−1)−𝑋𝐹2. With 7
230
algorithms and 15 data sets, the value of 𝐹𝐹 is calculated to be11.64. With =0.05 the critical
value of 𝐹𝐹 is calculated to be 2.13.
We can see that the critical value remains lower than the pair wise differences of ranks
between the control clustering algorithm (HeMI++) and all other contending algorithms (K-
means vs HeMI++: 3.13, K-means++ vs HeMI++: 3.26, GAGR vs HeMI++: 3.46, AGCUK vs
HeMI++: 3.06, GenClust vs HeMI++: 2.36 and HeMI vs HeMI++: 4.76) indicating that
HeMI++ performs better (in therms of Silhouette Coefficient) than all other algorithms on 15
real-life data sets.
8.4 Summary
We realize that some recent clustering techniques do not produce sensible clustering solutions
although their solutions achieve high fitness values based on existing evaluation criteria. These
solutions are therefore unlikely to be useful in knowledge discovery from underlying data sets.
We apply some existing techniques on a brain data set and realize that their clustering solutions
either have a way too many clusters or only two clusters where one cluster contains only one
record and all other records are stored in the other cluster.
Hence, in this chapter we propose a new clustering technique HeMI++ that first learns
important properties of sensible clustering solutions and then applies the information in
producing its clustering solutions. When we apply HeMI++ on a brain data set we find that the
proposed clustering technique overcomes the existing problem and produces the right number
of clusters with right records in the clusters.
During the development of the proposed clustering technique we realize that the existing
cluster evaluation techniques are biased towards either high number of clusters or very low
number of clusters. Hence, in this chapter we also propose a novel cluster evaluation technique
called Tree Index.
231
Tree Index first labels the records based on the clustering results that it wants to evaluate. It
then builds a decision tree from the data set with the labels. The basic idea here is that if the
labeling is good (i.e. sensible) then the produced tree is likely to classify the training records
more accurately and be shallow. Based on this basic concept Tree Index computes an evaluation
value of a clustering solution. Different clustering solutions can be compared based on their
Tree Index values.
In this chapter we graphically visualize two types of clustering solutions: sensible solutions
and non-sensible solutions (either having too many clusters or having one record in one cluster
and all other records in the other cluster) as shown in Fig. 8.2 to Fig. 8.5. While existing
evaluation techniques fail to correctly evaluate the cluster quality, Tree Index scores the sensible
solutions higher than those non-sensible solutions.
We then empirically compare our proposed clustering technique (HeMI++) with five existing
techniques on 21 publicly available data sets in terms of our Tree Index evaluation technique.
We find that HeMI++ achieves the best clustering solutions in 18 out of 21 data sets. Moreover,
we graphically visualize the clustering results of HeMI++ on a brain data set and find the results
to be more sensible than others. Additionally, we discover some useful knowledge from the
clustering results produced by HeMI++ indicating its usefulness in knowledge discovery.
232
Chapter 9
Discussion
9.1 Introduction
The main goals of this study are to propose a novel clustering technique with the ability to
produce sensible clusters, and to put forward a cluster evaluation technique suitable for
evaluating sensible and non-sensible clusters. A number of clustering techniques have been
outlined in the literature (Arthur & Vassilvitskii, 2007; D.-X. Chang et al., 2009; D. Chang et
al., 2012; He & Tan, 2012; Y. Liu et al., 2011; Lloyd, 1982; Rahman & Islam, 2014). However,
there are limitations to the existing clustering techniques, indicating there is potential for
improvement. Hence, in this study we propose a number of clustering techniques with sequential
quality improvement, developed by addressing the limitations of existing techniques.
Moreover, during the development of the proposed clustering techniques we observe that the
existing cluster evaluation techniques (Agustín-Blas et al., 2012; D L Davies & Bouldin, 1979;
Pang-Ning Tan, Michael Steinbach, 2005; Rahman & Islam, 2014) are biased towards either
high numbers of clusters or very low numbers of clusters. Consequently, in this study we also
propose a novel cluster evaluation technique which is demonstrated to be effective in evaluating
sensible and non-sensible clustering solutions.
233
In this chapter, we will present an overall discussion and comparison of the proposed
techniques and investigate their performances. The main contributions of the thesis will then be
discussed, followed by complexity analyses of the proposed techniques along with comparison
with existing techniques. Finally, comment is made on future research directions.
The structure of Chapter 9 is as follows: Section 9.2 features a comparison and discussion of
the proposed techniques; Section 9.3 assesses the main contributions of the thesis. Complexity
analyses of the proposed techniques are presented in Section 9.4, and comparison of
complexities is described in Section 9.5; while Section 9.6 summarises the proposed techniques;
and finally, future research directions are proposed in Section 9.7.
9.2 Comparison and Discusion of the Proposed Techniques
9.2.1 DeRanClust
In Chapter 3, we present a GA-based clustering technique called DeRanClust that produces
high-quality chromosomes in the intial population. The use of GA in clustering techniques can
help to avoid the local optima issue of K-means (Agustín-Blas et al., 2012; D.-X. Chang et al.,
2009; D. Chang et al., 2012; He & Tan, 2012; Y. Liu et al., 2011; Peng et al., 2014; Rahman &
Islam, 2014). Typically, a genetic algorithm-based technique does not require any user input in
clusters 𝑘.
However, many existing techniques (Y. Liu et al., 2011; Maio et al., 1995; Maulik &
Bandyopadhyay, 2000; Xiao et al., 2010) generate the number of genes of a chromosome
randomly, in population initialization. These techniques may also randomly choose records as
genes, instead of carefully choosing genes of a chromosome. Careful selection of genes can
create an initial population containing high-quality chromosomes; and a high-quality initial
population typically increase the likelihood of obtaining a good clustering solution at the
234
completion of genetic processing (Diaz-Gomez & Hougen, 2007; Goldberg et al., 1991;
Rahman & Islam, 2014).
An existing technique known as GenClust (Rahman & Islam, 2014) finds a high-quality
initial population and thereby obtains good clustering solutions. However, its initial population
selection process is very complex; with a complexity of 𝑂(𝑛2), where 𝑛 is the number of records
in a data set. Moreover, GenClust requires user input in regard to the number of radius values
for the clusters in the initial population selection. It can be very difficult for a user to estimate
the set of radius values (i.e. radii).
Therefore, we propose DeRanClust to enable a high-quality initial population with a low
complexity of 𝑂(𝑛) to be produced. This technique automatically chooses the number of
clusters for the chromosomes in the initial population. Therefore, no user input is require for the
number of clusters 𝑘. The proposed population initialization approach uses chromosomes
obtained both deterministically and randomly. The effectiveness of this method of population
initialization is illustrated in Table 3.2 using five data sets. The table indicates that an existing
technique called AGCUK performs better when it uses our proposed population initialization
approach rather than using traditional population initialization techniques.
Lastly, we compare DeRanClust with AGCUK (Y. Liu et al., 2011), GAGR (D. Chang et al.,
2012), K-Means (Lloyd, 1982), and GenClust (Rahman & Islam, 2014) on five data sets in terms
of two well-known evaluation criteria: Silhouette Coefficient (Agustín-Blas et al., 2012; Pang-
Ning Tan, Michael Steinbach, 2005) and DB Index (D. L. Davies & Bouldin, 1979).
DeRanClust performs significantly better than the other techniques for all data sets and for two
evaluation criteria. Using the proposed DeRanClust technique, we make progress towards
achieving our first research goal.
235
9.2.2 GMC
Building on this, we contend that cluster quality of DeRanClust (see Chapter 3) can be improved
by enhancing other genetic operations such as crossover and mutation operations. Consequently,
in Chapter 4, we propose a novel clustering technique titled GMC that uses new selection,
crossover and mutation operations in order to improve cluster quality. Chapter 4 involved
further progress towards achieving our first research goal.
In the proposed crossover operation, it first classifies the chromosomes in a population in
one of two groups: Good group and Non-good group. It then performs different types of
crossover on the two different groups. Intuitively, this to increase the possibility of obtaining
good-quality offspring chromosomes from a pair of good-quality parent chromosomes. Fig. 4.3
and Fig. 4.4 show the impact of the proposed crossover operation in producing better clustering
results.
Mutation is another important component in GA, in the quest to improve chromosome
quality. Therefore, in a similar method to crossover we also introduce a new mutation operation
in GMC with the aim of improving chromosome quality. The proposed mutation operation
reduces the number of changes on the good-quality chromosomes, and increases the number of
changes on the bad-quality chromosomes, in order to improve their overall quality. Fig. 4.5 and
Fig. 4.6 show the effectiveness of the proposed mutation operation. GMC also uses a new
selection operation comparing chromosomes with two generations, whereby a chromosome
with higher fitness value has a greater likelihood of being selected for other genetic operations,
such as crossover and mutation. Fig. 4.7 and Fig. 4.8 show the effectiveness of the proposed
selection operation.
We evaluate GMC by comparing it with AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang
et al., 2009), K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007), and GenClust
236
(Rahman & Islam, 2014) on ten natural data sets available from the UCI Machine Learning
Repository (M. Lichman, 2013). GMC achieves significantly better clustering results (according
to the sign test analysis) than all the existing techniques on all ten data sets in terms of two
evaluation criteria (see Fig. 4.11). Fig. 4.1 and Fig. 4.2 demonstrate that GMC clearly achieves
better average results than all other techniques.
9.2.3 GCS
Although genetic operations such as crossover and mutation operation tend to improve the
health/fitness of a chromosome, they can also deteriorate the health of some chromosomes. GCS
(see Chapter 5), therefore introduces a new genetic operation known as the health check in order
to ensure the health of chromosomes in a population. The usefulness of the health check
operation is presented in Table 5.2. The work conducted in Chapter 5 allow us to move closer
to achieving research goal 1.
GCS also modifies the process by which selects a pair of chromosomes in a crossover
operation through two phases in order to increase the potential for obtaining better quality
offspring chromosomes. GMC (as presented in Chapter 4) also uses a new crossover operation
where a chromosome with low fitness value always makes a pair with another low-quality
chromosome. Therefore, GCS introduces a new crossover operation where each chromosome
gets an opportunity to make a pair with the best chromosome. Table 5.3 shows that GCS
achieves better clustering results when it uses the proposed crossover operation rather than the
conventional crossover operation.
The proposed technique also uses a new selection operation in order to increase the quality
of chromosomes in a population. GCS uses the elitist operation after each genetic operation
within a generation, in order to keep track of the best solution obtained thus far. Fig. 5.7 shows
the gradual improvement of the best chromosome over the iterations. In Fig. 5.8, we can see that
237
all chromosomes in a population improve over the iterations, indicating the usefulness of
components including the health check and selection operation.
We empirically compare GCS with AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et
al., 2009), K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007), and GenClust
(Rahman & Islam, 2014) on 15 natural data sets available from the UCI Machine Learning
Repository (M. Lichman, 2013). GCS achieves significantly better clustering results than all the
existing techniques on all 15 data sets in terms of two evaluation criteria (see Fig. 5.1 to Fig.
5.4). Fig. 5.2 and Fig. 5.4 indicate that GCS clearly achieves better average results than all other
techniques, without any overlapping of the standard deviation.
9.2.4 HeMI
It is evident from the literature (Pourvaziri & Naderi, 2014; Straßburg et al., 2012) that the
population size has a positive impact on clustering quality. That is, a big population size is likely
to contribute towards a good clustering solution. However, a big population size requires high
time complexity.
Therefore, in Chapter 6, we propose a novel clustering method titled HeMI that uses a big
population through multiple streams where each stream contains a relatively small number of
chromosomes, and thus can facilitate managing a low execution time as they are suitable for
parallel processing when necessary. The effectiveness of the use of multiple streams is presented
in Table 6.6. Various genetic operations (such as crossover and mutation) are applied to each
stream in parallel. As a result, HeMI is likely to produce better quality clustering solutions.
Moreover, due to splitting chromosomes into a number of streams and processing the splits
separately, HeMI exhibits ability to explore the solution space, compared to the traditional
approach of processing all chromosomes in a single stream.
238
Note that some existing techniques use parallel genetic algorithms (Kumar et al., 2011; Y.
Y. Liu & Wang, 2015; Moore, 2004; Straßburg et al., 2012) where the total number of
chromosomes is divided into a number of parallel runs. In our technique, however, the total
number of chromosomes is increased. The main goal of the existing techniques is to reduce time
complexity through the parallelization of the genetic algorithms, whereas the main goal of HeMI
is to improve clustering results. Employing parallelization via these existing techniques does
not share information between the parallel streams, whereas HeMI introduces information
sharing across the streams at regular intervals in order to take advantage of the multiple streams
(see Table 6.6, Table 6.7, and Table 6.8). The impact of the number of streams is presented in
Fig. 6.5. HeMI also demonstrates the significance of the use of intervals on information sharing
in Fig. 6.4.
In a similar way to DeRanClust (Chapter 3), GMC (Chapter 4), and GCS (Chapter 5), HeMI
also uses a high-quality initial population in two phases; a deterministic phase and a random
phase. However, in the random phase excluding HeMI all these techniques generate |𝑃|
2
chromosomes, where |𝑃| (|𝑃| is set to be 20 in our experiments) is the number of chromosomes
in a population. We realize that in a comparable way to the deterministic phase, good-quality
chromosomes can also be produced through the random phase. Therefore, in HeMI we generate
the same number of chromosomes (45 chromosomes) in the random phase and then select top
|𝑃|
2 (i.e. 10) chromosomes from the random phase. The effectiveness of the high-quality initial
population selection is presented in Table 6.10.
The presence of healthy chromosomes (i.e. chromosomes with high fitness values) in a
population can increase the possibility of good clustering results. Hence, HeMI replaces the sick
chromosomes (i.e. chromosomes with low fitness) with healthy chromosomes. GCS (as
presented in Chapter 5) also uses a health check operation to select sick chromosomes in a
239
population, and probabilistically replaces them with healthy chromosomes found in the previous
20 generations. GCS applies the health check operation after 20 generations. However, we
empirically find that the chromosomes and best chromosome in a population improve their
quality over the iterations (see Fig. 6.6, Fig. 6.7 and Fig. 6.8). Hence, GCS’s approach of using
the pool of best chromosomes obtained from the first 20 iterations may not be effective in the
health improvement of later iterations, such as the 40th iteration.
Consequently, HeMI uses a new health check operation whereby some of the healthy
chromosomes are chosen from a pool of healthy chromosomes obtained by the initial
population, while other healthy chromosomes are generated through the crossover operation of
the existing healthy chromosomes of a generation, with the hope that the crossover of two
healthy chromosomes may generate new healthy chromosomes. Table 6.13 demonstrates the
impact of the health improvement operation.
HeMI uses the three steps of mutation operation, which employs a division and absorption
operation in sequence, if they improve the quality of clustering solutions. Additionally, at the
end of the division and absorption operation, it also applies a random change in chromosomes.
The usefulness of the proposed mutation operation is presented in Table 6.11 and Table 6.12.
HeMI also maintained randomness through the noising selection and crossover operation, in
order to explore the solution space through its randomness.
Finally, we compare HeMI with AGCUK, GAGR, K-means, K-means++, and GenClust on
20 data sets in relation to two well-known evaluation criteria; Silhouette Coefficient and DB
Index. HeMI achieves significantly better clustering results (according to the sign test analysis)
than all existing techniques on all 20 data sets in terms of the two evaluation criteria (see Fig.
6.9 and Fig. 6.10). Fig. 6.2 and Fig. 6.3 demonstrate that HeMI clearly achieves better results
on average than all other techniques based on the two evaluation criteria, without any
240
overlapping of the standard deviation. In Chapter 6 we achieve research goal 1 (producing
parameter less clustering techniques with high-quality solutions and low complexity).
9.2.5 CSClust
With the result of HeMI (as presented in Chapter 6), we achieve our first goal of proposing a
parameter less clustering technique with a high-quality solution and low complexity. However,
in order to achieve our second goal of producing sensible clustering solutions we need to
carefully analyse the results obtained by HeMI and other existing techniques. This analysis is
undertaken in Chapter 7. From the empirical analysis (see Section 7.2 in Chapter 7) we realise
that many existing clustering techniques (AGCUK, GAGR, GenClust, K-means and K-
means++) do not produce sensible clustering solutions, although their solutions achieve high
fitness values based on existing evaluation criteria. These solutions are typically not useful in
knowledge discovery from underlying data sets (see Fig. 7.2, Fig. 7.3, and Fig. 7.4). Therefore,
in Chapter 7, we propose a novel clustering technique known as CSClust that first learns
important properties of sensible clustering solutions and then applies this information in
producing its clustering solutions.
We apply the existing techniques on a brain data set and realize that their clustering solutions
either have far too many clusters or only two clusters; where one cluster contains only one record
and all other records are stored in the other cluster (see Fig. 7.2, Fig. 7.3, and Fig. 7.4). CSClust
overcomes this problem and produces the right number of clusters with right records in the
clusters (see Table 7.3). From the brain data set, it captures 40 seizure records in one cluster and
non-seizure records in another cluster (see Fig. 7.5). To evaluate the clustering quality of
CSClust, we relabel the records based on the clustering solutions of CSClust and produce a
number of decision trees to discover logic rules for seizure and non-seizure records. The logic
rules (such as if Std > 102.44 Seizure) obtained by the forest (see Fig. 7.13 ) appear to be
241
sensible; further confirming the accuracy of clustering results obtained by the proposed
clustering technique.
We compare CSClust with AGCUK, GAGR, GenClust, K-means, and K-means++ using the
brain data set. Table 7.2 demonstrates that CSClust achieves better clustering results than the
existing techniques, based on the two evaluation criteria. The empirical results on the brain data
set also indicate that CSClust produces an appropriate number of clusters (see Column 4 of
Table 7.2). We also evaluate CSClust against existing clustering techniques using ten natural
data sets obtained from the UCI machine learning repository. Fig. 7.14 and Fig. 7.15 show that
CSClust clearly achieves better results than all other techniques, based on the two evaluation
criteria.
9.2.6 HeMI++
In Chapter 8, we propose a new clustering technique and an evaluation technique. In the
proposed clustering technique, we combine our previous technique – known as CSClust – with
HeMI, and also significantly improve the components of CSClust (see Chapter 7) and HeMI
(see Chapter 6). Therefore, we call the proposed technique HeMI++. In Chapter 8, we achieve
our second and third research goals (producing high-quality and sensible clustering solutions,
and a cluster evaluation technique for better evaluating sensible and non-sensible clustering
solutions).
First, we explore the quality of HeMI and several existing clustering techniques. We also
assess the quality of existing evaluation techniques. In Chapter 7, we find that some existing
techniques do not produce sensible clusters. However, in Chapter 8, we carefully assess the
clustering quality of the existing techniques and HeMI through cluster visualization. In order to
assess the quality of the existing clustering techniques and cluster evaluation techniques. We
plot the data set in order to graphically visualize the clusters (see Fig. 8.1). We know that this
242
data set has two types of records; seizure and non-seizure. Fig. 8.1 also clearly demonstrate that
there are two clusters of records. We then apply the existing clustering techniques on this data
set and plot their clustering results so that we can graphically visualize the clusters.
Through this, we find that some existing clustering techniques – such as GAGR and GenClust
– do not produce sensible clusters. We also find that our technique HeMI (as presented in
Chapter 6) does not produce sensible clusters. GenClust produces 447 clusters (see Fig. 8.2)
which is not sensible as the actual number of clusters in this data set is supposed to be only two.
As shown in Fig. 8.3, GAGR produces 56 clusters, while HeMI produces two clusters (see Fig.
8.4) where one cluster contains one record, and the other cluster contains all remaining records.
In order to handle such a situation, in HeMI++ we propose a new component named Selection
of Sensible Properties (see Section 8.2.3 of Chapter 8). Through this component, HeMI++ first
learns important properties of sensible clustering solutions and then applies the information in
producing its clustering solutions.
Note that CSClust also learns the important properties of sensible clustering solutions.
However, CSClust does this by using the DB Index on the initial population (see Step 2 in
Section 7.3 in Chapter 7). This approach can be problematic, as the selection can be biased by
the limitations of the DB Index. HeMI++ therefore learns the properties of a sensible clustering
solution through a new approach, not via the DB Index (see Section 8.2.3 of Chapter 8). The
necessary properties of a sensible clustering solution for a data set is learned by HeMI++ from
the initial population, which is generated in the initial population through multiple streams.
The central component of HeMI++ is a cleansing operation (see Section 8.2.3 in Chapter 8)
in each generation in order to ensure that all chromosomes in a population have a sensible
solution. It applies the cleansing operation to each chromosome of a population by applying two
conditions: (i) the number of clusters must be within the range of a maximum and minimum
number of clusters, which is learned by HeMI++ from some of the properties of a data set, and
243
(ii) the minimum number of records in a cluster must be greater than a threshold minimum
number of records, which again is data driven (i.e. not user defined). HeMI++ use the initial
population in order to learn the range of a maximum and minimum number of clusters and the
threshold minimum number of records.
Another interesting idea associated with HeMI++ is the cloning operation (see Section 8.2.3
in Chapter 8) that replaces sick chromosomes in each generation/population. In each population,
the cleansing operation identifies the sick chromosomes, which are then replaced by a pool of
healthy chromosomes found in the initial population. The pool of high-quality chromosomes
created for the initial population is expected to be reasonably healthy, due to the repeated use of
K-means.
During the development of the proposed clustering technique, we realize that the existing
cluster evaluation techniques (see Section 8.2.1) produce inaccurate evaluation results.
Sometimes they produce higher evaluation values for non-sensible clustering solutions and
lower evaluation values for sensible clustering solutions. Sometimes they produce higher
evaluation values both for the sensible and non-sensible clustering solutions, which is not as
useful for measuring clustering quality. Therefore, we propose a new evaluation technique titled
Tree Index (see Section 8.2.5).
Tree Index first labels the records based on the clustering results to be evaluated. It then
builds a decision tree from the data set with the labels. The premise of this being, that if the
labeling is good (i.e. sensible) then the produced tree is likely to classify the training records
more accurately and be shallow. Using this basic concept Tree Index computes an evaluation
value of a clustering solution. Different clustering solutions can be compared based on their
Tree Index values.
244
We graphically visualize two types of clustering solutions; sensible solutions and non-
sensible solutions (either having too many clusters or having one record in one cluster and all
other records in the other cluster) as shown in Fig. 8.2, Fig. 8.3, Fig. 8.4 and Fig. 8.5. While
existing evaluation techniques fail to correctly evaluate the cluster quality, Tree Index evaluates
the sensible solution more accurately than those non-sensible solutions.
Our proposed clustering technique (HeMI++) is then empirically compared with five existing
techniques using 21 publicly available data sets, in terms of our Tree Index evaluation technique.
We find that HeMI++ achieves the best clustering solutions in 18 out of 21 data sets. Moreover,
we graphically visualize the clustering results of HeMI++ on a brain data set and find the results
to be more sensible than others. Additionally, we discover some useful knowledge from the
clustering results produced by HeMI++, indicating its practicality in knowledge discovery.
Therefore, we empirically demonstrate achievement of our second and third research goals –
through the proposed HeMI++ and Tree Index techniques – of producing high quality and
sensible clusters (superior to other techniques used in this study) require no user input, and a
cluster evaluation technique for better evaluation of sensible and non-sensible clustering results
(more accurate than the existing evaluation techniques used in this study).
9.3 Key Contributions of the Thesis
In this study, we present a number of clustering techniques for producing high-quality and
sensible clustering solutions, and a cluster evaluation technique for better evaluating sensible
and non-sensible clustering results. The main contributions of the thesis are listed as follows.
All proposed clustering techniques produce high-quality clusters with a low complexity
of 𝑂(𝑛);
All proposed clustering techniques do not require any user input;
245
All proposed clustering techniques avoid local optima while clustering the records;
We propose clustering techniques with the ability to process data sets with categorical
and/numerical attributes;
We propose clustering techniques that generate appropriate cluster numbers through a
data-driven approach;
We propose clustering techniques that produce sensible clustering solution, appropriate
for knowledge discovery;
We propose a cluster evaluation technique suitable for evaluating sensible and non-
sensible clustering solutions.
9.4 Complexity Analysis of the Techniques
In this section, we present a detailed complexity analysis of our proposed techniques and some
existing techniques.
9.4.1 Notations for Complexity Analysis
We use the following notations in order to analyze the complexity of techniques. We consider
a data set 𝐷 with 𝑛 records; 𝑚 attributes; maximum domain size of a categorical attribute is 𝑑;
minimum number of records in a cluster is 𝑅; number of genes in a chromosome is 𝑘; number
of chromosomes in a population is 𝑃; number of iterations in K-means is 𝑁ˊ, and number of
generations is 𝑁.
9.4.2 Complexity of DeRanClust
In Chapter 3, we present a novel clustering techniques known as DeRanClust. The step-by-step
detailed complexity analysis of DeRanClust is as follows:
246
Step 1: Normalization
We apply DeRanClust on the data sets with numerical attributes only. To normalize the values
of a numerical attribute of a data set, we find the minimum and maximum domain values of the
attribute. The complexity of find the minimum domain value of a numerical attribute is
𝑂(𝑛). Similarly, the complexity of find the maximum domain value of a numerical attribute is
𝑂(𝑛). Therefore, the overall complexity for normalizing 𝑚 attributes is 𝑂(𝑛𝑚).
Step 2: Population Initialization
DeRanClust produces its initial chromosomes through a deterministic phase and a random
phase. In the deterministic phase, it applies K-means to produce each chromosome. If there are
𝑃 number of chromosomes, then the complexity for deterministic chromosomes is 𝑂(𝑛𝑚𝑘𝑃𝑁´),
where the number of attributes is 𝑚, total number of records is 𝑛, maximum number of genes
of a chromosome is 𝑘 and the total number of iterations in K-means is 𝑁´. The complexity of
𝑃 number of random chromosomes is 𝑂(𝑘𝑃), where 𝑘 is the maximum number of genes in
each chromosome.
The fitness of each chromosome is calculated using the DB Index which estimates the
distance between all pairs of seeds. If there are 𝑧 number of chromosomes, then the complexity
of fitness calculation is 𝑂(𝑛𝑚𝑘𝑃). Once the fitness values of 𝑃 chromosomes are computed
then we need to sort them in descending order in order to select 𝑃 best chromosomes. The
complexity of sorting is 𝑂(𝑃2). Therefore, the total complexity of population initialization
is 𝑂(𝑛𝑚𝑘𝑃𝑁´ + 𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃2) = 𝑂(𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2).
Step 3: Noise-based Selection
In noise-based selection, DeRanClust pairwise compares the fitness of the chromosomes of the
current 𝑖𝑡ℎgeneration with the fitness of chromosomes of the previous (𝑖 − 1)𝑡ℎ generation.
Therefore, for 𝑃 chromosomes the complexity is 𝑂(𝑃).
247
Step 4: Crossover Operation
For the crossover operation, pairs of chromosomes are selected from 𝑧 chromosomes. The best
chromosome (currently available in the population) is chosen as the first chromosome of the
pair while the second chromosome of the pair is chosen using the roulette approach. In the
roulette wheel technique, we need to calculate the probability of each chromosome in order to
select the second chromosome of the pair. There are 𝑃
2 crossover operations altogether.
Therefore, the overall complexity of the crossover operation is 𝑂(𝑃2).
Step 5: Twin Removal
The twin removal approach removes/change the identical genes of a chromosome. If a
chromosome has 𝑘 genes then the complexity of finding the identical genes is 𝑂(𝑚𝑘2).
Therefore, the complexity of all 𝑃 chromosomes is 𝑂(𝑚𝑘2𝑃).
Step 6: Mutation Operation
Division
DeRanClust first calculates the fitness of 𝑧 chromosomes with a complexity of 𝑂(𝑛𝑚𝑘𝑃). It
then split the sparse cluster using K-means, where the value of 𝑘 is 2. If there are 𝑃 number of
chromosomes, then the complexity of splitting all 𝑃 chromosomes is 𝑂(𝑘𝑃). The overall
complexity of division operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑘𝑃)= 𝑂(𝑛𝑚𝑘𝑃).
Absorption
After the division operation, DeRanClust calculates the fitness of 𝑃 chromosomes with the
complexity of 𝑂(𝑛𝑚𝑘𝑃). It then compares the fitness of all 𝑃 chromosomes (that it obtains after
division operation) with all 𝑃 chromosomes (that it obtains after crossover operation) in order
to select the chromosomes for the absorption operation. The complexity of this is 𝑂(𝑃). In the
absorption operation, it identifies two closest clusters in a chromosome and then merges the two
248
closest clusters with a complexity of 𝑂(𝑘). The complexity for 𝑧 chromosomes is 𝑂(𝑘𝑃). The
overall complexity of the absorption operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑃 + 𝑘𝑃) = 𝑂(𝑛𝑚𝑘𝑃).
Step 7: Elitist Operation
After the mutation operation, DeRanClust calculates the fitness of 𝑃 chromosomes again with a
complexity of 𝑂(𝑛𝑚𝑘𝑃). In the elitist operation, it identifies the best and worst chromosome
of a generation. The complexity of this is 𝑂(𝑃) for all 𝑃 chromosomes. The overall complexity
of the elitist operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑃) = 𝑂(𝑛𝑚𝑘𝑃).
If there are 𝑁 iterations, then Step 3 to Step 7 will be repeated 𝑁 times while the
normalization and population initialization will occur only once. Therefore, the total complexity
of the steps is 𝑂(𝑛𝑚 + 𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑃 + 𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃 +
𝑛𝑚𝑘𝑃))= 𝑂(𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃)). If 𝑛 ≫ 𝑃, 𝑚 ≫ 𝑃, 𝑛 ≫ 𝑘,
𝑚 ≫ 𝑘, 𝑛 ≫ 𝑁′, 𝑚 ≫ 𝑁′, 𝑛 ≫ 𝑁 and 𝑚 ≫ 𝑁 then the complexity of DeRanClust is 𝑂(𝑛𝑚).
9.4.3 Complexity of GMC
In Chapter 4, we present a novel clustering technique known as GMC. The step-by-step detailed
complexity analysis of GMC is as follows:
Step 1: Normalization
In GMC, the complexity of normalizing the values of numerical attributes of a data set is the
same as the complexity of normalizing the attributes values of DeRanClust. Therefore, the
complexity for normalizing 𝑚 attributes is 𝑂(𝑛𝑚).
Step 2: Population Initialization
The complexity of population initialization of GMC is the same as the complexity of population
initialization of DeRanClust. Therefore, the complexity of population initialization is
𝑂(𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2).
249
Step 3: Probabilistic Selection
In the probabilistic selection, GMC first merged the chromosomes of the current 𝑖𝑡ℎ and the
previous (𝑖 − 1)𝑡ℎgeneration. The complexity of this is 𝑂(𝑃). It then calculates the fitness of
each chromosome. If there are 𝑃 number of chromosomes, then the complexity of the fitness
calculation is 𝑂(𝑛𝑚𝑘𝑃). It then probabilistically selects a set of chromosomes from the merged
chromosomes based on their fitness value. The complexity of this is 𝑂(𝑃).
Therefore, the overall complexity of the probabilistic selection is 𝑂(𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃) =
𝑂(𝑛𝑚𝑘𝑃).
Step 4: Two Phases of Crossover Operation
Before applying crossover, GMC first identifies the chromosomes in a population with two
groups: Good group and Non-good group. The complexity for this is 𝑂(𝑃).
Good group
In the good group, each chromosome makes a pair with each other. The complexity for this
is 𝑂(𝑃2).
Single-point crossover
For each pair of chromosomes in the good group, GMC applies single-point crossover with a
complexity of 𝑂(𝑃2).
Random crossover
In the random crossover phase, GMC combines a pair of chromosomes (𝑃𝑥 and 𝑃𝑦), and
generates a random number (𝑅𝑚) between 0 and the length of the combined chromosomes (𝑃𝑥 +
𝑃𝑦). If there are 𝑃 number of chromosomes, then the complexity of this is 𝑂(𝑘𝑃). For offspring
one, it then randomly selects 𝑅𝑚 genes from the combined chromosomes and deletes 𝑅𝑚 genes
from (𝑃𝑥 + 𝑃𝑦). The remaining genes ((𝑃𝑥 + 𝑃𝑦) -𝑅𝑚)) in the combined chromosomes are then
250
selected for offspring two. The complexity of this is 𝑂(𝑘). If there are 𝑃 number of
chromosomes, then complexity is 𝑂(𝑘𝑃).
Once the crossover operation is complete in this group, GMC then calculates fitness of the
offspring chromosomes with a complexity of 𝑂(𝑛𝑚𝑘𝑃). It then sorts the offspring
chromosomes in descending order based on their fitnesss values. The complexity of this is
𝑂(𝑃2). It then selects 𝑃
2 offspring chromosomes with a complexity of 𝑂(𝑃). The overall
complexity for random crossover is 𝑂(𝑘𝑃 + 𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑃) = 𝑂(𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃2).
Non-good group
In the non-good group, a pair of chromosomes are selected using the roulette wheel. The best
chromosome of this group is selected as the first chromosome of the second pair while the
second chromosome of the pair is selected using roulette wheel. The complexity of this is 𝑂(𝑃2).
Single-point crossover
The complexity of the single point crossover for the non-good group is the same as the
complexity of the single-point crossover for the good group. Therefore, the complexity of the
single-point crossover of the non-good group is 𝑂(𝑃2).
Random crossover
The complexity of the random crossover for the non-good group is the same as the complexity
of the random crossover of the good group. Therefore, the complexity of the random crossover
of the non-good group is 𝑂(𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃2).
The overall complexity of the crossover operation is 𝑂(𝑃 + 𝑃2 + 𝑃2 + 𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃2 +
𝑃2 + 𝑃2 + 𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃2) = (𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃2).
251
Step 5: Twin Removal
In GMC, the complexity of twin removal is the same as the complexity of twin removal of
DeRanClust. Therefore, the complexity of twin removal is 𝑂(𝑚𝑘2𝑃).
Step 6: Three Steps of Mutation Operation
Before applying mutation, GMC first identifies the chromosomes in a population with two
groups: Good group and Non-good group. The complexity of this is 𝑂(𝑃).
Good group
For the good group, GMC applies division and absorption operation.
Division
The complexity of division operation is the same as the complexity of division operation of
DeRanClust. Therefore, the complexity of division operation is 𝑂(𝑛𝑚𝑘𝑃).
Absorption
The complexity of the absorption operation is the same as the complexity of the absorption
operation of DeRanClust. Therefore, the complexity of absorption operation is 𝑂(𝑛𝑚𝑘𝑃).
Non-good group
For the non-good group, division, absorption, and a random change operation are applied.
Division
The complexity of the division operation is the same as the complexity of the division operation
of DeRanClust. Therefore, the complexity of division operation is 𝑂(𝑛𝑚𝑘𝑃).
Absorption
The complexity of the absorption operation is the same as the complexity of the absorption
operation of DeRanClust. Therefore, the complexity of the absorption operation is 𝑂(𝑛𝑚𝑘𝑃).
252
Random Change
GMC changes one attribute value (randomly chosen) of a gene of the chromosome. The
complexity of this is 𝑂(𝑃), if there are 𝑃 number of chromosomes.
The overall complexity of mutation operation is 𝑂(𝑃 + 𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃 +
𝑛𝑚𝑘𝑃 + 𝑃) = 𝑂(𝑛𝑚𝑘𝑃).
Step 7: Elitist Operation
In GMC, the complexity of the elitist operation is the same as the complexity of the elitist
operation of DeRanClust. Therefore, the complexity of elitist operation is 𝑂(𝑛𝑚𝑘𝑃).
If there are 𝑁 iterations then the Step 3 to Step 7 will be repeated 𝑁 times while the
normalization and population initialization will occur only once. Therefore, the total complexity
of the steps will be 𝑂(𝑛𝑚 + 𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑛𝑚𝑘𝑃 + 𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃2 +
𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃)) = 𝑂(𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑛𝑚𝑘𝑃 + 𝑚𝑘2𝑃 + 𝑃2)). If 𝑛 ≫
𝑃, 𝑚 ≫ 𝑃, 𝑛 ≫ 𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑁′, 𝑚 ≫ 𝑁′, 𝑛 ≫ 𝑁 and 𝑚 ≫ 𝑁 then the complexity of GMC
is 𝑂(𝑛𝑚).
9.4.4 Complexity of GCS
In Chapter 5, we present a novel clustering techniques called GCS. The step-by-step detailed
complexity analysis of GCS is as follows:
Step 1: Normalization
In GCS, the complexity of normalizing the values of numerical attributes of a data set is the
same as the complexity of normalizing the attributes values of DeRanClust. Therefore, the
complexity of normalizing 𝑚 attributes is 𝑂(𝑛𝑚).
253
Step 2: Population Initialization
The complexity of population initializaton of GCS is the same as the complexity of population
initializaton of DeRanClust. Therefore, the complexity of population initialization is
𝑂(𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2).
Step 3: Two Phases of Selection
Phase 1
In Phase1, GCS selects the top |𝑃| chromosomes (according to the fitness values) from 2 × |𝑃|
chromosomes of the current population. The complexity of this is 𝑂(𝑃).
Phase 2
In Phase 2, GCS selects |𝑃| chromosomes probabilistically from a set of 3 × |𝑃| chromosomes,
which is made of the remaining bottom |𝑃| chromosomes of the current population and 2 × |𝑃|
chromosomes from the last population of the immediate previous generation. The complexity
of this is 𝑂(𝑃).
The overall complexity of Step 3 is 𝑂(𝑃 + 𝑃) = 𝑂(𝑃).
Step 4: Crossover Operation
Phase 1
In Phase 1, GCS selects 2 × |𝑃| − 1 pair of chromosomes, where in each pair the first
chromosome is always the best chromosome of the population. All other chromosomes are
chosen one by one as the second chromosome of a pair. All pairs have different second
chromosome. Therefore, the total complexity of Phase 1 is 𝑂(𝑃2).
Phase 2
254
In Phase 2, pairs of chromosomes are selected from 𝑃 chromosomes. The best chromosome
(currently available in the population) is chosen as the first chromosome of the pair. The second
chromosome of the pair is chosen using the roulette approach. In the roulette wheel technique,
we need to calculate the probability of each chromosome in order to select the second
chromosome of the pair. Therefore, the overall complexity of Phase 2 is 𝑂(𝑃2).
The overall complexity of the crossover operation is 𝑂(𝑃2 + 𝑃2) = 𝑂(𝑃2).
Step 5: Twin Removal
In GCS, the complexity of twin removal is the same as the complexity of twin removal of
DeRanClust. Therefore, the complexity of twin removal is 𝑂(𝑚𝑘2𝑃).
Step 6: Mutation Operation
GCS applies division and absorption mutation operation.
Division
The complexity of division operation is the same as the complexity of division operation of
DeRanClust. Therefore, the complexity of division operation is 𝑂(𝑛𝑚𝑘𝑃).
Absorption
The complexity of absorption operation is the same as the complexity of absorption operation
of DeRanClust. Therefore, the complexity of absorption operation is 𝑂(𝑛𝑚𝑘𝑃).
The overall complexity of the mutation operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃) = 𝑂(𝑛𝑚𝑘𝑃).
Step 7: Health Check Operation
Phase 1
GCS prepares a set of chromosomes 𝑃, where it stores the best chromosomes of each generation
for the first 𝐼 generations. It first calculates the fitness of 𝑃 chromosomes with a complexity
255
of 𝑂(𝑛𝑚𝑘𝑃). It then calculates the average fitness 𝐹𝑑 of the 𝑃 chromossomes with a complexity
of 𝑂(𝑃).
Phase 2
GCS calculates the fitness of the chromosomes of the current population. If there are 𝑧 number
of chromosomes in the population then complexity of this is 𝑂(𝑛𝑚𝑘𝑃). It then compare the
fitness of the chromosomes with 𝐹𝑑 in order to find sick chromosomes. The complexity of this
is 𝑂(𝑃). GCS then probabilistically selects a chromosome from 𝑃 chromosomes (i.e. prepares
in Phase 1) to replace the sick chromosome with a complexity of 𝑂(𝑃).
The overall complexity of the health check operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃 +
𝑃)= 𝑂(𝑛𝑚𝑘𝑃).
Step 8: Elitist Operation
The complexity of the elitist operation of GCS is the same as the complexity of the elitist
operation of DeRanClust. Therefore, the complexity of the elitist operation is 𝑂(𝑛𝑚𝑘𝑃).
If there are 𝑁 iterations then the Step 3 to Step 8 will be repeated 𝑁 times while the
normalization and population initialization will occur only once. Therefore, the total complexity
of the steps will be 𝑂(𝑛𝑚 + 𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑃 + 𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃 +
𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃)) = 𝑂(𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃)). If 𝑛 ≫ 𝑃, 𝑚 ≫
𝑃, 𝑛 ≫ 𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑁′, 𝑚 ≫ 𝑁′, 𝑛 ≫ 𝑁 and 𝑚 ≫ 𝑁 then the complexity of GCS is 𝑂(𝑛𝑚).
9.4.5 Complexity of HeMI
In Chapter 6, we present a novel clustering techniques known as HeMI. The step-by-step
detailed complexity analysis of HeMI is as follows:
Step 1: Normalization
Numerical Attributes
256
In HeMI, the complexity of normalizing the values of numerical attributes of a data set is the
same as the complexity of normalizing the attributes values of DeRanClsut. Therefore, the
complexity of normalizing 𝑚 attributes is 𝑂(𝑛𝑚).
Categorical Attributes
To calculate the distance between two categorical attributes, we find the similarity following
the approach of VICUS (H. Giggins & Brankovic, 2012). The similarity between two
categorical attribute values (𝑎 and 𝑏) is calculated as follows (H. Giggins & Brankovic, 2012;
Md Anisur Rahman, 2014):
𝑆𝑎,𝑏′ =
∑ √𝑒𝑎𝑐 × 𝑒𝑏𝑐𝑙𝑐=1
√𝑑(𝑎) × 𝑑(𝑏)
Eq. 9.1
where 𝑑(𝑎) is the degree of the attribute value 𝑎 (i.e. the number of other attribute values co-
appearing with the attribute value 𝑎 in the whole data set), 𝑒𝑎𝑐 is the number of edges between
two attribute values 𝑎 and 𝑐 (i.e. number of times the two categorical values 𝑎 and 𝑐 co-appear
in the whole data set) and 𝑙 is the total number of domain values for all attributes (except 𝑎
and 𝑏) values.
Let us consider that the domain size of the largest categorical attribute (i.e. the attribute that
has the largest number of domain values) is 𝑑. Then the complexity of calculating the degrees
of the values of a categorical attribute is 𝑂((𝑚 − 1)𝑛). The complexity of calculating the
degrees of the values of all categorical attribute is 𝑂((𝑚 − 1)𝑛) + 𝑂((𝑚 − 2)𝑛) + … +
𝑂((𝑚 − (𝑚 − 1))𝑛) which is 𝑂(𝑛𝑚2). The complexity of calculating the edges between two
values 𝑎 and 𝑏; ∀ 𝑎, 𝑏 is 𝑂(𝑛𝑚2). If the domain size of an attribute is 𝑑, then the complexity
of calculating the similarity for all value pairs of the attribute is 𝑂(𝑑2). Therefore, the
257
complexity of calculating the similarity of all value pairs of 𝑚 attributes is 𝑂(𝑚𝑑2). The overall
complexity of normalizing the categorical attribute is 𝑂(𝑛𝑚2 + 𝑚𝑑2).
Step 2: Population Initialization
In HeMI, the complexity of population initialization is the same as the complexity of population
initialization of DeRanClust, GCS and GMC. Therefore, the complexity of population
initialization is 𝑂(𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2).
Step 3: Noise-based Selection
The complexity of noise-based selection of HeMI is the same as the complexity of noise-based
selection of DeRanClust. Therefore, the complexity of noise-based selection is 𝑂(𝑃).
Step 4: Crossover Operation
In HeMI, the complexity of the crossover operation is the same as the complexity of the
crossover operation of DeRanClust. Therefore, the complexity of the crossover operation is
𝑂(𝑃2).
Step 5: Twin Removal
The complexity of twin removal of HeMI is the same as the complexity of twin removal of
DeRanClust. Therefore, the complexity of twin removal is 𝑂(𝑚𝑘2𝑃).
Step 6: Mutation Operation
Division
In HeMI, the complexity of division operation is the same as the complexity of division
operation of DeRanClust. Therefore, the complexity of division operation is 𝑂(𝑛𝑚𝑘𝑃).
Absorption
The complexity of the absorption operation of HeMI is the same as the complexity of absorption
operation of DeRanClust. Therefore, the complexity of the absorption operation is 𝑂(𝑛𝑚𝑘𝑃).
258
Random Change
After the absorption operation, HeMI calculates the fitness of 𝑃 chromosomes with a
complexity of 𝑂(𝑛𝑚𝑘𝑃). It then compares the fitness of all 𝑃 chromosomes (that it obtains after
the absorption operation) with all 𝑃 chromosomes (that it obtains for the division operation) in
order to select the chromosomes for the random change operation. The complexity of this
is 𝑂(𝑃). HeMI also calculates the mutation probability for all 𝑃 chromosomes with a complexity
of 𝑂(𝑃). If a chromosome is chosen for a random change operation, it then changes one attribute
value (randomly chosen) of a gene of the chromosome. The complexity of this is 𝑂(𝑃) if there
are 𝑃 number of chromosomes. The overall complexity of the random change operation is
𝑂(𝑛𝑚𝑘𝑃 + 𝑃 + 𝑃 + 𝑃) = 𝑂(𝑛𝑚𝑘𝑃).
The overall complexity of the mutation operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃) = 𝑂(𝑛𝑚𝑘𝑃).
Step 7: Health Improvement Operation
Phase 1
HeMI first calculates the fitness of 𝑃 chromosomes with a complexity of 𝑂(𝑛𝑚𝑘𝑃). It then
identifies 𝑃 (50% of the population size) healthy chromosomes with a complexity of 𝑂(𝑃). The
overall complexity of Phase 1 is 𝑂(𝑛𝑚𝑘𝑃 + 𝑃) = 𝑂(𝑛𝑚𝑘𝑃).
Phase 2
HeMI identifies 𝑃 (20% of the population size) healthy chromosomes with a complexity of
𝑂(𝑃). Applying the same approach to that of Component 5 (see Section 6.2.2 of Chapter 6) it
then chooses pairs of chromosomes from these 20% of healthy chromosomes. The complexity
of this is 𝑂(𝑃). The overall complexity of Phase 2 is 𝑂(𝑃 + 𝑃)= 𝑂(𝑃).
259
Phase 3
HeMI identifies 𝑃 (30% of the population size) chromosomes from the pool of chromosomes
obtained through the deterministic phase of Component 3 (see Section 6.2.2 of Chapter 6) with
a complexity of 𝑂(𝑃). For each of these chromosomes HeMI then randomly changes an attribute
value of a gene within its original domain. The complexity of this is 𝑂(𝑃). The overall
complexity of Phase 3 is 𝑂(𝑃 + 𝑃)= 𝑂(𝑃).
The overall complexity of the health improvement operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑃 +
𝑃)= 𝑂(𝑛𝑚𝑘𝑃).
Step 8: Elitist Operation
The complexity of the elistist operation of HeMI is the same as the complexity of the elitist
operation of DeRanClust. Theerefore, the complexity of the elitist operation is 𝑂(𝑛𝑚𝑘𝑃).
Step 9: Neighbor Information Sharing
For the information exchange among neighboring streams, HeMI finds the best and worst
chromosome of each stream. The complexity of this is 𝑂(𝑃).
Step 10: Global Best Selection
HeMI compares all the best chromosomes of all streams and then selects the best of the best
chromosomes as the final clustering solution. The complexity of this is 𝑂(𝑃).
If there are 𝑁 iterations then the Step 3 to Step 9 will be repeated 𝑁 times while the
normalization, population initialization and global best selection will occur only once.
Therefore, the total complexity of the steps will be 𝑂(𝑛𝑚2 + 𝑚𝑑2 + 𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 +
𝑃2 + 𝑃 + 𝑁(𝑃 + 𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃))= 𝑂(𝑛𝑚2 + 𝑚𝑑2 +
260
𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃)). If 𝑛 ≫ 𝑑, 𝑚 ≫ 𝑑, 𝑛 ≫ 𝑃, 𝑚 ≫ 𝑃, 𝑛 ≫
𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑁′, 𝑚 ≫ 𝑁′, 𝑛 ≫ 𝑁 and 𝑚 ≫ 𝑁 then the complexity of HeMI is 𝑂(𝑛𝑚).
9.4.6 Complexity of CSClust
In Chapter 7, we present a novel clustering techniques titled CSClust. The step-by-step detailed
complexity analysis of CSClust is as follows:
Step 1: Normalization
Numerical Attributes
In CSClust, the complexity of normalizing the values of numerical attributes of a data set is the
same as the complexity of normalizing the attributes values of DeRanClust, GCS, GMC, and
HeMI. Therefore, the complexity of normalizing 𝑚 attributes is 𝑂(𝑛𝑚).
Categorical Attributes
In CSClust, the complexity of normalizing the values of categorical attributes of a data set is
the same as the complexity of normalizing the categorical attributes values of HeMI. Therefore,
the complexity of normalizing 𝑚 attributes is 𝑂(𝑛𝑚2 + 𝑚𝑑2).
The overall complexity of normalization is 𝑂(𝑛𝑚 + 𝑛𝑚2 + 𝑚𝑑2) = (𝑛𝑚2 + 𝑚𝑑2).
Step 2: Population Initialization
In CSClust, the complexity of population initialization is the same as the complexity of
population initialization of HeMI. Therefore, the complexity of population initialization is
𝑂(𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2).
Step 3: Selection of Sensible Properties
CSClust selects the necessary properties [minimum (𝑀𝑛) and maximum (𝑀𝑥)], number of
clusters, and minimum number of records (𝑀𝑟) in a cluster of a sensible clustering solution. To
261
find the minimum number of records in a cluster, CSClust calculate the distance between genes
and records to form clusters. The complexity of this is 𝑂(𝑛𝑚𝑘), where 𝑘 is the maximum
number of genes in each chromosome. If there are 𝑃 number of chromosomes, then the
complexity of selecting a minimum number of records in a cluster is 𝑂(𝑛𝑚𝑘𝑃).
Once the clusters are formed for each chromosome, it then finds the minimum and maximum
number of clusters from 𝑃 chromosomes. The complexity of this is 𝑂(𝑃). Therefore, the total
complexity of selecting sensible properties is 𝑂(𝑛𝑚𝑘𝑃 + 𝑃) = 𝑂(𝑛𝑚𝑘𝑃).
Step 4: Crossover Operation
The complexity of the crossover operation of CSClust is the same as the complexity of the
crossover operation of DeRanClust and HeMI. Therefore, the complexity of Step 4 is 𝑂(𝑃2).
Step 5: Mutation Operation
CSClust first calculates the fitness of 𝑧 chromosomes with the complexity of 𝑂(𝑛𝑚𝑘𝑃). It then
finds the maximum and minimum fitness of the chromosomes for calculating the mutation
probability. The complexity of this is 𝑂(𝑃). If a chromosome is chosen for mutation then
CSClust changes an attribute value (randomly chosen) of each and every gene of the
chromosome. The complexity of this is 𝑂(𝑘), if the number of genes in a chromosome is 𝑘.
Therefore, the overall complexity of the mutation operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑃 +
𝑘𝑃)= 𝑂(𝑛𝑚𝑘𝑃).
Step 6: Twin Removal
In CSClust, the complexity of twin removal is the same as the complexity of twin removal of
DeRanClust. Therefore, the complexity of twin removal is 𝑂(𝑚𝑘2𝑃).
262
Step 7: Cleansing Operation
CSClust applies the cleansing operation to each chromosome in a population based on the
properties of a sensible clustering solution. In this operation, it finds the (minimum (𝑀𝑛) and
maximum (𝑀𝑥) number of clusters, and the minimum number of records (𝑀𝑟) in a cluster of
each chromosome. In order to find the minimum number of records in a cluster, CSClust
calculates the distance between genes and records to form clusters. The complexity of this is
𝑂(𝑛𝑚𝑘), where 𝑘 is the maximum number of genes in each chromosome. If there are 𝑃 number
of chromosomes, then the complexity of selecting a minimum number of records in a cluster
is 𝑂(𝑛𝑚𝑘𝑃).
Once the clusters are formed for each chromosome, it then finds the minimum and maximum
numbers of clusters from 𝑃 chromosomes. The complexity of this is 𝑂(𝑃). Therefore, the total
complexity of the cleansing operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑃) = 𝑂(𝑛𝑚𝑘𝑃).
Step 8: Cloning Operation
In the cloning operation, CSClust replaces the sick chromosomes. To replace a sick
chromosome, another chromosome is probabilistically selected from the pool of chromosomes.
If there re 𝑃 number of chromosomes in the pool then the complexity of the probabilistic
selection is 𝑂(𝑃). HeMI++ then randomly changes an attribute value of a gene to another value
within the domain of the attribute. The complexity of this is 𝑂(𝑃). Therefore, the overall
complexity of Step-8 is 𝑂(𝑃 + 𝑃) = 𝑂(𝑃).
Step 9: Elitist Operation
In CSClust, the complexity of the elitist operation is the same as the complexity of the elitist
operation of DeRanClust. Therefore, the complexity of the elitist operation is 𝑂(𝑛𝑚𝑘𝑃).
If there are 𝑁 iterations, then the Step 3 to Step 9 will be repeated 𝑁 times while the
normalization, population initialization and selection of sensible properties will occur only once.
263
Therefore, the total complexity of the steps will be 𝑂(𝑛𝑚2 + 𝑚𝑑2 + 𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 +
𝑃2 + 𝑛𝑚𝑘𝑃 + 𝑁( 𝑃2 + 𝑛𝑚𝑘𝑃 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃 + 𝑛𝑚𝑘𝑃)= 𝑂(𝑛𝑚2 + 𝑚𝑑2 +
𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃)). If 𝑛 ≫ 𝑑, 𝑚 ≫ 𝑑, 𝑛 ≫ 𝑃, 𝑚 ≫ 𝑃, 𝑛 ≫
𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑁′, 𝑚 ≫ 𝑁′, 𝑛 ≫ 𝑁 and 𝑚 ≫ 𝑁 then the complexity of CSClust is 𝑂(𝑛𝑚).
9.4.7 Complexity of HeMI++
In Chapter 8, we present a novel clustering techniques titled HeMI++. The step-by-step detailed
complexity analysis of HeMI++ is as follows:
Step 1: Normalization
Numerical Attributes
In HeMI++, the complexity of normalizing the values of numerical attributes of a data set is the
same as the complexity of normalizing the attributes values of DeRanClust. Therefore, the
complexity of normalizing 𝑚 attributes is 𝑂(𝑛𝑚).
Categorical Attributes
In HeMI++, the complexity of normalizing the values of categorical attributes of a data set is
the same as the complexity of normalizing the attributes values of HeMI. Therefore, the
complexity of normalizing 𝑚 attributes is 𝑂(𝑛𝑚2 + 𝑚𝑑2).
The overall complexity of normalization is 𝑂(𝑛𝑚 + 𝑛𝑚2 + 𝑚𝑑2) = (𝑛𝑚2 + 𝑚𝑑2).
Step 2: Population Initialization
The complexity of population initialization of HeMI++ is the same as the complexity of
population initialization of HeMI. Therefore, the complexity of population initialization is
𝑂(𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2).
264
Step 3: Selection of Sensible Properties
In HeMI++, the complexity of the selection of sensible properties is the same as the complexity
of selection of sensible properties of CSClust. Therefore, the complexity of the selection of
sensible properties is 𝑂(𝑛𝑚𝑘𝑃).
Step 4: Noise-based Selection
The complexity of noise-based selection of HeMI++ is the same as the complexity of noise-
based selection of DeRanClust and HeMI. Therefore, the complexity of noise-based selection is
𝑂(𝑃).
Step 5: Crossover Operation
In HeMI, the complexity of the crossover operation is the same as the complexity of the
crossover operation of DeRanClust and HeMI. Therefore, the complexity of the crossover
operation is 𝑂(𝑃2).
Step 6: Twin Removal
The complexity of twin removal is the same as the complexity of twin removal of DeRanClust.
Therefore, the complexity of twin removal is 𝑂(𝑚𝑘2𝑃).
Step 7: Three Steps Mutation Operation
In HeMI++, the complexity of the three step mutation is the same as the complexity of the three
step mutation of HeMI. Therefore, the complexity of the three step mutation is 𝑂(𝑛𝑚𝑘𝑃).
Step 8: Health Improvement Operation
The complexity of the health improvement operation of HeMI++ is the same as the complexity
of the health improvement operation of HeMI. Therefore, the complexity of the health
improvement operation is 𝑂(𝑛𝑚𝑘𝑃).
265
Step 9: Cleansing Operation
In HeMI++, the complexity of the cleansing operation is the same as the complexity of the
cleansing operation of CSClust. Therefore, the complexity of the cleansing operation is
𝑂(𝑛𝑚𝑘𝑃).
Step 10: Cloning Operation
In HeMI++, the complexity of the cloning operation is the same as the complexity of the cloning
operation of CSClust. Therefore, the complexity of the cloning operation is 𝑂(𝑃).
Step 11: Elitist Operation
In HeMI++, the complexity of the elitist operation is the same as the complexity of the elitist
operation of HeMI. Therefore, the complexity of the elitist operation is 𝑂(𝑛𝑚𝑘𝑃).
Step 12: Neighbor Information Sharing
In HeMI++, the complexity of neighbor information sharing is the same as the complexity of
neighbor information sharing of HeMI. Therefore, the complexity for neighbor information
sharing is 𝑂(𝑃).
Step 13 Global Best Selection
The complexity of global best selection of HeMI++ is the same as the complexity of global best
selection of HeMI. Therefore, the complexity of global best selection is 𝑂(𝑃).
If there are 𝑁 iterations, then the Step 4 to Step 12 will be repeated 𝑁 times while the
normalization, population initialization, selection of sensible properties and global best selection
will occur only once. Therefore, the total complexity of the steps will be 𝑂(𝑛𝑚2 + 𝑚𝑑2 +
𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑛𝑚𝑘𝑃 + 𝑃 + 𝑁( 𝑃 + 𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃 +
𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃)= 𝑂(𝑛𝑚2 + 𝑚𝑑2 + 𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃)). If
266
𝑛 ≫ 𝑑, 𝑚 ≫ 𝑑, 𝑛 ≫ 𝑃, 𝑚 ≫ 𝑃, 𝑛 ≫ 𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑁′, 𝑚 ≫ 𝑁′, 𝑛 ≫ 𝑁 and 𝑚 ≫ 𝑁 then the
complexity of HeMI++ is 𝑂(𝑛𝑚).
9.4.8 Complexity of AGCUK
The complexity of AGCUK is taken from the AGCUK paper (Y. Liu et al., 2011) and the thesis
(Md Anisur Rahman, 2014). The complexity of each generation/iteration is 𝑂(𝑛𝑚𝑘𝑃). If there
are 𝑁 number of iterations, then the total complexity of AGCUK for 𝑁 iterations is 𝑂(𝑛𝑚𝑘𝑧𝑁).
If 𝑛 ≫ 𝑃, 𝑚 ≫ 𝑃, 𝑛 ≫ 𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑁 and 𝑚 ≫ 𝑁 then the complexity of AGCUK
is 𝑂(𝑛𝑚).
9.4.9 Complexity of GAGR
The complexity of GAGR (D.-X. Chang et al., 2009) is derive from the thesis (Md Anisur
Rahman, 2014). The detailed complexity of GAGR is as follows:
Step 1:
In Step 1, GAGR generates 𝑃 number of random chromosomes with a complexity of 𝑂(𝑚𝑘𝑃).
Step 2:
In Step 2, the fitness of 𝑧 chromosomes is calculated using the sum of square error (SSE). The
complexity of the SSE calculation is 𝑂(𝑚𝑘𝑅𝑃), if there are 𝑃 number of chromosomes and the
minimum number of records in a cluster is 𝑅. The complexity of fitness calculation of 𝑃 number
of chromosomes is 𝑂(𝑃). GAGR then finds the best chromosome with a complexity of 𝑂(𝑃).
The best chromosome is then stored in a separate location. The complexity of this is 𝑂(𝑚𝑘).
The total complexity of Step 2 is 𝑂(𝑚𝑘𝑅𝑃 + 𝑃 + 𝑃 + 𝑚𝑘) = 𝑂(𝑚𝑘𝑅𝑃).
Step 3:
In Step 3, GAGR selects the best chromosome as the best cluster result with a complexity
of 𝑂(𝑚𝑘).
267
Step 4:
In Step 4, it selects the chromosomes for crossover and mutation operation with a complexity
of 𝑂(𝑚𝑘𝑃).
Step 5:
In Step 5, GAGR applies crossover operation. Before the crossover, it applies gene re-
arrangement on each chromosome (except the best chromosome) with a complexity of 𝑂(𝑚𝑘2).
The complexity of the gene rearrangement of (𝑃 − 1) chromosomes is 𝑂((𝑃 − 1)𝑘2𝑚)
= 𝑂(𝑚𝑘2𝑃). The complexity of the crossover of a pair of chromosomes is 𝑂(𝑘𝑚). The
complexity of the crossover of 𝑃
2 pair of chromosomes is 𝑂((
𝑃
2)𝑘𝑚). The overall complexity
of Step 5 is 𝑂(𝑚𝑘2𝑃) + 𝑂((𝑃
2)𝑘𝑚) = 𝑂(𝑚𝑘2𝑃).
Step 6:
In Step 6, GAGR applies mutation operation. It calculates the mutation probability of 𝑃
chromosomes with a complexity of 𝑂(𝑃). The complexity of performing a mutation operation
on a chromosome is 𝑂(𝑚𝑘).The complexity of inserting a mutated/un-mutated chromosome
is 𝑂(𝑚𝑘). The overall complexity of Step 6 is 𝑂(𝑃 + 𝑚𝑘 + 𝑚𝑘) = 𝑂(𝑃 + 𝑚𝑘).
Step 7:
In Step 7, GAGR calculates the newly generated chromosomes with a complexity of 𝑂(𝑚𝑘𝑅𝑃).
Step 8:
In Step 8, it compares the worst chromosomes in the new population with the best chromosome
(i.e. the best chromosome from all previous generations) in terms of their fitness value. The
complexity of this is 𝑂(𝑚𝑘).
268
Step 9:
In Step 9, GAGR find the best chromosome in the new population and replace the best
chromosome (i.e. the best chromosome from all previous generation). The complexity of this
is 𝑂(𝑚𝑘).
Step 10:
In Step 10, the best chromosome is selected as a reference for the gene re-arrangement. The
complexity of this is 𝑂(𝑚𝑘).
If there are 𝑁 iterations, then the total complexity of GAGR is 𝑂(𝑚𝑘𝑅𝑃𝑁 + 𝑚𝑘2𝑃𝑁).
Moreover, GAGR applies K-means on the best clustering solution. If there are 𝑁′ number of
iterations in K-means then the complexity of K-Means is 𝑂(𝑛𝑚𝑘𝑁′) (Md Anisur Rahman,
2014). Therefore, the total complexity of GAGR is 𝑂(𝑚𝑘𝑅𝑃𝑁 + 𝑚𝑘2𝑃𝑁 + 𝑛𝑚𝑘𝑁′). If 𝑛 ≫
𝑃, 𝑚 ≫ 𝑃, 𝑛 ≫ 𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑅, 𝑚 ≫ 𝑅, 𝑛 ≫ 𝑁′, 𝑚 ≫ 𝑁′, 𝑛 ≫ 𝑁 and 𝑚 ≫ 𝑁 then the
complexity of GAGR is 𝑂(𝑛𝑚).
9.4.10 Complexity of GenClust
The complexity of GenClust is taken from the GenClust paper (Rahman & Islam, 2014). The
detailed complexity of GenClust is as follows:
Step 1: Population Initialization
GenClust produces initial population through two phases: deterministic and random. The
complexity of the deterministic phase is 𝑂(𝑃(𝑛𝑚2 + 𝑚𝑑2 + 𝑛2𝑚)), where 𝑃 is the number of
chromosomes. The complexity of the random phase is (𝑘𝑃). The total complexity of the
population initialization is 𝑂(𝑃(𝑛𝑚2 + 𝑚𝑑2 + 𝑛2𝑚) + 𝑘𝑃).
269
Step 2: Selection Operation
GenClust uses COSEC to calculate the fitness of each chromosome. The complexity of
calculating the fitness of each chromosomes is 𝑂(𝑛𝑚𝐾 + 𝑚𝑘2). If there are 𝑃 number of
chromosomes then the complexity of the fitness calculation for 𝑃 chromosomes is 𝑂(𝑛𝑚𝐾𝑃 +
𝑚𝑘2𝑃). Chromosomes are then sorted in descending order with a complexity of 𝑂(𝑃2) for
finding the 𝑃
2 chromosomes. The total complexity of the selection operation is 𝑂(𝑛𝑚𝑘𝑃 +
𝑚𝑘2𝑃 + 𝑃2).
Step 3: Crossover Operation
Before the crossover, GenClust applies gene re-arrangement on each chromosome (except the
best chromosome) with the complexity of 𝑂(𝑚𝑘2). The complexity of gene rearrangement of
(𝑃 − 1) chromosomes is 𝑂((𝑃 − 1)𝑘2𝑚) = 𝑂(𝑚𝑘2𝑃). GenClust uses roulette wheel approach
to select a pair of chromosomes. For 𝑃 number of chromosomes there are 𝑃
2 crossover
altogether. Therefore, the complexity of the crossover operation is 𝑂(𝑃2). After the crossover
operation, GenClust applies the twin removal operation with a complexity of 𝑂(𝑘2𝑚).
Therefore, the total complexity of this step is 𝑂(𝑃2 + 𝑚𝑘2𝑃 + 𝑚𝑘2𝑃) = 𝑂(𝑃2 + 𝑚𝑘2𝑃).
Step 4: Elitism Operation
GenClust calculates the fitness of 𝑧 chromosomes with a complexity of 𝑂(𝑛𝑘𝑚𝑃 + 𝑚𝑘2𝑃). It
then find the best and worst chromosome with a complexity of 𝑂(𝑃). The total complexity of
the elitist operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑚𝑘2𝑃 + 𝑃) = 𝑂(𝑛𝑚𝑘𝑃 + 𝑚𝑘2𝑃).
Step 5: Mutation Operation
GenClust probabilistically selects a chromosome for the mutation operation. The complexity of
the probabilistic selection is 𝑂(𝑃). It then randomly changes an attribute value of each and every
gene of the selected chromosomes. The complexity of this is 𝑂(𝑘𝑃). GenClust again applies the
270
twin removal operation with a complexity of 𝑂(𝑚𝑘2𝑃). Therefore, the total complexity of this
step is 𝑂(𝑃 + 𝑘𝑃 + 𝑚𝑘2𝑃) = 𝑂(𝑘𝑃 + 𝑚𝑘2𝑃).
Step 6: K-Means
GenClust applies K-means to the best clustering result with a complexity of 𝑂(𝑛𝑚𝑘𝑁′).
The overall complexity of GenClust is (𝑛𝑚2𝑃 + 𝑚𝑑2𝑃 + 𝑛2𝑚𝑃 + 𝑘𝑃 + 𝑁(𝑛𝑚𝑘𝑃 +
𝑚𝑘2𝑃 + 𝑃2) + 𝑛𝑚𝑘𝑁′). If 𝑛 ≫ 𝑑, 𝑚 ≫ 𝑑, 𝑛 ≫ 𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑃, 𝑚 ≫ 𝑃, 𝑛 ≫ 𝑁, 𝑚 ≫ 𝑁,
𝑛 ≫ 𝑁′and 𝑚 ≫ 𝑁′ then the complexity of GenClust is 𝑂(𝑛𝑚2 + 𝑛2𝑚).
9.4.11 Complexity of K-means
The complexity of K-means is 𝑂(𝑛𝑚𝑘𝑁′) (Kolen & Hutcheson, 2002; Md Anisur Rahman,
2014; Xu & Wunsch, 2005). If 𝑛 ≫ 𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑁′ and 𝑚 ≫ 𝑁′ then the complexity of K-
Means is 𝑂(𝑛𝑚).
The complexities of some other existing GA-based clustering techniques such as
CLUSTERING (Tseng & Yang, 2001), GCA (Garai & Chaudhuri, 2004) and TGCA (He &
Tan, 2012) are 𝑂(𝑛2 + 𝑚2) (Tseng & Yang, 2001), 𝑂(𝑛2) (Garai & Chaudhuri, 2004) and
𝑂(𝑛𝑚2) (He & Tan, 2012), respectively.
9.5 Comparison of the Complexities of the Techniques
In Table 9.1, we present the complexities of our proposed technique. We also present the
complexities of some existing techniques in Table 9.1.
From Table 9.1, it is evident that the complexity of each of our proposed technique is 𝑂(𝑛𝑚)
which is lower than the complexity of GenClust. The complexity of GenClust is 𝑂(𝑛𝑚2 +
𝑛2𝑚). Besides, the complexity of HeMI++ is 𝑂(𝑛𝑚) which is the same as the complexity of
AGCUK, GAGR and K-means. Although the complexity of AGCUK, GAGR and K-means is
271
equal to the complexity of HeMI++, the clustering quality of HeMI++ is better than the
clustering quality of AGCUK, GAGR, and K-means.
Table 9.1: The complexities of the techniques
Techniques Complexity
DeRanClust 𝑂(𝑛𝑚)
GCS 𝑂(𝑛𝑚)
GMC 𝑂(𝑛𝑚)
HeMI 𝑂(𝑛𝑚)
CSClust 𝑂(𝑛𝑚)
HeMI++ 𝑂(𝑛𝑚)
AGCUK 𝑂(𝑛𝑚)
GAGR 𝑂(𝑛𝑚)
GenClust 𝑂(𝑛𝑚2 + 𝑛2𝑚)
K-means 𝑂(𝑛𝑚)
CLUSTERING 𝑂(𝑛2 + 𝑚2)
GCA 𝑂(𝑛2)
TGCA 𝑂(𝑛𝑚2)
9.6 Summary of the Proposed Techniques
In this section, a summary (see Table 9.2) is provided in regard to the strengths and weaknesses
of the proposed clustering techniques. The reader will be assisted to select a clustering
technique.
However, HeMI++ (presented in Chapter 8) is the most appropriate technique among our
proposed methods because it is suitable for knowledge discovery. CSClust (presented in Chapter
7) is another suitable technique for knowledge discovery. However, CSClust is affected by the
limitations of DB Index whereas HeMI++ is the advanced version and does not suffer from the
limitations of DB Index. HeMI++ is applicable to data sets with both numerical and/or
categorical attributes. It generates the number of clusters automatically through the clustering
272
process. It learns the sensible clustering property from a data set and applies that to produce a
sensible clustering solution.
Table 9.2: Strengths and weaknesses of the proposed techniques
Proposed
Technique
Strengths Weaknesses
DeRanClust
(Presented in
Chapter 3)
Does not require any user input.
Selects high-quality initial seeds.
Solves the local minima issue of partition based
clustering techniques.
Number of clusters is generated automatically through
the clustering process.
Inappropriate for knowledge
discovery.
Cannot handle data sets with
categorical attributes.
Affected by the limitations
of the DB Index.
GMC
(Presented in
Chapter 4)
Does not require any user input.
Selects high-quality initial seeds.
Solves the local minima issue of partition based
clustering techniques.
Produces good-quality offspring chromosomes
through two phases of crossover operation.
Improves chromosome quality through three steps of
mutation operation.
Number of clusters is generated automatically through
the clustering process.
Inappropriate for knowledge
discovery.
Cannot handle the data sets
with categorical attributes.
Affected by the limitations
of the DB Index.
GCS
(Presented in
Chapter 5)
Does not require any user input.
Selects high-quality initial seeds.
Solves the local minima issue of partition based
clustering techniques.
Ensures the presence of good-quality chromosomes in
a population at the beginning of each generation
through a selection operation.
Ensures the presence of healthy chromosomes in each
population through a health check operation.
Number of clusters is generated automatically through
the clustering process.
Inappropriate for knowledge
discovery.
Cannot handle data sets with
categorical attributes.
Affected by the limitations
of the DB Index.
HeMI
(Presented in
Chapter 6)
Does not require any user input.
Selects high-quality initial seeds.
Solves the local minima issue of partition based
clustering techniques.
Uses a big population through multiple streams.
Inappropriate for knowledge
discovery.
Affected by the limitations
of the DB Index.
273
Improves chromosome quality through three steps of
mutation operation.
Ensures the presence of good-quality chromosomes in
each population through a health check operation.
Number of clusters is generated automatically through
the clustering process.
Applicable to data sets with both numerical and/or
categorical attributes.
CSClust
(Presented in
Chapter 7)
Does not require any user input.
Selects high-quality initial seeds.
Solves the local minima issue of partition based
clustering techniques.
Learns the sensible clustering property from a data set
and applies that to produce a sensible clustering
solution.
Number of clusters is generated automatically through
the clustering process.
Suitable for knowledge discovery.
Applicable to data sets with numerical and/or
categorical attributes.
Affected by the limitations
of the DB Index.
HeMI++
(Presented in
Chapter 8)
Does not require any user input.
Selects high-quality initial seeds.
Solves the local minima issue of partition based
clustering techniques.
Learns the sensible clustering property from a data set
and applies that to produce a sensible clustering
solution.
Uses a big population through multiple streams.
Improves chromosome quality through three steps of
mutation operation.
Ensures the presence of good-quality chromosomes in
each population through a health check operation.
Number of clusters is generated automatically through
the clustering process.
Suitable for knowledge discovery.
Applicable to data sets with both numerical and/or
categorical attributes.
NA
Tree Index
(Presented in
Chapter 8)
Suitable for evaluating sensible and non-sensible
clustering solutions.
NA
274
9.7 Future Research Directions
A future direction for research could involve making further essential modifications to the
proposed technique HeMI++. HeMI++ learns the necessary properties (minimum and maximum
number of clusters, and the minimum number of records in a cluster ) of sensible clustering
solutions based on the chromosomes that it generates in the initial population through multiple
streams. For the properties minimum and maximum number of clusters, HeMI++ finds the
minimum and maximum number of clusters of the chromosomes that it generates through
multiple streams. We can explore the appropriate number of clusters without using any range
(minimum and maximum number of clusters) although the number of clusters are varied from
data set to data set.
Similarly, we can also explore the minimum number of records in a cluster. HeMI++ finds
the minimum number of records in a cluster for each of the chromosomes generated through
multiple streams. It then sorts these numbers in descending order and calculates the median of
these numbers. The median value is then used as the property; the minimum number of records
in a cluster. We plan to furhter explore the correctness of using the median value as the minimum
number of records in a cluster. Another future research direction could entail making further
modification of our cluster evaluation technique Tree Index.
275
Chapter 10
Conclusion
Clustering is an important and well-known technique in the area of data mining which has a
wide range of applications such as machine learning (Gan, 2013; Mukhopadhyay & Maulik,
2009), image segmentation (Cai et al., 2007; B. N. Li et al., 2011; F. Zhao et al., 2014), medical
imaging (Bai et al., 2013; Kannan et al., 2010; Kaya et al., 2017; Saha et al., 2016) and social
network analysis (Girvan & Newman, 2002).
Therefore, it is crucial to improve clustering techniques in order to obtain better quality
clusters from data sets. There are many approaches for clustering as presented in the literature.
However, the existing clustering techniques have some limitations and therefore, there is room
for further improvement. In this study we present a number of clustering techniques that produce
better quality clustering solutions than a number of recent existing techniques. In Chapter 9 we
present a detailed analysis and discussions based on the basic concepts and advantages of our
proposed techniques. We also present the main contributions of this study in Chapter 9 (see
Section 9.3).
Many existing techniques have some limitations such as the requirement for user input on
the number of clusters, the tendency of getting stuck at local optima while clustering the records,
and selection of high quality initial seeds with a high complexity of 𝑂(𝑛)2. Therefore, in this
study we present a number of clustering techniques that sequentially improve the cluster quality
and produce high-quality clustering solutions with low complexity and require no user input.
276
We propose DeRanClust (see Chapter 3 ) that does not require any user input, is less likely
to get stuck at local optima and explores high-quality initial seeds with a low complexity
of 𝑂(𝑛). In Chapter 3, we progress towards achieving our first research goal. We realize that
there is room for further improvement of cluster quality of DeRanClust by improving other
genetic operations such as crossover and mutation operations. Therefore, we propose GMC (see
Chapter 4) that uses a new selection, crossover and mutation operation in order to improve the
chromosomes quality. Chapter 4, involved further progress towards achieving our research goal
1.
Typically, the genetic operations such as the crossover and mutation operation tend to improve
the health of a chromosome, but they can also cause the health of some chromosomes to
deteriorate. Therefore, we propose GCS (see Chapter 5) that uses a health check operation in
order to ensure the presence of healthy chromosomes in a population. GCS also uses a new
crossover and selection operation. Chapter 5 further refines the techniques proposed in the
previous two chapters, and allow us to move closer to achieving research goal 1.
In addition, from the literature (Pourvaziri & Naderi, 2014; Straßburg et al., 2012), and
through our empirical analysis (carried out in Chapter 6) we find that the population size has a
positive impact on the clustering quality. That is, a big population size is likely to contribute
towards a good clustering solution. However, big population size requires high execution time.
Therefore, we propose HeMI (see Chapter 6) that uses a big population in multiple streams,
where each stream contains a relatively small number of chromosomes, and thus can facilitate
managing a low execution time since they are suitable for parallel processing when necessary.
HeMI also introduces information sharing among the streams at a regular interval in order to
take advantage of the multiple streams. HeMI also uses a new health improvement operation in
277
order to ensure the healthy chromosomes in a population. We compare HeMI with five (5)
existing techniques on 20 publicly available data sets in terms of two well-known evaluation
criteria (see Section 6.3.4 of Chapter 6). We also carry out a thorough experimentation to
investigate the usefulness of the new components of HeMI (see Section 6.3.7 of Chapter 6). In
Chapter 6 we achieve our first goal of proposing a parameter less clustering technique with high-
quality solutions and low complexity.
However, in order to achieve our second goal of producing sensible clustering solutions we
carefully assess the results obtained by HeMI and other existing techniques. We find that some
recent clustering techniques do not produce sensible clusters and fail to discover knowledge
from underlying data sets. Therefore, we propose CSClust (see Chapter 7) that uses a new
cleansing and cloning operation which helps to produce sensible clusters with high fitness
values.
Finally we propose HeMI++ where we combine our previous technique called CSClust with
HeMI where we also significantly improve the components of CSClust and HeMI. In HeMI++,
we first explore the quality of HeMI and some existing clustering techniques. We observe that
some existing clustering techniques do not produce sensible clusters (see see Fig. 8.2 and Fig.
8.3). We also find that our technique HeMI also does not produce sensible clusters (see Fig.
8.4). Sometimes, they obtain huge number of clusters and sometimes they obtain only two
clusters, where one cluster contains one record and the other cluster contains all remaining
records. In order to handle such a situation, HeMI++ uses a new component called Selection of
Sensible Properties. Through this component HeMI++ first learns important properties of
sensible clustering solutions and then applies the information in producing its clustering
solutions.
HeMI++ also proposes a cleansing and cloning operation that helps to produce sensible
clusters. HeMI++ learns necessary properties of a sensible clustering solution for a data set from
278
a high-quality initial population without requiring any user input. It then disqualifies the
chromosomes that do not satisfy the properties through its cleansing operation. In the cloning
operation, the chromosomes are then replaced by high-quality chromosomes found in the initial
population.
During the development of the proposed clustering techniques we realize that the existing
cluster evaluation techniques are biased towards either high number of clusters or very low
number of clusters. Therefore, we also evaluate the existing cluster evaluation techniques by
analyzing them on some ground truth results,which also graphically visualized (see Section
8.2.1 of Chapter 8). We find that the existing evaluation techniques produce better evaluation
values for non-sensible clustering solutions (compared to a sensible clustering solutions).
Hence, in this study we propose Tree Index which scores the sensible solutions higher than those
non-sensible solutions (see Section 8.2.1, 8.2.2 and 8.3.5 of Chapter 8).
We then empirically compare our proposed clustering technique (HeMI++) with five existing
techniques on 21 publicly available data sets in terms of our Tree Index. We find that HeMI++
achieves the best clustering solutions in 18 out of 21 data sets (see Section 8.3.8 of Chapter 8).
Moreover, we graphically visualize the clustering results of HeMI++ on a brain data set and find
the results to be more sensible than others. Additionally, we discover some useful knowledge
from the clustering results produced by HeMI++ indicating its usefulness in knowledge
discovery. In Chapter 8 we achieve our second and third research goals of producing high
quality and sensible clusters with no user input, and a cluster evaluation technique for better
evaluating sensible and non-sensible clustering results. A future research direction of the
proposed techniques is presented in Chapter 9 (see Section 9.7).
279
References
Aalaei, A., Fazlollahtabar, H., Mahdavi, I., Mahdavi-Amiri, N., & Yahyanejad, M. H. (2013).
A genetic algorithm for a creativity matrix cubic space clustering: A case study in
Mazandaran Gas Company. Applied Soft Computing, 13(4), 1661–1673.
http://doi.org/10.1016/j.asoc.2012.12.011
Abolhassani, B., Salt, J. E., & Dodds, D. E. (2004). A Two-Phase Genetic K-Means Algorithm
for Placement of Radioports in Cellular Networks. IEEE Transactions on Systems, Man
and Cybernetics, Part B (Cybernetics), 34(1), 533–538.
http://doi.org/10.1109/TSMCB.2003.817073
Abonyi János, & Feil Balázs. (2007). Cluster Analysis for Data Mining and System
Identification. Basel: Birkhäuser Basel. http://doi.org/10.1007/978-3-7643-7988-9
Abshouri, A. A., & Bakhtiary, A. (2012). A new clustering method based on Firefly and KHM.
Journal of Communication and Computer, 9, 387–391.
Adnan, M. N., & Islam, M. Z. (2014). ComboSplit: Combining Various Splitting Criteria for
Building a Single Decision Tree. In International Conference on Artificial Intelligence
and Pattern Recognition (pp. 1–8).
Adnan, M. N., & Islam, M. Z. (2016). Forest CERN: A New Decision Forest Building
Technique (pp. 304–315). Springer, Cham. http://doi.org/10.1007/978-3-319-31753-3_25
Agustín-Blas, L. E., Salcedo-Sanz, S., Jiménez-Fernández, S., Carro-Calvo, L., Del Ser, J., &
Portilla-Figueras, J. A. (2012). A new grouping genetic algorithm for clustering problems.
Expert Systems with Applications, 39(10), 9695–9703.
http://doi.org/10.1016/j.eswa.2012.02.149
280
Ahmad, A., & Dey, L. (2007a). A k-mean clustering algorithm for mixed numeric and
categorical data. Data & Knowledge Engineering, 63(2), 503–527.
http://doi.org/10.1016/j.datak.2007.03.016
Ahmad, A., & Dey, L. (2007b). A method to compute distance between two categorical values
of same attribute in unsupervised learning for categorical data set. Pattern Recognition
Letters (Vol. 28). http://doi.org/10.1016/j.patrec.2006.06.006
Alexander, G. J., & Peterson, M. A. (2007). An analysis of trade-size clustering and its relation
to stealth trading. Journal of Financial Economics, 84(2), 435–471.
http://doi.org/10.1016/j.jfineco.2006.02.005
Andreopoulos, B., An, A., & Wang, X. (2007). Hierarchical Density-Based Clustering of
Categorical Data and a Simplification. In Advances in Knowledge Discovery and Data
Mining (pp. 11–22). Berlin, Heidelberg: Springer Berlin Heidelberg.
http://doi.org/10.1007/978-3-540-71701-0_5
Andreopoulos, W. (2006). Clustering Algorithms for Categorical Data. York University,
Toronto, Ontario.
ap Gwilym, O., & Verousis, T. (2010). Price clustering and underpricing in the IPO
aftermarket. International Review of Financial Analysis, 19(2), 89–97.
http://doi.org/10.1016/j.irfa.2010.01.007
Arthur, D., & Vassilvitskii, S. (2007). 2006-13. In Proceedings of the Eighteenth Annual ACM-
SIAM Symposium on Discrete Algorithms, SODA 2007 (pp. 1027–1035). New Orleans,
Louisiana,USA.
Ashton, J. K., & Hudson, R. S. (2008). Interest rate clustering in UK financial services markets.
Journal of Banking & Finance, 32(7), 1393–1403.
http://doi.org/10.1016/j.jbankfin.2007.11.002
Bador, M., Gilleland, E., Castellà, M., & Arivelo, T. (2015). Spatial clustering of summer
281
temperature maxima from the CNRM-CM5 climate model ensembles & E-OBS over
Europe. Weather and Climate Extremes, 9, 17–24.
http://doi.org/10.1016/j.wace.2015.05.003
Bai, P. R., Liu, Q. Y., Li, L., Teng, S. H., Li, J., & Cao, M. Y. (2013). A novel region-based
level set method initialized with mean shift clustering for automated medical image
segmentation. Computers in Biology and Medicine, 43(11), 1827–1832.
http://doi.org/10.1016/j.compbiomed.2013.08.024
Bandyopadhyay, S., & Maulik, U. (2001). Nonparametric genetic clustering: comparison of
validity indices. IEEE Transactions on Systems, Man, and Cybernetics, Part-C, 31(2001),
120–125.
Bandyopadhyay, S., & Maulik, U. (2002). An evolutionary technique based on K-Means
algorithm for optimal clustering in RN. Information Sciences, 146(1), 221–237.
http://doi.org/10.1016/S0020-0255(02)00208-6
Bandyopadhyay, S., Maulik, U., & Mukhopadhyay, A. (2007). Multiobjective Genetic
Clustering for Pixel Classification in Remote Sensing Imagery. IEEE Transactions on
Geoscience and Remote Sensing, 45(5), 1506–1511.
http://doi.org/10.1109/TGRS.2007.892604
Banharnsakun, A., Sirinaovakul, B., & Achalakul, T. (2013). The best-so-far ABC with
multiple patrilines for clustering problems. Neurocomputing, 116, 355–366.
http://doi.org/10.1016/j.neucom.2012.02.047
Beauchemin, M. (2015). On affinity matrix normalization for graph cuts and spectral
clustering. Pattern Recognition Letters (Vol. 68).
http://doi.org/10.1016/j.patrec.2015.08.020
Bello-Orgaz, G., Jung, J. J., & Camacho, D. (2016). Social big data: Recent achievements and
new challenges. Information Fusion, 28, 45–59.
282
http://doi.org/10.1016/j.inffus.2015.08.005
Bezdek, J. C., & C., J. (1981). Pattern recognition with fuzzy objective function algorithms.
Plenum Press.
Brameier, M., & Wiuf, C. (2007). Co-clustering and visualization of gene expression data and
gene ontology terms for Saccharomyces cerevisiae using self-organizing maps. Journal
of Biomedical Informatics, 40(2), 160–173. http://doi.org/10.1016/j.jbi.2006.05.001
Brown, P., Chua, A., & Mitchell, J. (2002). The influence of cultural factors on price clustering:
Evidence from Asia–Pacific stock markets. Pacific-Basin Finance Journal, 10(3), 307–
332. http://doi.org/10.1016/S0927-538X(02)00049-5
Cagnina, L., Errecalde, M., Ingaramo, D., & Rosso, P. (2014). An efficient Particle Swarm
Optimization approach to cluster short texts. Information Sciences, 265, 36–49.
http://doi.org/10.1016/j.ins.2013.12.010
Cai, W., Chen, S., & Zhang, D. (2007). Fast and robust fuzzy c-means clustering algorithms
incorporating local information for image segmentation. Pattern Recognition, 40(3), 825–
838. http://doi.org/10.1016/j.patcog.2006.07.011
Chan, K. Y., Kwong, C. K., & Hu, B. Q. (2012). Market segmentation and ideal point
identification for new product design using fuzzy data compression and fuzzy clustering
methods. Applied Soft Computing, 12(4), 1371–1378.
http://doi.org/10.1016/j.asoc.2011.11.026
Chang, D.-X., Zhang, X.-D., & Zheng, C.-W. (2009). A genetic algorithm with gene
rearrangement for K-means clustering. Pattern Recognition, 42(7), 1210–1222.
http://doi.org/10.1016/j.patcog.2008.11.006
Chang, D., Zhao, Y., Zheng, C., & Zhang, X. (2012). A genetic clustering algorithm using a
message-based similarity measure. Expert Systems with Applications, 39(2), 2194–2202.
http://doi.org/10.1016/j.eswa.2011.07.009
283
Chapelle, O., Scholkopf, B., & Zien, A. (Eds.). (2006). Semi-Supervised Learning. The MIT
Press. http://doi.org/10.7551/mitpress/9780262033589.001.0001
Chen, M.-Y. (2013). A hybrid ANFIS model for business failure prediction utilizing particle
swarm optimization and subtractive clustering. Information Sciences, 220, 180–195.
http://doi.org/10.1016/j.ins.2011.09.013
Chen, Y., Wang, L., Li, F., Du, B., Choo, K.-K. R., Hassan, H., & Qin, W. (2017). Air quality
data clustering using EPLS method. Information Fusion, 36, 225–232.
http://doi.org/10.1016/j.inffus.2016.11.015
Chen, Z., & Ji, H. (2010). Graph-based Clustering for Computational Linguistics: A Survey.
In Proceedings of the 2010 Workshop on Graph-based Methods for Natural Language
Processing, ACL 2010 (pp. 1–9). Uppsala, Sweden.
Cheng, C. H., Lee, W. K., & Wong, K. F. (2002). A genetic algorithm-based clustering
approach for database partitioning. IEEE Transactions on Systems, Man and Cybernetics
Part C: Applications and Reviews, 32(3), 215–230.
http://doi.org/10.1109/TSMCC.2002.804444
Chiou, Y. C., & Lan, L. W. (2001). Genetic clustering algorithms. European Journal of
Operational Research, 135(2), 413–427. http://doi.org/10.1016/S0377-2217(00)00320-9
Chuang, K.-T., & Chen, M.-S. (2004). Clustering Categorical Data by Utilizing the Correlated-
Force Ensemble. In M. W. Berry, U. Dayal, C. Kamath, & D. Skillicorn (Eds.),
Proceedings of the 2004 SIAM International Conference on Data Mining. Philadelphia,
PA: Society for Industrial and Applied Mathematics.
http://doi.org/10.1137/1.9781611972740
Chuang, L.-Y., Hsiao, C.-J., & Yang, C.-H. (2011). Chaotic particle swarm optimization for
data clustering. Expert Systems with Applications, 38(12), 14555–14563.
http://doi.org/10.1016/j.eswa.2011.05.027
284
Cost, S., & Salzberg, S. (1993). A Weighted Nearest Neighbor Algorithm for Learning with
Symbolic Features. Machine Learning, 10(1), 57–78.
http://doi.org/10.1023/A:1022664626993
Cowgill, M. C., Harvey, R. J., & Watson, L. T. (1999). A genetic algorithm approach to cluster
analysis. Computers & Mathematics with Applications, 37(7), 99–108.
http://doi.org/10.1016/S0898-1221(99)00090-5
Cucchiara, R. (1998). Genetic algorithms for clustering in machine vision. Machine Vision and
Applications, 11(1), 1–6. http://doi.org/10.1007/s001380050084
Cura, T. (2012). A particle swarm optimization approach to clustering. Expert Systems with
Applications, 39(1), 1582–1588. http://doi.org/10.1016/j.eswa.2011.07.123
D.Mason, R. (1998). Statistics: An Introduction (5th ed.). Brooks/Cole Publishing Company.
Daraganova, G., Pattison, P., Koskinen, J., Mitchell, B., Bill, A., Watts, M., & Baum, S. (2012).
Networks and geography: Modelling community network structures as the outcome of
both spatial and network processes. Social Networks, 34(1), 6–17.
http://doi.org/10.1016/j.socnet.2010.12.001
Das, S., Abraham, A., & Konar, A. (2008). Automatic kernel clustering with a Multi-Elitist
Particle Swarm Optimization Algorithm. Pattern Recognition Letters (Vol. 29).
http://doi.org/10.1016/j.patrec.2007.12.002
Davies, D. L., & Bouldin, D. W. (1979). A Cluster Separation Measure. IEEE Transactions on
Pattern Analysis and Machine Intelligence, PAMI-1(2), 224–227.
http://doi.org/10.1109/TPAMI.1979.4766909
Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on
Pattern Analysis and Machine Intelligence, PAMI, 1(2), 224–227.
http://doi.org/10.1109/TPAMI.1979.4766909
de Arruda, G. F., Costa, L. da F., & Rodrigues, F. A. (2012). A complex networks approach
285
for data clustering. Physica A: Statistical Mechanics and Its Applications, 391(23), 6174–
6183. http://doi.org/10.1016/j.physa.2012.07.007
Deckersbach, T., Peters, A. T., Sylvia, L. G., Gold, A. K., da Silva Magalhaes, P. V., Henry,
D. B., … Miklowitz, D. J. (2016). A cluster analytic approach to identifying predictors
and moderators of psychosocial treatment for bipolar depression: Results from STEP-BD.
Journal of Affective Disorders, 203, 152–157. http://doi.org/10.1016/j.jad.2016.03.064
Demiriz, A., Demiriz, A., Bennett, K., & Embrechts, M. J. (1999). Semi-Supervised Clustering
Using Genetic Algorithms. In Artificial Neural Network in Engineering (ANNE-99), 809-
-814. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.3696
Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of
Machine Learning Research, 7, 1–30. Retrieved from
http://delivery.acm.org/10.1145/1250000/1248548/7-1-
demsar.pdf?ip=137.166.21.89&id=1248548&acc=PUBLIC&key=65D80644F295BC0D
.C3714298A2589389.4D4702B0C3E38B35.4D4702B0C3E38B35&CFID=966800663
&CFTOKEN=74417605&__acm__=1501499243_57e3d3eb1dc2a758ce96c929
Deng, S., He, Z., & Xu, X. (2010). G-ANMI: A mutual information based genetic clustering
algorithm for categorical data. Knowledge-Based Systems, 23(2), 144–149.
http://doi.org/10.1016/j.knosys.2009.11.001
Diaz-Gomez, P., & Hougen, D. (2007). Initial Population for Genetic Algorithms: A Metric
Approach. In Proceedings of the 2007 International Conference on Genetic and
Evolutionary Methods (pp. 43–49). Las Vegas, Nevada, USA.
Dimopoulos, C., & Mort, N. (2001). A hierarchical clustering methodology based on genetic
programming for the solution of simple cell-formation problems. International Journal of
Production Research, 39(1), 1–19. http://doi.org/10.1080/00207540150208835
Dipnall, J. F., Pasco, J. A., Berk, M., Williams, L. J., Dodd, S., Jacka, F. N., & Meyer, D.
286
(2017). Why so GLUMM? Detecting depression clusters through graphing lifestyle-
environs using machine-learning methods (GLUMM). European Psychiatry, 39, 40–50.
http://doi.org/10.1016/j.eurpsy.2016.06.003
Dunn, J. C. (1974). Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of
Cybernetics, 4(1), 95–104. http://doi.org/10.1080/01969727408546059
Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical
Association, 56, 52–64.
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density Based Notion of Clusters in
Large Spatial Databases with Noise. In 2nd International Conference on Knowledge
Discovery and Data Mining (KDD-96).
Fathian, M., Amiri, B., & Maroosi, A. (2007). Application of honey-bee mating optimization
algorithm on clustering. Applied Mathematics and Computation, 190(2), 1502–1513.
http://doi.org/10.1016/j.amc.2007.02.029
Festa, P. (2013). A biased random-key genetic algorithm for data clustering. Mathematical
Biosciences, 245(1), 76–85. http://doi.org/10.1016/j.mbs.2013.07.011
Firat, A., Chatterjee, S., & Yilmaz, M. (2007). Genetic clustering of social networks using
random walks, 51, 6285–6294. http://doi.org/10.1016/j.csda.2007.01.010
Firestone, S. M., Ward, M. P., Christley, R. M., & Dhand, N. K. (2011). The importance of
location in contact networks: Describing early epidemic spread using spatial social
network analysis. Preventive Veterinary Medicine, 102(3), 185–195.
http://doi.org/10.1016/j.prevetmed.2011.07.006
Forsati, R., Keikha, A., & Shamsfard, M. (2015). An improved bee colony optimization
algorithm with an application to document clustering. Neurocomputing, 159, 9–26.
http://doi.org/10.1016/j.neucom.2015.02.048
Friedman, M. (1940). A Comparison of Alternative Tests of Significance for the Problem of m
287
Rankings. Source: The Annals of Mathematical Statistics, 11(1), 86–92. Retrieved from
http://www.jstor.org/stable/2235971
Fuzzy Clustering. (2017). Retrieved February 27, 2017, from
http://reference.wolfram.com/legacy/applications/fuzzylogic/Manual/12.html
Galluccio, L., Michel, O., Comon, P., & Hero, A. O. (2012). Graph based k-means clustering.
Signal Processing, 92(9), 1970–1984. http://doi.org/10.1016/j.sigpro.2011.12.009
Gan, G. (2013). Application of data clustering and machine learning in variable annuity
valuation. Insurance: Mathematics and Economics, 53(3), 795–801.
http://doi.org/10.1016/j.insmatheco.2013.09.021
Ganti, V., Gehrket, J., & Ramakrishnant, R. (1999). CACTUS-Clustering Categorical Data
Using Summaries. In KDD-99 (pp. 73–83). San Diego CA, USA.
Garai, G., & Chaudhuri, B. . (2004). A novel genetic algorithm for automatic clustering.
Pattern Recognition Letters (Vol. 25). http://doi.org/10.1016/j.patrec.2003.09.012
Ghahramani, Z. (2004). Unsupervised Learning (pp. 72–112). Springer Berlin Heidelberg.
http://doi.org/10.1007/978-3-540-28650-9_5
Giebultowicz, S., Ali, M., Yunus, M., & Emch, M. (2011). A comparison of spatial and social
clustering of cholera in Matlab, Bangladesh. Health & Place, 17(2), 490–497.
http://doi.org/10.1016/j.healthplace.2010.12.004
Giggins, H., & Brankovic, L. (2012). VICUS: a noise addition technique for categorical data.
Proceedings of the Tenth Australasian Data Mining Conference - Volume 134, 139–148.
Giggins, H. P. (2009). Security of genetic databases. University of Newcastle,
Newcastle,NSW, Australia . Retrieved from
http://trove.nla.gov.au/work/31926869?selectedversion=NBD44558520
Girvan, M., & Newman, M. E. J. (2002). Community structure in social and biological
networks. Proceedings of the National Academy of Sciences, 99(12), 7821–7826.
288
http://doi.org/10.1073/pnas.122653799
Goldberg, D. E., Deb, K., & Clark, J. H. (1991). Genetic Algorithms, Noise, and the Sizing of
Populations. Complex Systems, 6, 333–362.
Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff, J. M., Ivanov, P. C., Mark, R. G.,
… Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet. Circulation,
101(23).
Gu, X., Zhang, Q., Singh, V. P., Chen, Y. D., & Shi, P. (2016). Temporal clustering of floods
and impacts of climate indices in the Tarim River basin, China. Global and Planetary
Change, 147, 12–24. http://doi.org/10.1016/j.gloplacha.2016.10.011
Gunaratne, G. H., Nicol, M., Seemann, L., & Török, A. (2009). Clustering of volatility in
variable diffusion processes. Physica A: Statistical Mechanics and Its Applications,
388(20), 4424–4430. http://doi.org/10.1016/j.physa.2009.06.050
Han, J., & Kamber, M. (2006). Data Mining Concepts and Techniques. San Francisco: Morgan
Kaufmann.
Hanagandi, V., & Nikolaou, M. (1998). A hybrid approach to global optimization using a
clustering algorithm in a genetic search framework. Computers & Chemical Engineering,
22(12), 1913–1925. http://doi.org/10.1016/S0098-1354(98)00251-8
Hassanzadeh, T., & Meybodi, M. R. (2012). A new hybrid approach for data clustering using
Firefly algorithm and k-means. The 16th CSI International Symposium on Artificial
Intelligence and Signal Processing (AISP 2012),IEEE, (Aisp), 7–11.
http://doi.org/10.1109/AISP.2012.6313708
Hatamlou, A. (2013). Black hole: A new heuristic optimization approach for data clustering.
Information Sciences, 222, 175–184. http://doi.org/10.1016/j.ins.2012.08.023
He, H., & Tan, Y. (2012). A two-stage genetic algorithm for automatic clustering.
Neurocomputing, 81, 49–59. http://doi.org/10.1016/j.neucom.2011.11.001
289
Holland, J. H. (John H. (1975). Adaptation in natural and artificial systems : an introductory
analysis with applications to biology, control, and artificial intelligence. University of
Michigan Press.
Hong, T.-P., Chen, C.-H., & Lin, F.-S. (2015). Using group genetic algorithm to improve
performance of attribute clustering. Applied Soft Computing, 29, 371–378.
http://doi.org/10.1016/j.asoc.2015.01.001
Hong, X., Wang, J., & Qi, G. (2014). Comparison of spectral clustering, K-clustering and
hierarchical clustering on e-nose datasets: Application to the recognition of material
freshness, adulteration levels and pretreatment approaches for tomato juices.
Chemometrics and Intelligent Laboratory Systems, 133, 17–24.
http://doi.org/10.1016/j.chemolab.2014.01.017
Hong, Y., & Kwong, S. (2008). To combine steady-state genetic algorithm and ensemble
learning for data clustering. Pattern Recognition Letters, 29(9), 1416–1423.
http://doi.org/10.1016/j.patrec.2008.02.017
Hruschka, H., Fettes, W., & Probst, M. (2004). Market segmentation by maximum likelihood
clustering using choice elasticities. European Journal of Operational Research, 154(3),
779–786. http://doi.org/10.1016/S0377-2217(02)00807-X
Hsieh, M.-H., & Magee, C. L. (2008). An algorithm and metric for network decomposition
from similarity matrices: Application to positional analysis. Social Networks, 30(2), 146–
158. http://doi.org/10.1016/j.socnet.2007.11.002
Huang, A. (2008). Similarity measures for text document clustering. In Sixth New Zealand
Computer Science Research Student Conference.
Huang, C.-L., Huang, W.-C., Chang, H.-Y., Yeh, Y.-C., & Tsai, C.-Y. (2013). Hybridization
strategies for continuous ant colony optimization and particle swarm optimization applied
to data clustering. Applied Soft Computing, 13(9), 3864–3872.
290
http://doi.org/10.1016/j.asoc.2013.05.003
Huang, Z. (1997). Clustering large data sets with mixed numeric and categorical values. In In
The First Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 21--
34).
Hulse, J. D. Van, Khoshgoftaar, T. M., & Huang, H. (2007). The pairwise attribute noise
detection algorithm. Knowl Inf Syst, 11(2), 171–190. http://doi.org/10.1007/s10115-006-
0022-x
Iman, R. L., & Davenport, J. M. (1980). Approximations of the critical region of the fbietkan
statistic. Communications in Statistics - Theory and Methods, 9(6), 571–595.
http://doi.org/10.1080/03610928008827904
İnkaya, T., Kayalıgil, S., & Özdemirel, N. E. (2015). Ant Colony Optimization based clustering
methodology. Applied Soft Computing, 28, 301–311.
http://doi.org/10.1016/j.asoc.2014.11.060
Islam, M. Z., & Brankovic, L. (2011). Privacy preserving data mining: A noise addition
framework using a novel clustering technique. Knowledge-Based Systems, 24(8), 1214–
1223. http://doi.org/10.1016/j.knosys.2011.05.011
Islam, M. Z., & Giggins, H. (2011). Knowledge Discovery through SysFor -a Systematically
Developed Forest of Multiple Decision Trees. 9th Australian Data Mining Conference,
195–204.
Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters,
31(8), 651–666. http://doi.org/10.1016/j.patrec.2009.09.011
Jarboui, B., Cheikh, M., Siarry, P., & Rebai, A. (2007). Combinatorial particle swarm
optimization (CPSO) for partitional clustering problem. Applied Mathematics and
Computation, 192(2), 337–345. http://doi.org/10.1016/j.amc.2007.03.010
Jasper, H. H. (1958). Report of the committee on methods of clinical examination in
291
electroencephalography: 1957. Electroencephalography and Clinical Neurophysiology.
http://doi.org/10.1016/0013-4694(58)90053-1
Ji, J., Pang, W., Zhou, C., Han, X., & Wang, Z. (2012). A fuzzy k-prototype clustering
algorithm for mixed numeric and categorical data. Knowledge-Based Systems, 30, 129–
135. http://doi.org/10.1016/j.knosys.2012.01.006
Jiang, B., Wang, N., & Wang, L. (2013). Particle swarm optimization with age-group topology
for multimodal functions and data clustering. Communications in Nonlinear Science and
Numerical Simulation, 18(11), 3134–3145. http://doi.org/10.1016/j.cnsns.2013.03.011
Kalyani, S., & Swarup, K. S. (2011). Particle swarm optimization based K-means clustering
approach for security assessment in power systems. Expert Systems with Applications,
38(9), 10839–10846. http://doi.org/10.1016/j.eswa.2011.02.086
Kannan, S. R., Ramathilagam, S., Sathya, A., & Pandiyarajan, R. (2010). Effective fuzzy c-
means based kernel function in segmenting medical images. Computers in Biology and
Medicine, 40(6), 572–579. http://doi.org/10.1016/j.compbiomed.2010.04.001
Karaboga, D., & Ozturk, C. (2011). A novel clustering approach: Artificial Bee Colony (ABC)
algorithm. Applied Soft Computing, 11(1), 652–657.
http://doi.org/10.1016/j.asoc.2009.12.025
Kashef, R., & Kamel, M. S. (2009). Enhanced bisecting k-means clustering using intermediate
cooperation. Pattern Recognition, 42(11), 2557–2569.
http://doi.org/10.1016/j.patcog.2009.03.011
Kaya, I. E., Pehlivanlı, A. Ç., Sekizkardeş, E. G., & Ibrikci, T. (2017). PCA based clustering
for brain tumor segmentation of T1w MRI images. Computer Methods and Programs in
Biomedicine, 140, 19–28. http://doi.org/10.1016/j.cmpb.2016.11.011
Kerr, G., Ruskin, H. J., Crane, M., & Doolan, P. (2008). Techniques for clustering gene
expression data. Computers in Biology and Medicine.
292
http://doi.org/10.1016/j.compbiomed.2007.11.001
Kolen, J. F., & Hutcheson, T. (2002). Reducing the Time Complexity of the Fuzzy C-Means
Algorithm. IEEE Transactions on Fuzzy Systems, 10(2), 263–267.
Korürek, M., & Nizam, A. (2008). A new arrhythmia clustering technique based on Ant Colony
Optimization. Journal of Biomedical Informatics, 41(6), 874–881.
http://doi.org/10.1016/j.jbi.2008.01.014
Kumar, J., Mills, R. T., Hoffman, F. M., & Hargrove, W. W. (2011). Parallel k-Means
Clustering for Quantitative Ecoregion Delineation Using Large Data Sets. Procedia
Computer Science, 4, 1602–1611. http://doi.org/10.1016/j.procs.2011.04.173
Kuo, R. J., Huang, Y. D., Lin, C.-C., Wu, Y.-H., & Zulvia, F. E. (2014). Automatic kernel
clustering with bee colony optimization algorithm. Information Sciences, 283, 107–122.
http://doi.org/10.1016/j.ins.2014.06.019
Kuo, R. J., Syu, Y. J., Chen, Z.-Y., & Tien, F. C. (2012). Integration of particle swarm
optimization and genetic algorithm for dynamic clustering. Information Sciences, 195,
124–140. http://doi.org/10.1016/j.ins.2012.01.021
Lai, C.-C. (2005). A novel clustering approach using hierarchical genetic algorithms.
Intelligent Automation and Soft Computing, 11(3), 143–153.
Laszlo, M., & Mukherjee, S. (2006). A genetic algorithm using hyper-quadtrees for low-
dimensional k-means clustering. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 28(4), 533–543. http://doi.org/10.1109/TPAMI.2006.66
Laszlo, M., & Mukherjee, S. (2007). A genetic algorithm that exchanges neighboring centers
for k-means clustering. Pattern Recognition Letters (Vol. 28).
http://doi.org/10.1016/j.patrec.2007.08.006
Lee, M., & Pedrycz, W. (2009). The fuzzy C-means algorithm with fuzzy P-mode prototypes
for clustering objects having mixed features. Fuzzy Sets and Systems, 160(24), 3590–
293
3600. http://doi.org/10.1016/j.fss.2009.06.015
Lei, X., Tian, J., Ge, L., & Zhang, A. (2013). The clustering model and algorithm of PPI
network based on propagating mechanism of artificial bee colony. Information Sciences,
247, 21–39. http://doi.org/10.1016/j.ins.2013.05.027
Lei, X., Wang, F., Wu, F. X., Zhang, A., & Pedrycz, W. (2016). Protein complex identification
through Markov clustering with firefly algorithm on dynamic protein-protein interaction
networks. Information Sciences, 329, 303–316. http://doi.org/10.1016/j.ins.2015.09.028
Levine, S. S., & Kurzban, R. (2006). Explaining clustering in social networks: towards an
evolutionary theory of cascading benefits. Managerial and Decision Economics, 27(2–3),
173–187. http://doi.org/10.1002/mde.1291
Li, B. N., Chui, C. K., Chang, S., & Ong, S. H. (2011). Integrating spatial fuzzy clustering with
level set methods for automated medical image segmentation. Computers in Biology and
Medicine, 41(1), 1–10. http://doi.org/10.1016/j.compbiomed.2010.10.007
Li, C.-T., & Chiao, R. (2003). Multiresolution genetic clustering algorithm for texture
segmentation. Image and Vision Computing, 21(11), 955–966.
http://doi.org/10.1016/S0262-8856(03)00120-3
Liao, L., Lin, T., & Li, B. (2008). MRI brain image segmentation and bias field correction
based on fast spatially constrained kernel clustering approach. Pattern Recognition
Letters (Vol. 29). http://doi.org/10.1016/j.patrec.2008.03.012
Lin, H.-J., Yang, F.-W., & Kao, Y.-T. (2005). An Efficient GA-based Clustering Technique.
Amkang Journal of Science and Engineering, 8(2), 113–122.
Lin Yu Tseng, & Shiueng Bien Yang. (1997). Genetic algorithms for clustering, feature
selection and classification. In Proceedings of International Conference on Neural
Networks (ICNN’97) (Vol. 3, pp. 1612–1616). IEEE.
http://doi.org/10.1109/ICNN.1997.614135
294
Liu, B. (2011). Supervised Learning Web Data Mining Exploring Hyperlinks, Contents, and
Usage Data (2nd ed.). Springer-Verlag Berlin Heidelberg.
Liu, Y., Wu, X., & Shen, Y. (2011). Automatic clustering using genetic algorithms. Applied
Mathematics and Computation, 218(4), 1267–1279.
http://doi.org/10.1016/j.amc.2011.06.007
Liu, Y. Y., & Wang, S. (2015). A scalable parallel genetic algorithm for the Generalized
Assignment Problem. Parallel Computing, 46, 98–119.
http://doi.org/10.1016/j.parco.2014.04.008
Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information
Theory, 28(2), 129–137.
Lozano, J. A., & Larrañaga, P. (1999). Applying genetic algorithms to search for the best
hierarchical clustering of a dataset. Pattern Recognition Letters, 20(9), 911–918.
http://doi.org/10.1016/S0167-8655(99)00057-4
M. Lichman. (2013). UCI Machine Learning Repository. Retrieved June 22, 2013, from
http://archive.ics.uci.edu/ml/
Ma, Y., Cheng, G., Liu, Z., & Xie, F. (2017). Fuzzy nodes recognition based on spectral
clustering in complex networks. Physica A: Statistical Mechanics and Its Applications,
465, 792–797. http://doi.org/10.1016/j.physa.2016.08.022
Maimon, O., & Rokach, L. (2010). Data mining and knowledge discovery handbook. New
York: Springer.
Maio, D., Maltoni, D., & Rizzi, S. (1995). Topological clustering of maps using a genetic
algorithm. Pattern Recognition Letters, 16(1), 89–96. http://doi.org/10.1016/0167-
8655(94)00069-F
Mann, C. F., Matula, D. W., & Olinick, E. V. (2008). The use of sparsest cuts to reveal the
hierarchical community structure of social networks. Social Networks, 30(3), 223–234.
295
http://doi.org/10.1016/j.socnet.2008.03.004
Maraziotis, I. A. (2012). A semi-supervised fuzzy clustering algorithm applied to gene
expression data. Pattern Recognition, 45(1), 637–648.
http://doi.org/10.1016/j.patcog.2011.05.007
Masulli, F., & Schenone, A. (1999). A fuzzy clustering based segmentation system as support
to diagnosis in medical imaging. Artificial Intelligence in Medicine, 16(2), 129–147.
http://doi.org/10.1016/S0933-3657(98)00069-4
Mathew, J., & Vijayakumar, R. (2014). Scalable parallel clustering approach for large data
using parallel K means and firefly algorithms. In 2014 International Conference on High
Performance Computing and Applications (ICHPCA) (pp. 1–8). IEEE.
http://doi.org/10.1109/ICHPCA.2014.7045322
Matthias, B., & Juri, S. (2009). Spectral Clustering.
Maulik, U., & Bandyopadhyay, S. (2000). Genetic algorithm-based clustering technique.
Pattern Recognition, 33(9), 1455–1465. http://doi.org/10.1016/S0031-3203(99)00137-5
Maulik, U., & Mukhopadhyay, A. (2010). Simulated annealing based automatic fuzzy
clustering combined with ANN classification for analyzing microarray data. Computers
& Operations Research, 37(8), 1369–1380. http://doi.org/10.1016/j.cor.2009.02.025
Md Anisur Rahman. (2014). Automatic Selection of High Quality Initial Seeds for Generating
High Quality Clusters without requiring any User Inputs. Charles Sturt University.
Menon, N., & Ramakrishnan, R. (2015). Brain Tumor Segmentation in MRI Images Using
Unsupervised Artificial Bee Colony Algorithm and FCM Clustering. In International
Conference on Communications and Signal Preocessing, ICCSP 2015 (pp. 0006–0009).
Merz, B., Nguyen, V. D., & Vorogushyn, S. (2016). Temporal clustering of floods in Germany:
Do flood-rich and flood-poor periods exist? Journal of Hydrology, 541, 824–838.
http://doi.org/10.1016/j.jhydrol.2016.07.041
296
Miller, G. E., & Cole, S. W. (2012). Clustering of Depression and Inflammation in Adolescents
Previously Exposed to Childhood Adversity. Biological Psychiatry, 72(1), 34–40.
http://doi.org/10.1016/j.biopsych.2012.02.034
Mo, J., Kiang, M. Y., Zou, P., & Li, Y. (2010). A two-stage clustering approach for multi-
region segmentation. Expert Systems with Applications, 37(10), 7120–7131.
http://doi.org/10.1016/j.eswa.2010.03.003
Mohd, W. M. B. W., Beg, a. H., Herawan, T., & Rabbi, K. F. (2012). An Improved Parameter
less Data Clustering Technique based on Maximum Distance of Data and Lioyd k-means
Algorithm. Procedia Technology, 1, 367–371.
http://doi.org/10.1016/j.protcy.2012.02.076
Montani, S., & Leonardi, G. (2014). Retrieval and clustering for supporting business process
adjustment and analysis. Information Systems, 40, 128–141.
http://doi.org/10.1016/j.is.2012.11.006
Moore, M. (2004). An accurate parallel genetic algorithm to schedule tasks on a cluster.
Parallel Computing, 30(5), 567–583. http://doi.org/10.1016/j.parco.2003.12.005
Mukhopadhyay, A., & Maulik, U. (2009). Towards improving fuzzy clustering using support
vector machine: Application to gene expression data. Pattern Recognition, 42(11), 2744–
2763. http://doi.org/10.1016/j.patcog.2009.04.018
Mungle, S., Benyoucef, L., Son, Y. J., & Tiwari, M. K. (2013). A fuzzy clustering-based
genetic algorithm approach for time-cost-quality trade-off problems: A case study of
highway construction project. Engineering Applications of Artificial Intelligence, 26(8),
1953–1966. http://doi.org/10.1016/j.engappai.2013.05.006
Mur, A., Dormido, R., Duro, N., Dormido-Canto, S., & Vega, J. (2016). Determination of the
optimal number of clusters using a spectral clustering optimization. Expert Systems with
Applications, 65, 304–314. http://doi.org/10.1016/j.eswa.2016.08.059
297
Murthy, C. A., & Chowdhury, N. (1996). In search of optimal clusters using genetic algorithms.
Pattern Recognition Letters, 17(8), 825–832. http://doi.org/10.1016/0167-
8655(96)00043-8
Nanda, S. R., Mahanty, B., & Tiwari, M. K. (2010). Clustering Indian stock market data for
portfolio management. Expert Systems with Applications, 37(12), 8793–8798.
http://doi.org/10.1016/j.eswa.2010.06.026
Narayan, P. K., Narayan, S., & Popp, S. (2011). Investigating price clustering in the oil futures
market. Applied Energy (Vol. 88). http://doi.org/10.1016/j.apenergy.2010.07.034
Narayan, P. K., Narayan, S., Popp, S., & D’Rosario, M. (2011). Share price clustering in
Mexico. International Review of Financial Analysis (Vol. 20).
http://doi.org/10.1016/j.irfa.2011.02.003
Nascimento, M. C. V., & de Carvalho, A. C. P. L. F. (2011). Spectral methods for graph
clustering – A survey. European Journal of Operational Research, 211(2), 221–231.
http://doi.org/10.1016/j.ejor.2010.08.012
Neto, J. C., Meyer, G. E., & Jones, D. D. (2006). Individual leaf extractions from young canopy
images using Gustafson-Kessel clustering and a genetic algorithm. Computers and
Electronics in Agriculture, 51(1–2), 66–85. http://doi.org/10.1016/j.compag.2005.11.002
Omran, M. G. H., Engelbrecht, A. P., & Salman, A. (2007). An overview of clustering methods.
Intelligent Data Analysis, 11(6), 583–605.
Oostenveld, R., & Praamstra, P. (2001). The five percent electrode system for high-resolution
EEG and ERP measurements. Clinical Neurophysiology, 112(4), 713–719.
http://doi.org/10.1016/S1388-2457(00)00527-7
Opsahl, T., & Panzarasa, P. (2009). Clustering in weighted networks. Social Networks, 31(2),
155–163. http://doi.org/10.1016/j.socnet.2009.02.002
Ozturk, C., Hancer, E., & Karaboga, D. (2015). Dynamic clustering with improved binary
298
artificial bee colony algorithm. Applied Soft Computing Journal, 28, 69–80.
http://doi.org/10.1016/j.asoc.2014.11.040
Pakhira, M. K., Bandyopadhyay, S., & Maulik, U. (2005). A study of some fuzzy cluster
validity indices, genetic clustering and application to pixel classification. Fuzzy Sets and
Systems, 155(2), 191–214. http://doi.org/10.1016/j.fss.2005.04.009
Pang-Ning Tan, Michael Steinbach, V. K. (2005). Introduction to Data Mining (1st ed.).
Pearson Addison Wessley.
Parente, J., Pereira, M. G., & Tonini, M. (2016). Space-time clustering analysis of wildfires:
The influence of dataset characteristics, fire prevention policy decisions, weather and
climate. Science of The Total Environment, 559, 151–165.
http://doi.org/10.1016/j.scitotenv.2016.03.129
Paterlini, S., & Krink, T. (2006). Differential evolution and particle swarm optimisation in
partitional clustering. Computational Statistics & Data Analysis, 50(5), 1220–1247.
http://doi.org/10.1016/j.csda.2004.12.004
Peng, P., Addam, O., Elzohbi, M., Özyer, S. T., Elhajj, A., Gao, S., … Alhajj, R. (2014).
Reporting and analyzing alternative clustering solutions by employing multi-objective
genetic algorithm and conducting experiments on cancer data. Knowledge-Based Systems,
56, 108–122. http://doi.org/10.1016/j.knosys.2013.11.003
Pirim, H., Ekşioğlu, B., Perkins, A. D., & Yüceer, Ç. (2012). Clustering of high throughput
gene expression data. Computers & Operations Research, 39(12), 3046–3061.
http://doi.org/10.1016/j.cor.2012.03.008
Pourvaziri, H., & Naderi, B. (2014). A hybrid multi-population genetic algorithm for the
dynamic facility layout problem. Applied Soft Computing, 24, 457–469.
http://doi.org/10.1016/j.asoc.2014.06.051
Pyle, D. (1999). Data preparation for data mining. San Francisco: Morgan Kaufmann
299
Publishers.
Qiao, S., Li, T., Li, H., Peng, J., & Chen, H. (2012). A new blockmodeling based hierarchical
clustering algorithm for web social networks. Engineering Applications of Artificial
Intelligence, 25(3), 640–647. http://doi.org/10.1016/j.engappai.2012.01.003
Qing, L., Gang, W., Zaiyue, Y., & Qiuping, W. (2008). Crowding clustering genetic algorithm
for multimodal function optimization. Applied Soft Computing, 8(1), 88–95.
http://doi.org/10.1016/j.asoc.2006.10.014
Quinlan, J. R. (1993). C4.5 : programs for machine learning. San Mateo, U.S.A: Morgan
Kaufmann Publishers.
Quinlan, J. R. (1996). Improved Use of Continuous Attributes in C4.5. Journal of Artiicial
Intelligence Research Submitted, 4, 77–90.
Rafailidis, D., Constantinou, E., & Manolopoulos, Y. (2017). Landmark selection for spectral
clustering based on Weighted PageRank. Future Generation Computer Systems, 68, 465–
472. http://doi.org/10.1016/j.future.2016.03.006
Rahman, M. A., & Islam, M. Z. (2014). A hybrid clustering technique combining a novel
genetic algorithm with K-Means. Knowledge-Based Systems, 71, 345–365.
http://doi.org/10.1016/j.knosys.2014.08.011
Ramos, G. N., Hatakeyama, Y., Dong, F., & Hirota, K. (2009). Hyperbox clustering with Ant
Colony Optimization (HACO) method and its application to medical risk profile
recognition. Applied Soft Computing, 9(2), 632–640.
http://doi.org/10.1016/j.asoc.2008.09.004
Rivera-Baltanas, T., Olivares, J. M., Martinez-Villamarin, J. R., Y. Fenton, E., E. Kalynchuk,
L., & J. Caruncho, H. (2014). Serotonin 2A receptor clustering in peripheral lymphocytes
is altered in major depression and may be a biomarker of therapeutic efficacy. Journal of
Affective Disorders, 163, 47–55. http://doi.org/10.1016/j.jad.2014.03.011
300
Roiger, R. J., & Geatz, M. (2003). Data mining : A tutorial-based primer. Addison Wesley.
Roy, A., & Parui, S. K. (2014). Pair-copula based mixture models and their application in
clustering. Pattern Recognition, 47(4), 1689–1697.
http://doi.org/10.1016/j.patcog.2013.10.004
Saha, S., Alok, A. K., & Ekbal, A. (2016). Brain image segmentation using semi-supervised
clustering. Expert Systems with Applications, 52, 50–63.
http://doi.org/10.1016/j.eswa.2016.01.005
Schaeffer, S. E. (2007). Graph clustering. Computer Science Review, 1(1), 27–64.
http://doi.org/10.1016/j.cosrev.2007.05.001
Scheunders, P. (1997). A genetic c-Means clustering algorithm applied to color image
quantization. Pattern Recognition, 30(6), 859–866. http://doi.org/10.1016/S0031-
3203(96)00131-8
Schulz, J. (2008). Minkowski distance. Retrieved May 15, 2013, from
http://www.code10.info/index.php?option=com_content&view=article&id=61:articlemi
nkowski-distance&catid=38:cat_coding_algorithms_data-similarity&Itemid=57
Senthilnath, J., Omkar, S. N., & Mani, V. (2011). Clustering using firefly algorithm:
Performance study. Swarm and Evolutionary Computation, 1(3), 164–171.
http://doi.org/10.1016/j.swevo.2011.06.003
Shang, R., Zhang, Z., Jiao, L., Wang, W., & Yang, S. (2016). Global discriminative-based
nonnegative spectral clustering. Pattern Recognition, 55, 172–182.
http://doi.org/10.1016/j.patcog.2016.01.035
Sharbrough F, Chatrian G-E, Lesser RP, Lüders H, Nuwer M, & Picton TW. (1991). American
Electroencephalographic Society guidelines for standard electrode position nomenclature.
Journal of Clinical Neurophysiology : Official Publication of the American
Electroencephalographic Society, 8(2), 200–2. Retrieved from
301
http://www.ncbi.nlm.nih.gov/pubmed/2050819
Sheikh, R. H., Raghuwanshi, M. M., & Jaiswal, A. N. (2008). Genetic Algorithm Based
Clustering: A Survey. In 2008 First International Conference on Emerging Trends in
Engineering and Technology (pp. 314–319). IEEE.
http://doi.org/10.1109/ICETET.2008.48
Shelokar, P. ., Jayaraman, V. ., & Kulkarni, B. . (2004). An ant colony approach for clustering.
Analytica Chimica Acta, 509(2), 187–195. http://doi.org/10.1016/j.aca.2003.12.032
Sheng, W., Howells, G., Fairhurst, M., & Deravi, F. (2008). Template-Free Biometric-Key
Generation by Means of Fuzzy Genetic Clustering. IEEE Transactions on Information
Forensics and Security, 3(2), 183–191.
Sisodia, D., Singh, L., Sisodia, S., & Saxena, K. (2012). Clustering Techniques: A Brief Survey
of Different Clustering Algorithms. International Journal of Latest Trends in Engineering
and Technology (IJLTET), 1(3), 82–87.
Son, L. H., & Tuan, T. M. (2017). Dental segmentation from X-ray images using semi-
supervised fuzzy clustering with spatial constraints. Engineering Applications of Artificial
Intelligence, 59, 186–195. http://doi.org/10.1016/j.engappai.2017.01.003
Song, W., Li, C. H., & Park, S. C. (2009). Genetic algorithm for text clustering using ontology
and evaluating the validity of various semantic similarity measures. Expert Systems with
Applications, 36(5), 9095–9104. http://doi.org/10.1016/j.eswa.2008.12.046
Sonğur, C., & Top, M. (2016). Regional clustering of medical imaging technologies.
Computers in Human Behavior, 61, 333–343. http://doi.org/10.1016/j.chb.2016.03.056
Srikanth, R., George, R., Warsi, N., Prabhu, D., Petry, F. E., & Buckles, B. P. (1995). A
variable-length genetic algorithm for clustering and classification. Pattern Recognition
Letters, 16(8), 789–800. http://doi.org/10.1016/0167-8655(95)00043-G
Srinivas, M., & M.Patnaik, L. (1994). Adaptive probabilities of crossover and mutation in
302
genetic algorithms. IEEE Transactions on Systems, Man and Cybernetics, 24(1994), 656–
667.
Stockman, G., & Shapiro, L. G. (2001). Computer Vision. New Jersey: Prentice-Hall.
Straßburg, J., Gonzàlez-Martel, C., & Alexandrov, V. (2012). Parallel genetic algorithms for
stock market trading rules. Procedia Computer Science, 9, 1306–1313.
http://doi.org/10.1016/j.procs.2012.04.143
Sumathi, S., & Sivanandam, S. N. (2006). Introduction to data mining and its applications.
Springer.
Sun, J., Chen, W., Fang, W., Wun, X., & Xu, W. (2012). Gene expression data analysis with
the clustering method based on an improved quantum-behaved Particle Swarm
Optimization. Engineering Applications of Artificial Intelligence, 25(2), 376–391.
http://doi.org/10.1016/j.engappai.2011.09.017
Suzuki, T., Shiga, T., Kuwahara, K., Kobayashi, S., Suzuki, S., Nishimura, K., … Hagiwara,
N. (2014). Impact of clustered depression and anxiety on mortality and rehospitalization
in patients with heart failure. Journal of Cardiology, 64(6), 456–462.
http://doi.org/10.1016/j.jjcc.2014.02.031
Szeto, L. K., Liew, A. W.-C., Yan, H., & Tang, S. (2003). Gene expression data clustering and
visualization based on a binary hierarchical clustering framework. Journal of Visual
Languages & Computing, 14(4), 341–362. http://doi.org/10.1016/S1045-
926X(03)00033-8
Teknomo, K. (2015a). Jaccard’s Coefficient. Retrieved May 15, 2013, from
http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html
Teknomo, K. (2015b). Minkowski Distance. Retrieved January 2, 2017, from
http://people.revoledu.com/kardi/tutorial/Similarity/MinkowskiDistance.html
Traud, A. L., Mucha, P. J., & Porter, M. A. (2012). Social structure of Facebook networks.
303
Physica A: Statistical Mechanics and Its Applications, 391(16), 4165–4180.
http://doi.org/10.1016/j.physa.2011.12.021
Triola, M. F. (2001). Elementary Statistics (8th ed.). Boston San Francisco New York: Addison
Wesley Longman, Inc.
Tsai, C.-Y., & Kao, I.-W. (2011). Particle swarm optimization with selective particle
regeneration for data clustering. Expert Systems with Applications, 38(6), 6565–6576.
http://doi.org/10.1016/j.eswa.2010.11.082
Tseng, L. Y., & Yang, S. B. (2001). A genetic approach to the automatic clustering
problem.pdf. Pattern Recognition Society, 34(October 1999), 415–424.
Turgut, D., Das, S. K., Elmasri, R., & Turgut, B. (2002). Optimizing clustering algorithm in
mobile ad hoc networks using genetic algorithmic approach. In Global
Telecommunications Conference, 2002. GLOBECOM ’02. IEEE (Vol. 1, pp. 62–66).
IEEE. http://doi.org/10.1109/GLOCOM.2002.1188042
Tzes, A., Pei-Yuan Peng, & Guthy, J. (1998). Genetic-based fuzzy clustering for DC-motor
friction identification and compensation. IEEE Transactions on Control Systems
Technology, 6(4), 462–472. http://doi.org/10.1109/87.701338
Van Lancker, A., Beeckman, D., Verhaeghe, S., Van Den Noortgate, N., & Van Hecke, A.
(2016). Symptom clustering in hospitalised older palliative cancer patients: A cross-
sectional study. International Journal of Nursing Studies, 61, 72–81.
http://doi.org/10.1016/j.ijnurstu.2016.05.010
von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4),
395–416. http://doi.org/10.1007/s11222-007-9033-z
W.M. Ma, E., & Chow, T. W. S. (2004). A new shifting grid clustering algorithm. Pattern
Recognition, 37(3), 503–514. http://doi.org/10.1016/j.patcog.2003.08.014
Wan, M., Wang, C., Li, L., & Yang, Y. (2012). Chaotic ant swarm approach for data clustering.
304
Applied Soft Computing, 12(8), 2387–2393. http://doi.org/10.1016/j.asoc.2012.03.037
Wang, C.-H. (2009). Outlier identification and market segmentation using kernel-based
clustering techniques. Expert Systems with Applications, 36(2), 3744–3750.
http://doi.org/10.1016/j.eswa.2008.02.037
Wang, C., Cao, L., Li, J., Wei, W., Ou, Y., & Wang, M. (2011). Coupled Nominal Similarity
in Unsupervised Learning.
Wang, W., Yang, J., & Muntz, R. (1997). STING : A Statistical Information Grid Approach to
Spatial Data Mining. In The 23rd VLDB Conference Athens (pp. 186–195). Greece.
Wikaisuksakul, S. (2014). A multi-objective genetic algorithm with fuzzy c-means for
automatic data clustering. Applied Soft Computing, 24, 679–691.
http://doi.org/10.1016/j.asoc.2014.08.036
Xiao, J., Yan, Y., Zhang, J., & Tang, Y. (2010). A quantum-inspired genetic algorithm for k-
means clustering. Expert Systems with Applications, 37(7), 4966–4973.
http://doi.org/10.1016/j.eswa.2009.12.017
Xie, X. L., & Beni, G. (1991). A validity measure for fuzzy clustering. IEEE Transactions on
Pattern Analysis and Machine Intelligence, PAMI, 13(8), 841–847.
Xu, R., Damelin, S., Nadler, B., & Wunsch, D. C. (2010). Clustering of high-dimensional gene
expression data with feature filtering methods and diffusion maps. Artificial Intelligence
in Medicine, 48(2), 91–98. http://doi.org/10.1016/j.artmed.2009.06.001
Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural
Networks, 16(3), 645–678.
Yan, X., Zhu, Y., Zou, W., & Wang, L. (2012). A new approach for data clustering using hybrid
artificial bee colony algorithm. Neurocomputing, 97, 241–250.
http://doi.org/10.1016/j.neucom.2012.04.025
Yang, F., Sun, T., & Zhang, C. (2009). An efficient hybrid data clustering method based on K-
305
harmonic means and Particle Swarm Optimization. Expert Systems with Applications,
36(6), 9847–9852. http://doi.org/10.1016/j.eswa.2009.02.003
Yang, J. Y., & Ersoy, O. K. (2003). Combined Supervised and Unsupervised Learning in
Genomic Data Mining. Electrical and Computer Engineering, Purdue University.
Yang, Y., Wang, Y., & Xue, X. (2016). A novel spectral clustering method with superpixels
for image segmentation. Optik - International Journal for Light and Electron Optics,
127(1), 161–167. http://doi.org/10.1016/j.ijleo.2015.10.053
Yücenur, G. N., & Demirel, N. Ç. (2011). A new geometric shape-based genetic clustering
algorithm for the multi-depot vehicle routing problem. Expert Systems with Applications,
38(9), 11859–11865. http://doi.org/10.1016/j.eswa.2011.03.077
Zeng, Y., & Garcia-Frias, J. (2006). A novel HMM-based clustering algorithm for the analysis
of gene expression time-course data. Computational Statistics & Data Analysis, 50(9),
2472–2494. http://doi.org/10.1016/j.csda.2005.07.007
Zhang, C., Ouyang, D., & Ning, J. (2010). An artificial bee colony approach for clustering.
Expert Systems with Applications, 37(7), 4761–4767.
http://doi.org/10.1016/j.eswa.2009.11.003
Zhang, L., & Cao, Q. (2011). A novel ant-based clustering algorithm using the kernel method.
Information Sciences, 181(20), 4658–4672. http://doi.org/10.1016/j.ins.2010.11.005
Zhang, L., Cao, Q., & Lee, J. (2013). A novel ant-based clustering algorithm using Renyi
entropy. Applied Soft Computing, 13(5), 2643–2657.
http://doi.org/10.1016/j.asoc.2012.11.022
Zhao, F., Fan, J., & Liu, H. (2014). Optimal-selection-based suppressed fuzzy c-means
clustering algorithm with self-tuning non local spatial information for image
segmentation. Expert Systems with Applications, 41(9), 4083–4093.
http://doi.org/10.1016/j.eswa.2014.01.003
306
Zhao, L., Yang, Y., & Zeng, Y. (2009). Eliciting compact T-S fuzzy models using subtractive
clustering and coevolutionary particle swarm optimization. Neurocomputing, 72(10–12),
2569–2575. http://doi.org/10.1016/j.neucom.2008.11.001
Zhao, P., & Zhang, C.-Q. (2011). A new clustering method and its application in social
networks. Pattern Recognition Letters, 32(15), 2109–2118.
http://doi.org/10.1016/j.patrec.2011.06.008
Zhao, Z., Feng, S., Wang, Q., Huang, J. Z., Williams, G. J., & Fan, J. (2012). Topic oriented
community detection through social objects and link analysis in social networks.
Knowledge-Based Systems, 26, 164–173. http://doi.org/10.1016/j.knosys.2011.07.017
Zhong, C., Miao, D., & Wang, R. (2010). A graph-theoretical clustering method based on two
rounds of minimum spanning trees. Pattern Recognition, 43(3), 752–766.
http://doi.org/10.1016/j.patcog.2009.07.010