genetic algorithm based clustering techniques and tree based … · i declaration of authorship i...

Genetic Algorithm based Clustering

Techniques and Tree based Validation in

Producing and Evaluating Sensible Clusters

Abul Hashem Beg

A thesis submitted in the fulfilment of the requirements for the degree of

Doctor of Philosophy

School of Computing and Mathematics

Charles Sturt University

Panorama Avenue, Bathurst, NSW 2795, Australia

September 2017

i

Declaration of Authorship

I Abul Hashem Beg hereby declare that this submission titled “A Novel Genetic Algorithm

based Clustering and Tree based validation in Producing and Evaluating Sensible Clusters” is

my own work and to the best of my knowledge and belief, understand that it contains no

material previously published or written by another person, nor material which to a substantial

extent has been accepted for the award of any other degree or diploma at Charles Sturt

University or any other educational institution, except where due acknowledgement is made in

the thesis. Any contribution made to the research by colleagues with whom I have worked at

Charles Sturt University or elsewhere during my candidature is fully acknowledged.

I agree that this thesis be accessible for the purpose of study and research in accordance with

normal conditions established by the Executive Director, Library Services, Charles Sturt

University or nominee, for the care, loan and reproduction of thesis, subject to confidentiality

provisions as approved by the University.

Signature:

Date: 25/09/2017

ii

Acknowledgement

First and foremost, I would like to thank the almighty Allah for blessing me with the strength,

knowledge, ability and opportunity to complete this research work. This achievement would not

have been possible without his blessings.

I would like to express my deepest gratitude to my principal supervisor Associate Professor

Dr Md Zahidul Islam for his continuous support, discussions, suggestions and valuable time

throughout my PhD. His constant inspiration, valuable guidance and directions made this work

possible. My sincere and cordial appreciation to him because without his supervision I cannot

be whom I am today.

I am also thankful to my co-supervisors Professor Vladimir Estivill-Castro and Dr Peter

White for their kind support and suggestions during my study. I am also grateful to Charles Sturt

University for providing the scholarship to me. I am also thankful to the Centre for Research in

Complex Systems (CRiCS) for providing me a nice working environment.

My special thanks to my parents, my wife, my brothers and sister, brother in law and sisters

in law, my nieces and other relatives for their support and encouragement, especially to my elder

brother Associate Professor Dr Md Dalour Hossen Beg for his kind support during my PhD.

I am grateful to my friends Nasim Adnan, Dr Anisur Rahman, Dr Geaur Rahman, Dr Zavid

Parvez, Samuel Fletcher, Michael Siers, Fazley Rabbi, Khubeb Siddiqui, Pallab Podder, Jahid

Reza, Musfequs Salehin and Buyani for their moral supports throughout my PhD. I would give

thanks also to all faculty and staff members of the School of Computing and Mathematics and

all postgraduate students for being very supportive and friendly to me during my study.

iii

With love and gratitude this thesis is dedicated to

My Parents

Mily, My wife

My Brothers and Sister

My Brother in law and Sisters in law

My Father-in-law, My Mother-in-law, My Nieces

and All Relatives

for their support and inspiration.

iv

Publications from the Thesis

[1] Beg, A. H., Islam, M. Z., and Estivill-Castro. V. (2016): Application of a Novel GA-based

Clustering and Tree based Validation on a Brain Data Set for Knowledge Discovery,

Information Systems, ELSEVIER. (Status: Under Review). (ERA 2010 Rank A*, SJR

2016 Rank Q1, H Index 64).

[2] Beg, A. H., Islam, M. Z., and Estivill-Castro. V. (2016): Genetic Algorithm with Healthy

Population and Multiple Streams Sharing Information for Clustering, Knowledge-Based

Systems, 114 (2016) 61-78, ELSEVIER. (ABDC 2016 Rank A, SJR 2016 Rank Q1, 5

Year Impact Factor: 3.433, H Index 63).

[3] Beg, A. H. and Islam, M. Z. (2016): A Novel Genetic Algorithm-Based Clustering

Technique and its Suitability for Knowledge Discovery from a Brain Data set, In Proc.

of the IEEE Congress on Evolutionary Computation (IEEE CEC 2016), Vancouver,

Canada, July 24-29, 2016, pp. 948-956. (ERA 2010 Rank A).

[4] Beg, A. H. and Islam, M. Z. (2016): Novel crossover and mutation operation in genetic

algorithm for Clustering, In Proc. of the IEEE Congress on Evolutionary Computation

(IEEE CEC 2016), Vancouver, Canada, July 24-29, 2016, pp. 2114-2121. (ERA 2010

Rank A).

[5] Beg, A. H. and Islam, M. Z. (2016): Branches of Evolutionary Algorithms and their

Effectiveness for Clustering Records, In Proc. of the 11th IEEE Conference on Industrial

Electronics and Applications (ICIEA 2016), Hefei, China, June 5-7, 2016, pp. 2484-

2489. (ERA 2010 Rank A).

[6] Beg, A. H. and Islam, M. Z. (2016): Advantages and Limitations of Genetic Algorithms

for Clustering Records, In Proc. of the 11th IEEE Conference on Industrial Electronics

v

and Applications (ICIEA 2016), Hefei, China, June 5-7, 2016, pp. 2478-2483. (ERA

2010 Rank A).

[7] Beg, A. H. and Islam, M. Z. (2016): Genetic Algorithm with Novel Crossover, Selection

and Health Check for Clustering, In Proc. of the 24th European Symposium on Artificial

Neural Networks, Computational Intelligence and Machine Learning (ESANN 2016),

Bruges, Belgium, April 27-29, 2016, pp. 575-580. (ERA 2010 Rank B).

[8] Beg, A. H., and Islam, M. Z. (2015): Clustering by Genetic Algorithm - High Quality

Chromosome Selection for Initial Population, In Proc. of the 10th IEEE Conference on

Industrial Electronics and Applications (ICIEA 2015), Auckland, New Zealand, 15-17

June, 2015, pp. 129 -134. (ERA 2010 Rank A).

ERA: Excellence in Research for Australia

SJR: SCImago Journal rank

ABDC: Australian Business Deans Council

vi

Abstract

Clustering is an important technique in the area of data mining, which aims to group similar

records in one cluster and dissimilar records in different clusters. Clustering is used in various

fields for knowledge discovery and facilitating decision making processes, and many clustering

approaches have been proposed. However, many of them have various limitations; including

the requirement for user input on the number of clusters, the tendency to get stuck at local

optima, and a high complexity of 𝑂(𝑛)2. There is room for improvement of the cluster quality

produced by existing methods. We also observe that existing cluster evaluation methods often

produce inappropriate/biased evaluation values. A good cluster evaluation technique is therefore

critical.

In this study, we propose a number of clustering techniques that produce high-quality clusters

through the improvement of various genetic operations with a low complexity of 𝑂(𝑛), and

require no user input on the cluster numbers. We also demonstrate through a graphical

visualization that many existing clustering techniques often do not produce sensible clusters,

which may not be useful in knowledge discovery from underlying data sets. Sometimes, they

obtain huge numbers of clusters and sometimes they derive only two clusters where one cluster

contains one record and the other cluster contains all remaining records.

Hence, in this study we propose a clustering technique that produces sensible clustering

solutions. We graphically visualize the clustering results of our proposed technique on a brain

data set and demonstrate its ability in knowledge discovery. In this study, we also propose an

evaluation technique for clustering results. We validate the effectiveness of the proposed

evaluation method by analyzing it on some ground truth clustering results.

vii

Table of Contents

Declaration of Authorship........................................................................................................... i

Acknowledgement ..................................................................................................................... ii

Publications from the Thesis ..................................................................................................... iv

Abstract ..................................................................................................................................... vi

Table of Contents ..................................................................................................................... vii

Principal Notations.................................................................................................................. xvi

List of Figures .......................................................................................................................xviii

List of Tables ........................................................................................................................xxiii

Chapter 1 Introduction ....................................................................................................... 1

Chapter 2 Literature Review ............................................................................................ 10

2.1 Introduction ............................................................................................................... 10

2.2 Data Set with its Notations and Definition................................................................ 11

2.3 Data Mining............................................................................................................... 12

2.4 Machine Learning ..................................................................................................... 12

2.5 Clustering .................................................................................................................. 14

2.5.1 Applications of Clustering ................................................................................. 14

2.5.2 Categories of Clustering Techniques ................................................................. 16

2.5.2.1 Partition-based Clustering Techniques ......................................................... 17

2.5.2.2 Hierarchical Clustering Techniques .............................................................. 21

2.5.2.3 Density-based Clustering Techniques ........................................................... 22

2.5.2.4 Graph-based Clustering Techniques ............................................................. 23

2.5.2.5 Grid-based Clustering Techniques ................................................................ 23

2.5.2.6 Spectral Clustering Techniques .................................................................... 24

2.5.2.7 Model-based Clustering Techniques ............................................................. 25

2.5.2.8 Evolutionary Algorithm-based Clustering Techniques .................................... 26

viii

Ant Colony Algorithm-based Clustering Techniques ........................................... 27

Bee Colony Algorithm-based Clustering Techniques ........................................... 30

Particle Swarm Optimization (PSO) Algorithm-based Clustering Techniques .... 31

Black hole Algorithm-based Clustering Techniques ............................................. 32

Firefly Algorithm-based Clustering Techniques ................................................... 33

Genetic Algorithm-based Clustering Techniques.................................................. 33

2.6 Distance Calculation ................................................................................................. 41

2.6.1 Distance Calculation for Numerical Attributes.................................................. 41

Minkowski Distance .............................................................................................. 42

Manhattan Distance ............................................................................................... 42

Euclidean Distance ................................................................................................ 42

Chebyshev Distance .............................................................................................. 43

Cosine Distance ..................................................................................................... 43

Jaccard distance ..................................................................................................... 43

2.6.2 Distance Calculation for Categorical Attributes ................................................ 44

2.7 Cluster Evaluation Techniques.................................................................................. 46

2.7.1 Internal Cluster Evaluation Techniques ............................................................. 47

Sum of Square Error (SSE) ................................................................................... 47

Davies-Bouldin (DB) Index ................................................................................... 47

Silhouette Coefficient ............................................................................................ 48

Xie-Beni Index ...................................................................................................... 49

Dunn Index ............................................................................................................ 50

2.7.2 External Cluster Evaluation Techniques ............................................................ 50

F-measure .............................................................................................................. 50

Purity ..................................................................................................................... 51

Entropy .................................................................................................................. 52

2.8 Summary ................................................................................................................... 52

Chapter 3 High-Quality Initial Population in a GA for High-Quality Clustering with

Low Complexity ..................................................................................................................... 56

3.1 Introduction ............................................................................................................... 56

ix

3.2 DeRanClust: Deterministic and Random Selection for the Initial Population in a GA-

Based Clustering Technique................................................................................................. 59

Step 1: Normalization .................................................................................................... 59

Step 2: Population Initialization .................................................................................... 60

Step 3: Noise-based Selection Operation ...................................................................... 64

Step 4: Crossover Operation .......................................................................................... 64

Step 5: Twin Removal ................................................................................................... 65

Step 6: Mutation Operation ........................................................................................... 67

Step 7: Elitist Operation ................................................................................................ 67

3.3 Experimental Results and Discussion ....................................................................... 68

3.3.1 Data Sets ............................................................................................................ 68

3.3.2 Evaluation Criteria ............................................................................................. 68

3.3.3 Experimental Results on All Techniques ........................................................... 69

3.3.4 An Analysis of the Impact of Various Component of DeRanClust ................... 70

3.3.4.1 An Analysis of the Impact of the Population Initialization .................... 70

3.3.4.2 An Analysis of the Impact of the Crossover Operation.......................... 71

3.3.4.3 Cluster Quality Comparison between DeRanClust and Modified

AGCUK ................................................................................................................. 72

3.3.5 Complexity Analysis .......................................................................................... 73

3.4 Summary ................................................................................................................... 74

Chapter 4 Extensive Crossover and Mutation in a GA for High-Quality Clustering with

Low Complexity ..................................................................................................................... 76

4.1 Introduction ............................................................................................................... 76

4.2 The Motivation Behind the Proposed Technique ...................................................... 78

4.3 GMC: Genetic Algorithm with Novel Mutation and Crossover for Clustering ........ 79

Step 1: Normalization ............................................................................................... 79

Step 2: Population Initialization ................................................................................ 80

Step 3: Probabilistic Selection .................................................................................. 81

Step 4: Two Phases of Crossover Operation ............................................................. 81

Step 5: Twin Removal ............................................................................................... 84

Step 6: Three Steps of Mutation Operation .............................................................. 84

x

Step 7: Elitist Operation ............................................................................................ 85

4.4 Experimental Results and Discussion ....................................................................... 85

4.4.1 Data Sets ............................................................................................................ 85

4.4.2 The Parameter used in the Experiments............................................................. 85

4.4.3 The Experimental Setup ..................................................................................... 86

4.4.4 Experimental Results on All Techniques ........................................................... 87

4.4.5 An Analysis of the Impact of Various Properties of GMC ................................ 88

4.4.5.1 An Analysis of the Impact of the Crossover Operation.......................... 88

4.4.5.2 An Analysis of the Impact of the Mutation Operation ........................... 89

4.4.5.3 An Analysis of the Impact of the Probabilistic Selection Operation ...... 89

4.4.5.4 An Analysis of Improvement in Chromosomes over the Iterations ....... 90

4.4.6 Statistical Analysis ............................................................................................. 91

4.5 Summary ................................................................................................................... 93

Chapter 5 High-Quality Clustering through Novel Crossover, Selection and Health

Check with Low Complexity ................................................................................................. 96

5.1 Introduction ............................................................................................................... 96

5.2 The Motivation Behind the Proposed Technique ...................................................... 98

5.3 GCS: GA with Novel Crossover, Health Check and Selection for Clustering ........ 99

Step 1: Normalization ............................................................................................... 99

Step 2: Population Initialization .............................................................................. 101

Step 3: Two Phases of Selection Operation ............................................................ 101

Step 4: Crossover Operation ................................................................................... 101

Step 5: Twin Removal ............................................................................................. 103

Step 6: Mutation Operation ..................................................................................... 103

Step 7: Health Check Operation .............................................................................. 103

Step 8: Elitist Operation .......................................................................................... 103

5.4 Experimental Results and Discussion ..................................................................... 104

5.4.1 Data Sets .......................................................................................................... 105

5.4.2 Evaluation Criteria ........................................................................................... 105

5.4.3 Experimental Results on All Techniques ......................................................... 105

5.4.4 Comparative Results between GCS and GMC ................................................ 107

xi

5.4.5 An Analysis of the Impact of Various Component of GCS............................. 108

5.4.5.1 An Analysis of the Impact of the Health Check Operation .................. 109

5.4.5.2 An Analysis of the Impact of the Crossover Operation........................ 109

5.4.6 An Analysis of the Improvement in Chromosomes over the Iterations........... 110

5.5 Summary ................................................................................................................. 111

Chapter 6 GA with Multiple Streams and Neighbor Information Sharing for Clustering

................................................................................................................................................ 114

6.1 Introduction ............................................................................................................. 114

6.2 HeMI: Healthy Population and Multiple Streams Sharing Information in a GA for

Clustering ........................................................................................................................... 118

6.2.1 Basic Concepts ................................................................................................. 118

6.2.2 Main Steps ....................................................................................................... 120

Component 1: Normalization .................................................................................. 121

Component 2: Multiple Streams ............................................................................. 123

Component 3: Population Initialization .................................................................. 123

Component 4: Noise-based Selection ..................................................................... 124

Component 5: Crossover Operation ........................................................................ 124

Component 6: Twin Removal ................................................................................. 125

Component 7: Three Steps Mutation Operation ..................................................... 125

Component 8: Health Improvement Operation ....................................................... 127

Component 9: Elitist Operation .............................................................................. 129

Component 10: Neighbor Information Sharing ...................................................... 130

Component 11: Global Best Selection .................................................................... 130

6.2.3 The HeMI Algorithm ....................................................................................... 131


6.3.1 The Data sets and the Evaluation Criteria........................................................ 132

6.3.2 The Parameters used in the Experiments ......................................................... 134

6.3.3 The Experimental Setup ................................................................................... 134

6.3.4 Experimental Results on All Techniques ......................................................... 135

6.3.5 Comparative Results between HeMI and GCS ................................................ 137

6.3.6 Comparative Results among HeMI, GCS, GMC and DeRanClust ................. 138

xii

6.3.7 An Analysis of the Impact of Various Properties of HeMI ............................. 140

6.3.7.1 An Analysis of the Impact of the Multiple Streams that Exchange

Information .............................................................................................................. 140

6.3.7.2 An Analysis of the Impact of the Population Initialization ........................ 143

6.3.7.3 An Analysis of the Impact of the Mutation Operation ................................ 144

6.3.7.4 An Analysis of the Impact of the Health Improvement .............................. 145

6.3.7.5 An Analysis of the Impact of the Interval ................................................... 145

6.3.7.6 An Analysis of the Impact of the number of Streams ................................. 146

6.3.7.7 An Analysis of the Improvement in Chromosomes over the Iterations ...... 147

6.3.8 Statistical Analysis ........................................................................................... 149

6.3.9 An Analysis on the use of K-means++ instead of K-means in HeMI ............. 151

6.3.10 Complexity Analysis ........................................................................................ 151

6.3.11 Comparison between HeMI and Multiple Runs of K-means........................... 152

6.4 Summary ................................................................................................................. 154

Chapter 7 A Novel GA-based Clustering Technique and its Suitability for Knowledge

Discovery from a Brain Data Set ........................................................................................ 156

7.1 Introduction ............................................................................................................. 156

7.2 The Motivation Behind the Proposed Technique .................................................... 159

7.3 CSClust: High-quality Chromosome Selection and Cleansing Operation in a GA for

Clustering ........................................................................................................................... 162

Step 1: Normalization ............................................................................................. 163

Step 2: Population Initialization .............................................................................. 163

Step 3: Sensible Properties Selection ...................................................................... 163

Step 4: Crossover Operation ................................................................................... 163

Step 5: Mutation Operation ..................................................................................... 164

Step 6: Twin Removal Operation ............................................................................ 165

Step 7: Cleansing Operation.................................................................................... 166

Step 8: Cloning Operation ....................................................................................... 166

Step 9: Elitist Operation .......................................................................................... 167


7.4.1 The Data sets and the Cluster Evaluation Criteria ........................................... 167

xiii

7.4.2 The Parameter used in the Experiments........................................................... 168


7.4.4 Brain Data set (CHB-MIT Scalp EEG) Pre-processing ................................... 169

7.4.5 Experimental Results on Brain Data Set .......................................................... 170

7.4.6 Analysis of the Clustering Result obtained by CSClust on the Brain Data set 170

7.4.7 Knowledge from Decision Tree on Brain Data set .......................................... 175

7.4.8 Experimental Results on 10 Real Life Data sets .............................................. 177

7.4.9 An Analysis of the Improvement in Chromosomes over the Iterations........... 178

7.4.10 Statistical Analysis ........................................................................................... 178

7.5 Summary ................................................................................................................. 179

Chapter 8 Application of a Novel GA-based Clustering and Tree based Validation on

a Brain Data Set for Knowledge Discovery ....................................................................... 182

8.1 Introduction ............................................................................................................. 182

8.2 Our Technique ......................................................................................................... 187

8.2.1 Basic Concepts of Our Clustering Technique HeMI++ ................................... 187

8.2.2 Basic Concepts of Our Cluster Evaluation Technique Tree Index .................. 193

8.2.3 Main Components of HeMI++ ........................................................................ 194

Component 1: Normalization .................................................................................. 195

Component 2: Multiple Stream ............................................................................... 195

Component 3: Population Initialization .................................................................. 195

Component 4: Selection of Sensible Properties ...................................................... 195

Component 5: Noise-based Selection ..................................................................... 196

Component 6: Crossover Operation ........................................................................ 198

Component 7: Twin Removal ................................................................................. 198

Component 8: Three Steps Mutation Operation ..................................................... 198

Component 9: Health Improvement Operation ....................................................... 198

Component 10: Cleansing Operation ...................................................................... 198

Component 11: Cloning Operation ......................................................................... 199

Component 12: The Elitist Operation ..................................................................... 199

Component 13: Neighbor Information Sharing ...................................................... 199

Component 14: Global Best Selection .................................................................... 199

8.2.4 The HeMI++ Algorithm ................................................................................... 200

xiv

8.2.5 Our Cluster Evaluation Technique (Tree Index) ............................................. 201


8.3.1 The Data Sets and the Evaluation Criteria ....................................................... 202

8.3.2 The Parameters used in the Experiments ......................................................... 204


8.3.4 Brain Data set Pre-processing .......................................................................... 205

8.3.5 Clustering Quality Comparison between HeMI++ and Other Techniques on the

MIT-Chb01_03 Data Set ................................................................................................ 205

8.3.6 Analysis of the Clustering Result Obtained by HeMI++ from the CHB-MIT

Scalp EEG (chb01-03) Data Set ..................................................................................... 210

8.3.7 Evaluation of HeMI++ and Tree Index on the LD data set ............................. 214

8.3.8 Experimental Results on All Techniques on 21 Real Life Data Sets .............. 217

8.3.9 An Analysis of the Clustering Quality of HeMI++ on Different Data Sets .... 219

8.3.9.1 Performance of HeMI++ compared to Existing Techniques, based on

Number of Records ................................................................................................. 221

8.3.9.2 Performance of HeMI++ compared to Existing Techniques, based on

Number of Attributes .............................................................................................. 222

8.3.9.3 Performance of HeMI++ compared to Existing Techniques, based on Type

of the Majority of Attributes ................................................................................... 223

8.3.10 Knowledge from the Brain Data ...................................................................... 224

8.3.11 Complexity Analysis ........................................................................................ 227

8.3.12 Statistical Friedman Test.................................................................................. 228

8.4 Summary ................................................................................................................. 230

Chapter 9 Discussion ....................................................................................................... 232

9.1 Introduction ............................................................................................................. 232

9.2 Comparison and Discusion of the Proposed Techniques ........................................ 233

9.2.1 DeRanClust ...................................................................................................... 233

9.2.2 GMC ................................................................................................................ 235

9.2.3 GCS .................................................................................................................. 236

9.2.4 HeMI ................................................................................................................ 237

9.2.5 CSClust ............................................................................................................ 240

9.2.6 HeMI++ ........................................................................................................... 241

9.3 Key Contributions of the Thesis.............................................................................. 244

xv

9.4 Complexity Analysis of the Techniques ................................................................. 245

9.4.1 Notations for Complexity Analysis ................................................................. 245

9.4.2 Complexity of DeRanClust .............................................................................. 245

9.4.3 Complexity of GMC ........................................................................................ 248

9.4.4 Complexity of GCS.......................................................................................... 252

9.4.5 Complexity of HeMI ........................................................................................ 255

9.4.6 Complexity of CSClust .................................................................................... 260

9.4.7 Complexity of HeMI++ ................................................................................... 263

9.4.8 Complexity of AGCUK ................................................................................... 266

9.4.9 Complexity of GAGR ...................................................................................... 266

9.4.10 Complexity of GenClust .................................................................................. 268

9.4.11 Complexity of K-means ................................................................................... 270

9.5 Comparison of the Complexities of the Techniques ............................................... 270

9.6 Summary of the Proposed Techniques .................................................................... 271

9.7 Future Research Directions ..................................................................................... 274

Chapter 10 Conclusion .................................................................................................... 275

References .............................................................................................................................. 279

xvi

Principal Notations

This is a list of the principal notations used throughout the thesis

𝐷 A data set

𝑛 The number of records of a data set

𝑅𝑖 The 𝑖𝑡ℎ record of a data set

|𝐴| Set of attributes

𝑚 The number of attributes of a data set

𝑚𝑐 The number of categorical attributes of a data set

𝑚𝑟 The number of numerical attributes of a data set

𝑑 The domain size of an attribute

𝑘 The number of clusters

𝐶 A set of clusters

𝑆 A set of seeds

𝑃𝑗𝑖 The 𝑗𝑡ℎ chromosome in the 𝑖𝑡ℎ iteration

𝑓𝑗𝑖 The fitness of 𝑗𝑡ℎ chromosome in the 𝑖𝑡ℎ iteration

𝑃𝑑 Set of chromosomes generated in the deterministic phase

𝑃𝑠 Set of chromosomes

𝑃𝑚 Set of mutated chromosomes

𝐼 User defined number of iterations/generations

𝑃𝑟 Set of random chromosomes

xvii

𝐹𝑠 Fitness of set of Chromosomes

𝑃𝑜 Set of offspring chromosomes

𝑂 A pair of offspring chromosomes

𝑃𝑣 Set of chromosomes after division and absorption operation

𝐻𝑠 Set of healthy chromosomes

𝑇𝑗 Crossover probability of 𝑗𝑡ℎ chromosome

𝑓𝑚𝑒𝑎𝑛 Average fitness value of the chromosome in the population

𝑃𝑗,𝑑 𝑗𝑡ℎ chromosome after division

𝑃𝑗,𝑎 𝑗𝑡ℎ chromosome after absorption

𝑀𝑗 Mutation probability of 𝑗𝑡ℎ chromosome

𝛱𝑖𝑗 Cosine similarity between 𝑖𝑡ℎ record 𝑅𝑖 and seed 𝑠𝑗 of 𝑗𝑡ℎ cluster

Ϛ𝑖𝑗 Cosine distance between 𝑖𝑡ℎ record 𝑅𝑖 and seed 𝑠𝑗 of 𝑗𝑡ℎ cluster

𝑠𝑗 The seed of 𝑗𝑡ℎ cluster 𝑐𝑗

𝑠𝑗,𝑎 The 𝑎𝑡ℎ attribute value of the seed of the 𝑗𝑡ℎ cluster

Ӻ𝑖𝑗 Jaccard coefficient between 𝑖𝑡ℎ record 𝑅𝑖 and seed 𝑠𝑗 of 𝑗𝑡ℎ cluster

Ԓ𝑖𝑗 Jaccard distance between 𝑖𝑡ℎ record 𝑅𝑖 and seed 𝑠𝑗 of 𝑗𝑡ℎ cluster

SSE Sum of square error

𝐷𝐵 Davis-Bouldin Index

𝑋𝐵 Xie-Beni Index

𝐷𝐼 Dunn Index

𝐹𝑀 F-measure

𝑃𝑇 Purity

𝑒𝑇 Entropy

xviii

List of Figures

Fig. 1.1: The three dimensional CHB-MIT Scalp EEG (chb01-03) data set ............................. 4

Fig. 2.1: Basic steps of Genetic Algorithms (GA) ................................................................... 36

Fig. 3.1: The formation of a chromosome through K-means .................................................. 61

Fig. 3.2: Flowchart of the population initialization ................................................................. 62

Fig. 3.3: Single point crossover between a pair of chromosomes ........................................... 65

Fig. 3.4: Comparative results between DeRanClust and other techniques based on Silhouette

Coefficient................................................................................................................................ 69

Fig. 3.5: Comparative result between DeRanClust and other techniques based on DB Index 70

Fig. 4.1: Comparative results between GMC and other techniques based on Silhouette

Coefficient (higher the better) .................................................................................................. 87

Fig. 4.2: Comparative results between GMC and other techniques based on DB Index (lower

the better) ................................................................................................................................. 87

Fig. 4.3: Comparative results between GMC and GMC without Crossover based on Silhouette


Fig. 4.4: Comparative results between GMC and GMC without Crossover based on DB Index

(lower the better) ...................................................................................................................... 88

Fig. 4.5: Comparative results between GMC and GMC without Mutation based on Silhouette


Fig. 4.6: Comparative results between GMC and GMC without Mutation based on DB Index

(lower the better) ...................................................................................................................... 89

Fig. 4.7: Comparative results between GMC and GMC without Probabilistic Selection (PS)

based on Silhouette Coefficient (higher the better) ................................................................. 90

xix

Fig. 4.8: Comparative results between GMC and GMC without Probabilistic Selection (PS)

based on DB Index (lower the better) ...................................................................................... 90

Fig. 4.9: Average fitness (best chromosome fitness) versus iterations over the 10 data sets .. 91

Fig. 4.10: Flow chart of sign test ............................................................................................. 92

Fig. 4.11: Sign test of GMC on 10 data sets ............................................................................ 93

Fig. 5.1: Silhouette Coefficient of the techniques on eight data sets ..................................... 105

Fig. 5.2: Silhouette Coefficient of the techniques on seven data sets .................................... 106

Fig. 5.3: DB Index of the techniques on eight data sets ........................................................ 106

Fig. 5.4: DB Index of the techniques on seven data sets ....................................................... 106

Fig. 5.5: Comparative results between GCS and GMC based on Silhouette Coefficient ...... 108

Fig. 5.6: Comparative results between GCS and GMC based on DB Index ......................... 108

Fig. 5.7: Average fitness (best chromosome) versus Iteration of 20 runs on PID data set .... 110

Fig. 5.8: Average fitness (all chromosomes) versus Iterations. Each line represents the average

fitness of 20 runs on PID data set .......................................................................................... 111

Fig. 6.1: Flowchart of HeMI algorithm ................................................................................. 129

Fig. 6.2: (a) Comparative results between HeMI and other techniques on ten data sets based on

Silhouette Coefficient. (b) Comparative results between HeMI and other techniques on ten data

sets based on Silhouette Coefficient ...................................................................................... 135

Fig. 6.3: (a) Comparative results between HeMI and other techniques on ten data sets based on

DB Index. (b) Comparative results between HeMI and other techniques on ten data sets based

on DB Index ........................................................................................................................... 136

Fig. 6.4: Comparative results between HeMI and HeMI with different Intervals ................. 146

Fig. 6.5: Comparative results between HeMI and HeMI with 8 Streams .............................. 147

Fig. 6.6: Average Fitness versus Iteration. Each line represents the average fitness of the best

chromosome of 5 consecutive runs of HeMI on a data set .................................................... 148

xx

Fig. 6.7: Average Fitness (best chromosome) versus Iteration over the 20 data sets ........... 148

Fig. 6.8: Average Fitness (best chromosome) versus Iteration. Each line represents the average

fitness of 5 consecutive runs on PID data set ........................................................................ 149

Fig. 6.9: (a) Sign test of HeMI based on Silhouette Coefficient on ten data sets. (b) Sign test of

HeMI based on Silhouette Coefficient on ten data set ........................................................... 149

Fig. 6.10: (a) Sign test of HeMI based on DB Index on ten data sets. (b) Sign test of HeMI

based on DB Index on ten data sets ....................................................................................... 150

Fig. 6.11: Comparative result between HeMI and K-means ................................................. 153

Fig. 7.1: The three-dimensional CHB-MIT Scalp EEG (chb01-03) data set ......................... 159

Fig. 7.2: Clustering result of GenClust on the CHB-MIT Scalp EEG (chb01-03) data set ... 160

Fig. 7.3: Clustering result of GAGR on the CHB-MIT Scalp EEG (chb01-03) data set....... 160

Fig. 7.4: Clustering result of GenClust using DB Index on CHB-MIT Scalp EEG (chb01-03)

data set ................................................................................................................................... 161

Fig. 7.5: Clustering result of CSClust on CHB-MIT Scalp EEG (chb01-03) data set .......... 171

Fig. 7.6: Channel positions according to the International 10-20 system (Jasper, 1958;

Sharbrough F et al., 1991)...................................................................................................... 173

Fig. 7.7: Seizure records on different channels ...................................................................... 174

Fig. 7.8: EEG signals (10 seconds) of channel-5 during the non-seizure time ...................... 174

Fig. 7.9: EEG signals (10 seconds) of channel-5 during the seizure time ............................. 174

Fig. 7.10: EEG signals (10 seconds) of channel-7 during the seizure time ........................... 175

Fig. 7.11: EEG signals (10 seconds) of channel-9 during the seizure time ........................... 175

Fig. 7.12: EEG signals (10 seconds) of channel-13 during the seizure time ......................... 175

Fig. 7.13: Decision trees on CHB-MIT (chb01-03) data set.................................................. 176

Fig. 7.14: Comparative results between CSClust and other techniques based on Silhouette

Coefficient (higher the better) ................................................................................................ 177

xxi

Fig. 7.15: Comparative results between CSClust and other techniques based on DB Index

(lower the better) .................................................................................................................... 177

Fig. 7.16: Grand Average fitness versus iteration over the 10 data sets ................................ 178

Fig. 7.17: Sign test of CSClust on 11 data sets ...................................................................... 179

Fig. 8.1: The three dimensional CHB-MIT Scalp EEG (chb01-03) data set ......................... 188



Fig. 8.4: Clustering result of HeMI on the CHB-MIT Scalp EEG (chb01-03) data set ........ 191

Fig. 8.5: A sensible clustering result on the CHB-MIT Scalp EEG (chb01-03) data set ...... 193

Fig. 8.6: Clustering result of HeMI on the CHB-MIT Scalp EEG (chb01-03) data set ........ 206

Fig. 8.7 : Clustering result of AGCUK on the CHB-MIT Scalp EEG (chb01-03) data set ... 206



Fig. 8.10: Clustering result of K-means on the CHB-MIT Scalp EEG (chb01-03) data set . 209

Fig. 8.11: Clustering result of K-means++ on the CHB-MIT Scalp EEG (chb01-03) data set

................................................................................................................................................ 209

Fig. 8.12: Clustering result of our proposed technique, HeMI++ on the CHB-MIT Scalp EEG

(chb01-03) data set ................................................................................................................. 210

Fig. 8.13: Seizure records of different channels .................................................................... 212

Fig. 8.14: EEG signals (10 seconds) of Channel-5 during the non-seizure time ................... 213

Fig. 8.15: EEG signals (10 seconds) of Channel-5 during the seizure time .......................... 213



Fig. 8.18: EEG signals (10 seconds) of Channel-13 during the seizure time ........................ 213

xxii

Fig. 8.19: Channel positions according to the International 10-20 system (Jasper, 1958;

Sharbrough F et al., 1991)...................................................................................................... 214

Fig. 8.20: Clustering result of HeMI on the LD data set ....................................................... 215

Fig. 8.21: Clustering result of AGCUK on the LD data set ................................................... 215

Fig. 8.22: Clustering result of GAGR on the LD data set...................................................... 215

Fig. 8.23: Clustering result of GenClust on the LD data set .................................................. 215

Fig. 8.24: Clustering result of K-means on the LD data set .................................................. 215

Fig. 8.25: Clustering result of K-means++ on the LD data set .............................................. 215

Fig. 8.26: Clustering result of HeMI++ on the LD data set ................................................... 216

Fig. 8.27: The three dimensional LD data set ........................................................................ 216

Fig. 8.28: Scores of the techniques on 15 numerical data sets based on Tree Index ............. 219

Fig. 8.29: Decision trees on the CHB-MIT Scalp EEG (chb01-03) data set ......................... 226

xxiii

List of Tables

Table 2.1: A synthetic data set ................................................................................................. 11

Table 2.2: List of ranked (based on citation reports) evolutionary algorithm-based clustering

techniques from 1995-2015 ..................................................................................................... 28

Table 2.3: List of ranked (based on citation reports) evolutionary algorithm-based clustering

techniques from 1995-2015 ..................................................................................................... 29

Table 2.4: List of ranked (based on citation reports) GA-based clustering techniques from 1995-

2015.......................................................................................................................................... 35

Table 2.5: Advantages and limitations of currently used clustering techniques ...................... 53

Table 3.1: A brief description of the data sets ......................................................................... 69

Table 3.2: Comparative results between AGCUK and Modified AGCUK ............................. 71

Table 3.3: Comparative result between DeRanClust and DeRanClust without Crossover ..... 72

Table 3.4: Comparative results between DeRanClust and Modified AGCUK ....................... 72

Table 4.1: Data sets at a glance ................................................................................................ 86

Table 5.1: Data sets at a glance .............................................................................................. 104

Table 5.2: Comparative result between GCS and GCS without Health Check ..................... 109

Table 5.3: Comparative result between GCS and GCS with Traditional Crossover ............ 110

Table 6.1: A brief description of the data sets ....................................................................... 133

Table 6.2: Comparative results between HeMI and GCS ...................................................... 138

Table 6.3: Comparative Results among HeMI, GCS, GMC and DeRanClust based on

Silhouette Coefficient ............................................................................................................ 139

xxiv

Table 6.4: Comparative Results among HeMI, GCS, GMC and DeRanClust based on DB Index

................................................................................................................................................ 139

Table 6.5: Comparative results between AGCUK, AGCUK with 40 Population and AGCUK

with 80 Population ................................................................................................................. 140

Table 6.6: Comparative results between AGCUK with 80 Population and AGCUK with

Multiple Streams .................................................................................................................... 141

Table 6.7: Comparative results between AGCUK with Multiple Streams and AGCUK with

Neighbor Exchange ................................................................................................................ 141

Table 6.8: Comparative results between GenClust, GenClust with Multiple Streams and

GenClust with Neighbor Exchange ....................................................................................... 142

Table 6.9: Comparative results between HeMI, AGCUK with Neighbor Exchange and

GenClust with Neighbor Exchange ....................................................................................... 142

Table 6.10: Comparative results between AGCUK and AGCUK with HeMI Population .... 143

Table 6.11: Comparative results between AGCUK and AGCUK with HeMI Mutation ...... 144

Table 6.12: Comparative results between HeMI and HeMI without Mutation ..................... 144

Table 6.13: Comparative results between HeMI and HeMI without Health Improvement

Operation................................................................................................................................ 145

Table 6.14: Comparative results between HeMI and HeMI with K-means++ ...................... 151

Table 7.1: Data sets at a glance .............................................................................................. 168

Table 7.2: Clustering results of all techniques on the CHB-MIT Scalp EEG (chb01-03) data set

................................................................................................................................................ 170

Table 7.3: Channel wise number of records in Cluster 2 of CSClust on CHB-MIT Scalp EEG

(chb01-03) data set ................................................................................................................. 172

Table 8.1: Some sensible and non-sensible clustering solutions and their evaluation values

based on the existing cluster evaluation metrics .................................................................... 191

xxv

Table 8.2: Cluster results of some sensible and non-sensible clustering solutions based on Tree

Index ...................................................................................................................................... 194

Table 8.3: A brief description of the data sets ....................................................................... 203

Table 8.4: Clustering results of HeMI++ and other techniques based on Tree Index ........... 210

Table 8.5: Channel wise number of records in Cluster 2 of HeMI++ on the CHB-MIT Scalp

EEG (chb01-03) data set ........................................................................................................ 211

Table 8.6: Comparative results of all the techniques on the LD data set based on Tree Index

and other evaluation techniques ............................................................................................. 216

Table 8.7: Comparative results between HeMI++ and other techniques on 15 numerical data

sets based on Tree Index ........................................................................................................ 218

Table 8.8: Clustering results of HeMI ++ and other techniques on 6 categorical data sets based

on Tree Index ......................................................................................................................... 219

Table 8.9: Performance of HeMI++ compared to existing techniques, based on number of

records .................................................................................................................................... 222

Table 8.10: Performance of HeMI++ compared to existing techniques, based on number of

attributes ................................................................................................................................. 223

Table 8.11: Performance of HeMI++ compared to existing techniques, based on type of the

majority of attributes .............................................................................................................. 224

Table 8.12: Silhouette Coefficient rank of the techniques based on Friedman Test (Demšar,

2006; Friedman, 1940) ........................................................................................................... 229

Table 9.1: The complexities of the techniques ...................................................................... 271

Table 9.2: Strengths and weaknesses of the proposed techniques ......................................... 272

1

Chapter 1

Introduction

Nowadays with the advancement of scientific technology and increase of information, huge

amounts of data can be collected (Bello-Orgaz, Jung, & Camacho, 2016; Kuo, Syu, Chen, &

Tien, 2012). It is difficult for a domain expert to infer knowledge manually from the enormous

amount of data. To acquire information from the huge amounts of data and facilitate decision-

making process data mining techniques are required.

Clustering is an important and well-known technique in the area of data mining, which aims

to group similar records in one cluster and dissimilar records in other clusters (D.-X. Chang,

Zhang, & Zheng, 2009; D. Chang, Zhao, Zheng, & Zhang, 2012; Han & Kamber, 2006; Kuo et

al., 2012; Y. Liu, Wu, & Shen, 2011; Pang-Ning Tan, Michael Steinbach, 2005; Rahman &

Islam, 2014). Through clustering, hidden information can be extracted from a data set that can

subsequently help in various decision-making processes (Rahman & Islam, 2014).

Clustering has a wide range of applications including machine learning (Gan, 2013;

Mukhopadhyay & Maulik, 2009), image segmentation (Cai, Chen, & Zhang, 2007; B. N. Li,

Chui, Chang, & Ong, 2011; F. Zhao, Fan, & Liu, 2014), medical imaging and object detection

(Bai et al., 2013; Kannan, Ramathilagam, Sathya, & Pandiyarajan, 2010; Kaya, Pehlivanlı,

Sekizkardeş, & Ibrikci, 2017; B. N. Li et al., 2011; Liao, Lin, & Li, 2008; Masulli & Schenone,

1999; Saha, Alok, & Ekbal, 2016; Son & Tuan, 2017; Sonğur & Top, 2016; Stockman &

Shapiro, 2001), business (M.-Y. Chen, 2013; Montani & Leonardi, 2014) and social network

2

analysis (Girvan & Newman, 2002). It is therefore very important to produce good-quality

clusters from data sets.

Many approaches for clustering have been proposed (Arthur & Vassilvitskii, 2007; D.-X.

Chang et al., 2009; D. Chang et al., 2012; Y. Liu et al., 2011; Lloyd, 1982; Rahman & Islam,

2014). K-means is one of the most popular techniques for clustering. While K-means is popular

for its simplicity, it has a number of well-known drawbacks (D.-X. Chang et al., 2009; Jain,

2010; Mohd, Beg, Herawan, & Rabbi, 2012; Rahman & Islam, 2014). One of the main

disadvantages of K-means is that it requires a user defined number of clusters (𝑘) prior to

clustering. It is difficult for a user (data miner) to estimate the appropriate number of clusters in

advance. The appropriate number of clusters influences the quality of the final clustering

solution (Kuo et al., 2012).

Another drawback of the K-means clustering technique is that it has a tendency to get stuck

at local optima. Moreover, the random selection of the initial seeds is also considered to be a

major weakness as it heavily influences the final clustering quality (Arthur & Vassilvitskii,

2007). A recent technique called K-means++ (Arthur & Vassilvitskii, 2007) addresses the last

drawback of K-means. However, it also suffers from other drawbacks of K-means as listed

above.

The use of a Genetic Algorithm (GA) in clustering can help a data miner to avoid the local

optima issue of K-means (Agustín-Blas et al., 2012; D.-X. Chang et al., 2009; D. Chang et al.,

2012; He & Tan, 2012; Y. Liu et al., 2011; Peng et al., 2014; Rahman & Islam, 2014). Typically,

a genetic algorithm-based technique does not require any user input on the number of clusters 𝑘.

In GA a chromosome contains a set of genes, where a gene is a (real or pseudo) record. A gene

is considered to be the center of a cluster. Therefore, a chromosome is considered to be a

clustering solution.

3

However, GA-based clustering techniques have some limitations. Many existing techniques

(Y. Liu et al., 2011; Maio, Maltoni, & Rizzi, 1995; Maulik & Bandyopadhyay, 2000; Xiao, Yan,

Zhang, & Tang, 2010) generate the number of genes of a chromosome randomly, in the

population initialization phase. They also randomly choose records as genes, instead of carefully

choosing genes of a chromosome. Careful selection of genes can create an initial population

containing high-quality chromosomes. A high-quality initial population typically increases the

likelihood of obtaining a good clustering solution at the end of the genetic processing (Diaz-

Gomez & Hougen, 2007; Goldberg, Deb, & Clark, 1991; Rahman & Islam, 2014).

One existing GA-based clustering technique -GenClust (Rahman & Islam, 2014) finds high-

quality initial population and thereby obtains a good clustering solution. However, the initial

population selection process of GenClust is very complex-with a complexity of 𝑂(𝑛2), where

𝑛 is the number of records in a data set. Moreover, GenClust requires a user input on a number

of radius values for the clusters in the initial population selection. It can be very difficult for a

user to estimate the set of radius values (i.e. radii). Therefore, in this study we aim to produce

high-quality initial seeds with low complexity and require no user input. In this thesis, we

propose clustering techniques to improve the final clustering results.

We also carefully analyze the results obtained by both our techniques and other existing

techniques, whether they are sensible or not. In order to assess the quality of existing clustering

techniques, we use a brain data set (CHB-MIT Scalp) (Goldberger et al., 2000) as an example

which is available from https://physionet.org/cgi-bin/atm/ATM. We plot the data set so that we

can graphically visualize the clusters (see Fig. 1.1). We know that this data set has two types of

records: seizure and non-seizure. We can also see in the figure that there are clearly two clusters

of records. We then apply the existing clustering techniques on this data set and plot their

clustering results.

https://physionet.org/cgi-bin/atm/ATM

4

We find that some recent and state-of-the-art clustering techniques such as GAGR (D.-X.

Chang et al., 2009), AGCUK (Y. Liu et al., 2011) and GenClust (Rahman & Islam, 2014) do

not produce sensible clusters. Sometimes, they obtain huge number of clusters and sometimes

they obtain only two clusters, where one cluster contains one record and the other cluster

contains all remaining records. These solutions are typically not useful in knowledge discovery

from underlying data sets. Therefore, a clustering technique that can produce a sensible

clustering solution is highly desirable. Hence, in this study we also propose a clustering

technique which produces sensible clustering solutions.

Fig. 1.1: The three dimensional CHB-MIT Scalp EEG (chb01-03) data set

During the development of the proposed clustering techniques we realize that the existing

cluster evaluation techniques are biased towards either high numbers of clusters or very low

numbers of clusters. Therefore, we also evaluate the existing cluster evaluation techniques by

analyzing them on some ground truth results,which are also graphically visualized. We find that

the existing evaluation techniques produce better evaluation values for non-sensible clustering

solutions (compared to sensible clustering solutions). Therefore, a good evaluation technique is

also highly required in order to evaluate sensible and non-sensible clustering solutions.

5

Consequently, in this study we also propose a cluster evaluation technique for appraising

sensible and non-sensible clustering solutions.

The main research goals of this study are therefore as follows:

1. To produce parameter less clustering techniques with high-quality solutions and low

complexity;

2. To produce sensible clustering solutions; and

3. The evaluation of sensible and non-sensible clustering solutions.

In Chapter 2 we provide a comprehensive literature review of the existing techniques for

clustering and cluster evaluation. We discuss the advantages and limitations of the existing

techniques.

In Chapter 3 we propose a GA based clustering technique called DeRanClust that produces

high-quality initial seeds through a deterministic phase and a random phase. The basic idea

behind the proposed technique is that high-quality initial population typically increases the

possibility of obtaining a good clustering solution at the end of the genetic processing (Diaz-

Gomez & Hougen, 2007; Goldberg et al., 1991; Rahman & Islam, 2014).

DeRanClust therefore aims to produce high-quality initial seeds with a low complexity of

𝑂(𝑛). This technique automatically chooses the number of clusters for the chromosomes in the

initial population, and so does not require any user input for the number of clusters 𝑘.

DeRanClust also reduces the chance of getting stuck at local optima by using our proposed

genetic algorithm for high-quality chromosome selection. In Chapter 3, progress is made

towards achieving the first research goal, as presented above.

In Chapter 4 we present a technique titled GMC, which improves on DeRanClust. There is

room for further improvement of cluster quality of DeRanClust by improving other genetic

operations such as crossover and mutation. GMC therefore uses a new selection operation

6

whereby a chromosome with higher fitness value has a greater chance of being selected for other

genetic operations such as crossover and mutation.

GMC also proposes a new crossover operation where firstly the chromosomes of a population

are classified in one of two groups: Good group and Non-good group; and then different types

of crossover are performed on the two groups. Intuitively, this will increase the possibility of

obtaining good-quality offspring chromosomes from a pair of good-quality parent

chromosomes. GMC also performs different types of mutation operation for the two different

groups, whereby the number of changes on the good chromosomes are reduced, while the

number of changes on bad chromosomes are increased. In Chapter 4, further progress is made

towards achieving research goal 1.

In Chapter 5 we present a GA-based clustering technique called GCS that represents an

improvement on the techniques proposed in the previous two chapters. Typically, genetic

operations such as crossover and mutation tend to improve the health/fitness of a chromosome,

but they can also cause the health of some chromosomes to deteriorate. GCS therefore uses a

health check operation in order to ensure the presence of healthy chromosomes in a population.

GCS also modifies the process by which a pair of chromosomes is selected in a crossover

operation in order to increase the possibility of getting better quality offspring chromosomes.

GMC (proposed in Chapter 4) uses the new crossover operation whereby a chromosome with

low fitness value always pairs with another low-quality chromosome. GCS subsequently

introduces a new crossover operation whereby each chromosome is able to pair with the best

chromosome. GCS also uses a new selection operation in order to ensure the presence of good-

quality chromosomes in a population at the beginning of each generation. Chapter 5 further

refines the techniques proposed in the previous two chapters, and assists us to move closer

towards achieving our research goal 1.

7

In Chapter 6 we present a novel technique called HeMI which is a further improvement of

the techniques proposed in previous chapters. It is evident from the literature (Pourvaziri &

Naderi, 2014; Straßburg, Gonzàlez-Martel, & Alexandrov, 2012) and through empirical analysis

undertaken in this study that population size has a positive impact on clustering quality. That is,

a big population size is likely to contribute towards a good clustering solution. However, a large

population size also requires high time complexity.

HeMI therefore uses a big population in multiple streams, where each stream contains a

relatively small number of chromosomes, and thus can facilitate managing a low execution time

since they are suitable for parallel processing when necessary. HeMI also introduces information

sharing among the streams at a regular intervals in order to take advantage of the multiple

streams. The presence of healthy chromosomes (i.e. chromosomes with high fitness values) in

a population can increase the possibility of good clustering results. Therefore, HeMI replaces

the sick chromosomes (i.e. chromosomes with low fitness) with healthy chromosomes that are

produced through a novel approach. In Chapter 6, research goal 1 is achieved.

In Chapter 7 we present a GA-based Clustering technique called CSClust, with the aim of

producing sensible clusters. In order to achieve our second goal of producing sensible clustering

solutions we carefully analyze the results of some existing techniques. We find that some recent

clustering techniques do not produce sensible clusters and fail to discover knowledge from

underlying data sets. Sometimes, they obtain a huge number of clusters and sometimes they

obtain only two clusters, where one cluster contains one record and the other cluster contains all

remaining records. Therefore, in CSClust we propose a new cleansing and cloning operation

that helps to produce sensible clusters with high fitness values, which are also useful for

knowledge discovery.

8

In Chapter 8 we propose a new clustering technique, and an evaluation technique. In the

proposed clustering technique, we combine our previous technique called CSClust with HeMI

where we also significantly improve the components of CSClust and HeMI. Therefore, we call

the proposed technique HeMI++. We first explore the quality of HeMI and some existing

clustering techniques. We also explore the quality of existing evaluation techniques. In Chapter

7, we find that some existing techniques do not produce sensible clusters. However, in Chapter

8, we carefully assess the clustering quality of the existing techniques and HeMI through cluster

visulization.

We find that some of the existing clustering techniques do not produce sensible clusters. In

order to overcome this limitation, HeMI++ incorporates a new component titled Selection of

Sensible Properties. Through this component, HeMI++ first learns important properties of

sensible clustering solutions and then applies the information in producing its clustering

solutions. HeMI++ also uses a cleansing operation in order to identify the sick chromosomes.

The sick chromosomes are then replaced through its cloning operation by a pool of healthy

chromosomes found in the initial population.

During the development of the proposed clustering technique we realize that the existing

cluster evaluation techniques are biased towards either a high number of clusters or a very low

number of clusters. Hence, we also propose a novel cluster evaluation technique called Tree

Index. We validate the effectiveness of Tree Index by analyzing it on some ground truth

clustering results, which are also graphically visualized. While existing evaluation techniques

fail to correctly evaluate the cluster quality, Tree Index scores the sensible solutions higher than

non-sensible solutions.

We empirically compare our proposed clustering technique HeMI++ with five existing

techniques using 21 publicly available data sets in terms of Tree Index. The experimental results

on the 21 publicly available data sets clearly indicate the superiority of HeMI++ over the

9

existing techniques. We also graphically visualize the clustering results of HeMI++ on a brain

data set and find the results to be more sensible than others. Additionally, we discover some

useful knowledge from the clustering results produced by HeMI++ indicating its usefulness in

knowledge discovery. In Chapter 8, research goal 2 and 3 are attained.

In Chapter 9 we present a detailed analysis on the performance of the proposed techniques.

The performance of HeMI++ is analyzed based on some factors including number of records,

number of attributes and types of the majority of attributes in a data set. Contributions of the

thesis and future research directions are also discussed. Chapter 9 also presents a complexity

analysis of the proposed techniques and some existing techniques.

Finally in Chapter 10 we present our concluding remarks.

10

Chapter 2

Literature Review

2.1 Introduction

In this chapter, a data set with its notations and definitions is presented in Section 2.2, followed

by a short introduction to data mining in Section 2.3. Section 2.4 provides a brief overview of

machine learning, while Section 2.5 examines clustering, applications of clustering, and

categories of clustering. Different types of distance calculations are set out in Section 2.6; and

Section 2.7 introduces cluster evaluation techniques. A summary of the chapter is presented in

Section 2.8.

During the PhD candidature, we have published the following papers based on this chapter.

Beg, A. H. and Islam, M. Z. (2016): Branches of Evolutionary Algorithms and their Effectiveness for

Clustering Records, In Proc. of the 11th IEEE Conference on Industrial Electronics and Applications

(ICIEA 2016), Hefei, China, June 5-7, 2016, pp. 2484-2489. (ERA 2010 Rank A).

Beg, A. H. and Islam, M. Z. (2016): Advantages and Limitations of Genetic Algorithms for Clustering

Records, In Proc. of the 11th IEEE Conference on Industrial Electronics and Applications (ICIEA 2016),

Hefei, China, June 5-7, 2016, pp. 2478-2483. (ERA 2010 Rank A).

11

2.2 Data Set with its Notations and Definition

In this thesis, a data set 𝐷 is considered to be a two-dimensional matrix/table having 𝑛 records

(i.e. rows) and 𝑚 attributes (i.e. columns). The data set is represented as 𝐷 = {𝑅1, 𝑅2 … … 𝑅𝑛},

where 𝑅𝑖 is the 𝑖𝑡ℎ record. The set of attributes is denoted as 𝐴 = {𝐴1, 𝐴2 … … 𝐴𝑚}, where 𝐴𝑗 is

the 𝑗𝑡ℎ attribute. Each record 𝑅𝑖 has |𝐴| attributes. An attribute can be categorical and/or

numerical (Han & Kamber, 2006; Pang-Ning Tan, Michael Steinbach, 2005).

The domain of a numerical attribute 𝐴𝑖 is characterized as 𝐴𝑖 = [𝑙𝑖, 𝑢𝑖], where 𝑙𝑖 is the lower

limit and 𝑢𝑖 is the upper limit of the domain of 𝐴𝑖. The domain of a categorical attribute 𝐴𝑗 is

represented as 𝐴𝑗 = {𝐴𝑗1, 𝐴𝑗

2, … … . 𝐴𝑗𝑥}, where 𝐴𝑗

𝑘 is the 𝑘𝑡ℎ domain value and 𝑥 is the domain

size of 𝐴𝑗 .

Table 2.1: A synthetic data set

Record Student Name Course Marks Grade Study Mode

𝑅1 Daniel Advanced Electronics 87 A Full-Time

𝑅2 Andrew Computer Graphics 92 A+ Full-Time

𝑅3 Matthew Compiler Design 86 A Full-Time

𝑅4 Alex Computer Graphics 67 B Full-Time

𝑅5 Melissa Computer Graphics 75 B+ Part-Time

𝑅6 Anita Theory of Computing 82 A Part-Time

𝑅7 Andrew Object Oriented Programming 39 F Full-Time

𝑅8 Emily Data Structure and Algorithms 94 A+ Part-Time

𝑅9 Samuel Compiler Design 88 A Full-Time

𝑅10 Matthew Theory of Computing 93 A+ Full-Time

In Table 2.1, an example data set with ten records and five attributes is presented. Four

attributes “Student Name”, “Course”, “Grade”, and “Study Mode” are categorical, and one

attribute “Marks” is numerical. The domain values of the numerical attribute “Marks” range

from 39 to 94. The domain values for the categorical attribute “Study Mode” are {Full-Time,

12

Part-Time}. In a similar way, the domain values of all other attributes can be learnt from Table

2.1.

2.3 Data Mining

Due to the advancement of scientific technology and increase of information, huge amounts of

data can be collected (Bello-Orgaz et al., 2016; Kuo et al., 2012). In most of the cases, these

data are collected in the unstructured way. As it would be difficult for a domain expert to gather

knowledge manually from the enormous amount of data. Data mining techniques are required

to allow information (patterns) to be acquired and decision-making processes to be facilitated.

Data mining is a technique that discovers useful information from collections of data by

representing the data in a structured way (Han & Kamber, 2006; Pang-Ning Tan, Michael

Steinbach, 2005). Many similar terms are used interchangeably, such as knowledge extraction,

knowledge mining from data, data archaeology, data dredging, and data/pattern analysis (Han

& Kamber, 2006). Organizations use data mining to make better decisions. A data mining

technique identifies interesting patterns (such as the discovery of knowledge and predictive

patterns in data), that otherwise could be very difficult to ascertain, especially from large

collections of data (Hulse, Khoshgoftaar, & Huang, 2007; Pyle, 1999; Sumathi & Sivanandam,

2006).

2.4 Machine Learning

Machine learning is a process by which knowledge from previous data is automatically learnt.

Based on the data, a model that produces some knowledge about the data is built. The knowledge

is then used to analyze future data. Automatic development of learning algorithms without

human interference is the main aspect of machine learning. Typically, machine learning can be

13

divided into two categories as follows (Md Anisur Rahman, 2014; Roiger & Geatz, 2003; J. Y.

Yang & Ersoy, 2003):

Supervised learning; and

Unsupervised learning.

Supervised Learning

In supervised learning, the data set has a special attribute called the class attribute, which

contains the class value/output of a record. Typically, the domain value of the class attribute is

equal to or greater than two (B. Liu, 2011; Md Anisur Rahman, 2014). The data sets are also

divided into subsets; namely, training data sets and testing data sets. In the training data sets, a

record contains a class label (i.e. the class value of the record), whereas in the testing data sets,

the class value of a record needs to be anticipated. Based on the training data set a model is

developed which produces logic rules, which are used to predict the class labels for the records

of the testing data set (i.e. the future data). Supervised learning methods include Support Vector

Machine, Decision Tree, Bayesian Network, Neural Network, Regression Analyses, and so on

(Maimon & Rokach, 2010; Md Anisur Rahman, 2014).

Unsupervised Learning

Unsupervised learning is a data-driven approach, which is also known as learning by

observation (Han & Kamber, 2006). In unsupervised learning, the data set does not have a class

attribute. Unsupervised learning analyzes data to discover the inherent structure of the data.

Unsupervised learning is considered to be the pre-process of supervised learning. Unsupervised

learning includes Outlier Detection, Clustering, Dimensionality Reduction, and so on.

(Chapelle, Scholkopf, & Zien, 2006; Ghahramani, 2004; Md Anisur Rahman, 2014).

14

2.5 Clustering

Clustering is an important and well-known technique in the area of data mining, which aims to

group similar records in one cluster and dissimilar records in other clusters. Through clustering,

hidden information can be extracted from the data that can help in the decision-making process

(D.-X. Chang et al., 2009; D. Chang et al., 2012; Han & Kamber, 2006; Kuo et al., 2012; Y. Liu

et al., 2011; Pang-Ning Tan, Michael Steinbach, 2005; Rahman & Islam, 2014).

Clustering has a wide range of applications such as machine learning (Gan, 2013;

Mukhopadhyay & Maulik, 2009), image segmentation (Cai et al., 2007; B. N. Li et al., 2011; F.

Zhao et al., 2014), business (M.-Y. Chen, 2013; Montani & Leonardi, 2014), social network

analysis (Girvan & Newman, 2002), and medical imaging (Kannan et al., 2010; Masulli &

Schenone, 1999; Stockman & Shapiro, 2001). It is therefore vital that good-quality clusters are

produced.

2.5.1 Applications of Clustering

Clustering has a wide range of applications, with key applications of clustering as follows:

Psychology and Medicine

An illness or circumstance frequently has a number of abnormalities. Clustering technique can

be used to identify these different subcategories (Pang-Ning Tan, Michael Steinbach, 2005). For

example, clustering techniques have been used to diagnose different types of depression

(Deckersbach et al., 2016; Dipnall et al., 2017; Miller & Cole, 2012; Rivera-Baltanas et al.,

2014; Suzuki et al., 2014; Van Lancker, Beeckman, Verhaeghe, Van Den Noortgate, & Van

Hecke, 2016).

15

Gene Analysis

In DNA microarray technology, a huge amount of gene expression data are generated and

monitored simultaneously. Detecting useful patterns from the produced data is valuable for

biomedical research, as it can help to diagnose diseases such as cancer and heart attacks (Md

Anisur Rahman, 2014). Clustering is widely used in gene expression data in order to extract

patterns/hidden information (Brameier & Wiuf, 2007; Kerr, Ruskin, Crane, & Doolan, 2008;

Maraziotis, 2012; Pirim, Ekşioğlu, Perkins, & Yüceer, 2012; Szeto, Liew, Yan, & Tang, 2003;

Xu, Damelin, Nadler, & Wunsch, 2010; Zeng & Garcia-Frias, 2006).

Medical Imaging and Object Detection

Clustering is also widely used in segmenting medical images, robotics and object detection (Bai

et al., 2013; Kannan et al., 2010; Kaya et al., 2017; B. N. Li et al., 2011; Liao et al., 2008;

Masulli & Schenone, 1999; Saha et al., 2016; Son & Tuan, 2017; Sonğur & Top, 2016;

Stockman & Shapiro, 2001). Clustering can partition a medical image into different anatomical

structures that can help to detect diseases (Kannan et al., 2010; B. N. Li et al., 2011; Liao et al.,

2008; Md Anisur Rahman, 2014) .

Climate

Clustering is also widely used in predicting climate (Bador, Gilleland, Castellà, & Arivelo,

2015; Y. Chen et al., 2017; Gu, Zhang, Singh, Chen, & Shi, 2016; Merz, Nguyen, &

Vorogushyn, 2016; Parente, Pereira, & Tonini, 2016). In order to understand the earth’s climate

it is necessary to establish the existence of patterns in atmospheric pressure and ocean currents.

Clustering approaches have therefore been applied to determine patterns in the atmospheric

pressure of Polar Regions, and in ocean areas that impact significantly on land climate (Pang-

Ning Tan, Michael Steinbach, 2005).

16

Social Network Analysis

Social Network Analysis (SNA) is crucial in regard to investigating the social activities of

communities, by analyzing cultural activities, daily activities, employment status, and earnings

among people living within a community. Clustering has received major attention in SNA, in

identifying similar groups of people (Bello-Orgaz et al., 2016; Daraganova et al., 2012; de

Arruda, Costa, & Rodrigues, 2012; Firat, Chatterjee, & Yilmaz, 2007; Firestone, Ward,

Christley, & Dhand, 2011; Giebultowicz, Ali, Yunus, & Emch, 2011; Girvan & Newman,

2002; Hsieh & Magee, 2008; Levine & Kurzban, 2006; Mann, Matula, & Olinick, 2008; Md

Anisur Rahman, 2014; Opsahl & Panzarasa, 2009; Qiao, Li, Li, Peng, & Chen, 2012; Traud,

Mucha, & Porter, 2012; P. Zhao & Zhang, 2011; Z. Zhao et al., 2012).

Business

Clustering is used in business to analyze customers’ requirements and expectations. On the stock

market, clustering is used as a decision support system in predicting the price of a product

(Alexander & Peterson, 2007; ap Gwilym & Verousis, 2010; Ashton & Hudson, 2008; Brown,

Chua, & Mitchell, 2002; Chan, Kwong, & Hu, 2012; M.-Y. Chen, 2013; Gunaratne, Nicol,

Seemann, & Török, 2009; Hruschka, Fettes, & Probst, 2004; Md Anisur Rahman, 2014; Mo,

Kiang, Zou, & Li, 2010; Montani & Leonardi, 2014; Nanda, Mahanty, & Tiwari, 2010; Narayan,

Narayan, & Popp, 2011; Narayan, Narayan, Popp, & D’Rosario, 2011; C.-H. Wang, 2009).

2.5.2 Categories of Clustering Techniques

Different types of clustering techniques include partitioning, hierarchical, graph-based, and

evolutionary algorithms-based. In this thesis, the various clustering methods will be divided into

the following categories.

Partition-based Clustering Techniques;

Hierarchical Clustering Techniques;

17

Density-based Clustering Techniques;

Graph-based Clustering Techniques;

Grid-based Clustering Techniques;

Spectral Clustering Techniques;

Model-based Clustering Techniques; and

Evolutionary Algorithm-based Clustering Techniques.

2.5.2.1 Partition-based Clustering Techniques

The partition-based clustering technique divides a data set into 𝑘 partitions (𝑘 ≤ 𝑛, where 𝑛 is

the number of records in the data set), where each partition represents a cluster (Han & Kamber,

2006). The clusters are formed in such a way that the records within a cluster are more similar

to each other than the records in different clusters. A record is allocated to a cluster if the record

has the minimum distance to the seed/centroid of the clusters. During the clustering process,

the seed of a cluster is updated based on the records allocated to the cluster, and the records of

the data set change the clusters based on their distances to the updated seeds of the clusters

(Han & Kamber, 2006; Md Anisur Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005).

Partition-based clustering techniques have an objective function which is optimized during the

clustering process. Typically, partition-based clustering techniques can be divided into the

following two categories:

Non-Fuzzy Clustering; and

Fuzzy Clustering.

Non-Fuzzy Clustering

The non-fuzzy cluster is also known as hard clustering or exclusive clustering (Md Anisur

Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005). In this form of clustering, the data

set is separated into non-overlapping clusters in such a way that a record belongs to only one

18

cluster. K-means (Han & Kamber, 2006; Lloyd, 1982; Pang-Ning Tan, Michael Steinbach,

2005) is one of the most popular non-fuzzy clustering techniques. In K-means, a user (data

miner) is required to define the number of clusters (𝑘) in advance. Based on the user defined

number of clusters, K-means then randomly selects 𝑘 records as initial seeds from the data set

and each record of the data set is then allocated to its closest seed in order to form clusters. The

seed of a cluster is then updated based on the records allocated to the cluster. The updated seed

is a (real or pseudo) record where each attribute value of the updated seed is the average of all

values of the attribute for all records belonging to the cluster.

The process of the record allocation/re-allocation to the clusters and updating is considered

to be an iteration of K-means. The iterations continue until any of the termination conditions are

met. Typically there are two termination conditions: first, if the user defined number of

maximum iterations is reached then the process terminates; and second, if the improvement of

the objective function values of two consecutive iterations do not improve more than a user

defined threshold (Arthur & Vassilvitskii, 2007; Lloyd, 1982; Pang-Ning Tan, Michael

Steinbach, 2005).

The objective function of K-means is the sum of the squared error (SSE), also known as

scatter. K-means calculates the error of each record; i.e. its Euclidean distance (see Section

2.6.1) to the closest seed; and then calculate the total sum of the squared error (Pang-Ning Tan,

Michael Steinbach, 2005). The main objective of the K-means algorithm is to minimize the

objective function that is described by the equation:

𝑆𝑆𝐸 = ∑ ∑ 𝑑𝑖𝑠𝑡( 𝑆𝑗, 𝑅𝑖)2

𝑖∈ 𝐶𝑗

𝑘

𝑗=1

Eq. 2.1

19

where 𝑘 stands for the number of clusters, 𝑆𝑗 is the seed of the 𝑗𝑡ℎ cluster 𝐶𝑗

and 𝑑𝑖𝑠𝑡( 𝑠𝑗 , 𝑅𝑖) is the Euclidean distance between the record 𝑅𝑖 and the seed 𝑆𝑗 of the

𝑗𝑡ℎ cluster 𝐶𝑗.

K-means++ (Arthur & Vassilvitskii, 2007) is another well-known non-fuzzy clustering

technique. K-means++ chooses only the first seed randomly. It then chooses the second seed in

a probabilistic way so that the record having the highest distance with the first seed has the

highest probability of being chosen as the second seed. While choosing the third seed the record

having the maximum distance with its nearest seed has the highest probability. In a similar way,

it picks the fourth seed and so on; it picks as many seeds as the user defined number of clusters.

All other components of K-means++ are exactly the same as K-means.

Partition-based clustering techniques have a number of limitations. One of the main

disadvantages of partition-based clustering techniques is that they require a user defined number

of clusters prior to clustering. It is difficult for a user (data miner) to estimate the appropriate

number of clusters in advance. Another drawback to these techniques is that they tend to get

stuck at local optima.

Fuzzy Clustering

Fuzzy clustering (also known as soft clustering) is a process of clustering in which each record

of a data set can belong to more than one cluster (“Fuzzy Clustering,” 2017; Pang-Ning Tan,

Michael Steinbach, 2005). In many real-life data sets, the records may sometimes show a fuzzy

nature in the sense that they may have an association with more than one cluster, instead of

completely belonging to only one cluster. For example, a record may have a 70% membership

of one cluster, a 20% membership of another cluster, and a 10% membership of a third cluster

(Kannan et al., 2010; Md Anisur Rahman, 2014).

20

The fuzzy C-means – also known as FCM – explore this fuzzy nature of the records (Bezdek

& C., 1981; Md Anisur Rahman, 2014). A record in a data set belonging to multiple clusters has

different membership degrees. These membership degrees (known as fuzzy membership

degrees) indicate which records belong to each cluster (Abonyi János & Feil Balázs, 2007; Md

Anisur Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005). For each record of the data

set, FCM techniques allocate a fuzzy membership degree for the record and a cluster, in order

to signify the level of attachment between the record and the cluster. The fuzzy membership

degrees of a record with different clusters can vary, as it does not belong to just one cluster. The

summation of the fuzzy membership degrees of a record with all clusters is equal to one (Md

Anisur Rahman, 2014).

Typically, FCM works on data sets with numerical attributes only (Bezdek & C., 1981; Md

Anisur Rahman, 2014). After initialization, FCM computes the seeds/centroids of the clusters

based on the fuzzy membership degree. Once the seed of the clusters is obtained it next

calculates the fuzzy membership degree of each record. The fuzzy membership degree of a

record and the seed of a cluster is inversely proportional to the distance between the seed and

the record. In a similar way to K-means, FCM techniques iteratively recompute the seeds and

fuzzy membership degree until the seeds/centroids do not change (Pang-Ning Tan, Michael

Steinbach, 2005).

Fuzzy clustering techniques have a number of limitations. Many of the existing fuzzy

clustering techniques, including FCM, do not work on data sets with both categorical and

numerical attributes; in fact few fuzzy clustering techniques exist that can process such data

sets. However, many techniques require various user inputs such as the number of clusters, while

randomly selecting initial fuzzy membership degree.

21

2.5.2.2 Hierarchical Clustering Techniques

Hierarchical clustering merges smaller clusters into larger clusters or successively splits the

larger clusters into a smaller clusters (Han & Kamber, 2006; Pang-Ning Tan, Michael Steinbach,

2005). A tree structure dendrogram is used to illustrate hierarchical clustering. The two types of

hierarchical clustering include:

Agglomerative Hierarchical Clustering; and

Divisive Hierarchical Clustering.

Agglomerative Hierarchical Clustering

Agglomerative hierarchical clustering is a bottom-up approach where each individual record of

a data set is considered as a cluster and iteratively merges two similar clusters, using similarity

measures for merging agglomerative clustering. The merging process is continued until a single

cluster is obtained or the termination condition is satisfied (Han & Kamber, 2006; Md Anisur

Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005).

Divisive Hierarchical Clustering

Divisive hierarchical clustering is a top-down approach where the records of a data set are

considered as one large cluster. The large clusters are then split into smaller clusters in such a

way that the most similar records are placed in one cluster. The division process is continued

until each record of the data set forms a separate cluster (Han & Kamber, 2006; Pang-Ning Tan,

Michael Steinbach, 2005). The divisive hierarchical process is, in effect, the reverse of the

agglomerative hierarchical process.

The major problem with many hierarchical clustering techniques is that they require high

computational complexity. In general, the overall complexity of hierarchical clustering is 𝑂(𝑛3)

(Md Anisur Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005).

22

2.5.2.3 Density-based Clustering Techniques

Density-based clustering techniques discover areas (groups) of high density – in terms of

records in a data set – that are separated from one another by areas of low density. In this

technique, each area of high density (group of records) is considered to be a different cluster

(B. Andreopoulos, An, & Wang, 2007; W. Andreopoulos, 2006; Han & Kamber, 2006; Md

Anisur Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005). DBSCAN is a simple and

effective density-based clustering technique, which finds the densest areas in a data set (Ester,

Kriegel, Sander, & Xu, 1996; Han & Kamber, 2006; Md Anisur Rahman, 2014). To find the

densest areas DBSCAN uses a radius (𝑟𝑥) and finds the neighbourhood of a record within its 𝑟𝑥.

The neighbourhood within a radius of a given record is called the neighbourhood of the record.

If the neighbourhood of a record contains a user defined minimum number of records within a

radius, then the record is considered as a core record (Han & Kamber, 2006; Md Anisur Rahman,

2014).

DBSCAN then decreases the number of core records by using a directly density-reachable

function. A record is considered directly density-reachable from a core record if the record is

within the neighbourhood of the core record. DBSCAN iteratively collects directly density-

reachable records from the core records and forms to a few density-reachable clusters. The

process terminates when no record is left to be clustered.

One of the main disadvantages of density-based clustering techniques is that various user

inputs are required, including the radius of a cluster (W. Andreopoulos, 2006; Md Anisur

Rahman, 2014; Omran, Engelbrecht, & Salman, 2007; Sisodia, Singh, Sisodia, & Saxena,

2012).

23

2.5.2.4 Graph-based Clustering Techniques

In graph-based clustering techniques, records of a data set are represented by vertices, with the

edge (connection) between two records indicating their similarity is greater than a threshold

value (Han & Kamber, 2006; Md Anisur Rahman, 2014). Graph-based clustering aims to group

the vertices into different clusters based on their similarity. Grouping is performed in such a

way that there should be many edges within each cluster, and relatively, a small number of

edges between the clusters (Abonyi János & Feil Balázs, 2007; Z. Chen & Ji, 2010; Md Anisur

Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005).

Graph-based clustering typically uses 𝑘 nearest neighbours to build a graph. Typically, it

uses a minimum spanning tree (MST) to partition the graph into clusters (Galluccio, Michel,

Comon, & Hero, 2012; Md Anisur Rahman, 2014; Zhong, Miao, & Wang, 2010). Limitations

of graph-based clustering techniques include that a similarity function must be selected from the

wide range of available similarity functions. A similar problem exists in relation to the selection

of a similarity graph.

2.5.2.5 Grid-based Clustering Techniques

The grid-based clustering method uses a multiresolution grid structure. It quantizes the record’s

space into a finite number of cells that form a grid structure on which the clustering operations

are performed. The main advantage of this approach is that it is computationally fast regardless

of the number of records of the cells (Han & Kamber, 2006; Md Anisur Rahman, 2014; W.

Wang, Yang, & Muntz, 1997). An example of grid-based clustering technique is STING, which

is discussed below.

STING is a grid-based multiresolution clustering technique which divides the spatial area of

the data into rectangular cells (Han & Kamber, 2006; Md Anisur Rahman, 2014). Usually there

are several levels of hierarchy in the rectangular cells. Each cell at a higher level is partitioned

24

from the lower-level cells and stores the statistical information of attributes such as mean,

maximum, minimum, standard deviation, and the type of distribution for an attribute. All kinds

of statistical parameters of higher-level cells can easily be calculated based on corresponding

statistical information in lower-level cells.

STING follows the top-down approach. It commences from a predefined layer containing a

small number of cells and removes the irrelevant cells from further consideration (Han &

Kamber, 2006). STING continues this process until it reaches to the bottom layer. One of the

key limitations of grid-based clustering techniques is that they require huge amounts of memory

if the number of cells is high (Md Anisur Rahman, 2014; W.M. Ma & Chow, 2004).

2.5.2.6 Spectral Clustering Techniques

The spectral clustering algorithm is one of the most commonly used clustering techniques and

has become a popular research topic (Beauchemin, 2015; X. Hong, Wang, & Qi, 2014; Ma,

Cheng, Liu, & Xie, 2017; Md Anisur Rahman, 2014; Mur, Dormido, Duro, Dormido-Canto, &

Vega, 2016; Nascimento & de Carvalho, 2011; Rafailidis, Constantinou, & Manolopoulos,

2017; Shang, Zhang, Jiao, Wang, & Yang, 2016; von Luxburg, 2007; Y. Yang, Wang, & Xue,

2016). This algorithm is also used for partitioning of graphs. It uses numerous mathematical

theories, including the similarity matrix and the similarity graph for partitioning the records of

a data set into different clusters.

One of the main advantages of spectral clustering techniques is that arbitrary shapes of data

sets can be identified, which partition-based clustering techniques are incapable of detecting

(Matthias & Juri, 2009; Md Anisur Rahman, 2014). However, the drawback of spectral

clustering is that it is often hard to select the best similarity function from the wide range of

similarity functions. It is also affected by problems related to the similarity graph, the Laplacian

matrix, and the number of clusters 𝑘.

25

2.5.2.7 Model-based Clustering Techniques

Model-based clustering methods optimize the fit between the given data and several

mathematical models. Such methods are often based on the assumption that the records of the

data set used for clustering are produced by a mathematical model (Han & Kamber, 2006; Md

Anisur Rahman, 2014). An example of a model-based clustering technique is Expectation

Maximisation (EM), which is the extension of the K-means partitioning algorithm. The EM

algorithm is briefly discussed as follows:

Expectation Maximisation

EM assumes that each cluster can be represented by a probability distribution (Han & Kamber,

2006; Md Anisur Rahman, 2014). A data set is the mixture (combination) of these distributions.

Therefore, a mixture model is used to cluster the data, where each distribution represents a

cluster. It is difficult to estimate the parameters of the probability distributions. The EM method

is used for estimating the parameters of the probability distribution. The steps of the algorithm

are described as follows:

Select a set of initial parameters; and

Based on the following two steps iteratively refine the parameters:

Expectation step; and

Maximisation step.

EM first selects the preliminary parameters such as initial seeds and several other parameters.

These parameters are then iteratively updated based on expectation and maximisation. In the

expectation step, EM calculates the probability of cluster membership of each record for each

of the clusters. Based on probabilities from the expectation step, the maximisation step finds the

new approximations of the parameter that maximises the expected likelihood.

26

However in a similar way to K-means, EM may be affected by becoming obstructed at local

optima (Md Anisur Rahman, 2014). One of the downsides of many model-based clustering

techniques is that the probability distribution is considered to be the same for each cluster,

which may not always be the case (Md Anisur Rahman, 2014; Roy & Parui, 2014)

2.5.2.8 Evolutionary Algorithm-based Clustering Techniques

Partition-based clustering techniques such as K-Means has a number of well-known drawbacks.

One of the problems with K-means is that a user defined number of clusters 𝑘 is required. In

reality, it can be difficult for a user (data miner) to estimate the appropriate number of clusters

in advance. Another disadvantage of K-means is its tendency to get stuck at local optima (Arthur

& Vassilvitskii, 2007; Mohd, Beg, Herawan, & Rabbi, 2012; Rahman & Islam, 2014). In order

to overcome these limitations some existing clustering techniques use various evolutionary

algorithms including genetic algorithms, particle swarm optimization, and ant colony

optimization.

In Table 2.2 and Table 2.3, a review of some major evolutionary algorithm-based clustering

techniques from the last twenty years (1995-2015) is presented. In total, 65 ranked evolutionary

algorithm-based clustering approaches are reviewed (in this instance, the term “ranked” is based

on citation reports on 16/04/2016, and Journal Citation Reports/The Computing Research and

Education Association of Australasia ranks). Most of these techniques do not require users to

define the number of clusters in advance, and are used for many real-life applications such as

highway construction projects, gas companies, cellular networks, satellite image segmentations,

and real-world medical problems.

In this thesis, the various evolutionary algorithm-based clustering techniques are identified

as follows:

Ant Colony Algorithm-based Clustering Techniques;

27

Bee Colony Algorithm-based Clustering Techniques;

Particle Swarm Optimization (PSO) Algorithm-based Clustering Techniques;

Black hole Algorithm-based Clustering Techniques;

Firefly Algorithm-based Clustering Techniques; and

Genetic Algorithm-based Clustering Techniques.

Ant Colony Algorithm-based Clustering Techniques

The ant colony approach produces optimal clustering solutions (C.-L. Huang, Huang, Chang,

Yeh, & Tsai, 2013; İnkaya, Kayalıgil, & Özdemirel, 2015; Korürek & Nizam, 2008; Ramos,

Hatakeyama, Dong, & Hirota, 2009; Shelokar, Jayaraman, & Kulkarni, 2004; Wan, Wang, Li,

& Yang, 2012; L. Zhang & Cao, 2011; L. Zhang, Cao, & Lee, 2013) using a concept known as

“ants”, which is a collection of a number of software ants. An ant is an individual and a

clustering solution that contains a number of cluster centers. Typically, a user defined number

of records is randomly selected from the data set that collectively forms an ant. The quality of

each ant is measured through its fitness value/objective function.

Ant colony optimization, features a number of iterations. In each iteration, a user defined

number of ants is selected based on their fitness, and a local search operation is applied to the

selected ants. In the local search operation, the number of clusters is altered based on probability.

At the end of each iteration, the ant (clustering solution) is updated using the pheromone trail

matrix. The pheromone trail matrix works as an adaptive memory that contains information

about the previous best solution; this is updated at the end of each iteration (Shelokar et al.,

2004). The algorithm repeatedly carries out the local search and pheromone update procedure

for a maximum number of given iterations.

However, a limitation of the many ant colony algorithm-based clustering techniques is that

a user needs to define the number of clusters in advance to form an ant; and it is difficult for a

28

user to guess the correct number of clusters in advance. Some ant colony algorithm-based

clustering techniques generate the number of clusters randomly to form an ant; however, the

quality of the ant is unlikely to be high due to the random selection process. We argue that

having high-quality ants at the beginning of the iteration can result in better quality final

clustering solutions for a given number of iterations.

Table 2.2: List of ranked (based on citation reports) evolutionary algorithm-based clustering techniques from 1995-2015

Algorithms Authors Year of

publications

Number

of

citations

Rank Initial cluster

number

selection

Applications

GA Srikanth et al. 1995 88 Q2 Random Real life data sets

GA Maio et al. 1995 24 Q2 Random Map topologies

GA Murthy and

Chowdhury 1996 284 Q2 Random Real life data sets

GA Scheunders 1997 205 Q1 Random Image segmentation

GA Tseng and Yang 1997 46 B User defined Real life data sets

GA Tzes et al. 1998 44 Q1 Random DC-motor friction identification

GA Hanagandi and

Nikolaou 1998 43 Q1 Random Real life data sets

GA Cucchiara 1998 31 Q2 Random Image segmentation

GA Cowgill et al. 1999 135 Q1 Random Real life data sets

GA Lozano and Larrañaga 1999 47 Q2 Random Real life data sets

GA Demiriz et al. 1999 281 B User defined Real life data sets

GA Maulik and

Bandyopadhyay 2000 1011 Q1 Random Synthetic and real life data sets

GA Tseng and Yang 2001 201 Q1 Random Real life data sets

GA Bandyopadhyay and

Maulik 2002 301 Q1 Random

Satellite image of a part of the

city of Mumbai

GA Turgut et al. 2002 129 B Random Mobile ad hoc networks

GA Li and Chiao 2003 16 Q1 User defined Image segmentation

Ant Colony Shelokar et al. 2004 356 Q1 User defined Real life data sets

GA Pakhira et al. 2005 160 Q1 User defined Synthetic and real life data sets

PSO Paterlini and Krink 2006 296 Q3 Random Synthetic and real life data sets

GA Laszlo and Mukherjee 2006 113 Q1 User defined Real life data sets

PSO Jarboui et al. 2007 93 Q1 Random Real life data sets

GA Bandyopadhyay et al. 2007 182 Q1 Random Image Segmentation

Bee Colony Fathian et al. 2007 205 Q1 Random Real life data sets

PSO Das et al. 2008 110 Q2 Random Synthetic and real life data sets

GA Qing et al. 2008 62 Q1 User defined

Varied-line-spacing

holographic gratings

GA Sheng et al. 2008 31 Q1 Random Hand written signature data

Ant Colony Korürek and Nizam 2008 47 Q1 Random ECG Signals

29

Table 2.3: List of ranked (based on citation reports) evolutionary algorithm-based clustering techniques from 1995-2015

Algorithms Authors Year of

publications

Number

of

citations

Rank Initial cluster

number

selection

Applications

PSO Zhao et al. 2009 34 Q2 Random Box–Jenkins gas furnace data set

PSO Yang et al. 2009 100 Q1 Random Synthetic and real life data sets

GA Chang et al. 2009 88 Q1 User defined Real life data sets

Ant Colony Ramos et al. 2009 23 Q1 Random Synthetic and real life data sets

Bee Colony Zhang et al. 2010 208 Q1 Random Real life data sets

PSO Tsai and Kao 2011 51 Q1 Random Synthetic and real life data sets

PSO Kalyani and Swarup 2011 60 Q1 Random

Security assessment in power

systems

GA Liu et al. 2011 43 Q1 Automatic Real life data sets

GA Yücenur and Demirel 2011 44 Q1 Random Real life data sets

PSO Chuang et al. 2011 46 Q1 Random Real life data sets

Firefly Senthilnath and Mani 2011 172 NA Random Real life data sets

Ant Colony Zhang and Cao 2011 32 Q1 Random Synthetic and real life data sets

Bee Colony Karaboga and Ozturk 2011 493 Q1 Random Real life data sets

Bee Colony Yan et al. 2012 62 Q2 User defined Real life data sets

PSO Sun et al. 2012 42 Q1 Random Synthetic and real life data sets

PSO Cura 2012 41 Q1 Random Synthetic and real life data sets

PSO Kuo et al. 2012 65 Q1 Random Real life data sets

Ant Colony Wan et al. 2012 22 Q1 Random Real life data sets

GA Agustín-Blas et al. 2012 52 Q1 Random Synthetic and real life data sets

Bee Colony Lei et al. 2013 10 Q1 Random MIPS data set

GA Chang et al. 2012 16 Q1 Random Real life data sets

GA Aalaei et al. 2013 2 Q1 User defined

Mazandaran Gas Company in

Iran

PSO Jiang et al. 2013 24 Q1 Random Real life data sets

Ant Colony Huang et al. 2013 21 Q1 Random Real life data sets

GA Mungle et al. 2013 15 Q1 Random Highway construction project

Ant Colony Zhang et al. 2013 7 Q1 Random Real life data sets

Bee Colony Banharnsakun et al. 2013 12 Q2 Random Real life data sets

Black hole Hatamlou 2013 103 Q1 Random Real life data sets

GA Festa 2013 6 Q3 User defined Real life data sets

Bee Colony Kuo et al. 2014 9 Q1 Random Real life data sets

PSO Cagnina et al. 2014 11 Q1 Random Short-text corpora

GA Wikaisuksakul 2014 13 Q1 Random Synthetic and real life data sets

GA Rahman and Islam 2014 12 Q1 Random Real life data sets

GA Peng et al. 2014 4 Q1 Random Real life data sets

Bee Colony Forsati et al. 2015 3 Q2 Random Real life Data sets

GA Hong et al. 2015 2 Q1 Random Real life data sets

Ant Colony İnkaya et al. 2015 8 Q1 Random Real life data sets

Bee Colony Ozturk et al. 2015 13 Q1 Random Image clustering

30

Bee Colony Algorithm-based Clustering Techniques

The bee colony approach produces optimal clustering solutions using the concept of a honey

bee swarm (Banharnsakun, Sirinaovakul, & Achalakul, 2013; Fathian, Amiri, & Maroosi, 2007;

Forsati, Keikha, & Shamsfard, 2015; Karaboga & Ozturk, 2011; Kuo, Huang, Lin, Wu, &

Zulvia, 2014; Lei, Tian, Ge, & Zhang, 2013; Menon & Ramakrishnan, 2015; Ozturk, Hancer,

& Karaboga, 2015; Yan, Zhu, Zou, & Wang, 2012; C. Zhang, Ouyang, & Ning, 2010). Honey

bee swarms consist of three essential components; food sources, employed bees, and

unemployed bees (C. Zhang et al., 2010). The employed bees are associated with a particular

food source and share the information with onlooker bees. There are two types of unemployed

bees; onlooker bees and scout bees.

Onlooker bees wait in the nest and find the food source based on the information shared by

employed bees. In each food source, there is only one employed bee. If the position of a food

source doesn’t improve through its predefined number of iterations, then the employed bee

associated with that particular food source become a scout bee and starts a new search randomly.

In the bee colony algorithm, a food source represents a clustering solution. Initially, a number

of food sources are generated randomly. The food source is then improved through employed

bees, onlooker bees, and scout bees in each iteration. Based on fitness value, the best food source

is selected as the final clustering solution.

However, a limitation of the many bee colony algorithm-based clustering techniques is that

a user needs to estimate in advance the number of clusters required to form a bee; a difficult

task. Some bee colony algorithm-based clustering techniques randomly generate the number of

clusters required to form a bee. However, the quality of the bee is unlikely to be high due to the

random selection process. We propose that commencing the iteration with high-quality bees can

result in better-quality final clustering solutions for a given number of iterations.

31

Particle Swarm Optimization (PSO) Algorithm-based Clustering Techniques

Particle swarm optimization (PSO) algorithm-based clustering techniques produce optimal

clustering solutions using the concepts of swarm (Cagnina, Errecalde, Ingaramo, & Rosso,

2014; L.-Y. Chuang, Hsiao, & Yang, 2011; Cura, 2012; Das, Abraham, & Konar, 2008; Jarboui,

Cheikh, Siarry, & Rebai, 2007; Jiang, Wang, & Wang, 2013; Kalyani & Swarup, 2011; Kuo et

al., 2012; Paterlini & Krink, 2006; Sun, Chen, Fang, Wun, & Xu, 2012; Tsai & Kao, 2011; F.

Yang, Sun, & Zhang, 2009; L. Zhao, Yang, & Zeng, 2009). The swarm is a collection of a

number of particles, where a particle is a clustering solution that contains a number of cluster

centers. Typically, a user defined number of records is selected as the center of a cluster. The

centers of a cluster collectively form a particle/clustering solution. The number of particles of a

swarm varies from technique to technique; usually from 20 to 40 (Rahman & Islam, 2014). The

quality of a particle is measured through its objective function/fitness value.

In PSO, there are a number of iterations that vary from application to application. In each

iteration, a particle moves towards its best previous position and towards the best particle in the

swarm (Cura, 2012; Das et al., 2008; Kuo et al., 2012). Hence, the position represents the cluster

centers/seeds. Generally, in PSO, each particle has three properties; current position, current

velocity, and personal-best position. The best position of a particle represents cluster centers

that have the best fitness value of the particle. The current position of a particle represents the

cluster centers/seeds of a particle of the current iteration. The velocity of a particle represents

the speed of particle to change its current position. Typically, the velocity of a particle depends

on the differences between the current position and the best position of a particle, and the

differences between the current position and the best position of a particle in the swarm. Usually,

each particle has a tendency to move towards its best position and the best position of the particle

in the swarm. Hence, each particle of the swarm moves towards the best clustering solution over

iterations.

32

However, a limitation of many PSO-based clustering techniques is that the number of clusters

is randomly generated to form a particle. The quality of the particles is unlikely to be high due

to the random selection process. We contend that the possession of high-quality particles at the

beginning of the iteration can result in better-quality final clustering solutions for a given

number of iterations.

Black hole Algorithm-based Clustering Techniques

The black hole algorithm (Hatamlou, 2013) produces optimal clustering solutions using the

concept of a black hole. A black hole is a region of space that contains so much concentrated

mass that there is no way for a nearby object – including particles and electromagnetic radiation

such as light – to escape its gravitational pull. The black hole algorithm (BH) commences with

an initial population of candidate solutions called starts. In the BH algorithm, at each iteration,

the most applicable candidate solution is selected as the black hole, with the remainder forming

the other stars.

After initializing the population, the black hole starts drawing the stars towards it. If a star

comes into proximity of the black hole, then the star is swallowed by the black hole, and is gone

forever. In such a case, BH then generates a new star randomly, places it within the search space,

and starts a new search. The process is continued until it meets the termination condition (a

maximum number of iterations or a sufficiently good fitness is met).

The limitation of the black hole algorithm is that in each iteration some stars are deleted

based on the event horizon, and in order for the deleted stars to be replaced BH then generates

new stars randomly. We maintain that the quality of new randomly generated stars is unlikely

to be better than the stars which are deleted based on the event horizon. The deleted stars are

improved over the iterations.

33

Firefly Algorithm-based Clustering Techniques

A firefly algorithm (FA) is a nature-inspired optimization algorithm that simulates the flashing

behavior of social insects (fireflies) (Abshouri & Bakhtiary, 2012; Hassanzadeh & Meybodi,

2012; Lei, Wang, Wu, Zhang, & Pedrycz, 2016; Mathew & Vijayakumar, 2014; Senthilnath,

Omkar, & Mani, 2011). The firefly algorithm is a clustering solution that contains a number of

cluster centers; with records typically randomly selected as the center of a cluster. The centers

of clusters collectively form a firefly/clustering solution. Moreover, the fireflies carry a

luminescence quality known as luciferin that emits light proportional to this value. Each firefly

is attracted to others by the glow or brightness.

Attractiveness is decreased when the distance between fireflies is increased. A firefly will

move randomly if no other firefly is particularly bright to attract it. At the end of each iteration,

FA evaluates the fitness of each firefly and ranks the fireflies based on their fitness value;

selecting the firefly with the highest fitness value as the final clustering solution. The process is

continued until it meets the termination condition (i.e. a maximum number of iterations).

However, a limitation of the firefly algorithm is that it generates the number of clusters

randomly to form a firefly. Due to the random selection processes, the quality of the firefly is

unlikely to be high. We propose that having a high-quality firefly at the beginning of the iteration

can result in a better-quality final clustering solution for a given number of iterations.

Genetic Algorithm-based Clustering Techniques

Genetic algorithms (GA) are randomized search and optimization techniques based on the

concepts of Darwin’s law of evolution “Survival of the fittest in natural selection”, proposed by

John H. Holland (Holland, 1975). This algorithm simulates the biological structure of the

genetic evolution process. GA-based clustering techniques have a number of advantages

including that GA-based clustering techniques do not require any user defined “number of

34

clusters” prior to commencement of the clustering process, solving the local optima issue of the

partition-based clustering techniques.

In Table 2.4, an examination of some major GA-based clustering techniques of the last

twenty years is presented. A total of 45 ranked GA-based clustering approaches are reviewed,

which are used for real-life applications such as real-life data sets, highway construction

projects, gas companies, cellular networks, and satellite image segmentations (in this instance,

the term “ranked” is based on citation reports and JCR/CORE rank). Almost two-thirds of the

techniques do not require a user to define the number of clusters (Aalaei, Fazlollahtabar,

Mahdavi, Mahdavi-Amiri, & Yahyanejad, 2013; Abolhassani, Salt, & Dodds, 2004; Agustín-

Blas et al., 2012; S. Bandyopadhyay, Maulik, & Mukhopadhyay, 2007; Sanghamitra

Bandyopadhyay & Maulik, 2002; D.-X. Chang et al., 2009; D. Chang et al., 2012; Cheng, Lee,

& Wong, 2002; Chiou & Lan, 2001; Cowgill, Harvey, & Watson, 1999; Cucchiara, 1998;

Demiriz, Demiriz, Bennett, & Embrechts, 1999; Deng, He, & Xu, 2010; Dimopoulos & Mort,

2001; Festa, 2013; Garai & Chaudhuri, 2004; Hanagandi & Nikolaou, 1998; He & Tan, 2012;

T.-P. Hong, Chen, & Lin, 2015; Y. Hong & Kwong, 2008; M. Laszlo & Mukherjee, 2006;

Michael Laszlo & Mukherjee, 2007; C.-T. Li & Chiao, 2003; Lin Yu Tseng & Shiueng Bien

Yang, 1997; Y. Liu et al., 2011; Lozano & Larrañaga, 1999; Maio et al., 1995; Maulik &

Bandyopadhyay, 2000; Mungle, Benyoucef, Son, & Tiwari, 2013; Murthy & Chowdhury, 1996;

Neto, Meyer, & Jones, 2006; Pakhira, Bandyopadhyay, & Maulik, 2005; Peng et al., 2014; Qing,

Gang, Zaiyue, & Qiuping, 2008; Rahman & Islam, 2014; Scheunders, 1997; Sheng, Howells,

Fairhurst, & Deravi, 2008; Song, Li, & Park, 2009; Srikanth et al., 1995; Tseng & Yang, 2001;

Turgut, Das, Elmasri, & Turgut, 2002; Tzes, Pei-Yuan Peng, & Guthy, 1998; Wikaisuksakul,

2014; Xiao et al., 2010; Yücenur & Demirel, 2011).

35

Table 2.4: List of ranked (based on citation reports) GA-based clustering techniques from 1995-2015

Authors Year of

publications No. of

Citations Rank Initial cluster

number selection Applications

Srikanth et al. 1995 88 Q2 Random Real life data sets

Maio et al. 1995 24 Q2 Random Map topologies

Murthy and Chowdhury 1996 284 Q2 Random Real life data sets

Scheunders 1997 205 Q1 Random Image segmentation

Tseng and Yang 1997 46 B User defined Real life data sets

Tzes et al. 1998 44 Q1 Random DC-motor friction identification

Hanagandi and Nikolaou 1998 43 Q1 Random Real life data sets

Cucchiara 1998 31 Q2 Random Image segmentation

Cowgill et al. 1999 135 Q1 Random Real life data sets

Lozano and Larrañaga 1999 47 Q2 Random Real life data sets

Demiriz et al. 1999 281 B User defined Real life data sets

Maulik and Bandyopadhyay 2000 1011 Q1 Random Synthetic and real life data sets

Chiou and Lan 2001 108 Q1 User defined Synthetic data sets

Tseng and Yang 2001 201 Q1 Random Real life data sets

Dimopoulos and Mort 2001 75 Q2 Random Cell-formation problems

Cheng et al. 2002 99 Q1 Random Data partitioning

Bandyopadhyay and Maulik 2002 301 Q1 Random Satellite image of Mumbai

Turgut et al. 2002 129 B Random Mobile ad hoc networks

Li and Chiao 2003 16 Q1 User defined Image segmentation

Garai and Chaudhuri 2004 97 Q2 Random Real life data sets

Abolhassani et al. 2004 15 Q1 Random Cellular networks

Pakhira et al. 2005 160 Q1 User defined Synthetic and real life data sets

Neto et al. 2006 67 Q2 Random Leaf image segmentation

Laszlo and Mukherjee 2006 113 Q1 User defined Real life data sets

Bandyopadhyay et al. 2007 182 Q1 Random Image Segmentation

Laszlo and Mukherjee 2007 115 Q1 Random Real life data sets

Hong and Kwong 2008 31 Q1 Random Synthetic and real life data sets

Qing et al. 2008 62 Q1 User defined Varied-line-spacing holographic

gratings

Sheng et al. 2008 31 Q1 Random Hand written signature data

Chang et al. 2009 88 Q1 User defined Real life data sets

Song et al. 2009 75 Q1 Random Real life data sets

Deng et al. 2010 31 Q1 User defined Real life data sets

Xiao et al. 2010 42 Q1 Random Real life data sets

Liu et al. 2011 43 Q1 Automatic Real life data sets

Yücenur and Demirel 2011 44 Q1 Random Real life data sets

He and Tan 2012 41 Q2 Random Real life data sets

Agustín-Blas et al. 2012 52 Q1 Random Synthetic and real life data sets

Chang et al. 2012 16 Q1 Random Real life data sets

Aalaei et al. 2013 2 Q1 User defined Mazandaran Gas Company

Festa 2013 6 Q3 User defined Real life data sets

Mungle et al. 2013 15 Q1 Random Highway construction project

Wikaisuksakul 2014 13 Q1 Random Synthetic and real life data sets

Rahman and Islam 2014 12 Q1 Random Real life data sets

Peng et al. 2014 4 Q1 Random Real life data sets

Hong et al. 2015 2 Q1 Random Real life data sets

36

Steps of Genetic Algorithms

In GA, the roles of the initialization and recombination operators are very well-defined. The

initialization operator identifies the direction of search and the recombination operator generates

new regions for search (Sheikh, Raghuwanshi, & Jaiswal, 2008). GA starts a generation with an

initial population. The initial population is generated with a number of chromosomes. The

chromosomes are made up with a number of genes. For clustering, a chromosome is considered

to be a clustering solution, and a gene of a chromosome is considered to be the center of a

cluster.

Fig. 2.1: Basic steps of Genetic Algorithms (GA)

37

After initialization, in order to make a selection, an objective/fitness function is applied to

each chromosome that identifies the goodness of a chromosome. Biologically inspired

operators: crossover and mutation are then applied to the population in order to produce a

clustering solution. At the end of each generation, GA applies an elitist operation where the

newly generated populations are compared with the previous population. All these inter-related

parameters and operators influence the performance of a GA (Diaz-Gomez & Hougen, 2007;

Michael Laszlo & Mukherjee, 2007). The processes of selection, crossover, mutation, and elitist

operation are continued for a fixed number of generations or until a termination condition is

satisfied (Sheikh et al., 2008). GA consists of five main phases, namely initial population,

selection, crossover, mutation, and elitist operation (see Fig. 2.1). The main phases of genetic

algorithms will now be briefly explained, as follows.

Step 1: Initialization Population

The first step of GA is initial population, the size of which is typically user defined. Each

individual in the initial population is represented as a chromosome. In GA, a chromosome

contains a set of genes, where a gene is a (real or pseudo) record. A gene is regarded as the

center of a cluster; and therefore a chromosome is considered to be a clustering solution. GA

generally contains many iterations/generations. Each generation typically contains a number of

chromosomes that are known as the population of the generation.

Usually, a set of records is randomly selected from a data set to form a chromosome (Md

Anisur Rahman, 2014; Mukhopadhyay & Maulik, 2009; Xiao et al., 2010). The number of

records in a chromosome can vary from 2 to 𝐾∗ + 1, where 𝐾∗ is the soft estimation for the

maximum number of clusters (Md Anisur Rahman, 2014; Mukhopadhyay & Maulik, 2009).

The number of genes in a chromosome can range from 2 to √𝑛, where 𝑛 is the number of records

38

in a data set (D.-X. Chang et al., 2009; He & Tan, 2012; Y. Liu et al., 2011; Md Anisur Rahman,

2014; Pakhira et al., 2005; Xiao et al., 2010).

Step 2: Selection

For the next genetic operators of crossover and mutation, GA selects chromosomes based on

their fitness/objective function. Various methods are used to calculate fitness, such as the

Davies-Bouldin (DB) Index (D L Davies & Bouldin, 1979), Sum of the Squared Error (SSE)

(Agustín-Blas et al., 2012; Pang-Ning Tan, Michael Steinbach, 2005), and Silhouette

Coefficient (Pang-Ning Tan, Michael Steinbach, 2005). These methods will be discussed in

Section 2.7. A proportional selection method [roulette wheel technique (D. Chang et al., 2012;

Maulik & Bandyopadhyay, 2000; Mukhopadhyay & Maulik, 2009)] is used to choose

chromosomes. An existing technique known as AGCUK (Y. Liu et al., 2011) uses noise-based

selection for the selection of chromosomes.

Step 3: Crossover

Crossover is an important step in GA, where a pair of chromosomes swaps its segments/genes

and generates a pair of offspring chromosomes. Many types of selection criteria, including the

roulette wheel (D. Chang et al., 2012; Maulik & Bandyopadhyay, 2000; Mukhopadhyay &

Maulik, 2009), rank-based wheel (Agustín-Blas et al., 2012), and random selection (D.-X.

Chang et al., 2009) are used to select a chromosome pair for a crossover operation.

Some GA-based clustering techniques (D. Chang et al., 2012; Maulik & Bandyopadhyay,

2000; Mukhopadhyay & Maulik, 2009) use roulette wheel selection, where the best

chromosome (available in the current population) is chosen as the first chromosome of the pair,

while the second chromosome of the pair is selected using the roulette wheel technique. Agustín-

Blas et al. (2012) use rank-based wheel selection where chromosomes are sorted based on their

quality, and then a pair of chromosomes is chosen based on the rank of a chromosome.

39

Once the pair of chromosomes are selected, GA then applies a crossover operation on each

pair of chromosomes. There are many approaches for performing crossover on a pair of

chromosomes, such as single-point (Sanghamitra Bandyopadhyay & Maulik, 2002; D. Chang

et al., 2012; Garai & Chaudhuri, 2004; Michael Laszlo & Mukherjee, 2007; Pakhira et al., 2005;

Peng et al., 2014; Rahman & Islam, 2014; Song et al., 2009), multi-point (Agustín-Blas et al.,

2012), arithmetic (Yan et al., 2012), path-based (D.-X. Chang et al., 2009), and heuristic (D.-X.

Chang et al., 2009) crossover.

In single-point crossover, each chromosome of a pair is divided into two parts at a random

point between two genes. The left-hand portion (having one or more genes) of one chromosome

of a pair joins the right-hand portion of other chromosomes (having one or more genes), and

form an offspring chromosome (Rahman & Islam, 2014). In multi-point crossover, each

chromosome of a pair is divided into multiple parts, which are then swapped with each other to

generate new offspring chromosomes. In path-based crossover (D.-X. Chang et al., 2009), two

parent chromosomes create a path between them, from which two points are selected as

offspring chromosomes. Heuristic crossover (D.-X. Chang et al., 2009) uses the fitness values

of two parents, where the worst parent slightly moves towards the best parent.

Step 4: Mutation

Mutation operation randomly changes (Agustín-Blas et al., 2012; D.-X. Chang et al., 2009; D.

Chang et al., 2012; Y. Liu et al., 2011; Maulik & Mukhopadhyay, 2010; Rahman & Islam, 2014)

one or more genes (seeds) of a chromosome with a probability equal to the mutation rate. There

are many approaches for mutation such as division and absorption (Agustín-Blas et al., 2012;

D. Chang et al., 2012; Y. Liu et al., 2011), insertion (D. Chang et al., 2012), deletion (D. Chang

et al., 2012), perturbation (D. Chang et al., 2012) and movement (D. Chang et al., 2012). The

40

division operation divides one cluster of a chromosome into two clusters. The absorption

operation merges two clusters of a chromosome into one cluster.

The perturb mutator (D. Chang et al., 2012) randomly selects a cluster center and changes

the coordinates of the center, while the insert mutator (D. Chang et al., 2012) randomly generates

a center from the data set and inserts it into the chromosome. The delete mutator (D. Chang et

al., 2012) deletes a randomly selected center of the chromosome, while the move mutator (D.

Chang et al., 2012) transfers one record from one cluster to another cluster and re-computes the

cluster center of the chromosome.

Step 5: Elitist Operation

The elitist operation (D.-X. Chang et al., 2009; D. Chang et al., 2012; Y. Liu et al., 2011;

Rahman & Islam, 2014) preserves the best chromosome obtained so far at any stage (i.e. the

iteration) of the GA, and passes it to the next generation in order to ensure that the best

achievement at that time does not get lost during genetic operations. If the fitness of the worst

chromosome (𝑃𝑤𝑖 ) of the current (𝑖𝑡ℎ ) generation is less than the fitness of the best chromosome

(𝑃𝑏) found so far from all previous generations, then the worst chromosome (𝑃𝑤𝑖 ) is replaced

with the best chromosome (𝑃𝑏). Moreover, if the fitness of the best chromosome (𝑃𝑏𝑖 ) of 𝑖𝑡ℎ

generation is greater than 𝑃𝑏 then 𝑃𝑏 is replaced by 𝑃𝑏𝑖 .

Termination Condition

Typically, GA-based clustering techniques terminate when they meet the user defined number

of iterations/generations or if there is no improvement in the chromosomes of the current

generation compared to the previous generation (D.-X. Chang et al., 2009; Md Anisur Rahman,

2014). At the end of the total generations, GA-based clustering techniques select the best

chromosomes as the final clustering solution. The genes of the best chromosome represent the

cluster centers, and records are allocated to their closest seeds to form the final clusters.

41

However, GA-based clustering techniques have some limitations. Many existing GA-based

clustering techniques (Y. Liu et al., 2011; Maio et al., 1995; Maulik & Bandyopadhyay, 2000;

Xiao et al., 2010) randomly generate the number of genes of a chromosome in the population

initialization phase. They also randomly choose records as genes, instead of carefully choosing

genes of a chromosome; this is significant, given that a carefully selected initial population can

improve final clustering results. However, some existing techniques – such as GenClust – do

carefully select a high-quality initial population with a high complexity of 𝑂(𝑛2), where 𝑛 is

the number of records in a data set. Unfortunately, GenClust also requires user input on the

number of radius values for the clusters in the initial population selection. It can be very difficult

for a user to estimate the set of radius values (i.e. radii).

2.6 Distance Calculation

A data set typically consists of numerical and/or categorical attributes, with different distance

calculations required for each. Therefore, in this thesis, distance calculation is divided into two

categories:

Distance Calculation for numerical attributes; and

Distance Calculation for categorical attributes.

2.6.1 Distance Calculation for Numerical Attributes

Prior to making a distance calculation, the domain values of a numerical attribute are typically

normalized in the range between 0 and 1 in order to weigh each attribute equally, regardless of

domain size. Many different approaches have been proposed to calculate the distance between

two domain values of a numerical attribute. A number of different distance calculation

approaches for numerical attributes are listed as follows:

Minkowski Distance;

Manhattan Distance;

42

Euclidean Distance;

Chebyshev Distance;

Cosine Distance; and

Jaccard Distance.

Distance calculations for the attributes of a data set are commonly used in various data

mining approaches, including clustering. In this thesis, we use the Euclidean distance metric to

calculate the distance between two domain values of a numerical attribute.

Minkowski Distance

Minkowski distance is a generalized distance metric used in clustering to calculate the distance

between two domain values of a numerical attribute (Han & Kamber, 2006; Md Anisur

Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005; Schulz, 2008; Teknomo, 2015b).

Let us consider, 𝜏 is a positive integer, the number of attributes in a data set is 𝑚, the 𝑎𝑡ℎ

attribute value of the 𝑖𝑡ℎ record is 𝑅𝑖,𝑎, and 𝑠𝑗,𝑎 is the 𝑎𝑡ℎ attribute value of the seed of the 𝑗𝑡ℎ

cluster. The Minkowski distance (𝑑𝑖𝑗) between 𝑅𝑖,𝑎 and 𝑠𝑗,𝑎 can be calculated as follows:

𝑑𝑖𝑗 = √∑ |𝑅𝑖,𝑎 − 𝑠𝑗,𝑎|𝜏𝑚

𝑎=1

𝜏

Eq. 2.2

Manhattan Distance

Manhattan distance is generalized by Minkowski distance. It is also called city block distance,

taxicab norm or L1 norm distance (Han & Kamber, 2006; Md Anisur Rahman, 2014; Pang-

Ning Tan, Michael Steinbach, 2005; Schulz, 2008). In the equation of Minkowski distance, if

the value of 𝜏 is equal to 1 then it is considered as Manhattan distance.

Euclidean Distance

Euclidean Distance is also generalized by Minkowski distance. It is frequently used in many

clustering techniques to calculate the distance between two domain values of a numerical

attribute (Han & Kamber, 2006; Md Anisur Rahman, 2014; Pang-Ning Tan, Michael

43

Steinbach, 2005; Teknomo, 2015b). In the Minkowski distance equation, if the value of 𝜏 is

equal to 2 then it is considered as Euclidean Distance, which is also called L1 norm distance.

Chebyshev Distance

Chebyshev Distance is another distance calculation approach for numerical attributes

generalized by Minkowski distance (Han & Kamber, 2006; Md Anisur Rahman, 2014; Pang-

Ning Tan, Michael Steinbach, 2005; Schulz, 2008; Teknomo, 2015b). In the Minkowski

distance equation, if the value of 𝜏 is equal to ∞ then it is considered as Chebyshev Distance,

which is also called L1 norm distance.

Cosine Distance

Let us consider, 𝑅𝑖 is the 𝑖𝑡ℎ record of a data set and 𝑠𝑗 is the seed of the 𝑗𝑡ℎ cluster. The cosine

similarity (𝛱) between 𝑅𝑖 and 𝑠𝑗 can be calculated as follows (A. Huang, 2008; Md Anisur

Rahman, 2014):

𝛱𝑖𝑗 =𝑅𝑖. 𝑠𝑗

|𝑅𝑖| × |𝑠𝑗|

Eq. 2.3

The cosine distance (Ϛ𝑖𝑗) between 𝑅𝑖 and 𝑠𝑗 can be calculated as follows:

Ϛ𝑖𝑗=1-𝛱𝑖𝑗 Eq. 2.4

Jaccard distance

Let us consider, 𝑅𝑖 is the 𝑖𝑡ℎ record of a data where 𝑅𝑖 = {1,5,3,2} and 𝑠𝑗 is the seed of the

𝑗𝑡ℎ cluster where 𝑠𝑗 = {3,2,2,1}. The Jaccard coefficient (Ӻ𝑖𝑗) between 𝑅𝑖 and 𝑠𝑗 can be

calculated as follows (Md Anisur Rahman, 2014; Teknomo, 2015a):

Ӻ𝑖𝑗 =|𝑅𝑖 ∩ 𝑠𝑗|

|𝑅𝑖 ∪ 𝑠𝑗|=

3

4= 0.75

Eq. 2.5

44

The Jaccard distance (Ԓ𝑖𝑗) between 𝑅𝑖 and 𝑠𝑗 can be calculated as follows:

Ԓ𝑖𝑗 =1-Ӻ𝑖𝑗 Eq. 2.6

The Jaccard distance for the above example is Ԓ𝑖𝑗 =1-0.75=0.25.

2.6.2 Distance Calculation for Categorical Attributes

While it is evident that a number of formulae have been developed for the distance calculation

of numerical attributes, the creation of formulae for the distance calculation of values of

categorical attributes has received somewhat less attention (Md Anisur Rahman, 2014; C. Wang

et al., 2011). Generally, the distance between two domain values of a categorical attribute is

either regarded as zero (if the two values are identical) or one (if the two values are dissimilar)(Z.

Huang, 1997; Ji, Pang, Zhou, Han, & Wang, 2012; Md Anisur Rahman, 2014). However,

considering the distance between two domain values of a categorical attribute as either zero or

one may not be sensible. The distance between two domain values of a categorical attribute can

instead be measured, based on similarity (Islam & Brankovic, 2011; Md Anisur Rahman, 2014).

Typically, the similarity between two domain values of a categorical attribute is calculated based

on their co-appearance (relation) with the domain values of other categorical attributes in the

records of the whole data set (Ganti, Gehrket, & Ramakrishnant, 1999; H. Giggins & Brankovic,

2012; H. P. Giggins, 2009; Md Anisur Rahman, 2014).

An existing technique called VICUS (H. Giggins & Brankovic, 2012; H. P. Giggins, 2009)

calculates the distance between two categorical attributes based on their similarity. VICUS

measures the similarity between two domain values of a categorical attribute based on their co-

appearances with the domain values of other categorical attributes. VICUS converts the data set

into a graph where all the domain values of categorical attributes are considered as the vertices

of the graph. It uses the co-appearance of two attribute values for drawing the edges between

the vertices delineating the values.

45

Let us consider, 𝑆𝑎,𝑏′ is the similarity between two domain values 𝑎 and 𝑏 of categorical

attribute, 𝑑(𝑎) is the degree of the attribute value 𝑎 (i.e. the number of other attribute values

which is co-appearing with the attribute value 𝑎 in the whole data set), 𝑒𝑎𝑐 is the number of

edges between two attributes values 𝑎 and 𝑐 (i.e. number of times the two categorical values 𝑎

and 𝑐 co-appear in the complete data set), and 𝑙 is the total number of domain values for all

attributes (except 𝑎 and 𝑏) values. The similarity between two categorical attribute values (𝑎

and 𝑏) can be then computed as follows: (H. Giggins & Brankovic, 2012; Md Anisur Rahman,

2014):

𝑆𝑎,𝑏′ =

∑ √𝑒𝑎𝑘 × 𝑒𝑏𝑘𝑙𝑘=1

√𝑑(𝑎) × 𝑑(𝑏)

Eq. 2.7

However, Rahman (2014) advised that if a data set has both the numerical and categorical

attributes, the numerical attribute values can be categorized first, and then the similarity between

two categorical attribute values can be measured based on both the categorical and numerical

(categorized) attribute values.

Similarly, few other existing techniques (Ahmad & Dey, 2007b; Cost & Salzberg, 1993;

Ganti et al., 1999) calculate the distance between two categorical attributes based on their co-

appearance with the domain values of other attribute values. However, the similarity between

the domain values of two categorical attributes can be measured, not only based on their co-

appearance, but also on the frequency of the domain values throughout the data set (C. Wang et

al., 2011). The similarity that is measured based on co-appearance with the domain values of

other attribute values is known as inter attribute value similarity (𝛿𝑖), while the similarity that is

measured based on the frequency of other attribute values is 𝛿𝑜. The overall similarity between

two categorical attribute values is measured as follows:

46

𝛿 = 𝛿𝑖 ∗ 𝛿𝑜 Eq. 2.8

The intra-coupled attribute value similarity (𝛿𝑖) relies on the frequency of an attribute’s

values. If we consider, 𝑎 & 𝑏 are the domain values of attribute 𝐴 and the frequencies of attribute

values are 𝑓𝑎 and 𝑓𝑏 respectively, then the intra-coupled similarity between 𝑎 and 𝑏 can be

measured as follows:

𝛿𝑖(𝑎, 𝑏) =|𝑓𝑎||𝑓𝑏|

|𝑓𝑎|+|𝑓𝑏|+|𝑓𝑎|∗|𝑓𝑏|

Eq. 2.9

The inter-coupled attribute value similarity (𝛿𝑜) is measured with regard to its co-appearance

with other attribute values. For example, 𝜕 is the domain values of all attributes where X ⊆ 𝜕

and 𝑌 = 𝜕\𝑋. If 𝑎 and 𝑏 are the domain values of attribute A1, then the conditional probability

is 𝜏𝑖(𝑋|𝑎), and the inter-coupled similarity between a and b is computed with regard to other

attributes:

𝛿𝑜(𝑎, 𝑏) = 2 − min𝑋⊆|𝜕|

{2 − 𝜏(𝑋|𝑎) − 𝜏(𝑌|𝑏)} Eq. 2.10

The distance between two domain values of a categorical attribute can be computed as 1- 𝛿.

2.7 Cluster Evaluation Techniques

To measure the quality of a clustering solution an evaluation technique is required. Typically,

the quality of a clustering solution is measured on internal cluster evaluation criteria and external

cluster evaluation criteria. Therefore, in this thesis, cluster evaluation techniques are divided

into two categories:

Internal cluster evaluation criteria; and

External cluster evaluation criteria.

47

2.7.1 Internal Cluster Evaluation Techniques

Internal cluster evaluation criteria are also known as the unsupervised measurement of

clustering quality (Md Anisur Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005), which

allow the goodness of a cluster to be evaluated without any external information such as class

values (labels) of the records. A selection of internal cluster evaluation techniques is listed as

follows:

Sum of Square Error (SSE);

Davies-Bouldin (DB) Index;

Silhouette Coefficient;

Xie-Beni Index; and

Dunn Index.

Sum of Square Error (SSE)

The sum of square error (SSE) calculates clusters compactness (Md Anisur Rahman, 2014;

Pang-Ning Tan, Michael Steinbach, 2005). Simple K-means uses SSE as its objective function.

Many GA-based clustering techniques (D.-X. Chang et al., 2009; Michael Laszlo & Mukherjee,

2007) also use SSE as the fitness function. A lower value SSE indicates a better clustering result.

If 𝑘 is the number of clusters, 𝑠𝑗 is the seed of 𝑗𝑡ℎ cluster (𝑐𝑗), and 𝑑𝑖𝑠𝑡 (𝑠𝑗, 𝑥) is the distance

between a record 𝑥 and seed (𝑠𝑗) of cluster 𝑐𝑗, then SSE is calculated as follows:

SSE = ∑ ∑ 𝑑𝑖𝑠𝑡 (𝑠𝑗, 𝑥)2

𝑥∈𝑐𝑗

𝑘

𝑗

Eq. 2.11

Davies-Bouldin (DB) Index

The basic premise of the Davies-Bouldin (DB) Index (D. L. Davies & Bouldin, 1979) is to

minimize the distance of intra-cluster, while maximizing the distance of inter clusters (Agustín-

Blas et al., 2012). The DB index calculates the ratio of the sum of within-cluster scatter to

48

between-cluster separation (D. L. Davies & Bouldin, 1979; Y. Liu et al., 2011; Md Anisur

Rahman, 2014). Many GA-based clustering techniques use the DB Index (Y. Liu et al., 2011;

Xiao et al., 2010) as a fitness function. If 𝑠𝑗 is the seed of the 𝑗𝑡ℎcluster (𝑐𝑗) then scatter (𝑛𝑗) is

calculated as follows.

𝑛𝑗,𝑞 = ( 1

|𝑐𝑗|∑ ||𝑥 − 𝑠𝑗|| 𝑞

2

𝑥𝜖𝑐𝑗

)

1/𝑞

Eq. 2.12

If 𝑠𝑖 is the seed of the 𝑖𝑡ℎcluster (𝑐𝑖) and 𝑠𝑗 is the seed of the 𝑗𝑡ℎcluster (𝑐𝑗) then the distance

between them is 𝑑𝑖𝑗,𝑡 = || 𝑠𝑖 − 𝑠𝑗||𝑡. The DB Index of 𝑘 clusters is computed as follows:

𝑅𝑖,𝑞𝑡 = {𝑛𝑖,𝑞+𝑛𝑗,𝑞

𝑑𝑖𝑗,𝑡}𝑗,𝑗≠𝑖

𝑀𝑎𝑥

Eq. 2.13

𝐷𝐵 =1

𝐾∑ 𝑅𝑖,𝑞𝑡

𝐾

𝑖=1

Eq. 2.14

Silhouette Coefficient

The Silhouette Coefficient evaluates cluster quality based on matching the distances of the

records in the cluster with each other, and also in association with distances of records in other

clusters (Md Anisur Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005). Let us consider

that 𝑎𝑖 is the average distance of the 𝑖𝑡ℎ record of a cluster 𝑐𝑗 to all other records belonging in

the same cluster 𝑐𝑗, and the average distance between the 𝑖𝑡ℎ record and all other records of

another cluster 𝑐𝑘≠𝑗 is computed. Let us consider that 𝑏𝑖 is the minimum average distance with

respect to all clusters. Then the Silhouette Coefficient (𝑆𝑖) of the 𝑖𝑡ℎ record is calculated as

follows:

𝑆𝑖 =𝑏𝑖 − 𝑎𝑖

max(𝑎𝑖, 𝑏𝑖)

Eq. 2.15

49

The Silhouette Coefficient of a cluster 𝑐𝑗 is computed by simply taking the average

coefficients of all records belonging to the cluster 𝑐𝑗. An overall silhouette coefficient of

clustering (i.e. all clusters produced by a technique) can be obtained by computing the average

silhouette coefficient of all clusters 𝑐𝑗 , ∀j. The value of the silhouette coefficient can vary from

-1 to +1. A higher value of silhouette coefficient represents a better quality of clustering.

Xie-Beni Index

The Xie-Beni Index (XB) is the function of the ratio of the variation and separation of the

clusters (Maulik & Mukhopadhyay, 2010; Md Anisur Rahman, 2014; Xie & Beni, 1991). A

XB-Index of lower value indicates a better quality of clustering. Let us consider that a data set 𝐷

has 𝑛 number of records, the fuzzy membership degree of the 𝑖𝑡ℎ record of 𝑗𝑡ℎ cluster is 𝜇𝑖,𝑗,

the seed of the 𝑗𝑡ℎ cluster is 𝑠𝑗, 𝑅𝑖 is the 𝑖𝑡ℎ record of 𝑗𝑡ℎ cluster, and 𝛿(𝑠𝑗, 𝑅𝑖) is the distance

between the record 𝑅𝑖 and the seed 𝑠𝑗 of the 𝑗𝑡ℎ cluster. The variation of the clusters can be

calculated as follows:

𝜗 = ∑ ∑ 𝜇𝑗𝑖2

𝑛

𝑖=1

𝑘

𝑗=1

𝛿2(𝑠𝑗, 𝑅𝑖) Eq. 2.16

If 𝛿(𝑠𝑗, 𝑠𝑗) is the distance between the 𝑗𝑡ℎ and 𝑙𝑡ℎ seeds of the clusters then the separation

can be calculated as follows:

𝜑 = 𝑚𝑖𝑛𝑗!=𝑙

{𝛿2(𝑠𝑗, 𝑠𝑙)} Eq. 2.17

The XB Index of the clusters can be calculated as follows:

𝑋𝐵 =𝜗

n𝜑

Eq. 2.18

50

Dunn Index

The Dunn Index (DI) is a function of the ratio of minimal within cluster distance to maximum

within cluster distance (J. C. Dunn, 1974; Y. Liu et al., 2011; Peng et al., 2014). If 𝛥min is the

minimum within cluster distance and 𝛥max is the maximum within cluster distance then DI can

be calcuated as follows:

DI =𝛥min

𝛥max

Eq. 2.19

2.7.2 External Cluster Evaluation Techniques

External cluster evaluation criteria are also called the supervised measures of clusters (Md

Anisur Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005), which allow the goodness of

a cluster based on external information such as class values (labels) of the records to be

evaluated. A selection of external cluster evaluation techniques is listed as follows:

F-measure;

Purity; and

Entropy.

F-measure

F-measure is a combination of precision and recall (K.-T. Chuang & Chen, 2004; Kashef &

Kamel, 2009; Md Anisur Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005). Let us

consider that Ri,j is the number of records (belonging to a cluster 𝐶𝑖) in respect to a class value 𝑗,

and Ri is the number of records in 𝐶𝑖. The precision Υ(i, j) of cluster 𝐶𝑖 with regard to the class

value 𝑗 can be computed as follows (Kashef & Kamel, 2009; Md Anisur Rahman, 2014):

51

Υ(i, j) =Ri,j

Ri

Eq. 2.20

If Rj is the number of records (belonging to a cluster 𝐶𝑖) in respect to the class value 𝑗 in the

whole data set, then the recall δ(i, j) of cluster 𝐶𝑖 with respect to the class value 𝑗 can be

calculated as follows:

δ(i, j) =Ri,j

Rj

Eq. 2.21

The F-measure FM(i, j), of cluster 𝐶𝑖 with respect to the class value 𝑗 can be computed as

follows:

FM(i, j) =(∂2+1)∗Υ(i,j)∗ δ(i,j)

∂2∗Υ(i,j)+ δ(i,j)

Eq. 2.22

Generally, the value of ∂ is 1 (Md Anisur Rahman, 2014). A higher value of F-measure

represents a better clustering result, with the value of F-measure ranging between 0 and 1.

Purity

The purity of a cluster is measured to evaluate the correctness of a cluster with respect to the

class value. If 𝑗 is a class value, τi,j is the probability that a record of the 𝑖𝑡ℎcluster has 𝑗 class

value. The purity 𝜏𝑖 of 𝑖𝑡ℎ cluster 𝐶𝑖 can then be calculated as follows (Md Anisur Rahman,

2014; Pang-Ning Tan, Michael Steinbach, 2005):

ϑi = 𝑚𝑎𝑥𝑗(𝜏𝑖,𝑗), ∀𝑗 Eq. 2.23

If 𝑅𝑖 is the number of records in the 𝑖𝑡ℎ cluster, and 𝑅𝑖𝑗 is the number of records in the 𝑖𝑡ℎ

cluster which have 𝑗 class value, then the probability 𝜏𝑖𝑗 can be calculated as follows:

τi,j =𝑅𝑖,𝑗

𝑅𝑖 Eq. 2.24

52

If the total number of records in the data set is 𝑛, then the overall purity (PT) for 𝑘 number

of clusters can be computed as follows:

PT = ∑ 𝑅𝑖

𝑛

𝑘

𝑖=1

𝜏𝑖 Eq. 2.25

A higher value of purity represents better clustering results. The value of purity varies

between 0 and 1.

Entropy

In a similar way to purity, the entropy of a cluster is measured to evaluate the correctness of a

cluster with respect to the class value (Md Anisur Rahman, 2014; Pang-Ning Tan, Michael

Steinbach, 2005). The entropy ϱi of the 𝑖𝑡ℎ cluster can be calculated as follows (Md Anisur

Rahman, 2014; Pang-Ning Tan, Michael Steinbach, 2005):

ϱi = − ∑ τi,j

𝑑

j=1

log2τi,j

Eq. 2.26

Here, 𝜏𝑖,𝑗 is calculated using equation Eq. 2.24 and 𝑑 is the domain size of the class attribute.

The overall entropy (eT) for 𝑘 number of clusters can be calculated as follows:

eT = ∑Ri

𝑛

𝑘

i=1

ϱi Eq. 2.27

A lower value of entropy represents better clustering quality. The value of entropy varies

between −∞ to +∞.

2.8 Summary

In this chapter, we first introduce a data set with its notations and definitions, and then offered

a short introduction to data mining, machine learning, clustering, applications and categories of

clustering, different types of distance calculations, and cluster evaluation techniques. We also

53

discuss the strengths and weaknesses of currently used clustering techniques, with Table 2.5

providing a summary of these.

Table 2.5: Advantages and limitations of currently used clustering techniques

Categories of Techniques Advantages Limitations

Partition-based

Clustering Techniques

(Ahmad & Dey, 2007a;

Han & Kamber, 2006;

Pang-Ning Tan, Michael

Steinbach, 2005)

Time complexity is

comparatively low 𝑂(𝑛).

Capable of separating

overlapping clusters.

Require a user to define the various

inputs including the number of clusters

𝑘 in advance.

Most techniques select the initial seeds

randomly.

The objective function tends to get

stuck at local optima.

Most techniques cannot process data

sets having both categorical and

numerical attributes.

Hierarchical Clustering

Techniques

(Han & Kamber, 2006;


Steinbach, 2005)

Do not require any user input on

the number of clusters 𝑘.

Autonomous to the initial

conditions.

Time complexity is comparatively

high 𝑂(𝑛3).

May fail to separate overlapping

clusters.

Density-based Clustering

Techniques

Select high-quality and densest

initial seeds from the data set to

produce clusters.


inputs including the number of radii.

Graph-based Clustering

Techniques

(Z. Chen & Ji, 2010;


Steinbach, 2005;

Schaeffer, 2007; Zhong

et al., 2010)

Capable of detecting the

arbitrary shape of a cluster

Require a similarity function to be

selected from the wide range of

available similarity functions.

Require a similarity graph algorithm to

be selected.

Grid-based Clustering

Techniques


W.M. Ma & Chow,

2004; W. Wang et al.,

1997)

Time complexity is

comparatively low 𝑂(𝑛).

Require huge amounts of memory if

the number of cells is high.

54

Spectral Clustering

Techniques

(X. Hong et al., 2014;

Matthias & Juri, 2009;

Nascimento & de

Carvalho, 2011; von

Luxburg, 2007)

Capable of detecting the

arbitrary shape of a cluster.


inputs including the number of clusters

𝑘 in advance.

Require a similarity function to be

selected from the wide range of

available similarity functions.

Require a similarity graph algorithm to

be selected.

Model-based Clustering

Techniques



Steinbach, 2005; Roy &

Parui, 2014)

Optimise the fit between the

data and mathematical model.

Consider the probability distribution to

be the same for each cluster.

May get stuck at local optima.

Ant Colony Algorithm-

based Clustering

Techniques

(İnkaya et al., 2015;

Korürek & Nizam, 2008;

Ramos et al., 2009;

Shelokar et al., 2004)

Can avoid the local optima issue

of many partition-based

clustering techniques.

Generate the number of clusters

through the clustering process.

Some techniques require a user to

define various inputs in advance,

including the number of clusters 𝑘.

Some techniques generate the number

of clusters randomly to form an ant.

Bee Colony Algorithm-

based Clustering

Techniques

(Banharnsakun et al.,

2013; Karaboga &

Ozturk, 2011; Kuo et al.,

2014; Yan et al., 2012; C.

Zhang et al., 2010)






Some techniques require a user to

define various inputs in advance,

including the number of clusters 𝑘.

Some techniques generate the number

clusters randomly to form a bee.

Particle Swarm

Optimization (PSO)

Algorithm-based


(Cagnina et al., 2014; L.-

Y. Chuang et al., 2011;

Cura, 2012; Kuo et al.,

2012)






Randomly generate the number of

clusters to form a particle.

Black hole Algorithm-

based Clustering

Techniques





clusters to form a star.

55

(Hatamlou, 2013) Generate the number of clusters


Replacement of the deleted stars is

randomly generated.

Firefly Algorithm-based


(Abshouri & Bakhtiary,

2012; Hassanzadeh &

Meybodi, 2012;

Senthilnath et al., 2011)







clusters to form a firefly.

Genetic Algorithm-based

Clustering

(Agustín-Blas et al.,

2012; D.-X. Chang et al.,

2009; D. Chang et al.,

2012; He & Tan, 2012;

Y. Liu et al., 2011;

Rahman & Islam, 2014)





through the clustering process

Compared to other evolutionary

algorithm-based clustering

techniques such as PSO, ant

colony, bee colony, black hole

and firefly algorithms, GA has

more components to improve

clustering quality.

Most techniques randomly generate the

number of genes of a chromosome in

population initialization.

Records are also randomly chosen as

genes.

Some existing techniques select a high-

quality initial population with a high

complexity of 𝑂(𝑛2).

56

Chapter 3

High-Quality Initial Population in a GA for High-

Quality Clustering with Low Complexity

3.1 Introduction

An introduction to different types of clustering techniques and the significance of the clustering

techniques are presented in Chapter 2. It is clear from the literature that the clustering is a well-

known and extremely important technique in the area of data mining. Therefore, it is a very

active research area. However, the existing clustering techniques have some limitations and

therefore, there is room for further improvement. The main focus of this chapter is to make some

progress towards achieving our first research goal (see Chapter 1).

There are many approaches for clustering (Arthur & Vassilvitskii, 2007; D.-X. Chang et al.,

2009; D. Chang et al., 2012; Y. Liu et al., 2011; Lloyd, 1982; Rahman & Islam, 2014). K-means

During the PhD candidature, we have published the following paper based on this chapter.

Beg, A. H., and Islam, M. Z. (2015): Clustering by Genetic Algorithm - High Quality Chromosome

Selection for Initial Population, In Proc. of the 10th IEEE Conference on Industrial Electronics and

Applications (ICIEA 2015), Auckland, New Zealand, 15-17 June, 2015, pp. 129-134. (ERA 2010 Rank

A).

57

is one of the most popular techniques for clustering. In K-means, it requires a user (data miner)

to define the number of clusters (𝑘) in advance (Lloyd, 1982). Based on the user defined number

of clusters 𝑘, it then randomly selects 𝑘 records as initial seeds from the data set and each record

of the data set is then allocated to its closest seed in order to form clusters.

While K-means is popular for its simplicity, it has a number of well-known drawbacks (D.-

X. Chang et al., 2009; Jain, 2010; Mohd et al., 2012; Rahman & Islam, 2014). One of the main

drawbacks of K-means is its requirement of the user defined number of clusters (𝑘) prior to

clustering. The appropriate number of clusters has influence on the quality of a final clustering

solution (Kuo et al., 2012). It is difficult for a user (data miner) to estimate the appropriate

number of clusters in advance. Another drawback of K-means is that it has a tendency to getting

stuck at local optima. Moreover, the random selection of the initial seeds is also considered to

be a major drawback since it influences heavily the final clustering quality (Arthur &

Vassilvitskii, 2007). A recent technique K-means++ (Arthur & Vassilvitskii, 2007) addresses

the last drawback of K-means. However, it also suffers from other drawbacks of K-means.

The use of a GA in clustering can help to avoid the local optima issue of K-means (Agustín-

Blas et al., 2012; D.-X. Chang et al., 2009; D. Chang et al., 2012; He & Tan, 2012; Y. Liu et al.,

2011; Peng et al., 2014; Rahman & Islam, 2014). Typically, a GA-based technique does not

require any user input on the number of clusters 𝑘.

However, GA-based clustering techniques have some limitations. Many existing techniques

(Y. Liu et al., 2011; Maio et al., 1995; Maulik & Bandyopadhyay, 2000; Xiao et al., 2010)

generate the number genes of a chromosome randomly, in the population initialization phase.

They also randomly choose records as genes, instead of carefully choosing genes of a

chromosome. Careful selection of genes can create an initial population containing high-quality

chromosomes. High-quality initial population typically increases the possibility of obtaining a

58

good clustering solution at the end of the genetic processing (Diaz-Gomez & Hougen, 2007;

Goldberg et al., 1991; Rahman & Islam, 2014).

An existing technique called GenClust (Rahman & Islam, 2014) finds high-quality initial

population and thereby obtains good clustering solution. However, its initial population

selection process is very complex with a complexity of 𝑂(𝑛2), where 𝑛 is the number of records

in a data set. Moreover, GenClust requires a user input on radius values for the clusters in the

initial population selection. It can be very difficult for a user to guess the set of radius values

(i.e. radii).

In this chapter, we propose a clustering technique called DeRanClust that produces high-

quality initial seeds through a deterministic phase and a random phase. It aims to produce high-

quality initial seeds with a low complexity of 𝑂(𝑛). DeRanClust chooses the number of clusters

automatically for the chromosomes in the initial population. Therefore, it does not require any

user input for the number of clusters 𝑘. DeRanClust also reduces the chance of getting stuck at

local optima by using our new genetic algorithm for high-quality chromosome selection.

We implement DeRanClust and compare its performance with AGCUK (Y. Liu et al., 2011),

GAGR (D.-X. Chang et al., 2009), K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii,

2007) and GenClust (Rahman & Islam, 2014). We compare the performance of the techniques

through two cluster evaluation criteria, namely Silhouette Coefficient (Agustín-Blas et al., 2012;

Pang-Ning Tan, Michael Steinbach, 2005) and DB Index (D L Davies & Bouldin, 1979) using

five real-life data sets that we obtain from the UCI machine learning repository (M. Lichman,

2013). We also carry out a thorough experimentation to investigate the usefulness of the

components used in DeRanClust.

The contributions of the chapter are presented as follows:

59

Proposing DeRanClust that produces high quality clustering solutions with low

complexity and require no user input.

The evaluation of DeRanClust by comparing it with existing techniques.

The organization of the chapter is as follows: in Section 3.2, we present main steps of

DeRanClust; the experimental results and discussion are presented in Section 3.3, and the

summary of the chapter is presented in Section 3.4.

3.2 DeRanClust: Deterministic and Random Selection for the Initial Population in a GA-

Based Clustering Technique

We now introduce the main steps of the proposed technique as follows and explain each of them

in detail. Out of the following steps, Step 2: Population Initialization is our novel contribution

of this chapter.

BEGIN

Step 1: Normalization

Step 2: Population Initialization

DO: t=1to I/* I=50; I is the user defined number of iterations */

Step 3: Noise-Based Selection Operation

Step 4: Crossover Operation

Step 5: Twin Removal

Step 6: Mutation Operation


END

END


DeRanClust first normalizes a data set 𝐷 in order to weigh each attribute equally regardless of

their domain sizes (Rahman & Islam, 2014). If 𝐷 has an attribute (such as salary) with huge

domain sizes and an attribute (such as age) with relatively smaller domain sizes, then the

attribute with huge domain sizes has higher impacts (than the attribute with lower domain sizes)

60

on the distance calculations. Therefore, the different domain sizes of the attributes may give

different weigh/importance on different attributes.

The normalization avoids this undesirable situation and allows each attribute to have the

same level of impact. It brings the domain range of each numerical attribute between 0 and 1. It

generates a normalized attribute value 𝑋𝑁 = 𝑋𝑀𝑎𝑥−𝜇

𝑋𝑀𝑎𝑥− 𝑋𝑀𝑖𝑛, where 𝑋𝑀𝑎𝑥 is the maximum, 𝑋𝑀𝑖𝑛 is

the minimum, and 𝜇 is the average domain value of the numerical attribute. The distance

between records is computed using the Euclidean distance metric (Han & Kamber, 2006;

Schulz, 2008; Teknomo, 2015a). Hence, the distance between two records for a numerical

attribute can vary between 0 and 1.


This is a new/original contribution of DeRanClust that selects high-quality chromosomes in the

initial population through two phases: a deterministic phase and a random phase.

DeRanClust selects the first 50% of the chromosomes through a deterministic selection phase

and the remaining 50% chromosomes through a random selection phase, for the initial

population. For the deterministic phase, it uses a set of predefined numbers of genes/clusters 𝑘.

The default set of predefined 𝑘 is {2, 3, …. 10} where the size of the set is nine. DeRanClust

uses each element of the set as the number of clusters (𝑘) for K-means and thus produces a

clustering solution, i.e. chromosome. For each element it runs K-means five times and thus

produces five chromosomes. That is it produces altogether 5 × 9 = 45 chromosomes in the

deterministic phase (see Fig. 3.2). We actually have the opportunity to apply K-means many

times, which also increases the possibility of getting a good-quality chromosome (i.e. clustering

solution) at the beginning (the very first iteration) of the genetic algorithm.

Let us assume that we have a two-dimensional data set having 90 records as shown in Fig.

3.1(a). To generate a chromosome, at first, it takes an element from the predefined set as an

61

input. In the first iteration of K-means the user defined number of initial seeds/genes are

randomly generated from the data set. Fig. 3.1(b) shows the initial seeds/genes that are generated

randomly. An initial seed/gene is a record of the data set where each record is a set of attribute

values. Each record is then allocated to its closest seed/gene.

As usual, K-means computes a set of new seeds based on the records allocated to each seed.

The process continues until it settles to a set of seeds/genes that do not change any further. The

process of K-means is explained in Fig. 3.1. Fig. 3.1(e) has three seeds/genes, and the three

genes together form a chromosome.

Due to the use of K-means DeRanClust expects to get high-quality chromosomes for a given

𝑘 value. Since DeRanClust does not know the actual 𝑘 in a data set, it explores numbers from 2

to 10. Typically, the 𝑘 value for a data set varies between 2 and 10, which we find through the

empirical analysis on the data sets in the UCI machine learning repository (M. Lichman, 2013).

(a) Records

(b) Initial seed

(c) Iteration 1

(d) Iteration 2

(e) Final Iteration

Fig. 3.1: The formation of a chromosome through K-means

62

Fig. 3.2: Flowchart of the population initialization

Start

𝑖 = 1

Take seed number 𝑖, 𝑖 = 2,3, … 10

𝑗 = 1

Apply K-means to generate deterministic chromosome 𝑝𝑗𝑖 and

calculate the fitness 𝑓𝑗𝑖 and insert 𝑃𝑗

𝑖 into 𝑃𝑑

𝑗 = 𝑗 + 1

𝑗 = 5?

𝑖 = 𝑖 + 1

𝑖 = 10?

Sort the deterministic chromosomes fitness 𝑓 and select

the top 𝑃/2 chromosomes 𝑃𝑑

𝑙 = 1

Randomly generate seed number for random chromosome in a range 2

to √𝑛, where 𝑛 is the number of records in a dataset

Randomly generate chromosome 𝑃𝑙 and calculate the

fitness 𝑓𝑙 and insert 𝑃𝑙 into 𝑃𝑟

𝑙 = 𝑙 + 1

𝑙 = 10?

𝑃𝑠←𝑃𝑑 ∪ 𝑃𝑟 . Sort the fitness of the chromosomes (𝑃𝑠) and

find the chromosome 𝑃𝑏 having maximum fitness

End

Yes

Yes

Yes

No

No

No

63

In the UCI repository, there are 157 data sets for which the class sizes (i.e. the domain sizes

of the class attributes) have been reported. The domain size of the class attribute of a data set is

indicative to the number of clusters in the data set. The mean and standard deviation of the class

sizes of the data sets are 5.36 and 5.49, respectively. That is, the number of clusters of a data set

typically varies between 2 and 10. Hence, DeRanClust uses the set of 𝑘 {2, 3,…..10}, in the

deterministic phase.

However, the actual 𝑘 values in some data sets can be more than 10. In order to handle such

a situation DeRanClust uses the random phase where it generates 10 chromosomes (see Fig.

3.2). For each chromosome, it randomly generates the 𝑘 value between 2 and √𝑛 (𝑛 is the

number of records in a data set) and then randomly picks 𝑘 records to form 𝑘 genes of the

chromosome. DeRanClust by default uses 20 chromosomes in the initial population of a

generation. Therefore, it chooses the best 10 chromosomes from the 45 chromosomes generated

in the deterministic phase, and the 10 chromosomes from the random phase. While the use of

K-means helps to get high-quality chromosomes the use of the random approach helps to

explore the solution space through its randomness.

In this chapter, we prepare |𝑃| chromosomes through the population initialization process

(see Fig. 3.2 and Step 1 of Algorithm 3.1). We get |𝑃|/2 chromosomes from the deterministic

phase and |𝑃|/2 chromosomes from the random phase of the selection process. The value of

|𝑃| in the proposed technique is set to 20. When we get the set of initial chromosome, we then

compute the fitness of each chromosome and preserve the best chromosome 𝑃𝑏 for the elitist

operation. The proposed technique calculates the fitness of each chromosome using Davis

Bouldin (DB) index (D. L. Davies & Bouldin, 1979) where a small value of DB index represents

good clustering results. The fitness of a chromosome is calculated as 1/DB.

64

Step 3: Noise-based Selection Operation

In this chapter, the noise-based selection operation is used in order to select the chromosomes

for the consequent GA operations by comparing two generations. For example, if we have

twenty chromosomes in the current 𝑖𝑡ℎgeneration 𝑃1𝑖 , 𝑃2

𝑖 , … 𝑃20𝑖 and twenty chromosomes in

the previous (𝑖 − 1)𝑡ℎ generation 𝑃1𝑖−1 , 𝑃2

𝑖−1 ,…. 𝑃20𝑖−1 , then to select the chromosomes in the

current generation for the next GA operations such as crossover and mutation a pair wise

comparison (i.e. 𝑃𝑗𝑖 and 𝑃𝑗

𝑖−1 are compared, ∀𝑗 ) is carried out between the current and

previous generation (see Step 2 of Algorithm 3.1).

In the proposed technique, we aim to introduce some randomness that allows selecting the

worst chromosome of the pair instead of selecting only the best chromosome of the pair in order

to increase the diversity of the population. To achieve this goal, we use the nosing selection

approach of AGCUK (Y. Liu et al., 2011). In the noise-based selection approach, we add some

noise into the fitness of the current generation and compare it with the previous generation. The

noise value is a randomly generated real number. We set the noise value to be high at the begging

of the iteration, and the value is decreased when the iteration increases. As a result, the

chromosomes with the worst fitness value get a chance to be selected in the earliest iteration

whereas the chromosomes with the least fitness value in the later iteration have fewer chances

to be selected.


In the crossover operation, two parent chromosomes swap their segments and generate two new

offspring chromosomes (Agustín-Blas et al., 2012; D.-X. Chang et al., 2009; D. Chang et al.,

2012) as shown in Fig. 3.3. Fig. 3.3 shows that Chromosome 1 has three genes (Gene 11, Gene

12 and Gene 13) and Chromosome 2 has four genes (Gene 21, Gene 22, Gene 23, and Gene 24

respectively).

65

Two-parent chromosomes swap their genes and generate two offspring chromosomes:

Offspring1 and Offspring 2 (see Fig. 3.3). To select the pair of chromosomes, we use the roulette

wheel (D. Chang et al., 2012; Maulik & Bandyopadhyay, 2000; Mukhopadhyay & Maulik,

2009b) selection approach. In this approach, one chromosome of the pair is selected based on

its highest fitness value, and another chromosome of the pair is selected based on the roulette

wheel selection. Peng et al. (2014) experimentally demonstrate that single point crossover

performs better than the multi-point crossover. Therefore, in this study we use the single-point

crossover to perform a crossover between two-parent chromosomes.

Fig. 3.3: Single point crossover between a pair of chromosomes


Two identical genes can somehow be generated in a chromosome (Rahman & Islam, 2014).

Therefore, we use the twin removal approach (Rahman & Islam, 2014) to remove/change the

identical genes. If the length of a chromosome is more than two then while there are two

identical genes we delete one of the two identical genes. Thus the length of the chromosomes is

decreased by one. If the length of a chromosome is two and both the genes are identical then we

randomly change one of the two identical genes in order to make sure that the genes are not

identical.

66

Algorithm 3.1: DeRanClust Input: A data set D having N records and |A| attributes, where A is the set of attributes

Output: A set of cluster C

Require:

Ps ← ∅ /* 𝑃𝑠 is the set of initial population (20 chromosomes), initially set 𝑃𝑠 to empty */

Po ← ∅ /* 𝑃𝑜 is the set of offspring chromosomes, initially set 𝑃𝑜 to empty*/

Pm ← ∅ /* Pm is the set of mutated chromosomes, initially set 𝑃𝑚 to empty*

I = 50 /* user defined number of iterations/generations, default value for I is 50*/

D ́← normalized (D) /* normalize each numerical attribute in the normalized data set D ́*/

Pd← ∅ /* 𝑃𝑑 is the set of deterministic chromosomes (45 chromosomes), initially set 𝑃𝑑 to empty */

Pr← ∅ /* 𝑃𝑟 is the set of random chromosomes (10 chromosomes), initially set 𝑃𝑟 to empty */

end

Step 1: /* Population Initialization */

Pd ←GenerateDeterministicChromosomes (D′) /* generate 𝑃𝑑 through K-means, the number of initial seeds for K-means are chosen deterministically */

Pd ←SelectDeterministicChromosomes (Pd ) /* select top 10 chromosomes (50% chromosomes of the initial population) based on fitness */

Pr ←GenerateRandomChromosomes (D′) /* generate 𝑃𝑟 randomly, the number of initial seeds are chosen randomly and the seeds also chosen randomly */

Ps ← Ps ∪ (Pr ∪ Pd) /* insert 𝑃𝑟 and 𝑃𝑑 into 𝑃𝑠 */

Pb= FindBestChromosome (Ps) /* 𝑃𝑏 is a chromosome that has the maximum fitness value in 𝑃𝑠 */

end

for (t=1 to I ) do /* default I = 50, I is the user defined number of iterations and t is the counter of I */

Step 2: /* Noise-Based Selection Operation */

if ( t >1) then

Fs ← Calculate Fitness (Ps) /* F = {𝐹1𝑡, 𝐹2

𝑡 , … … … . 𝐹|Ps| 𝑡 } is the set of fitness of every chromosome in 𝑃𝑠 of 𝑡𝑡ℎgeneration */

for j= 1 to |Ps| do

if (Fjt−1- Fj

t + noise > 0) then

select Pjt−1 /* select the chromosome 𝑃𝑗 of (𝑡 − 1)𝑡ℎgeneration */

Ps ← Ps ∪ Pjt−1 /* insert 𝑃𝑗

𝑡−1 into 𝑃𝑠 */

else

select Pjt /* select the chromosome 𝑃𝑗 of 𝑡𝑡ℎgeneration */

Ps ← Ps ∪ Pjt /* insert 𝑃𝑗

𝑡 into 𝑃𝑠 */

end

end

end

end

Step 3: /* Crossover operation */

while |Ps| ≥ 2 do

p ← SelectChromosomePair (Ps) /* select a pair of chromosome 𝑝 = {𝑝1, 𝑝2} from 𝑃𝑠 using roulette wheel */

O ← SinglePointCrossover (p) /* after crossover between 𝑝1 and 𝑝2, two offspring o = {o1, o2} are generated */

Po ← Po ∪ O /* insert offspring 𝑂 = {𝑂1, 𝑂2} into 𝑃𝑜 */

Ps ← Ps- p /* remove 𝑝 = {𝑝1, 𝑝2} from 𝑃𝑠 */

end end


Ps ←Twin Removal (Po) /* If the length of a chromosome is > 2 and if there are two identical genes, delete one of the identical genes,

end If the length of a chromosome is = 2 and if there are two identical genes, change one of two identical genes */

Step 5: /* Mutation operation */

Fs ← Calculate Fitness (Ps) /* Calculate the fitness of every chromosome in 𝑃𝑠 */

Pb= FindChromsomeHavingMaxFitness (Ps) /* 𝑃𝑏 is a chromosome having maximum fitness in 𝑃𝑠 */

P𝑣 ← DivisionAndAbsorptionOperation (Pb) /* Perform division and absorption on chromosome 𝑃𝑏 and get mutated chromosome 𝑃𝑣 */

Pm ← P𝑣 /* insert P𝑣 into Pm*/

Ps − Pb /* remove Pb from Ps */

for i= 1 to |Ps| do

P𝑣← DivisionOrAbsorptionOperation (𝑃𝑖) /* randomly apply either division or absorption on chromosome 𝑃𝑖 and get mutated chromosome 𝑃𝑖 */

Pm ← P𝑣 /* insert 𝑃𝑣 into 𝑃𝑚*/

end

end

Step 6: /* Elitist Operation */

Pb ←ElitistOperation (Pm & Pb ) /* Apply elitist operation on 𝑃𝑚 & 𝑃𝑏 and find the best chromosome 𝑃𝑏 */

C ← C ∪ Pb/* insert 𝑃𝑏 into 𝐶 */

Return C

end

end

Let us consider a chromosome 𝑃𝑗 has two genes 𝑔𝑗𝑖 and 𝑔𝑗𝑘 . The two genes 𝑔𝑗𝑖 and 𝑔𝑗𝑘

identical when the distance between 𝑔𝑗𝑖 and 𝑔𝑗𝑘 is zero. To change the identical genes, we

randomly select an attribute (say the 𝐴𝑡ℎattribute) of 𝑔𝑗𝑖 and its value is 𝑥 then we replace 𝑥

67

with a random number (i.e. a randomly generated real number within the range between 0 and

1) until the gene is identical.


The main objective of the mutation operation is to arbitrarily change the genes of a chromosome

in order to explore different solutions. To perform the mutation operation, we use division and

absorption approach of AGCUK (Y. Liu et al., 2011). In this approach, the chromosomes that

we obtain from the crossover operation, we divide them into two parts: best and others.

The chromosome that has the highest fitness value is considered as the best. For the best

chromosome, we apply the division and absorption operation. For rest of the chromosomes, we

randomly apply either division or absorption (see Step 5 in Algorithm 3.1). For the division

operation, we find the sparsest cluster of the selected chromosome and divide them into two

clusters by applying K-means where the value of 𝑘 is 2. For the absorption operation, we find

the closest clusters of the selected chromosome and merge them into one cluster. The two

clusters that have minimum seed to seed distance are considered as the closest.


The Elitist operation keeps track of the best chromosome throughout the generations in order to

ensure the continuous improvement of the quality of the best chromosome found so far over the

iterations. The operation is applied on a population at the end of all other operations in a

generation. If the fitness of the worst chromosome 𝑃𝑤𝑖 of the 𝑖𝑡ℎ population (i.e. the current

population) is less than the fitness of the best chromosome 𝑃𝑏 found so far from all previous

generations then 𝑃𝑤𝑖 is replaced by 𝑃𝑏 in the current population. Moreover, if the best

chromosome of the current population 𝑃𝑏𝑖 has higher fitness than the fitness of 𝑃𝑏 then 𝑃𝑏

𝑖 is

copied in 𝑃𝑏 , replacing its old value.

68

3.3 Experimental Results and Discussion

We implement our proposed technique DeRanClust and four existing techniques namely

AGCUK (Y. Liu et al., 2011), GAGR (D. Chang et al., 2012), K-Means (Lloyd, 1982), K-

means++ (Arthur & Vassilvitskii, 2007) and GenClust (Rahman & Islam, 2014). To implement

AGCUK, GenClust and DeRanClust, we set the population size to 20 for each

generation/iteration, and the total number of iteration to 50. Following the suggestion of

AGCUK, we set the parameter (i.e. the population size and the number of iterations) for

AGCUK, GenClust and DeRanClust as the same for a fair comparison.

The population size for GAGR is set to 30 in each generation, and the total number of

generations is set to 50 based on what was suggested in GAGR. Moreover, the cluster number

for GAGR is user defined, thus we set the cluster number for GAGR same as the cluster number

that we obtained from DeRanClust. We set the threshold value for K-means to 0.05 and the total

number of iterations to 50 as suggested in AGCUK (Y. Liu et al., 2011).

3.3.1 Data Sets

We apply the techniques on five (5) real-life data sets as shown in Table 3.1. The data sets are

publicly available in UCI Machine Learning Repository (M. Lichman, 2013). We consider the

data sets in the experiment having numerical attributes except the categorical attributes. All the

data sets contain some class attributes, we remove them during the clustering process. Moreover,

we also normalize each attribute of the data set in order to get the same level of impact of

attributes. We run each technique 20 times on each data set, and we take the average result.

3.3.2 Evaluation Criteria

To compare our technique with the existing techniques two well-known evaluation criteria

namely Silhouette Coefficient (Agustín-Blas et al., 2012; Pang-Ning Tan, Michael Steinbach,

2005) and DB Index (D. L. Davies & Bouldin, 1979) are used. Note that the smaller value of

69

DB Index indicates a better clustering result and the higher value of Silhouette Coefficient

represents a better clustering result.

Table 3.1: A brief description of the data sets

Data set Number of

Records

Number of Numerical

Attributes

Class Size Attribute Type

Pima Indian Diabetes (PID) 768 8 2 Integer, Real

Blood Transfusion (BT) 748 4 2 Real

Glass Identification (GI) 214 9 6 Real

Liver Disorder (LD) 345 6 2 Integer, Real

Bank Note Authentication (BN) 1372 4 2 Real

3.3.3 Experimental Results on All Techniques

In this section, we compare the experimental result of the proposed technique with four existing

techniques AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), K-means (Lloyd,

1982), K-means++ (Arthur & Vassilvitskii, 2007) and GenClust (Rahman & Islam, 2014) in

order to evaluate the usefulness of the proposed technique on 5 data sets where each technique

runs 20 times on each data set.

Fig. 3.4 shows the average Silhouette Coefficient of the clustering results, where DeRanClust

achieves better results than all other techniques in 5 out of 5 data sets. That is, in 5 out of 5 data

sets the average Silhouette Coefficient of 20 runs of DeRanClust is higher than the average

Silhouette Coefficient of 20 runs of AGCUK, GAGR, K-means and GenClust.

Fig. 3.4: Comparative results between DeRanClust and other techniques based on Silhouette Coefficient

70

As we can see in Fig. 3.5, DeRanClust achieves better clustering results (on an average) than

all other techniques in 5 out of 5 data sets based on DB Index for which a lower value indicates

a better result.

Fig. 3.5: Comparative result between DeRanClust and other techniques based on DB Index

The right most columns of Fig. 3.4 and Fig. 3.5 show the average Silhouette Coefficient and

DB Index of all techniques on all data sets. DeRanClust achieves clearly better results on an

overage than all other techniques.

3.3.4 An Analysis of the Impact of Various Component of DeRanClust

In this section, we present some interesting initial results. We carry out the initial experiments

to analyze and evaluate the components used in the proposed technique. In order to evaluate the

effectiveness of various components of the proposed technique, we use the five (5) data sets as

shown in Table 3.1. We run each technique 20 times on each data set and present the average

result.

3.3.4.1 An Analysis of the Impact of the Population Initialization

We incorporate our proposed high-quality chromosome selection with an existing technique

called AGCUK (Y. Liu et al., 2011) for the selection of the initial population. Since our proposed

technique generates the initial population through the high-quality chromosomes selection

approaches (i.e. deterministic and random selection process) we explore the impact of high-

71

quality chromosome selection in the initial population. We generate 20 initial chromosomes

through our high-quality chromosome selection process for AGCUK. We then use these in

AGCUK as the initial population, and we run AGCUK for 50 iterations. We give the name of

this version of AGCUK as Modified AGCUK. We run both AGCUK and Modified AGCUK for

50 iterations on 5 data sets.

As we can see in Table 3.2, Modified AGCUK achieves better clustering result compared to

AGCUK in four (4) out of five (5) data sets according to the Silhouette Coefficient and in five

(5) out of five (5) data sets according to the DB Index. The average clustering result of Modified

AGCUK on five (5) data sets is also better than the AGCUK in terms of the both evaluation

criteria. Note that the shaded cells represent the best clustering results among the techniques.

Table 3.2: Comparative results between AGCUK and Modified AGCUK

3.3.4.2 An Analysis of the Impact of the Crossover Operation

We also explore the effectiveness of the crossover operation used in the proposed technique. In

order to evaluate the effectiveness of the crossover operation, we compare our proposed

technique with a different version of the proposed technique. We call this version as DeRanClust

without Crossover. The DeRanClust without Crossover is exactly the same as the DeRanClust

except it does not have the crossover operation. We run both DeRanClust and DeRanClust

without Crossover for 50 iterations on 5 data sets.

Data set DB Index (lower the better)

AGCUK Modified AGCUK

Silhouette Coefficient (higher the better)

AGCUK Modified AGCUK

PID 1.4062 1.3657 0.2728 0.2670

BT 0.5478 0.4724 0.6498 0.6812

GI 0.6247 0.5570 0.6573 0.7068

Liver 0.8731 0.8560 0.4612 0.4678

BN 0.7978 0.7596 0.4785 0.4994

Average 0.84992 0.80214 0.50392 0.52444

72

Table 3.3 shows that DeRanClust achieves better clustering results than DeRanClust without

Crossover in five (5) out of five (5) data sets based on both the Silhouette Coefficient and DB

Index. The average result of DeRanClust on five (5) data set is also better than DeRanClust

without Crossover in terms of both the Silhouette Coefficient and DB Index.

Table 3.3: Comparative result between DeRanClust and DeRanClust without Crossover


DeRanClust DeRanClust without Crossover


DeRanClust DeRanClust without Crossover

PID 0.9145 1.3317 0.4684 0.2942

BT 0.1854 0.4931 0.8616 0.6636

GI 0.2490 0.5491 0.8321 0.7125

Liver 0.2843 0.7585 0.8025 0.5066

BN 0.4234 0.9133 0.6956 0.4290

Average 0.41132 0.80914 0.73204 0.52118

3.3.4.3 Cluster Quality Comparison between DeRanClust and Modified AGCUK

Since Table 3.2 shows an improvement in AGCUK when the initial population is selected

through our proposed high-quality initial population selection (i.e. Modified AGCUK), in Table

3.4, we compare the cluster quality obtained by the proposed technique with Modified AGCUK.

This gives a fairer comparison between DeRanClust and AGCUK. In Table 3.4, we can see a

clear domination of the proposed technique over Modified AGCUK.

Table 3.4: Comparative results between DeRanClust and Modified AGCUK

DB Index (lower better) Silhouette Coefficient (higher the better)

Data set DeRanClust Modified AGCUK DeRanClust Modified AGCUK

PID 0.9145 1.3657 0.4684 0.2670

BT 0.1854 0.4724 0.8616 0.6812

GI 0.2490 0.5570 0.8321 0.7068

Liver 0.2843 0.8560 0.8025 0.4678

BN 0.4234 0.7596 0.6956 0.4994

Average 0.41132 0.80214 0.73204 0.52444

73

3.3.5 Complexity Analysis

In this section, we present the complexity of DeRanClust and compare it with the complexity of

AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), GenClust (Rahman & Islam,

2014) and K-means (Lloyd, 1982). The main factors related to the complexity of DeRanClust

are as follows: in a data set 𝐷 number of records is 𝑛, number of attributes is 𝑚, number of

genes in a chromosome is 𝑘, number of chromosomes in a population is 𝑧, number of iterations

in K-means is 𝑁′ and number of iterations in DeRanClust is 𝑁. We realize that out of these

factors 𝑛, 𝑚, 𝑘 and 𝑧 can be much bigger than others. Hence, we consider 𝑛, 𝑚, 𝑘 and 𝑧 to

compute the complexity.

For the initial population, DeRanClust uses K-means to get a number of deterministic

chromosomes, the complexity of which is 𝑂(𝑛𝑚𝑘𝑧). It also randomly selects some

chromosomes, for which the complexity is 𝑂(𝑘𝑧). The fitness function is DB index which has

a complexity of 𝑂(𝑛𝑚𝑘𝑧). Once fitness is computed the noising selection requires pair wise

comparison which can be done in 𝑂(𝑧) complexity. The crossover operation requires roulette

wheel for which we need 𝑂(𝑧2) complexity. For the twin removal, we need

𝑂(𝑚𝑘2𝑧) complexity. In the mutation operation, complexities for the division and absorption

are 𝑂(𝑛𝑚𝑘𝑧), 𝑂(𝑚𝑘𝑧), respectively. The elitist operation has a complexity of 𝑂(𝑧) once the

fitness is calculated with the cost of 𝑂(𝑛𝑚𝑘𝑧). Hence, the overall complexity of DeRanClust

is 𝑂(𝑛𝑚𝑘2𝑧2). With respect to 𝑛 and 𝑚 (the two most significant factors), it has a linear

complexity 𝑂(𝑛𝑚).

The complexity of AGCUK, K-means, GAGR and GenClust are 𝑂(𝑛𝑚)(Y. Liu et al., 2011),

𝑂(𝑛𝑚) (Lloyd, 1982), 𝑂(𝑛𝑚) (D.-X. Chang et al., 2009) and 𝑂(𝑛𝑚2 + 𝑛2𝑚) (Rahman &

Islam, 2014), respectively.

74

3.4 Summary

In this chapter, we propose a GA-based clustering technique called DeRanClust. The proposed

technique generates high-quality chromosomes in the initial population. It produces high-quality

chromosomes in the initial population through two phases: a deterministic phase and a random

phase.

In the deterministic phase, DeRanClust uses K-means to produce high-quality chromosomes.

The justification for using K-means is that it is known for producing a reasonably good quality

clustering solution in linear time. Due to its light weight and the ability to produce reasonably

good-quality solutions, it is expected to produce a good-quality chromosome with a low

complexity. DeRanClust also uses randomly selected chromosomes in the initial population in

order to maintain both high quality and randomness.

We compare the performance of DeRanClust with AGCUK, GAGR, GenClust, K-means++

and K-means in terms of the two cluster evaluation criteria, namely Silhouette Coefficient and

DB Index. In the experiments, we use five (5) natural data sets that we obtain from the UCI

Machine Learning Repository (M. Lichman, 2013). From the experimental results we find that

DeRanClust performs better than AGCUK, GAGR, GenClust and K-means in 5 out of 5 data

sets based on both Silhouette Coefficient and DB Index. The average clustering result of

DeRanClust in 5 data sets is also better than all other techniques.

We also compare the complexity of DeRanClust with the complexity of AGCUK, GAGR,

GenClust and K-means. From the complexity analysis, we find that DeRanClust produces its

clustering solutions with a low complexity of 𝑂(𝑛𝑚), whereas GenClust requires 𝑂(𝑛𝑚2 +

𝑛2𝑚) complexity to produce its clustering solutions. GenClust (Rahman & Islam, 2014) is a

recent technique and shown to be better than many other high-quality techniques (Ahmad &

Dey, 2007a; D.-X. Chang et al., 2009; Lee & Pedrycz, 2009; Y. Liu et al., 2011; Lloyd, 1982).

75

However, from the experimental results and complexity analyses, we find that DeRanClust

produces better clustering results than GenClust with a low complexity. Therefore, we

empirically demonstrate that through the proposed DeRanClust technique we progress towards

achieving our research goal 1.

We also experimentally evaluate the effectiveness of the proposed component for high-

quality initial population by applying it on AGCUK. The experimental results on 5 data sets

clearly indicate the usefulness of the proposed high-quality initial population selection. We also

explore the usefulness of other genetic operations such as the crossover operation. The

experimental results show that DeRanClust with crossover performs better than DeRanClust

without crossover. The results indicate that there is room for further improvement of clustering

quality by improving other genetic operations such as crossover and mutation.

Therefore, in the next chapter, we propose a new GA-based clustering technique called GMC

that proposes a new selection, crossover and mutation operation in order to improve the

chromosomes quality.

76

Chapter 4

Extensive Crossover and Mutation in a GA for

High-Quality Clustering with Low Complexity

4.1 Introduction

In this chapter, we propose a GA-based clustering technique called GMC, which is a further

improvement on DeRanClust (as presented in Chapter 3). DeRanClust produces high-quality

clustering solutions (see Fig. 3.4 and Fig. 3.5) with a low complexity of 𝑂(𝑛𝑚) through the

proposed high-quality initial population selection. We believe that there is room for further

improvement of cluster quality of DeRanClust by improving other genetic operations such as

crossover and mutation.

Therefore, GMC proposes a new selection, crossover and mutation operation in order to

improve the cluster quality. In this chapter, we aim to further progress to attain our research goal

1. GMC uses a probabilistic selection where a chromosome with higher fitness value has a

During the PhD candidature, we have published the following paper based on this chapter.

Beg, A. H. and Islam, M. Z. (2016): Novel crossover and mutation operation in genetic algorithm for

Clustering, In Proc. of the IEEE Congress on Evolutionary Computation (IEEE CEC 2016), Vancouver,

Canada, July 24-29, 2016, pp. 2114-2121. (ERA 2010 Rank A).

77

greater chance to be selected for other genetic operations such as crossover and mutation

operations.

GMC also proposes two phases of crossover operation. In the proposed crossover operation,

it first classifies the chromosomes in a population in one of the two groups: Good group and

Non-good group. It then performs different types of crossover on the two different groups. The

intuition behind this is to increase the possibility of getting good-quality offspring chromosomes

from a pair of good-quality parent chromosomes.

GMC also performs different types of mutation operation for the two different groups. In the

mutation operation, it applies two steps of mutation on the chromosomes of the good group and

three steps of mutation on the chromosomes of the non-good group. The proposed mutation

operation reduces the amount of changes on the good chromosomes, and increases the amount

of changes on bad chromosomes in order to improve their quality.

We implement GMC and compare its performance with AGCUK (Y. Liu et al., 2011),

GAGR (D.-X. Chang et al., 2009), K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii,

2007) and GenClust (Rahman & Islam, 2014). We compare the performance of the techniques

based on two cluster evaluation techniques namely Silhouette Coefficient (Agustín-Blas et al.,

2012; Pang-Ning Tan, Michael Steinbach, 2005) and DB Index (D L Davies & Bouldin, 1979)

using 10 real-life data sets that we obtain from the UCI machine learning repository (M.

Lichman, 2013).


Proposing GMC that produces high-quality clustering solutions with low complexity

and require no user input;

The evaluation of GMC by comparing it with existing techniques.

78

The organization of the chapter is as follows: in Section 4.2, we discuss the motivation behind

the proposed technique; in Section 4.3, we present our proposed technique; the experimental

results and discussion are presented in Section 4.4, and the summary of the chapter is presented

in Section 4.5.

4.2 The Motivation Behind the Proposed Technique

An existing technique called AGCUK (Y. Liu et al., 2011) uses a noising selection operation in

order to give a better chance for the selection of a chromosome with low fitness value in the

earlier iterations. However, GMC uses a probabilistic selection comparing chromosomes with

two generations where a chromosome with higher fitness value has a greater chance to be

selected. Hence, GMC aims to ensure good-quality chromosomes at the beginning of each

generation before the crossover and mutation operation.

The proposed technique also modifies the process that selects a pair of chromosomes on the

crossover operation in order to encourage crossover between two good-quality chromosomes.

There are many selection approaches including the roulette wheel (D. Chang et al., 2012; Maulik

& Bandyopadhyay, 2000; Mukhopadhyay & Maulik, 2009), rank-based wheel (Agustín-Blas et

al., 2012) and random selection (D.-X. Chang et al., 2009). These approaches do not carefully

select a pair of chromosomes for a crossover operation. As a result, a good-quality chromosome

often makes a pair with a bad quality chromosome, and there is a high chance to produce bad

quality offspring chromosomes.

Therefore, for crossover operation we classify the chromosomes in a population into two

different groups: good group and non-good group. In the good group, we apply crossover

operation only on two high-quality chromosomes in order to increase the possibility of getting

good-quality offspring chromosomes. In the good group we introduce an opportunity so that

each chromosome within the group forms a pair with each other chromosome within the group.

79

For example, if there are three chromosomes in the good group, then Chromosome 1 and

Chromosome 2 are selected as one pair, Chromosome 2 and Chromosome 3 are selected for

another pair, and Chromosome 3 and Chromosome 1 are selected for the other pair.

Similar to the crossover, GMC uses different types of mutation for different groups in the

mutation operation. Hence, the proposed mutation operation reduces the amount of changes on

the good-quality chromosomes, and increases the amount of changes on bad quality

chromosomes in order to improve their quality.

4.3 GMC: Genetic Algorithm with Novel Mutation and Crossover for Clustering

We first mention the main steps of the proposed technique as follows and then explain each of

them in detail. Out of the following steps, Step 3, Step 4 and Step 6 are our novel contributions

of this chapter.

BEGIN




Step 3: Probabilistic Selection

Step 4: Two Phases of Crossover Operation


Step 6: Three Steps of Mutation Operation


END

END


The proposed technique first normalizes the data set 𝐷 in order to consider each attribute equally

regardless of their domain sizes while calculating the fitness of a chromosome. For

normalization, we use the same approach of normalization that we used in DeRanClust (see

Section 3.2 of Chapter 3).

80


For the population initialization, the proposed technique uses the same approach of population

initialization that we used in DeRanClust (see Section 3.2 of Chapter 3). GMC selects |P|

number of chromosomes in the initial population, |P|/2 from the deterministic phase and |P|/2

from random phase. In the experiments of this chapter, we use |P| to be 20. GMC uses Davis

Bouldin (DB) index (D L Davies & Bouldin, 1979) to calculate the fitness of each chromosome.

Algorithm 4.1: GMC Input: A data set D having N records and |A| attributes, where A is the set of attributes


Require:



Pm ← ∅ /* Pm is the set of mutated chromosomes, initially set 𝑃𝑚 to empty*/


D ́← normalized (D) /* normalize each numerical attribute in the normalized data set D´ */



end







end


Step 2: /* Probabilistic Selection Operation */ if ( t >1) then

Ms ← MergedChromosomes (Pst, Ps

t−1) /* merged the chromosomes of 𝑡𝑡ℎgeneration with the chromosomes in the (𝑡 − 1)𝑡ℎgeneration*/

Ps ← ProbabilisticSelection (Ms) /* probabilistically select 𝑃𝑠 chromosomes from 𝑀𝑠*/

end

end

Step 3: /* Two Phases of Crossover operation */

P𝑜 ← TwoPhasesOfCrossoverOperation (P𝑠) /* Apply two phases of crossover operation on 𝑃𝑠 and get a set of offspring chromosomes 𝑃𝑜*/

end




Step 5: /* Three Steps Mutation operation */

Gs ← SelectGoodGroupChromosomes (P𝑠) /* classify the chromosomes for good group */

Ns ← SelectNonGoodChromosomes (P𝑠) /* classify the chromosomes for non-good group */

G𝑣 ← DivisionAndAbsorptionOperation (Gs) /* Perform division and absorption on 𝐺𝑠 and get set of mutated chromosomes 𝐺𝑣 */

Pm ← G𝑣 /* insert 𝐺𝑣 into 𝑃𝑚*/

N𝑣 ← DivisionAndAbsorptionOperation (Ns) /* Perform division and absorption on 𝑁𝑠 and get set of mutated chromosomes 𝑁𝑣 */

N𝑟 ← RandomChangeOperation (Nv) /* Perform random change on 𝑁𝑣 and get set of mutated chromosomes 𝑁𝑟 */

Pm ← 𝑁𝑟 /* insert 𝑁𝑟 into 𝑃𝑚*/

end



C ← C ∪ Pb /* insert 𝑃𝑏 in 𝐶 */

Return C

end

end

81


This is an original contribution of GMC that uses a probabilistic selection in order to select

chromosomes for the consequent genetic operations in the next generations. GMC first merges

the chromosomes of the current 𝑖𝑡ℎ and the previous (𝑖 − 1)𝑡ℎgeneration. It then

probabilistically selects a set of chromosomes from the merged chromosomes (see Algorithm

4.1). The chromosome with a higher fitness value has more chances to be selected than the

chromosome with a lower fitness value. The probability of the 𝑗𝑡ℎ chromosome of the

𝑖𝑡ℎ generation is calculated as follows,

𝑃𝑗𝑖 =

𝑓𝑗𝑖

∑ 𝑓𝑙𝑖 |𝑃|

𝑙=1

Eq. 4.1

where 𝑓𝑗𝑖 is the fitness of the 𝑗𝑡ℎ chromosome of the 𝑖𝑡ℎ generation and, |𝑃| is the number of

chromosomes in a population.


This is an original contribution of GMC. We perform a crossover operation on a pair of

chromosomes where the chromosomes swap their segments in order to generate a pair of

offspring chromosomes. The proposed technique uses two different phases of crossover: Single

point (Sanghamitra Bandyopadhyay & Maulik, 2002; D. Chang et al., 2012; Garai & Chaudhuri,

2004; Michael Laszlo & Mukherjee, 2007; Pakhira et al., 2005; Peng et al., 2014; Rahman &

Islam, 2014; Song et al., 2009) and Random crossover. Before applying crossover, GMC first

classifies the chromosomes in a population in one of the two groups: Good group and Non-good

group. In order to categorize two different groups, it first identifies the fitness range 𝑅𝑏 as

follows:

82

𝑅𝑏 = 𝑓𝑏

∑ 𝑓𝑙|𝑃|𝑙=1

Eq. 4.2

where 𝑓𝑏 is the fitness of best chromosome and |𝑃| is the number of chromosomes in a

population. It then separates the chromosomes of a population into two groups by using Eq. 4.3

and Eq. 4.4. When the difference between the fitness (𝑓𝑗) of a chromosome (𝑃𝑗) and the fitness

(𝑓𝑏) of the best chromosome (𝑃𝑏) is less than 𝑅𝑏 , then the chromosome 𝑃𝑗 is selected for good

group otherwise 𝑃𝑗 is selected for non-good group.

𝑓𝑏-𝑓𝑗≤𝑅𝑏 good group Eq. 4.3

𝑓𝑏-𝑓𝑗>𝑅𝑏 non-good group Eq. 4.4

Hence, the chromosomes of a population are divided into two groups. Once the group

selection is complete, GMC then selects a pair of chromosomes for the crossover operation. For

the chromosomes in the good group, GMC selects pair of chromosomes in such a way so that

each chromosome within the group forms a pair with each other chromosome within the group.

Once the pairs are selected, GMC carries out altogether 10 crossover operations between the

chromosomes of the pair. In the first crossover operation it carries out a single-point crossover

between the two chromosomes. For each of the nine remaining crossover operations, GMC

applies a random crossover between the two chromosomes (see Algorithm 4.2).

In the single-point crossover phase, each chromosome of a pair is divided into two parts at a

random point, and the segments of each chromosome are then swapped to each other and

generate two offspring chromosomes. In the random crossover phase, GMC combines a pair of

chromosomes (𝑃𝑥 and 𝑃𝑦), and generates a random number (𝑅𝑚) between 0 and the length of

83

the combined chromosomes (𝑃𝑥 + 𝑃𝑦). For offspring one, it then randomly selects 𝑅𝑚 genes

(without replacement) from the combined chromosomes and deletes the 𝑅𝑚 genes form the set

of (𝑃𝑥 + 𝑃𝑦) genes. The remaining genes ((𝑃𝑥 + 𝑃𝑦) -𝑅𝑚)) in the combined chromosomes are

then selected for offspring two.

Algorithm 4.2: Two Phases of Crossover Operation Input: A set of chromosome Ps after probabilistic selection operation Output: A set of offspring chromosome Po

Require:

P𝑥 ← ∅ /* 𝑃𝑥 is the set of offspring chromosomes obtain from good group crossover, initially set 𝑃𝑥 to empty */

P𝑦 ← ∅ /* 𝑃𝑦 is the set of offspring chromosomes obtain from non-good crossover, initially set 𝑃𝑦 to empty *

Gs ← ∅ /* 𝐺𝑠 is the set of chromosomes in the good group, initially set 𝐺𝑠 to empty */

Ns ← ∅ /* 𝑁𝑠 is the set of chromosomes in the non-good group, initially set 𝑁𝑠 to empty */

end

Step 1: /* Classify the chromosomes*/

Gs ← SelectGoodGroupChromosomes (P𝑠) /* classify the chromosomes for good group */

Ns ← SelectNonGoodChromosomes (P𝑠) /* classify the chromosomes for non-good group */

end

Step 2: /* Perform crossover on good group */

Gp ← ∅ /* 𝐺𝑝 is the set of pair of chromosomes in the good group, initially set 𝐺𝑝 to empty */

Gp ← PairSelection (Gs) /* Select set of pair of chromosomes, where each chromosome within the group forms a pair with each other */

Gx ← PerformSinlePointCrossover (Gp) /* Perform single point crossover on each pair of chromosomes in 𝐺𝑝 */

P𝑥 ← P𝑥 ∪ Gx /* insert offspring chromosomes 𝐺𝑥 into 𝑃𝑥 */

Gy ← PerformRandomCrossover (Gp) /* Perform random crossover on each pair of chromosomes in 𝐺𝑝 */

P𝑥 ← P𝑥 ∪ Gy /* insert offspring chromosomes 𝐺𝑦 into 𝑃𝑥 */

P𝑥 ← SelectOffspringChromosomes (Px) /* select |𝑃𝑠|/2 offspring chromosomes from 𝑃𝑥 based on their fitness */

end

Step 3: /* Perform crossover on non-good group */

Np ← ∅ /* 𝑁𝑝 is the set of pair of chromosomes in the non-good group, initially set 𝑁𝑝 to empty */

Np ← RouleteWheelSelection (Ns) /* Select set of pair of chromosomes using roulette wheel*/

Nx ← PerformSinlePointCrossover (Np) /* Perform single point crossover on each pair of chromosomes in 𝑁𝑝 */

P𝑦 ← P𝑦 ∪ Nx /* insert offspring chromosomes 𝑁𝑥 into 𝑃𝑦 */

Ny ← PerformRandomCrossover (Np) /* Perform random crossover on each pair of chromosomes in 𝑁𝑝 */

P𝑦 ← P𝑦 ∪ Ny /* insert offspring chromosomes 𝑁𝑦 into 𝑃𝑦 */

P𝑦 ← SelectOffspringChromosomes (Py) /* select |𝑃𝑠|/2 offspring chromosomes from 𝑃𝑦 based on their fitness */

end

Step 4: /* Return the offspring */

Po ← Po ∪ (P𝑥 ∪ P𝑦) /* insert (𝑃𝑥 and 𝑃𝑦) into 𝑃𝑜 */

Return Po

end

Once the crossover is complete on this group, it then selects offspring chromosomes based

on their fitness values. For example, if the number of chromosomes in the good group is 3, then

for 3 pairs (12, 21 and 31) altogether it generates 30 offspring chromosomes (through the single-

point crossover phase and random crossover phase). GMC then selects top 3 offspring

chromosomes from the 30 offspring chromosomes.

In the non-good group, a pair of chromosomes are selected using the roulette wheel (D.

Chang et al., 2012; Maulik & Bandyopadhyay, 2000; Mukhopadhyay & Maulik, 2009)

84

selection. In the roulette wheel selection, the best chromosome of this group is selected as the

first chromosome of the first pair. The second chromosome of the pair is selected

probabilistically. The chromosomes of the pair are then excluded from the selection process for

the second pair. The probability of a chromosome is calculated using Eq. 4.1.

Moreover, in the non-good group phase, the chromosomes are selected pair-wise for the

crossover operation. If the total number of chromosomes in the non-good group is any odd

number, GMC then deletes the worst chromosomes from the group. The deleted chromosome

is then replaced with the offspring chromosomes from a good group. For each pair of

chromosomes, it applies both the random crossover and single point crossover. Similar to the

good group phase, it then selects offspring chromosomes based on their fitness values.


GMC uses the twin removal operation in order to remove/modify twin genes (if any) from each

chromosome. For twin removal, GMC uses the same approach of twin removal of DeRanClust

(see Section 3.2 of Chapter 3).


The mutation operation of the proposed technique changes each chromosome using three

operations: division (Y. Liu et al., 2011), absorption (Y. Liu et al., 2011) and/or a random

change. Similar to the crossover, in mutation GMC also classifies the chromosomes in one the

two groups by using Eq. 4.3 and Eq. 4.4. For the good group, it applies division and absorption

operation. For the non-good group, it applies division, absorption, and a random change

operation (see Algorithm 4.1).

In the division operation for a chromosome, it identifies the sparsest cluster 𝐶𝑗 of a

chromosome 𝑃𝑗 and then divides 𝐶𝑗 into two clusters by applying K-means on 𝐶𝑗 using 𝑘 = 2.

The absorption operation finds the two closest clusters of the chromosome and merges them

85

into one cluster. The clusters that have the minimum seed to seed distance are considered to be

the closest clusters. In the random change operation, one gene of a chromosome is randomly

chosen and an attribute value of the gene is randomly changed to another value within its

domain.


The elitist operation keeps track of the best chromosome throughout the generations. For finding

the best chromosome, GMC uses the same approach of elitist operation of DeRanClust (see

Section 3.2 of Chapter 3).


We empirically compare our technique with five existing techniques called AGCUK (Y. Liu et

al., 2011), GAGR (D.-X. Chang et al., 2009), GenClust (Rahman & Islam, 2014), K-Means

(Lloyd, 1982) and K-Means++ (Arthur & Vassilvitskii, 2007) on ten (10) natural data sets that

are available in the UCI machine learning repository (M. Lichman, 2013).

4.4.1 Data Sets

Detailed information about the data set is presented in Table 4.1. All the data sets used in this

chapter have only numerical attributes except the class attribute. We evaluate and compare the

clustering result based on two evaluation criteria namely Silhouette Coefficient (Agustín-Blas

et al., 2012; Pang-Ning Tan, Michael Steinbach, 2005) and DB Index (D L Davies & Bouldin,

1979). A smaller value of DB Index indicates a better clustering result and a higher value of

Silhouette Coefficient represents a better clustering result.

4.4.2 The Parameter used in the Experiments

In the experimentation of AGCUK, GAGR, GenClust, and GMC, we consider the population

size to be 20 and the number of generations/iterations to be 50. We maintain this consistency

86

for all the techniques in order maintain a fair comparison among them. The number of iterations

in K-means and K-means++ is set to be 50. The number of iterations of K-means in GenClust

also set to be 50. The cluster numbers in GAGR, K-means and K-means++ is user defined. The

number of clusters in GAGR, K-means and K-means++ are generated randomly in the range

between 2 to √𝑛 (𝑛 is the number of records in the data set). The values of 𝑟𝑚𝑎𝑥 and 𝑟𝑚𝑖𝑛 in

AGCUK are set to be 1 and 0 respectively. The threshold value for K-means is set to 0.005.

Table 4.1: Data sets at a glance

Data set No. of

Records with

missing

No. of Records

without

missing

No. of

numerical

attributes

No. of

categorical

attributes

Class

size

Glass Identification (GI) 214 214 10 0 7

Vertebral Column (VC) 310 310 6 0 2

Leaf (LF) 340 340 16 0 36

Liver Disorder (LD) 345 345 6 0 2

Dermatology (DT) 366 358 34 0 6

Pima Indian Diabetes (PID) 768 768 8 0 2

Statlog Vehicle Silhouettes (SV) 846 846 18 0 4

Bank Note Authentication (BN) 1372 1372 4 0 2

Yeast (YT) 1484 1484 8 0 10

Image Segmentation (IS) 2310 2310 18 0 7

4.4.3 The Experimental Setup

For each data set, we run GMC 10 times since it can produce different clustering results in

different runs. We then present the average clustering results. We also run all other techniques

AGCUK, GAGR, GenClust, K-means and K-means++ 10 times. We then present the average

clustering result. In order to evaluate the effectiveness of various components of the proposed

technique, we randomly choose five (5) data sets (GI, PID, LF, LD, and DT) as shown in Table

4.1.

87


In this section, we experimentally evaluate the performance of the proposed technique by

comparing it with AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), GenClust

(Rahman & Islam, 2014), K-Means (Lloyd, 1982) and K-Means++ (Arthur & Vassilvitskii,

2007) on all 10 data sets where each technique runs 10 times on each data set. Fig. 4.1 shows

the average Silhouette Coefficient of the clustering solutions, where GMC achieves better

results than all other techniques in 10 out of 10 data sets.

Fig. 4.1: Comparative results between GMC and other techniques based on Silhouette Coefficient (higher the better)

Fig. 4.2: Comparative results between GMC and other techniques based on DB Index (lower the better)

As we can see in Fig. 4.2, GMC achieves better clustering results (on average) than all other

techniques in 10 out of 10 data sets based on DB Index. Moreover, the right most columns in

Fig. 4.1 and Fig. 4.2 show the average Silhouette Coefficient and DB Index of all techniques on

all data sets. GMC achieves better results on an average than all other techniques.

88

4.4.5 An Analysis of the Impact of Various Properties of GMC

We now explore the effectiveness of the proposed components of GMC in the following

subsections.


We explore the effectiveness of the crossover operation (see Step 4 of Section 4.3). In Fig. 4.3

and Fig. 4.4 we present the experimental results of GMC comparing with a different version of

GMC called GMC without Crossover that is exactly same as the GMC except that it does not

have a crossover in it. We run both GMC and GMC without Crossover for 50 iterations on 5

data sets. We run both techniques 10 times on each data set and present the average results.

We can see in Fig. 4.3 and Fig. 4.4 that GMC achieves better clustering result than GMC

without Crossover based on Silhouette Coefficient and DB Index. The average clustering result

of GMC on 5 data sets is also better than GMC without crossover based on Silhouette Coefficient

and DB Index.

Fig. 4.3: Comparative results between GMC and GMC

without Crossover based on Silhouette Coefficient

(higher the better)

Fig. 4.4: Comparative results between GMC and GMC without

Crossover based on DB Index (lower the better)

89

4.4.5.2 An Analysis of the Impact of the Mutation Operation

In order to explore the effectiveness of the proposed mutation operation, we introduce a new

version of GMC where we remove its mutation operation. We call this version to be GMC

without Mutation. We then compare this version of GMC with complete GMC. We run both

GMC and GMC without Mutation for 50 iterations. We run both techniques 10 times on each

data set and present the average results


without Mutation based on Silhouette Coefficient (higher

the better)


without Mutation based on DB Index (lower the better)

Fig. 4.5 and Fig. 4.6 indicate that GMC achieves better clustering result compared to GMC

without Mutation according to Silhouette Coefficient and DB Index. The average clustering

result of GMC on 5 data sets is also better than GMC without Mutation based on Silhouette

Coefficient and DB Index.

4.4.5.3 An Analysis of the Impact of the Probabilistic Selection Operation

We also explore the effectiveness of the proposed probabilistic selection operation (see Step 3

of Section 4.3). In Fig. 4.7 and Fig. 4.8, we present the experimental result of GMC with a

different version of GMC called GMC without Probabilistic Selection (PS) that is exactly same

as the GMC except that it does not have a probabilistic selection. We run both GMC and GMC

90

without PS for 50 iterations on 5 data sets. We run both techniques 10 times on each data set

and present the average results


without Probabilistic Selection (PS) based on Silhouette

Coefficient (higher the better)


without Probabilistic Selection (PS) based on DB Index

(lower the better)

Fig. 4.7 and Fig. 4.8 show that GMC achieves better clustering results than GMC without

Probabilistic Selection in 5 out of 5 data sets according to both Silhouette Coefficient and DB

Index. The average clustering result of GMC on 5 data sets is also better than GMC without

Probabilistic Selection based on Silhouette Coefficient and DB Index.

4.4.5.4 An Analysis of Improvement in Chromosomes over the Iterations

In Fig. 4.9, we present the average fitness (in terms of DB Index, where Fitness=1/DB) values

of the best chromosomes of GMC and AGCUK over the 10 runs for 10 data sets. Both GMC

and AGCUK use the same fitness function (DB Index) to calculate the fitness of a chromosome.

As we can see in Fig. 4.9 that the fitness of the best chromosome of GMC shows a rapid

improvement within first 5 iterations, and then continues to steadily increase over the 50

iterations. Moreover, the average fitness of the best chromosome of GMC is always higher than

the average fitness of the best chromosome of AGCUK, clearly indicating the effectiveness of

various components of GMC.

91

Fig. 4.9: Average fitness (best chromosome fitness) versus iterations over the 10 data sets

4.4.6 Statistical Analysis

We now analyze the results by using statistical sign test (D.Mason, 1998) to evaluate the

superiority of the results (Silhouette Coefficient and DB Index) obtained by GMC over the

existing techniques. We observe that results do not follow a normal distribution and thus the

conditions for the parametric test do not satisfy. Hence, we carry out a non-parametric sign test

on the Silhouette Coefficient and DB Index.

The sign test (Triola, 2001) analyzes the frequencies of the plus and minus signs to determine

whether they are significantly different. For example, suppose that we test the use of a guide

book written for student’s final exam. If there are 100 students and 51 of them are successful

and 49 of them are unsuccessful, common sense suggests that there is not sufficient evidence to

say that the guide book is useful, because 51 students are successful out of 100 students is not

significant. But how about 53 students are successful and 47 students are unsuccessful? Or 92

students are successful and 8 students are unsuccessful? The sign test is useful to determine

when such results are significant. Fig 4.10 summarizes the procedure of sign test. If the number

92

Fig. 4.10: Flow chart of sign test (Triola, 2001)

Start

Assign positive and negative signs

and reject any zeros.

Is the number of

positive and

negative signs are equal?

Is the test statistic

value less than or equal to the

critical value?

Yes

No

Let 𝑛 is the total number of signs.

Let 𝑥 is the number of the less

frequent sign.

Is

𝑛 ≤ 25

?

Find the critical value from Table A-

7 of statistics book.

Yes

No

No

Yes

Convert the test statistic 𝑥 to the test statistic

𝑧 =(x + 0.5) − (𝑛/2)

√𝑛/2

Find the critical 𝑧 value from Table

A-2 of statistics book.

Fail to reject the null

hypothesis.

Reject the null

hypothesis.

No

93

Fig. 4.11: Sign test of GMC on 10 data sets

of positive and negative signs are equal then we fail to reject the null hypothesis and do not

proceed with the sign test.

In Fig. 4.11, we compare the sign test of GMC with the existing techniques on 10 data sets

in terms of Silhouette Coefficient and DB Index. The first five bars of Silhouette Coefficient

and DB Index in Fig. 4.11 show the z-values (test statistic) values; the sixth bar denotes the z-

ref value. If the z-value is greater than the z-ref value, then the result obtains by GMC considered

as significant. We carry out right-tailed sign test at z > 1.96, p < 0.025 in terms of Silhouette

Coefficient and DB Index. The statistical sign test of GMC shown in Fig. 4.11 indicates the

superiority of GMC over the existing techniques.

4.5 Summary

In this chapter, we propose a GA-based clustering technique called GMC. The proposed

technique uses a new selection operation comparing chromosomes with two generations, where

a chromosome with a higher fitness value has greater chance to be selected for other genetic

operations such as crossover and mutation.

The proposed technique also modifies the process that selects a pair of chromosomes in the

crossover operation in order to encourage crossover between two good-quality chromosomes.

94

The proposed crossover operation aims to increase the possibility of getting good-quality

offspring chromosomes. GMC also introduces a new mutation operation which aims to reduce

the amount of changes on the good chromosomes, and increases the amount of changes on bad

chromosomes in order to improve their quality.

We compare the proposed technique by comparing its clustering quality with five existing

techniques namely AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), K-means

(Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007) and GenClust (Rahman & Islam,

2014) on 10 natural data sets that are publicly available from the UCI machine learning

repository (M. Lichman, 2013) in terms of two well-known evaluation criteria namely

Silhouette Coefficient (Agustín-Blas et al., 2012; Pang-Ning Tan, Michael Steinbach, 2005) and

DB Index (D L Davies & Bouldin, 1979). The experimental results indicate a clear superiority

of the proposed technique over the existing techniques. The proposed technique also presents

some interesting results in order to demonstrate the effectiveness of the proposed mutation and

crossover operation. The experimental results clearly indicate the effectiveness of the proposed

mutation and crossover operation.

However, the crossover operation of GMC has a drawback. In the crossover operation, GMC

first classifies the chromosomes in a population in one of the two groups: Good group and Non-

good group. The good chromosomes are classified in the good group and the bad chromosomes

are classified in the bad group. It then performs different types of crossover on the two different

groups. In the good group, GMC carries out crossover between all possible pairs of

chromosomes. However, in the non-good group, a bad chromosome makes a pair with another

bad chromosome. Hence, there are fewer chances to obtain good offspring chromosomes from

a pair of bad quality parent chromosomes.

Therefore, in the next chapter we propose a GA-based clustering technique called GCS that

proposes a new crossover operation. In the new crossover operation, GCS introduces an

95

opportunity for each chromosome to participate in crossover with the best chromosome.

Typically, the genetic operations such as the crossover and mutation tend to improve the

health/fitness of a chromosome, but they can also cause the health of some chromosomes to

deteriorate. Therefore, GCS also introduces a new genetic operation called the health check in

order to ensure the healthy chromosomes (Chromosomes with good fitness values) in a

population.

96

Chapter 5

High-Quality Clustering through Novel Crossover,

Selection and Health Check with Low Complexity

5.1 Introduction

In this chapter, we propose a GA-based clustering technique called GCS, which is a further

improvement on the techniques proposed in the previous two chapters. In Chapter 5, we aim to

move closer to accomplish our research goal 1.

We now briefly introduce the novel components/properties of GCS and their logical

justifications as follows. Typically, the chromosomes in a population improve their quality

through some genetic operations such as crossover and mutation. However, the health of some

chromosomes can also deteriorate through the genetic operations. Therefore, GCS introduces a

During the PhD candidature, we have published the following paper based on this chapter

Beg, A. H. and Islam, M. Z. (2016): Genetic Algorithm with Novel Crossover, Selection and Health

Check for Clustering, In Proc. of the 24th European Symposium on Artificial Neural Networks,

Computational Intelligence and Machine Learning (ESANN 2016), Bruges, Belgium, April 27-29, 2016,

pp. 575-580. (ERA 2010 Rank B).

97

new genetic operation called the health check operation in order to ensure the presence of

healthy chromosomes in a population. The proposed technique also uses a new selection

operation in order to ensure the presence of good-quality chromosomes in a population at the

begging of each generation. GCS uses the elitist operation after each genetic operation within a

generation, in order to keep track of the best solution obtained so far.

GCS also modifies the process which selects a pair of chromosomes in a crossover operation

in order to increase the possibility of getting better quality offspring chromosomes. GMC (as

presented in Chapter 4) also uses a new crossover operation where a chromosome with low

fitness value always makes a pair with another low-quality chromosome. Therefore, GCS

introduces a new crossover operation where each chromosome gets an opportunity to make a

pair with the best chromosome.

We implement GCS and compare its performance with AGCUK (Y. Liu et al., 2011), GAGR

(D.-X. Chang et al., 2009), K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007)

and GenClust (Rahman & Islam, 2014). We compare the performance of the techniques through

two cluster evaluation criteria, namely Silhouette Coefficient (Agustín-Blas et al., 2012; Pang-

Ning Tan, Michael Steinbach, 2005) and DB Index (D L Davies & Bouldin, 1979) using 15

real-life data sets that we obtain from the UCI machine learning repository (M. Lichman, 2013).


Presentation of GCS that contains some new genetic operations;

The evaluation of GCS by comparing it with existing techniques.

The organization of the chapter is as follows: in Section 5.2, we discuss the motivation behind

the proposed technique; in Section 5.3, we present our proposed technique; the experimental


in Section 5.5.

98


The presence of good chromosomes in a population increases the possibility of getting good

quality of final clustering solution (Diaz-Gomez & Hougen, 2007; Goldberg et al., 1991;

Rahman & Islam, 2014). Therefore, it is important to ensure the presence of good-quality

chromosomes at the begging of each generation. Hence, GCS uses a new selection operation in

order to ensure the presence of good-quality chromosomes in a population at the beginning of

each generation.

Moreover, gradual health improvement is also important for a GA to finally find a good-

quality chromosome. In each generation, GA goes through some genetic operations such as

crossover and mutation. The crossover and mutation operation can improve the health of a

chromosome, but they can also deteriorate the health of some chromosomes. Therefore, it is

important to check the chromosomes health at the end of each generation. Hence, GCS uses a

health check operation in order to find sick chromosomes, and replaces them with healthy

chromosomes found in the previous generation.

GCS also modifies the process which selects a pair of chromosomes in a crossover operation

through two phases in order to increase the possibility of getting better quality offspring

chromosomes. In Phase1, we introduce an opportunity for each chromosome to participate in

crossover with the best chromosome. We increase the opportunity to use the best chromosome

many times, for example: if the population size is twenty (20) then the best chromosome

participates in crossover with other nineteen (19) chromosomes in separate crossover

operations. Due to the use of the best chromosome many times in the crossover operation, this

phase increases the possibility of getting good-quality offspring chromosomes.

99

In GMC (see Chapter 4) the best chromosome only participates in crossovers with good

chromosomes. However in GMC, the chromosomes in the non-good group do no not get a

chance to crossover with the best chromosome. Moreover in GCS, in Phase 2, a crossover

operation on a pair of randomly selected chromosomes through the roulette wheel approach

supports the exploration in a non-deterministic way which is a property and advantage of genetic

algorithms.

5.3 GCS: GA with Novel Crossover, Health Check and Selection for Clustering

We now introduce the main steps of the proposed technique as follows and explain each of them

in detail. Out of the following steps, Step 3, Step 4 and Step 7 are our novel contributions of this

chapter.

BEGIN




Step 3: Two Phases of Selection Operation




Step 7: Health Check Operation

Step 8: The Elitist Operation

END

END


GCS takes a data set 𝐷 as input. It first normalizes the data set 𝐷 in order to weigh each attribute

equally regardless of their domain sizes. The normalization brings the domain range of each

numerical attribute of the data set between 0 and 1. For normalization, GCS uses the same

approach of normalization that we used in DeRanClust (see Section 3.2 in Chapter 3).

100

Algorithm 5.1: GCS Input: A data set D having N records and |A| attributes, where A is the set of attributes


Require:



Pm ← ∅ /* Pm is the set of mutated chromosomes, initially set 𝑃𝑚 to empty*/


D ́← normalized (D) /* normalize each numerical attribute in the normalized data set D´ */



Hs← ∅ /* 𝐻𝑠 is the set of good chromosomes, initially set 𝐻𝑠 to empty */

end







end


Step 2: /* Two Phases of Selection Operation */

if ( t >1) then

Hs ← SelectTopChromosome (Pst) /* select |𝑃𝑠

𝑡|/2 top chromosomes of current 𝑡𝑡ℎ generation based on their fitness */

Pst ← Ps

t-Hs /* remove 𝐻𝑠 from 𝑃𝑠𝑡 */

Ms ← MergedChromosomes (Pst, Ps

t−1) /* merged the remaining chromosomes of 𝑡𝑡ℎgeneration with all chromosomes in the (𝑡 − 1)𝑡ℎgeneration*/

Hs ← ProbabilisticSelection (Ms) /* probabilistically select |𝑃𝑠𝑡|/2 chromosomes from 𝑀𝑠*/

Ps ←Pst ∪ Hs /* insert 𝐻𝑠 into 𝑃𝑠

𝑡 */

end

end

Step 3: /* Two Phases of Crossover operation */

P𝑜 ← TwoPhasesOfCrossoverOperation (P𝑠) /* Apply two phases of crossover operation on 𝑃𝑠 and get set of offspring chromosomes 𝑃𝑜*/

end





Pb= FindChromsomeHavingMaxFitness (Ps) /* 𝑃𝑏 is a chromosome having maximum fitness in 𝑃𝑠 */

P𝑣 ← DivisionAndAbsorptionOperation (Pb) /* Perform division and absorption on chromosome 𝑃𝑏 and get mutated chromosome 𝑃𝑣 */


Ps − Pb /* remove 𝑃𝑏 from 𝑃𝑠 */

for i= 1 to |Ps| do

P𝑣← DivisionOrAbsorptionOperation (P𝑖) /* randomly apply either division or absorption on chromosome 𝑃𝑖 and get mutated chromosome 𝑃𝑣 */


end

end

Step 6: /* Health Check Operation */

while 𝐢 ≤ 20 do

Pa ← Pa ∪ Pb /* store the best chromosome of each generation of the first 20 iteration */

end

Fd ← CalculateAverageFitness (Pa) /* Find the average fitness of the chromosomes 𝑃𝑎 */ if (t>20)

Fm ← Calculate Fitness (Pm) /* F = {𝐹1𝑡 , 𝐹2

𝑡 , … … … . 𝐹𝑀𝑡 } is the set of fitness of every chromosome in 𝑃𝑚 of the 𝑡𝑡ℎgeneration */

for j= 1 to |Pm| do

𝐢𝐟 (Fjt > Fd)

Pm ← Pm ∪ Pjt /* select 𝑃𝑗

𝑡 as a healthy chromosome and insert it into 𝑃𝑚 */

end

else

Ph←ProbabilisticSelection (Pa) /* probabilistically select a healthy chromosome from 𝑃𝑎 */

Pm ← Pm ∪ Ph /* insert 𝑃ℎ into 𝑃𝑚*/

end

end

end

end



C ← C ∪ Pb /* insert 𝑃𝑏 into 𝐶 */

Return C

end

end

101


For the population initialization, GCS uses the same approach of population initialization that

we used in DeRanClust (see Section 3.2 of Chapter 3). In this chapter, we prepare an initial

population of 2 × |𝑃| chromosomes, |𝑃| chromosomes from the deterministic phase and

|𝑃| chromosomes from the random phase. In the experiments of this chapter, we use |𝑃| to be

10. In the deterministic phase, GCS selects top |𝑃| chromosomes (see Step 1 of Algorithm 5.1).

In the random phase, GCS produces |𝑃| chromosomes. Thus, GCS produces 2 ×

|𝑃| chromosomes from two phases. It then find the best chromosome 𝑃𝑏 from 2 ×

|𝑃| chromosomes and stores it for the elitist operation. The fitness of each chromosome is

calculated using Davis Bouldin (DB) index.

Step 3: Two Phases of Selection Operation

Starting from generation 2, GCS applies the two phases of selection operation in order to get a

new population for the next genetic operations such as crossover and mutation. In Phase 1, GCS

selects the top |𝑃| chromosomes (according to the fitness values) from 2 × |𝑃| chromosomes

of the current population.

In Phase 2, it selects |𝑃| chromosomes probabilistically from a set of 3 × |𝑃| chromosomes,

which is made of the remaining bottom |𝑃| chromosomes of the current population and 2 × |𝑃|

chromosomes from the last population of the immediate previous generation (see Step 2 of

Algorithm 5.1).


GCS performs a crossover operation on a pair of chromosomes, where each chromosome is

divided into two segments, and then the chromosomes swap their segments in order to generate

a pair of offspring chromosomes. GCS uses two phases of crossover operation. In Phase 1, it

102

selects 2 × |𝑃| − 1 pair of chromosomes, where in each pair the first chromosome is always

the best chromosome of the population. All other chromosomes are chosen one by one as the

second chromosome of a pair. All pairs have different second chromosome (see Step 1 of

Algorithm 5.2).

Algorithm 5.2: Two Phases of Crossover Operation Input: A set of chromosome Ps after selection operation Output: A set of offspring chromosome Po

Require:

p ← ∅ /* 𝑃 is the set of chromosome pair, initially set 𝑃 to empty */

P𝑥 ← ∅ /* 𝑃𝑥 is the set of offspring chromosomes obtain from phase 1 crossover, initially set 𝑃𝑥 to empty */

P𝑦 ← ∅ /* 𝑃𝑦 is the set of offspring chromosomes obtain from phase 2 crossover, initially set 𝑃𝑦 to empty */

pb= FindBestChromosome (Ps) /* 𝑃𝑏 is a chromosome that has the maximum fitness value in 𝑃𝑠 */

PT ← PT ∪ Ps /* insert offspring Ps into PT */

end

Step 1: /* Perform Phase one of crossover operation */

while |P𝑥| ≤ (|Ps| − 1) do

p ← SelectChromosome (PT) /* select a chromosome form PT for crossover */

while (𝑗 ≤ 1) do /* j is the counter of number of crossover, the default value of j is 5*/

o ← PerformPhase1Crossover (pb, p) /* after crossover between 𝑝𝑏 and 𝑝, two offspring 𝑜 = {𝑜1, 𝑜2} are generated */

P𝑥 ← P𝑥 ∪ o /* insert offspring 𝑜 = {𝑜1, 𝑜2} into 𝑃𝑥 */

end

PT ← PT- p /* remove 𝑃 from 𝑃𝑇 */

end

P𝑥 ← SelectPhase1OffspringChromosomes (Px) /* select top |Ps|/2 offspring chromosomes form 𝑃𝑥 based on their fitness */

end

Step 2: /* Perform Phase two of crossover operation */

while |Ps| ≥ 2 do

p ← SelectChromosomePair (Ps) /* select a pair of chromosome 𝑃 = {𝑃1, 𝑃2} from 𝑃𝑠 using roulette wheel */

o ← PerformPhase2Crossover (p) /* after crossover between 𝑝1 and 𝑝2, two offspring 𝑜 = {𝑜1, 𝑜2} are generated */

P𝑦 ← P𝑦 ∪ o /* insert offspring 𝑜 = {𝑜1, 𝑜2} into 𝑃𝑦 */

Ps ← Ps- p /* remove 𝑝 = {𝑝1, 𝑝2} from 𝑃𝑠 */

end

P𝑦 ← SelectPhase2OffspringChromosomes (P𝑦) /* select |Ps|/2 offspring chromosomes form 𝑃𝑦 based on their fitness */

Po ← Po ∪ (P𝑥 ∪ P𝑦) /* insert (𝑃𝑥 and 𝑃𝑦) into 𝑃𝑜 */

end

Step 3: /* Return the offspring */ Return Po

end

For the extensive exploration, GCS applies crossover operation five times on each pair and

thereby generates 5 different pair of offspring chromosomes. That is it produces altogether 5 ×

(2 × |𝑃| − 1) × 2 chromosomes, from which it then selects the top |𝑃| chromosomes (see Step

1 in Algorithm 5.2). This phase increases the possibility of getting good-quality offspring

chromosomes. In order to maintain the random exploration ability, in Phase 2 it uses the

traditional roulette wheel (D. Chang et al., 2012; Maulik & Bandyopadhyay, 2000;

Mukhopadhyay & Maulik, 2009) approach for selecting a pair of chromosomes for crossover.

In Phase 2, it selects |𝑃| pair of chromosomes and applies the traditional single point crossover.

103

GCS then selects |𝑃| offspring chromosomes from |𝑃| pair of offspring chromosomes. Thus,

from the two phases it finally produces 2 × |𝑃| offspring chromosomes.


GCS uses the twin removal operation in order to remove/modify twin genes (if any) from each

chromosome. For twin removal, GCS uses the same approach of twin removal of DeRanClust



GCS uses the same approach of mutation operation which is used in DeRanClust (see Section

3.2 of Chapter 3 and Step 5 of Algorithm 5.1).


GCS applies the proposed health check operation after 𝐼 generations. In this study we use 𝐼 =

20. It prepares a set of chromosomes 𝑃, where it stores the best chromosomes of each generation

for the first 𝐼 generations. It then calculates the average fitness 𝐹𝑑 of the chromosomes in 𝑃. If

the fitness of a chromosome in the current population is less than 𝐹𝑑, then the chromosome is

selected as sick. GCS then probabilistically selects a chromosome from 𝑃 to replace the sick

chromosome (see Step 6 of Algorithm 5.1).


Generally GA (D.-X. Chang et al., 2009; Y. Liu et al., 2011; Rahman & Islam, 2014) applies

the elitist operation at the end of each generation. However, GCS applies the elitist operation at

the end of each genetic operation within a generation. If the fitness of the worst chromosome

𝑃𝑤𝑖 of the 𝑖𝑡ℎpopulation (i.e. the current population) is less than the fitness of the best

chromosome 𝑃𝑏 (from all previous generation), then 𝑃𝑤𝑖 is replaced with 𝑃𝑏. Moreover, if the

104

fitness of the best chromosome 𝑃𝑏𝑖 of the 𝑖𝑡ℎpopulation is higher than 𝑃𝑏 , then 𝑃𝑏 is replaced

by 𝑃𝑏𝑖 .


We empirically compare the performance of our technique with five existing techniques called

AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), K-means (Lloyd, 1982), K-

means++ (Arthur & Vassilvitskii, 2007) and GenClust (Rahman & Islam, 2014). For the

experimentation of AGCUK, GAGR, GenClust and GCS, we consider the population size to be

20. The number of generations/iterations for all techniques set to be 50 for a fair comparison.

The cluster number in GGAR and AGCUK are generated randomly in the range 2 to √𝑛 (n is

the number of records in a data set). We run each technique 20 times on each data set, and we

take the average result. We set the threshold value for K-means to be 0.05 and the total number

of iterations to be 50 as suggested in AGCUK.


Data set No. of Records

with missing

No. of Records

without missing

No. of numerical

attributes

No. of categorical

attributes

Class size



Ecoli (EC) 336 336 8 0 8

Leaf (LF) 340 340 16 0 36



Blood Transfusion (BT) 748 748 4 0 2




Yeast (YT) 1484 1484 8 0 10


Wine Quality (WQ) 4898 4898 11 0 7

Page Blocks Classification (PBC) 5473 5473 10 0 5

MAGIC Gamma Telescope (MGT) 19020 19020 11 0 2

105

5.4.1 Data Sets

We apply the techniques on 15 real-life data sets as shown in Table 5.1. The data sets are

publicly available in UCI Machine Learning Repository (M. Lichman, 2013). We consider the

data sets in the experiment having numerical attributes except the categorical attributes. All the

data sets contain some class attributes. We remove them during the clustering process.

5.4.2 Evaluation Criteria

To compare our technique with the existing techniques two well-known evaluation criteria

namely Silhouette Coefficient (Agustín-Blas et al., 2012; Pang-Ning Tan, Michael Steinbach,

2005) and DB Index (D L Davies & Bouldin, 1979) are used. A smaller value of DB Index

indicates a better clustering result and the higher value of Silhouette Coefficient represents a

better clustering result.


In this section, we compare the experimental result of the proposed technique with five existing

techniques AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), K-means (Lloyd,

1982), K-means++ (Arthur & Vassilvitskii, 2007) and GenClust (Rahman & Islam, 2014) in

order to evaluate the usefulness of the proposed technique on 15 data sets where each technique

runs 20 times on each data set.

Fig. 5.1: Silhouette Coefficient of the techniques on eight data sets

106

Fig. 5.2: Silhouette Coefficient of the techniques on seven data sets

Fig. 5.1 and Fig. 5.2 show the average Silhouette Coefficient of the clustering solutions,

where GCS achieves better results than all other techniques in all 15 data sets. That is, in 15 out

of 15 data sets the average Silhouette Coefficient of 20 runs of GCS is higher than the average

Silhouette Coefficient of 20 runs of AGCUK, GAGR, K-means, K-means++ and GenClust.

Fig. 5.3: DB Index of the techniques on eight data sets

Fig. 5.4: DB Index of the techniques on seven data sets

Moreover, in 14 out of 15 data sets the standard deviations of GCS do not overlap the

standard deviation of GAGR based on Silhouette Coefficient. The standard deviations of GCS

107

do not overlap the standard deviations of AGCUK in 11 out of 15 data sets based on Silhouette

Coefficient. The standard deviations of GCS do not overlap the standard deviations of K-means,

K-means++ and GenClust on 15 out of 15 data sets. Note that the cases where the standard

deviations of GCS overlaps with the standard deviations of other techniques are indicated with

an arrow in Fig. 5.1 and Fig. 5.2. That is, for all cases without an arrow GCS achieves a better

result with no overlap of standard deviations.

As we can see in Fig. 5.3 and Fig. 5.4, GCS achieves better clustering results (on an average)

than all other techniques in 15 out of 15 data sets, based on DB Index for which a lower value

indicates a better result. The standard deviations of GCS do not overlap the standard deviations

of all other techniques on 15 out of 15 data sets.

The right most columns in Fig. 5.2 and Fig. 5.4 show the average Silhouette Coefficient and

DB Index of all techniques on all data sets. GCS achieves clearly better results on an average

than all other techniques without any overlapping of standard deviations.

5.4.4 Comparative Results between GCS and GMC

In this section, we compare GCS with GMC (as presented in Chapter 4) through two cluster

evaluation criteria, namely Silhouette Coefficient and DB Index using 10 real-life data sets that

we obtain from the UCI machine learning repository (M. Lichman, 2013) (see Table 4.1). We

compare GCS with GMC on 10 data sets that we used in Chapter 4. For each data set, we run

GMC and GCS 10 times and present the average clustering results.

As we can see in Fig. 5.5, GCS achieves better clustering results than GMC in 9 out of 10

data sets, based on Silhouette Coefficient. Fig. 5.6 shows that GCS performs better than GMC

on all 10 data sets based on DB Index. The right most columns in Fig. 5.5 and Fig. 5.6 show the

average Silhouette Coefficient and DB Index of the techniques on all data sets, respectively.

108

GCS achieves clearly better results on an average than GMC indicating effectiveness of the

components of GCS.

Fig. 5.5: Comparative results between GCS and GMC based on Silhouette Coefficient

Fig. 5.6: Comparative results between GCS and GMC based on DB Index

5.4.5 An Analysis of the Impact of Various Component of GCS

In this section, we explore the effectiveness of various components of the proposed technique,

including the health check and the crossover operation. In order to evaluate the effectiveness of

various components of the proposed technique, we randomly choose five (5) data sets (PID, LD,

LF, GI and VC) as shown in Table 5.1. We run each technique 20 times on each data set and

present the average result.

109

5.4.5.1 An Analysis of the Impact of the Health Check Operation

We explore the effectiveness of the proposed health check operation (see Step 7 of Section 5.3).

In order to evaluate the effectiveness of the health check operation, we compare our proposed

technique with a different version of the proposed technique. We call this version as GCS

without Health Check. GCS without Health Check is exactly the same as GCS except it does not

have the health check operation. We run both GCS and GCS without Health Check for 50

iterations on 5 data sets.

Table 5.2: Comparative result between GCS and GCS without Health Check


GCS GCS without Health Check


GCS GCS without Health Check

PID 0.17 0.18 0.83 0.79

LD 0.27 0.29 0.81 0.79

LF 0.44 0.45 0.70 0.68

GI 0.27 0.27 0.81 0.81

VC 0.25 0.25 0.83 0.83

Average 0.280 0.288 0.796 0.780

Table 5.2 shows that GCS achieves better clustering results than GCS without Health Check

in three (3) out of five (5) data sets based on both the silhouette coefficient and DB Index. The

average result of GCS on five (5) data set is better than the GCS without Health Check in terms

of both the Silhouette Coefficient and DB Index.


We also explore the effectiveness of the proposed crossover operation (see Step 4 of Section

5.3). In order to evaluate the effectiveness of the crossover operation, we compare our proposed

technique with a different version of the proposed technique. We call this version as GCS with

Traditional Crossover. In GCS with Traditional Crossover, we incorporate the traditional

crossover (single point crossover) with GCS by replacing its own crossover operation. We run

both GCS and GCS with Traditional Crossover for 50 iterations on 5 data sets. We can see in

110

Table 5.3 that GCS achieves better clustering results than GCS with Traditional Crossover based

on both Silhouette Coefficient and DB Index.

Table 5.3: Comparative result between GCS and GCS with Traditional Crossover


GCS GCS with Traditional Crossover


GCS GCS with Traditional Crossover

PID 0.17 0.18 0.83 0.86

LD 0.27 0.38 0.81 0.73

LF 0.44 0.50 0.70 0.65

GI 0.27 0.30 0.81 0.82

VC 0.25 0.25 0.83 0.83

Average 0.280 0.322 0.796 0.778

5.4.6 An Analysis of the Improvement in Chromosomes over the Iterations

In Fig. 5.7, we present the grand average fitness (in terms of DB Index, where Fitness= 1/DB

Index (David L. Davies & Bouldin, 1979) values of the best chromosome of 20 runs of GCS on

PID data set. We run GCS 20 times and then present the grand average fitness of the 20 runs.

The grand average fitness are plotted against the iterations. Fig. 5.7 shows the gradual

improvement of the best chromosome over the iterations.

Fig. 5.7: Average fitness (best chromosome) versus Iteration of 20 runs on PID data set

In Fig. 5.8, we present the average fitness of all chromosomes (20 chromosomes in a

population) of GCS and AGCUK on PID data set. Both GCS and AGCUK use the same fitness

111

function (DB Index) to calculate the fitness of a chromosome. Fig. 5.8 shows that average fitness

of 20 runs of all chromosomes of GCS are always higher than the average fitness of 20 runs of

all chromosomes of AGCUK, clearly indicating the effectiveness of various components of

GCS including health check.

Fig. 5.8: Average fitness (all chromosomes) versus Iterations. Each line represents the average fitness of 20 runs on PID data

set

5.5 Summary

In this chapter, we propose a GA-based clustering technique called GCS. The proposed

technique also uses a new selection operation in order to ensure the presence of good-quality

chromosomes in a population at the beginning of each generation. It also modifies the process

which selects a pair of chromosomes in a crossover operation in order to increase the possibility

of getting better quality offspring chromosomes. GCS also uses a health check operation in order

to maintain the chromosome health in a population. It also uses the elitist operation after each

genetic operation within a generation, in order to keep track of the best solution obtained so far.

We evaluate the proposed technique by comparing its performance with the performance of

five existing techniques namely AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al.,

2009), GenClust (Rahman & Islam, 2014), K-means (Lloyd, 1982) and K-means++ (Arthur &

112

Vassilvitskii, 2007). Two evaluation criteria called Silhouette Coefficient (Pang-Ning Tan,

Michael Steinbach, 2005) and DB index (D L Davies & Bouldin, 1979) are used.

We run each technique 20 times on each data set, and present the average clustering results

of 20 runs and standard deviation of the average clustering result. The experimental results show

that GCS performs better than all other techniques on all 15 data sets based on DB Index without

any overlapping of standard deviations. GCS achieves better results than AGCUK in all 15 data

sets based on Silhouette Coefficient. The standard deviations of GCS do not overlap the standard

deviations of AGCUK in 11 out of 15 data sets based on Silhouette Coefficient. GCS performs

better than GAGR in all 15 data sets based on Silhouette Coefficient. In 14 out of 15 data sets

the standard deviations of GCS do not overlap with the standard of GAGR based on Silhouette

Coefficient. GCS also achieves higher Silhouette Coefficient than K-means, K-means++ and

GenClust in all 15 data sets. In 15 out of 15 data sets the standard deviations of GCS do not

overlap with the standard deviations of any of these techniques.

We also present the average Silhouette Coefficient and DB Index of all techniques on all data

sets. The results show that GCS achieves clearly better results on an average than all other

techniques without any overlapping of the standard deviation. We also compare GCS with GMC

(as presented in Chapter 4). The empirical results indicate that GCS achieves clearly better

results than GMC based on two cluster evaluation criteria.

We also empirically evaluate the effectiveness of the proposed components: health check and

crossover operation. The experimental results indicate that the proposed crossover and health

check operation has positive influence in improving clustering results. However, the health

check operation of GCS has a drawback. GCS applies the health check operation from the 21st

iteration and onward. Its keeps collecting the best chromosome of an iteration for the first 20

iterations. This pool of 20 chromosomes then used in the health check operation from the 21st

113

iteration. Any chromosome having lower fitness than the average fitness of the pool is replaced

by a chromosome probabilistically selected from the pool.

The problem is the chromosomes in the pool do not change ever during the iterations. The

chance is high that chromosomes in later iterations (such as the 40th iteration or so) have better

fitness than the average fitness of the pool. Hence, the health check operation may become

ineffective. Moreover, even if there is a chromosome with low fitness in a later iteration (say

the 40th iteration) that requires to be replaced, the replacement of the chromosome by a

chromosome from the pool may not be very effective. This is because, the quality of

chromosomes in the pool is unlikely to be good enough to be useful for a later

generation/iteration.

Therefore, in the next chapter we propose a new clustering technique called HeMI that uses

a new health check operation in each iteration in order to find the sick chromosomes and replace

them with healthy chromosomes. HeMI also improves some other components including the

population initialization and mutation in order to improve the clustering quality.

In addition, from the literature (Pourvaziri & Naderi, 2014; Straßburg et al., 2012), we realize

that bigger population size plays a positive role in achieving better clustering solutions.

However, a bigger population size typically requires higher execution time. Therefore, HeMI

uses multiple streams to facilitate the maintenance of a low execution time while using a bigger

population size. Moreover, it uses the bigger population in such a way so that it produces better

clustering solution than just using the bigger population in a naive way. We discuss them in

details in the next chapter.

114

Chapter 6

GA with Multiple Streams and Neighbor Information

Sharing for Clustering

6.1 Introduction

In this chapter, we propose a GA-based clustering technique called HeMI which is a further

improvement on the techniques proposed in the previous chapters. We in this chapter achieve

our first research goal.

We now briefly introduce the novel components/properties of HeMI and their logical

justifications as follows. It is evident from the literature (Pourvaziri & Naderi, 2014; Straßburg

et al., 2012) and through our empirical analysis (carried out in this chapter) that the population

size has a positive impact on the clustering quality. That is, a big population size is likely to

contribute towards a good clustering solution. However, big population size requires high

During the PhD candidature, we have published the following paper based on this chapter with PhD

supervisors.

Beg, A. H., Islam, M. Z., and Estivill-Castro. V. (2016): Genetic Algorithm with Healthy Population

and Multiple Streams Sharing Information for Clustering, Knowledge-Based Systems, 114 (2016) 61-78.

(ABDC 2016 Rank A, SJR 2016 Rank Q1, 5 Year Impact Factor: 3.433, H Index 63).

115

execution time. Therefore, HeMI uses a big population in multiple streams, where each stream

contains a relatively small number of chromosomes, and thus can facilitate managing a low

execution time since they are suitable for parallel processing when necessary.

Various genetic operations (such as crossover and mutation) can applied on each stream in

parallel. As a result, HeMI is likely to produce better quality clustering solutions. Moreover, due

to splitting the chromosomes into a number of streams and processing the splits separately HeMI

exhibits the higher ability to explore the solution space than the traditional approach of

processing all chromosomes in a single stream. We present empirical evidence of this

phenomenon where we use a single stream of 20 chromosomes, 40 chromosomes and 80

chromosomes, and four streams of 20 chromosomes.

Note that there are some existing techniques that use parallel genetic algorithms (Kumar,

Mills, Hoffman, & Hargrove, 2011; Y. Y. Liu & Wang, 2015; Moore, 2004; Straßburg et al.,

2012) where they divide the total number of chromosomes into a number of parallel runs,

whereas in our technique we increase the total number of chromosomes. The main goal of these

existing techniques is to reduce the time complexity through the parallelization of the genetic

algorithms, whereas the main goal of HeMI is to improve clustering results. While employing

parallelization these existing techniques do not share information among the parallel streams,

whereas HeMI introduces information sharing among the streams at a regular interval in order

to take advantage of the multiple streams.

For a stream 𝑆𝑖, HeMI first identifies its neighboring streams and then spots out the best

chromosome from all neighboring streams and 𝑆𝑖. It then replaces the worst chromosome of 𝑆𝑖

by the best chromosome. The information sharing is carried out at a regular interval such as at

every 10th iteration.

116

For Stream 1, Stream 2 and Stream 3 are considered to be neighbors. Similarly for Stream 2,

Stream 3 and Stream 4 are considered to be neighbors. While the sharing of the best

chromosome from the neighbors increases the fitness of the best chromosome, it maintains the

divergence among the streams. That is, had HeMI used/inserted the best chromosome out of all

streams into all streams then they would have the same best chromosome in all streams.

Similar to DeRanClust (Chapter 3), GMC (Chapter 4) and GCS (Chapter 5), HeMI also uses

a high-quality initial population with a low complexity of 𝑂(𝑛) into two phases: a deterministic

phase and a random phase. Including HeMI all these techniques produce 45 chromosomes in

the deterministic phase and select top |𝑃|

2 chromosomes for the initial population, where |𝑃| (|𝑃|

is set to be 20 in our experiments) is the number of chromosomes in a population. However, in

the random phase excluding HeMI all these techniques generate |𝑃|

2 (i.e. 10) chromosomes. We

realize that similar to deterministic phase, we can also increase the possibility of getting good-

quality chromosomes through the random phase. Therefore, in HeMI we generate same number

of chromosomes (45 chromosomes) in the random phase and then select top |𝑃|

2 chromosomes.

The presence of healthy chromosomes (i.e. chromosomes with high fitness values) in a

population can increase the possibility of good clustering results. Hence, HeMI replaces the sick

chromosomes (i.e. chromosomes with low fitness) by healthy chromosomes. GCS (as presented

in Chapter 5) also uses a health check operation that finds sick chromosomes in a population,

and probabilistically replaces them with healthy chromosomes found in the previous 20

generations. GCS applies the health check operation after 20 generations. However, we

empirically find that the chromosomes and best chromosome in a population improve their

quality over the iterations (see Fig. 6.6, Fig. 6.7 and Fig. 6.8). Hence, GCS’s approach of using

the pool of best chromosomes obtained from the first twenty iteration may not be effective in

the health improvement in later iterations such as the 40th iteration.

117

Hence, HeMI uses a new health check operation where some of the healthy chromosomes

are chosen from a pool of healthy chromosomes obtained by the initial population operation,

whereas some of the healthy chromosomes are generated through the crossover operation of the

existing healthy chromosomes of a generation with the hope that the crossover of two healthy

chromosomes may generate new healthy chromosomes.

HeMI uses the three steps of mutation operation, which employs a division and absorption

operation in sequence if they improve the quality of clustering solutions. Additionally, at the

end of the division and absorption operation, it also applies a random change in chromosomes.

Unlike HeMI, an existing technique (Y. Liu et al., 2011) applies either division or absorption

randomly. Another existing technique (Agustín-Blas et al., 2012) applies division (they call it

splitting) on the largest cluster instead of the sparsest cluster, and absorption on two randomly

chosen clusters. Another existing technique (D. Chang et al., 2012) applies division and

absorption of randomly chosen clusters. Hence, HeMI has a better approach to carefully improve

cluster quality through mutation while exploring unconventional solution space. HeMI, also

maintained the randomness through the noising selection and crossover operation in order to

explore the solution space through its randomness.

We evaluate our technique by comparing its performance with the performance of five high-

quality techniques namely AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), K-

means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007) and GenClust (Rahman &

Islam, 2014). We conduct experiments for the techniques on twenty (20) real-life data sets that

are available in the UCI machine learning repository (M. Lichman, 2013). The experimental

results clearly indicate that the proposed technique performs significantly better than other

techniques in terms of the evaluation criteria considered in this chapter: Silhouette Coefficient

(Agustín-Blas et al., 2012; Pang-Ning Tan, Michael Steinbach, 2005) and DB Index (D L Davies

118

& Bouldin, 1979). We also experimentally evaluate the usefulness of various components of

HeMI.

The main contributions of HeMI are as follows:

The use of multiple streams (see Component 2 in Section 6.2.2).

The three steps mutation operation (see Component 7 in Section 6.2.2).

The Health improvement operation (see Component 8 in Section 6.2.2).

Neighbor information sharing (see Component 10 in Section 6.2.2).

The Global Best Selection operation (see Component 11 in Section 6.2.2).

HeMI works on data sets having numerical and/or categorical attributes.

The rest of the chapter is organized as follows: in Section 6.2 we describe our proposed

technique. We discuss the experimental result in Section 6.3 and in section 6.4 we present the

summary of the chapter.

6.2 HeMI: Healthy Population and Multiple Streams Sharing Information in a GA for

Clustering

6.2.1 Basic Concepts

One of the most important component of HeMI is multiple streams. It is evident from the

relevant literature that (Pourvaziri & Naderi, 2014; Straßburg et al., 2012) in genetic algorithm

based clustering techniques bigger population size tends to increase the quality of the final

clustering solution. Therefore, we realize that a population size of 80 chromosomes is more

likely to produce better clustering solution than a smaller population size such as 20

chromosomes. In this study, we carry out empirical analysis on this (presented in Section

6.3.7.1) where we can see the improvement of clustering quality with the increase of population

119

size in an existing genetic algorithm based clustering technique called AGCUK (Y. Liu et al.,

2011) and GenClust (Rahman & Islam, 2014).

An obvious issue related to the increase of population size is the increased execution time.

Therefore, HeMI uses multiple streams, where each stream contains a relatively small number

of chromosomes, and thus can facilitate managing a low execution time since they are suitable

for parallel processing if necessary. Another advantage of using the multiple streams is a better

exploration of clustering solutions. That is, if we run all chromosomes in a single stream (like

traditional techniques) then we get one best chromosome from the whole population, whereas

if we divide the chromosomes in multiple streams and run them independently then we get

multiple best chromosomes; one best chromosome from each stream. We naturally expect this

approach to get better clustering solution. The empirical analysis carried out in this study

(presented in Section 6.3.7.1) also supports the expectation.

While running multiple streams independently we also make them help each other in

achieving better clustering solutions. That is, the independent streams can exchange message at

a regular interval in order to increase the clustering quality of each stream. One way we could

do this is by identifying the best chromosome out of all streams and implanting the chromosome

in each stream. However, in that case, all streams would have the same best chromosome and

would lose the diversity among them.

Therefore, for each stream, we first identify a set of neighboring streams and then identify

the best chromosome within the neighboring streams which is then implanted into the stream.

This way we can ensure that all streams will not have the same best chromosome in them. Our

empirical analysis again shows (presented in Section 6.3.7.1) a clear evidence that the use of

multiple streams with sharing information among the neighboring streams results in better

clustering solutions.

120

Another interesting idea of HeMI is the continuous health improvement in every generation

in order to ensure the presence of high-quality chromosomes in each population. In each

population, it identifies a number of sick chromosomes and replaces them by healthy

chromosomes. Some of the healthy chromosomes are obtained from the pool of high-quality

chromosomes created for the initial population using K-means/K-means++ many times.

Moreover, some of the healthy chromosomes are created by applying the crossover operation

on pairs of good chromosomes. Again our empirical analysis indicates the effectiveness of the

health improvement as presented in Section 6.3.7.4.

The mutation operation generally changes some chromosome randomly (Agustín-Blas et al.,

2012; D. Chang et al., 2012; Rahman & Islam, 2014). However, HeMI aims to use the mutation

operation for improving the chromosome health while changing them randomly. The mutation

operation in HeMI has three components: division, absorption, and random change. In the

division operation, it examines whether dividing the sparsest cluster into two separate clusters

can improve the chromosome health. Similarly in the absorption operation, it examines whether

the chromosome health can be improved by merging the two closest clusters. After the division

and absorption operation, it finally makes a slight change randomly. The effectiveness of this

mutation operation has been empirically analysed in Section 6.3.7.3.

6.2.2 Main Steps

In this subsection, we introduce the main components and steps of, HeMI before we present the

complete algorithm of HeMI in the next subsection. Out of the following components,

Component 2, Component 7, Component 8, Component 10, and Component 11 are our novel

contributions of this chapter.

121

BEGIN


DO: k=1to m /* m is the user defined number of streams */


END

DO: j=1to G /* G is the user defined number of intervals*/


DO: t=1to I /* I=10; I is the user defined number of iterations */

Step 3: Noise-Based Selection



Step 6: Three Steps Mutation Operation

Step 7: Health Improvement Operation


END

END

Step 9: Neighbor Information Sharing

END

Step 10: Global Best Selection

END

Component 1: Normalization

Numerical Attributes:

For normalize numerical attributes, HeMI uses the same approach of normalization that we used

in DeRanClust (see Section 3.2 in Chapter 3).

Categorical Attributes:

For a categorical attribute, HeMI uses an existing technique (H. Giggins & Brankovic, 2012) to

compute the similarity 𝑠 between two categorical values of the categorical attribute. The

distance between two values of a categorical attribute 𝑑 = 1 − 𝑠. The similarity 𝑠 varies

between 0 and 1 and hence the distance 𝑑 also varies between 0 and 1. As a result the distance

between any two records varies between 0 and 1 and all attributes have equal weight in the

distance calculation.

122

Algorithm 6.1: HeMI

Input: A data set D having N records and |A| attributes, where A is the set of attributes Output: A set of cluster C

Require:

Ps ← ∅ /* 𝑃s is the set of initial population (20 chromosomes), initially set 𝑃s to empty */


Pm ← ∅ /* 𝑃𝑚 is the set of mutated chromosomes, initially set 𝑃𝑚 to empty*/

pc ← ∅ /* 𝑝𝑐 is the set of healthy chromosomes, initially set 𝑝𝑐 to empty*/

D′ ← Normalized ( 𝐷) /* normalize each attribute of the data set in the normalized data set (𝐷′) */



end

for k= 1 to m do /* m=4, user defined number of streams, default value of m is set to 4 and k is the counter of m */


Pd ←GenerateDeterministicChromosomes (D′) /* generate 𝑃𝑑 through K-means, the number of initial seeds for k-means are chosen deterministically */



Pr ← SelectRandomChromosomes (Pr) /* select top 10 chromosomes (50% chromosomes of the initial population) based on fitness */



end

end

for j=1 to G do /* default G=5, G is the number of intervals of the total number of iterations and j is the counter of G */

for k= 1 to m do /* default m=4, m is the user defined number of streams, and k is the counter of m */

for t=1 to I do /* default I = 10, I is the user defined number of iterations for each interval and t is the counter of I */

Step 2: /* Noised Based Selection */

if t >1 then

Ps = NoiseBasedSelection(Pst, Ps

t−1) /* perform noise based selection between current ( 𝑃𝑠𝑡 ) and previous (𝑃𝑠

𝑡−1) generation*/

end end

Step 3: /* Crossover operation */

Po ← PerformCrossover (Ps) /* perform single point crossover on 𝑃𝑠 and get a set of offspring chromosomes 𝑃𝑜*/

end

Step 4: /*Twin Removal */

P0 =Twin Removal (P0) /* perform twin removal on 𝑃0 and get a set of chromosomes 𝑃0 */

end


while |P0 | ≥ 0 do

Pv ← DivisionOperation (P0) /* perform division operation on 𝑃𝑜 and get chromosome 𝑃𝑣 */

Pv ← AbsorptionOperation (Pv) /* perform absorption operation on 𝑃𝑣 and get chromosome 𝑃𝑣 */

Pv ← RandomChangeOperation (Pv) /* perform random change operation on 𝑃𝑣 and get chromosome 𝑃𝑣 */

Pm ← P𝑣 /* insert P𝑣 in Pm*/

end

end

Step 6: /* Health Improvement Operation */

Px = Phase1 (Pm) /* select 10 best chromosomes from 𝑃𝑚 based on their fitness */

Py = Phase2 (Pm) /* select 4 best chromosomes from 𝑃𝑚 and perform single point crossover and get offspring chromosomes 𝑃𝑦 */

Pz = Phase3 (Pd) /* select 6 best chromosomes from 𝑃𝑑 and perform random mutation and get chromosomes 𝑃𝑧 */

Pc ∪ (Px ∪ Py ∪ Pz) /* insert 𝑃𝑥, 𝑃𝑦 and 𝑃𝑧 into 𝑃𝑐 */

end


Pbk ←ElitistOperation (Pc & Pb ) /* apply elitist operation on 𝑃𝑐 & 𝑃𝑏 and find the best chromosome 𝑃𝑏

𝑘 */

Pg← Pbk /* insert 𝑃𝑏

𝑘 into 𝑃𝑔, 𝑃𝑔 is the set of chromosomes that contains the best chromosome of each stream */

end

end

end

Step 8: /* Neighbor Information Sharing */

for k= 1 to m do /* default m=4, m is the user defined number of streams and k is the counter of m */

Pwk = FindWorstChromosome (Pc) /* find the worst chromosome 𝑃𝑤

𝑘 in 𝑃𝑐 */

Pwk ← ReplaceWithNeighborBestChromosome (Pg) /* replace Wp

k with the best chromosome of its neighbor */

Pbk ←FindLocalBestChromosome (Pc) /* find the local best chromosome */

Lb← Pb k /* insert 𝑃𝑏

𝑘 into 𝐿𝑏, 𝐿𝑏 is the set of chromosomes that contains the best chromosome of each k */

end

end

end

Step 9: /* Global Best Selection*/

C ←FindGlobalBestChromosome (Lb ) /* find the global best chromosome 𝐶 from 𝐿𝑏*/ Return C

end

123

Component 2: Multiple Streams

This component is an original/new contribution of HeMI that aims to take advantage of using a

big population through multiple streams where each stream contains a relatively small number

of chromosomes. Generally, in the genetic algorithm based clustering techniques, the bigger

population size tends to increase the quality of final clustering solutions (Pourvaziri & Naderi,

2014; Straßburg et al., 2012). Therefore, HeMI aims to use a big population in order to produce

better clustering solution. The chromosomes for each stream are generated separately through

the population initialization. Various components such as crossover and mutation are applied

on each stream separately.

Component 3: Population Initialization

HeMI selects high-quality chromosomes in the initial population through two phases: a

deterministic phase and a random phase.

Deterministic Phase

HeMI uses the same approach of deterministic phase that we used in DeRanClust (see Section

3.2 of Chapter 3).

Random Phase

In the random phase, HeMI generates 45 chromosomes. For each chromosome, it randomly

generates the 𝑘 value between 2 and √𝑛(𝑛 is the number of records in a data set) and then

randomly picks 𝑘 records to form k genes of the chromosome.

In the experiments of this chapter, we use 20 chromosomes in the population of a generation.

Therefore, HeMI chooses the best 10 chromosomes from the 45 chromosomes generated in the

deterministic phase and the best 10 chromosomes from the 45 chromosomes generated in the

random phase. The best chromosome out of the 20 chromosomes of the initial population is

124

stored separately as the best chromosome which is then used in the elitist operation later on. The

DB Index (D L Davies & Bouldin, 1979) is used by default as the fitness function of the

chromosomes throughout all steps in HeMI.

Component 4: Noise-based Selection

At the beginning of each generation starting from Generation 2, we carry out the Noise Based

Selection (Y. Liu et al., 2011) in order to get a new population for subsequent genetic operations

such as crossover and health improvement. HeMI uses the same approach of noise-based

selection that we used in DeRanClust (see Section 3.2 in Chapter 3).

Component 5: Crossover Operation

HeMI performs a crossover operation on a pair of chromosomes where the chromosomes swap

their segments/genes to each other and generate a pair of offspring (Agustín-Blas et al., 2012;

D. Chang et al., 2012; Rahman & Islam, 2014).Typically, there are many selection criteria such

as roulette wheel (D. Chang et al., 2012; Maulik & Bandyopadhyay, 2000; Mukhopadhyay &

Maulik, 2009) rank-based wheel (Agustín-Blas et al., 2012) and random selection (D.-X. Chang

et al., 2009) that are used to select a chromosome pair for a crossover operation.

In HeMI, the best chromosome (which is currently available in the population) is chosen as

the frist chromosome of the pair. The second chromosome of the pair is chosen using the roulette

approach, where a chromosome 𝑃𝑗 is chosen with a probability 𝑇𝑗 = (𝑓𝑗/ ∑ 𝑓𝑖|P|𝑖=1 ). Here, 𝑓𝑗 is the

fitness of the chromosome 𝑃𝑗 and |𝑃| is the size of the current population. Once a pair of

chromosomes is chosen it is removed from the current population. For the selection of the next

pair, again the new best chromosome is chosen. The second chromosome of the pair is chosen

using the same process described above. The intuition behind the roulette wheel selection is to

take a non-deterministic approach with high probability of choosing a pair of good

chromosomes.

125

There are many approaches to perform crossover between a pair of chromosome such as

single-point (Sanghamitra Bandyopadhyay & Maulik, 2002; D. Chang et al., 2012; Garai &

Chaudhuri, 2004; Michael Laszlo & Mukherjee, 2007; Pakhira et al., 2005; Peng et al., 2014;

Rahman & Islam, 2014; Song et al., 2009), multi-point (Agustín-Blas et al., 2012), arithmetic

(Yan et al., 2012) and path-based crossover (D.-X. Chang et al., 2009). However, in the genetic

algorithm a single-point crossover is very commonly used. There are many existing techniques

(Sanghamitra Bandyopadhyay & Maulik, 2002; D. Chang et al., 2012; Garai & Chaudhuri,


Islam, 2014; Song et al., 2009) that use single-point crossover.

Moreover, Peng et al. ( 2014) experimentally demonstrates that a single-point crossover

performs better than a multi-point crossover. Therefore, in HeMI a single-point crossover is

used where it randomly generates a crossover point for each chromosome of the pair in order to

divide a chromosome into two segments and then swaps the segments between the

chromosomes.

Component 6: Twin Removal

HeMI uses the twin removal operation in order to remove/modify twin genes (if any) from each

chromosome. For twin removal, HeMI uses the same approach of twin removal of DeRanClust


Component 7: Three Steps Mutation Operation

This is another new contribution of HeMI that changes a chromosome using three

steps/operations: division, absorption, and a random change. Note that the division and

absorption operations are also used in some existing techniques (Agustín-Blas et al., 2012; D.

Chang et al., 2012; Y. Liu et al., 2011), but there are differences between them and HeMI as

follows.

126

Chang et al. (D. Chang et al., 2012) applies both division and absorption on a chromosome

where clusters are chosen randomly for division and absorption, unlike HeMI that carefully

chooses clusters for division and absorption. Blas et al. (Agustín-Blas et al., 2012) also choose

a cluster for division carefully, where the largest cluster is chosen for the division. We argue

that a large cluster can also be compact and may not always require being divided into smaller

clusters. Therefore, HeMI applies division on the sparsest (not largest) cluster of a chromosome.

Moreover, Blas et al. (Agustín-Blas et al., 2012) chooses two random clusters for absorption,

whereas HeMI chooses the two closest clusters for absorption.

Liu et al. (Y. Liu et al., 2011) also chooses the sparsest cluster for the division and two closest

clusters for absorption. However, they randomly apply either division or absorption on a

chromosome, regardless of the improvement of its fitness. However, HeMI applies both division

and absorption on a chromosome. Division/absorption is applied only if it improves the fitness

of a chromosome. Additionally after the division and absorption operation, HeMI also applies

the random change operation on a chromosome based on a mutation probability in order to

support the exploration nature of genetic algorithms.

In the division operation, HeMI identifies the sparsest cluster 𝐶j of a chromosome 𝑃j

and

then divides the cluster 𝐶j into two clusters by applying K-means on 𝐶j

where the value of 𝑘 is

set to 2. If the fitness of the chromosome after division 𝑃j,d is better than the fitness of the

chromosome 𝑃j then 𝑃j,d

is selected, otherwise 𝑃j is selected for the absorption operation. The

absorption operation finds the two closest clusters 𝐶i and 𝐶j

of the chromosome 𝑃j or 𝑃j,d

(whichever is selected from the division operation), and merges them into one cluster. Thus it

forms a new chromosome𝑃j,a . If the fitness of 𝑃j,a

is better than the fitness of 𝑃j and 𝑃j,d

then

𝑃j,a is selected, otherwise either 𝑃j

or 𝑃j,d (whichever is selected from the division operation)

is selected for the random change operation.

127

Once the division and absorption operations for all chromosomes of a population are

completed then the random change operation is carried out. In the random change operation, the

mutation probabilities for each chromosome are computed. The mutation probability of a

chromosome 𝑃j is calculated using its fitness 𝑓j

, and the maximum fitness value 𝑓max and

average fitness value 𝑓mean of all chromosomes in the current population. The mutation

probability (D.-X. Chang et al., 2009; Srinivas & M.Patnaik, 1994) of the 𝑗𝑡ℎ chromosome is

calculated as follows, where 𝑘1 and 𝑘2 are equal to 0.5.

𝑀𝑗 = {𝑘1 ∗

𝑓𝑚𝑎𝑥 − 𝑓𝑗

𝑓𝑚𝑎𝑥 −𝑓𝑚𝑒𝑎𝑛 𝑖𝑓 𝑓𝑗 > 𝑓𝑚𝑒𝑎𝑛,

𝑘2 , 𝑖𝑓 𝑓𝑗 ≤ 𝑓𝑚𝑒𝑎𝑛

Eq. 6.1

The intuition behind this is to reduce the amount of random changes on good chromosome.

The 𝑇𝑗 value for the chromosome having the best fitness is zero. The 𝑇𝑗 value increases for the

chromosome with lower fitness value. The 𝑇𝑗 value is 0.5 for all chromosomes having fitness

less than the average fitness.

If the mutation probability of a chromosome is greater than a random number (between 0 and

1) then the chromosome is selected for the random change operation, otherwise, it remains

unchanged. In the random change operation, a gene of the chromosome is randomly chosen

where an attribute value of the gene is randomly changed to another value within its domain.

Component 8: Health Improvement Operation

This is an original contribution of HeMI. The aim of this component is to continuously improve

the health of chromosomes in every generation in order to ensure the presence of high-quality

chromosomes within a population. Crossover and mutation operations are likely to improve

health/fitness of some chromosomes, but they can also decrease health/fitness of some

128

chromosomes. Therefore, after the crossover and mutation operations, HeMI identifies sick

chromosomes and replaces them through three phases.

In Phase 1, HeMI identifies the healthy and sick chromosomes. It sorts the chromosomes in

descending order of their fitness values and identifies 50% of the chromosomes to be sick. For

example, if there are 20 chromosomes in a population (i.e. population size = 20) then it identifies

the 10 sickest chromosomes to be sick and the others to be healthy. The sick chromosomes are

then removed from the population. So the population size temporarily decreases to 50% where

all of them are considered to be healthy. In the following two phases 50% new chromosomes

are added to bring the population size back to 100%.

In Phase 2, HeMI generates 20% new chromosomes i.e. if the original population size is 20

then it generates 4 chromosomes. For this, it first picks the healthiest 20% chromosomes from

the set of healthy chromosomes found in Phase 1. Applying the same approach of Component

5 it then chooses pairs of chromosomes from these 20% healthy chromosomes. It next applies

the crossover operation on each pair in order to generate offspring chromosomes which are then

added into the population. Hence, at this stage the population size is back to 70% of the original

size.

In Phase 3, HeMI adds the remaining 30% chromosomes in the population. These

chromosomes are chosen from the pool of chromosomes that was obtained through the

deterministic phase of Component 3 which are supposed to be healthy chromosomes due to the

use of K-means/K-means++. Moreover, in this phase HeMI chooses the best chromosomes of

the pool. For each of these chromosomes HeMI then randomly changes an attribute value of a

gene within its original domain. These chromosomes are then added into the population to bring

the population size back to 100%.

129

Fig. 6.1: Flowchart of HeMI algorithm

Component 9: Elitist Operation



130

iterations. For finding the best chromosome, HeMI uses the same approach of elitist operation

that we used in DeRanClust (see Section 3.2 of Chapter 3).

Component 10: Neighbor Information Sharing

This is a new contribution of HeMI where neighboring streams share/exchange the best

chromosome among them, at a regular interval such as at every 10th generation. If a stream

somehow suffers from the low quality of its best chromosome then it gets an opportunity to

borrow the best chromosome from its neighboring streams.

For a stream 𝑆𝑖 , HeMI first identifies its neighboring streams. The streams 𝑆 =

{𝑆1, 𝑆2 … … . 𝑆|𝑠| } are number sequentially, where |𝑠| is the user defined number of streams.

The default number of streams is four in this study. For any stream 𝑆𝑖 , the two streams

𝑆𝑖+1 𝑀𝑂𝐷 |𝑠| and 𝑆𝑖+2 𝑀𝑂𝐷 |𝑠| are considered to be the neighboring streams. The 𝑀𝑂𝐷 operation

ensures that the neighbors are found in a wrap-up fashion where for the stream 𝑆|𝑠| the

neighboring streams will be 𝑆1 and 𝑆2 .

For a stream 𝑆𝑖 , HeMI spots out the best chromosome 𝑃𝑏 out of its neighboring streams

and 𝑆𝑖 . It then replaces the worst chromosome of 𝑆𝑖 by 𝑃𝑏 , if 𝑃𝑏 comes from a neighboring

stream of 𝑆𝑖 . The chromosomes of 𝑆𝑖 are then sorted and the best chromosome is stored as 𝑃𝑏

which is the best chromosome found so far for 𝑆𝑖 , as explained in Component 9. While the

sharing of the best chromosome from the neighboring streams increases the fitness of the best

chromosome 𝑆𝑖 , it maintains the divergence among the streams since the sets of neighboring

streams for any two streams 𝑆𝑖 and 𝑆𝑗 are different.

Component 11: Global Best Selection

This is another contribution of HeMI. At the end of all iterations/generations, each stream has

the best chromosome for the stream. HeMI compares all such best chromosomes from all

streams and then select the best of the best chromosomes as the final clustering solution. The

131

genes of the best chromosome represent the cluster centers and records are allocated to their

closest seeds to form the final clusters.

6.2.3 The HeMI Algorithm

After introducing the main components we are now ready to present the overall algorithm of

HeMI, which is also explained in Algorithm 6.1 and Fig. 6.1. We use the same notations in

Algorithm 6.1 and Fig. 6.1. HeMI takes a data set 𝐷 as input. It first normalizes all numerical

attributes separately as explained in Component 1. HeMI uses a user defined number of multiple

streams as explained in Component 2. The use of multiple streams aiming to improve clustering

results. The default number of multiple streams is four (4) in this study. Each stream contains a

subpopulation (see Component 2) in the sense that the total number of chromosomes is equally

(or as equally as possible) divided among the streams.

HeMI then generates initial chromosomes for each stream separately through its proposed

Population Initialization component (see Component 3 and Step 1 of Algorithm 6.1, and Fig.

6.1). It skips the Noise Based Selection operation in the first iteration as shown in Step 2 of

Algorithm 6.1 and Fig. 6.1. The Noise Based Selection operation is applied from the 2nd

iteration.

The single point crossover, Twin Removal, Mutation, Health Improvement and Elitist

operation are then applied sequentially. All these operations are described before (see from

Component 5 to Component 9). They can also be studied in various steps (from Step 3 to Step

7) of Algorithm 6.1.

In order to take the advantage of the multiple streams HeMI then performs the Neighbor

Information Sharing operation at a regular interval, which is by default 10 iterations. This

operation has been explained in Component 10 and Step 8 of Algorithm 6.1. At the end of all

132

iterations, HeMI applies the Global Best Selection operation in order to find the final clustering

solution (see Component 11 and Step 9 of Algorithm 6.1).


6.3.1 The Data sets and the Evaluation Criteria

We empirically compare our proposed technique called HeMI with five existing techniques

namely AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), K-means (Lloyd,

1982), K-means++ (Arthur & Vassilvitskii, 2007) and GenClust (Rahman & Islam, 2014) on

twenty (20) natural data sets that are available from the UCI machine learning repository (M.

Lichman, 2013). HeMI is compared with these existing techniques because they are recent and

were shown to be better than many other high-quality techniques (Sanghamitra Bandyopadhyay

& Maulik, 2001, 2002; Lai, 2005; Lin, Yang, & Kao, 2005; Murthy & Chowdhury, 1996).

Detailed information on the data sets is provided in Table 6.1. We choose data sets with a

wide variety. For example, some data sets (such as Glass identification) have only numerical

attributes, and some data sets (such as BC) have categorical attributes. The Credit Approval

(CA) data set has 6 numerical and 9 categorical attributes. The reason why we choose most of

the data sets with only numerical attributes is that all techniques (except HeMI and GenClust)

that we use in this study can handle only numerical attributes.

Some data sets have a low number of attributes such as Blood Transfusion (BT) that has only

4 attributes and some data sets have a high number of attributes such as Dermatology (DT) that

has 34 attributes. Similarly, some data sets have a low number of records such as Glass

identification that has 214 records and some data sets have relatively high number of records

such as MGT that has 19,020 records. Also, some data sets have a low number of class values

(i.e. low domain size of class attributes) such as BC that has only two class values, but some

data sets have high number of class values such as LF that has 36 class values.

133


Data set No. of

Records with

missing

No. of Records

without missing

No. of

numerical

attributes

No. of

categorical

attributes

Class size


Breast Cancer (BC) 286 277 0 9 2


Ecoli (EC) 336 336 8 0 8

Leaf (LF) 340 340 16 0 36



Credit Approval (CA) 690 653 6 9 2

Breast Cancer Wisconsin Original

(WBC)

699 683 10 0 2




Mammographic Mass (MGM) 961 830 5 0 2


Contraceptive Method Choice (CMC) 1473 1473 9 0 3

Yeast (YT) 1484 1484 8 0 10


Wine Quality (WQ) 4898 4898 11 0 7



Class values are the labels of records which show an important property of a data set.

Typically, clustering algorithms are applied on data sets that do not have any class values.

Hence, we delete the class attributes from all data sets prior to any experimentation. Some of

the data sets contain missing values in them, meaning that for some records some attribute values

are missing. Column 2 of Table 6.1 shows the total number of records of the data sets including

the records that have some missing values. We first delete the records having any missing

value/s. Column 3 of Table 6.1 shows the number of records without missing value/s. For

example, the BC data set has altogether 286 records, but 9 of them have one or more missing

values. Hence, after these 9 records are deleted the data set has 277 records without any missing

values. In all experiments, we use the data sets without any missing values.

134

We evaluate and compare the clustering techniques based on two well-known evaluation

criteria namely Silhouette Coefficient (Agustín-Blas et al., 2012; Pang-Ning Tan, Michael

Steinbach, 2005) and DB Index (D L Davies & Bouldin, 1979).

6.3.2 The Parameters used in the Experiments

In the experiments on AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009),

GenClust (Rahman & Islam, 2014) and HeMI we consider the population size to be 20 and

number of iterations/generations to be 50. We maintain this consistency for all of these

techniques in order to ensure a fair comparison among them.

In the experiments, the number of iterations of K-means/K-means++ in HeMI is set to 50

and the number of iterations of K-means in GenClust is also set to be 50. The cluster numbers

in GAGR, K-means and K-means++ is user defined. However in order to simulate a natural

scenario, the cluster number for GAGR, K-means and K-means++ are generated randomly

between 2 and√𝑛, where 𝑛 is the number of records in a data set.

The number of iterations for K-means and K-means++ is also set to 50 and the threshold

value is set as 0.005 that was suggested in GenClust. The value of 𝑟𝑚𝑎𝑥 and 𝑟𝑚𝑖𝑛 for AGCUK

and HeMI is set to be are 1 and 0 respectively as suggested in AGCUK.


For each data set, we run HeMI 20 times since it can produce different clustering results in

different runs. We then present the average and standard deviation of the clustering results. We

also run all other techniques AGCUK, GAGR, K-means, K-means++ and GenClust 20 times.

We present the average and standard deviation of the clustering results. We run each of the

techniques 20 times on all 20 data sets. Moreover, in order to evaluate the effectiveness of

various components of HeMI we use 5 data sets where we run the techniques 20 times.

135


In this section, we experimentally evaluate the performance of HeMI by comparing it with K-

means, K-means++, GGAR, AGCUK and GenClust on all 20 data sets where each technique

runs 20 times on each data set. Since there are 2 data sets with categorical attributes AGCUK,

GAGR, K-means and K-means++ cannot handle these data sets and therefore, these techniques

are tested on 18 (instead of 20) data sets. However, HeMI and GenClust are tested on all 20 data

sets.

(a)

(b)

Fig. 6.2: (a) Comparative results between HeMI and other techniques on ten data sets based on Silhouette Coefficient. (b)

Comparative results between HeMI and other techniques on ten data sets based on Silhouette Coefficient

Fig. 6.2 (a) and Fig. 6.2 (b) show the average and standard deviation of the Silhouette

Coefficient of the clustering solutions, where HeMI achieves better results than GenClust in 20

out of 20 data sets. That is, in 20 out of 20 data sets the average Silhouette Coefficient of 20

runs of HeMI is higher than the average Silhouette Coefficient of 20 runs of GenClust.

Moreover, in 17 out of 20 data sets the standard deviations of HeMI do not overlap the standard

136

deviations of GenClust and the average Silhouette coefficient of HeMI is higher than GenClust.

Note that the cases where the standard deviations of HeMI overlap with the standard deviations

of other techniques are indicated by an arrow in Fig. 6.2 (a), Fig. 6.2 (b), Fig. 6.3 (a) and Fig.

6.3 (b). That is, for all cases without an arrow HeMI achieves a better result with no overlap of

standard deviations.

HeMI achieves higher Silhouette Coefficient than AGCUK in 18 out of 18 data sets. The

standard deviations of HeMI do not overlap the standard deviations of AGCUK in 17 out of 18

data sets. HeMI also achieves higher Silhouette Coefficient than K-means, K-means++ and

GAGR in all 18 data sets. In 18 out of 18 data sets the standard deviations of HeMI do not


(a)

(b)

Fig. 6.3: (a) Comparative results between HeMI and other techniques on ten data sets based on DB Index. (b)

Comparative results between HeMI and other techniques on ten data sets based on DB Index

137

All bar graphs in Fig. 6.2 (a), Fig. 6.2 (b), Fig. 6.3 (a) and Fig. 6.3 (b) are in the same

sequence: K-means, K-means++, GAGR, AGCUK, GenClust, and HeMI. As we can see in Fig.

6.3 (a) and Fig. 6.3 (b), HeMI achieves better clustering results (on an average) than GenClust

in 19 out of 20 data sets, based on DB Index for which a lower value indicates a better result. In

18 out to 20 data sets HeMI does not have any overlap of standard deviations with the standard

deviations of GenClust. Moreover, HeMI performs better than K-means, K-means++, GAGR

and AGCUK on all 18 data sets based on DB Index. In 18 out to 18 data sets HeMI does not

have any overlap of standard deviations with the standard deviations of these techniques.

The right most columns of Fig. 6.2 (b) and Fig. 6.3 (b) show the average Silhouette

Coefficient and DB Index of all techniques on all data sets. HeMI achieves clearly better results

on an average than all other techniques without any overlapping of standard deviations. We

believe that this is a very strong result in order to demonstrate the superiority of HeMI over a

number of recent and high-quality clustering techniques.

6.3.5 Comparative Results between HeMI and GCS

In this section, we compare HeMI with GCS (as presented in Chapter 5) through two cluster

evaluation criteria, namely Silhouette Coefficient (Agustín-Blas et al., 2012; Pang-Ning Tan,

Michael Steinbach, 2005) and DB Index (D L Davies & Bouldin, 1979) using 15 real-life data

sets that we obtain from the UCI machine learning repository (M. Lichman, 2013) (see Table

5.1). In this section, we compare HeMI with GCS on 15 data sets that we used in Chapter 5 (see

Table 5.1).

As we can see in Table 6.2 that HeMI gets inferior clustering results than GCS in only 1 out

of 15 data sets based on DB Index and 2 out of 15 data sets based on Silhouette Coefficient. The

bottom row of Table 6.2, shows the average Silhouette Coefficient and DB Index of the

techniques on all data sets. HeMI achieves clearly better results on an average than GCS

138

indicating the effectiveness of the components of HeMI. Note that the results in Table 6.2 are

averages of 20 runs for each data set.

Table 6.2: Comparative results between HeMI and GCS

Data set DB Index (lower the better) Silhouette Coefficient (higher the better)

HeMI GCS HeMI GCS

GI 0.24 0.27 0.80 0.81

VC 0.25 0.25 0.83 0.83

EC 0.32 0.32 0.80 0.77

LF 0.43 0.44 0.71 0.70

LD 0.26 0.27 0.81 0.81

DT 0.63 0.71 0.61 0.59

BT 0.17 0.21 0.86 0.81

PID 0.40 0.17 0.73 0.83

SV 0.43 0.53 0.68 0.63

BN 0.39 0.47 0.70 0.66

YT 0.23 0.30 0.80 0.78

IS 0.45 0.50 0.73 0.71

WQ 0.23 0.23 0.84 0.84

PBC 0.10 0.12 0.92 0.90

MGT 0.24 0.51 0.82 0.65

Average 0.31 0.35 0.77 0.75

6.3.6 Comparative Results among HeMI, GCS, GMC and DeRanClust

We now compare HeMI with GCS (presented in Chapter 5), GMC (see Chapter 4) and

DeRanClust (see Chapter 3) on 10 natural data sets (obtain from the UCI machine learning

repository (M. Lichman, 2013) ) that we used in Chapter 4 (see Table 4.1). We evaluate and

compare the clustering techniques based on two well-known evaluation criteria namely

Silhouette Coefficient and DB Index.

Table 6.3 and Table 6.4 present that HeMI achieves better clustering result than DeRanClust

and GCS in 8 out of 10 data sets based on Silhouette Coefficient and DB Index, and in 1 out of

10 data sets HeMI performs equally with DeRanClust and GCS based on both Silhouette

Coefficient and DB Index.

139

Table 6.3: Comparative Results among HeMI, GCS, GMC and DeRanClust based on Silhouette Coefficient


Data sets DeRanClust GMC GCS HeMI

GI 0.832 0.781 0.789 0.827

VC 0.830 0.461 0.830 0.830

LF 0.664 0.571 0.691 0.699

LD 0.802 0.648 0.812 0.815

DT 0.605 0.508 0.602 0.615

PID 0.468 0.518 0.818 0.788

SV 0.657 0.592 0.638 0.683

BN 0.695 0.600 0.671 0.696

YT 0.786 0.724 0.799 0.837

IS 0.698 0.613 0.719 0.734

Average 0.7037 0.6016 0.7369 0.7524

Table 6.4: Comparative Results among HeMI, GCS, GMC and DeRanClust based on DB Index

DB Index (lower the better)

Data sets DeRanClust GMC GCS HeMI

GI 0.249 0.331 0.280 0.265

VC 0.254 0.822 0.254 0.254

LF 0.509 0.653 0.475 0.440

LD 0.284 0.458 0.270 0.266

DT 0.724 1.005 0.724 0.635

PID 0.914 0.897 0.174 0.409

SV 0.495 0.582 0.525 0.437

BN 0.423 0.589 0.470 0.401

YT 0.272 0.499 0.323 0.233

IS 0.522 0.604 0.493 0.453

Average 0.464 0.644 0.398 0.379

We can see in Table 6.3 and Table 6.4 that HeMI achieves better results than GMC in all 10

data sets based on Silhouette Coefficient and DB Index. Moreover, the bottom row of Table 6.3

and Table 6.4 show the average Silhouette Coefficient and DB Index of all techniques on all

140

data sets. HeMI achieves clearly better results on an average than GCS, GMC and DeRanClust

based on both Silhouette Coefficient and DB Index. Note that all results presented in the Table

6.3 and Table 6.4 are averages of 10 runs.

6.3.7 An Analysis of the Impact of Various Properties of HeMI

We now explore the effectiveness of some novel properties/components of HeMI in the

following subsections. For the experiments, we add a component of HeMI to an existing

technique called AGCUK, and investigate its impact on AGCUK.

6.3.7.1 An Analysis of the Impact of the Multiple Streams that Exchange Information

An important contribution of HeMI is its multiple streams that share/exchange information at a

regular interval. Due to having multiple streams HeMI can accommodate more chromosomes

than existing techniques. Hence, in this section we carry out experiments to first investigate

whether higher number of chromosomes improves the clustering results. We next investigate

the impact of exchanging information among the streams at a regular interval. The results justify

the usefulness of the components.

Table 6.5 demonstrate that AGCUK achieves better clustering results (in terms of Silhouette

Coefficient and DB Index) when it uses 40 chromosomes instead of 20 chromosomes and 80

chromosomes instead of 40 chromosomes. We run 50 iterations as usual. Average results of 20

runs are presented in the tables.

Table 6.5: Comparative results between AGCUK, AGCUK with 40 Population and AGCUK with 80 Population


AGCUK AGCUK with AGCUK with

40 Population 80 Population


AGCUK AGCUK with AGCUK with

40 Population 80 Population

PID 1.40 1.30 1.32 0.27 0.31 0.29

BT 0.54 0.53 0.47 0.64 0.63 0.66

GI 0.62 0.53 0.54 0.65 0.70 0.69

LD 0.87 0.82 0.79 0.46 0.50 0.51

BN 0.79 0.76 0.67 0.47 0.49 0.55

Average 0.84 0.78 0.75 0.49 0.52 0.54

141

Table 6.6 present clustering results obtained by AGCUK with 80 chromosomes in a single

stream and AGCUK with 80 chromosomes equally divided among 4 streams called AGCUK

with Multiple Streams. In this case, the streams do not exchange information. We pick the best

clustering result of 4 streams at the end of 50 iterations. It clearly shows that multiple streams

help AGCUK to achieve better results.

Table 6.6: Comparative results between AGCUK with 80 Population and AGCUK with Multiple Streams


AGCUK with 80 Population AGCUK with Multiple

Streams


AGCUK with 80 Population AGCUK with Multiple

Streams

PID 1.32 1.30 0.29 0.31

BT 0.47 0.43 0.66 0.69

GI 0.54 0.40 0.69 0.78

LD 0.79 0.77 0.51 0.55

BN 0.67 0.68 0.55 0.55

Average 0.75 0.71 0.54 0.57

Table 6.7 compare clustering results obtained by AGCUK with 4 streams that do not

exchange information and AGCUK with 4 streams called AGCUK with Neighbor Exchange that

exchange information among neighbors regularly at 10 iterations. We can clearly see the impact

of information exchange on the final clustering results.

Table 6.7: Comparative results between AGCUK with Multiple Streams and AGCUK with Neighbor Exchange


AGCUK with AGCUK with

Multiple Streams Neighbor Exchange


AGCUK with AGCUK with


PID 1.30 1.25 0.31 0.33

BT 0.43 0.44 0.69 0.70

GI 0.40 0.40 0.78 0.74

LD 0.77 0.46 0.55 0.67

BN 0.68 0.53 0.55 0.63

Average 0.71 0.61 0.57 0.61

142

The total number of chromosomes and iterations in all versions of AGCUK are same, but

still AGCUK with multiple streams that exchange information achieves better results than

others. This clearly indicates the effectiveness of the component.

Similar experiments are then carried out on another existing technique called GenClust. The

results are consistent with AGCUK and indicate the effectiveness of the proposed component

(see Table 6.8).

Table 6.8: Comparative results between GenClust, GenClust with Multiple Streams and GenClust with Neighbor Exchange

Data set DB Index (Lower the better)

GenClust GenClust with GenClust with



GenClust GenClust with GenClust with


PID 1.11 1.03 0.95 0.37 0.42 0.47

BT 0.89 0.80 0.77 0.41 0.44 0.47

GI 0.71 0.69 0.65 0.62 0.63 0.65

LD 0.85 0.83 0.75 0.56 0.57 0.63

BN 0.85 0.82 0.76 0.41 0.42 0.46

Average 0.88 0.83 0.77 0.47 0.49 0.53

Since we see a clear evidence of improvement in AGCUK and GenClust with the inclusion

of multiple streams that exchange information, we now compare them with HeMI in order to

investigate the effectiveness of other components of HeMI.

Table 6.9: Comparative results between HeMI, AGCUK with Neighbor Exchange and GenClust with Neighbor Exchange


AGCUK with GenClust with HeMI

Neighbor Exchange Neighbor Exchange


AGCUK with GenClust with HeMI

Neighbor Exchange Neighbor Exchange

PID 1.25 0.95 0.40 0.33 0.47 0.73

BT 0.44 0.77 0.17 0.70 0.47 0.86

GI 0.40 0.65 0.26 0.74 0.65 0.82

LD 0.46 0.75 0.26 0.67 0.63 0.81

BN 0.53 0.76 0.39 0.63 0.46 0.70

Average 0.61 0.77 0.29 0.61 0.53 0.78

Table 6.9 present that HeMI achieves better clustering result than GenClust with multiple

streams that exchange information in 5 out of 5 data sets based on Silhouette Coefficient and

143

DB Index. HeMI performs better than AGCUK with multiple streams that exchange information

in all 5 data sets based on Silhouette Coefficient and DB Index. Moreover, the average clustering

result of 5 data sets of HeMI based on Silhouette Coefficient and DB Index shows the clear

domination of HeMI over AGCUK and GenClust with multiple streams that exchange

information. All results presented in the tables are average of 20 runs.

6.3.7.2 An Analysis of the Impact of the Population Initialization

In order to evaluate the effectiveness of our proposed population initialization we incorporate

this component (see Component 3 in Section 6.2.2) with an existing technique called AGCUK,

and then see how the component impacts AGCUK.

We generate 20 initial chromosomes through the proposed component. We then use these

chromosomes in AGUCK as the initial population and run AGCUK for 50 iterations on 5 data

sets. We call this as AGCUK with HeMI population.

Table 6.10: Comparative results between AGCUK and AGCUK with HeMI Population


AGCUK AGCUK with HeMI Population


AGCUK AGCUK with HeMI Population

PID 1.40 1.29 0.27 0.30

BT 0.54 0.48 0.64 0.66

GI 0.62 0.57 0.65 0.70

LD 0.87 0.87 0.46 0.45

BN 0.79 0.75 0.47 0.50

Average 0.84 0.79 0.49 0.52

Table 6.10 clearly indicates that AGCUK with HeMI population achieves better clustering

result compared to AGCUK according to the Silhouette Coefficient and DB Index. The average

clustering result of AGCUK with HeMI Population on 5 data sets is also better than AGCUK

in terms of the both evaluation criteria.

144

6.3.7.3 An Analysis of the Impact of the Mutation Operation

In order to evaluate the effectiveness of the proposed mutation operation of HeMI we

incorporate the component (see Component 7 in Section 6.2.2) with AGCUK by replacing its

own mutation operation. We call this version of AGCUK as AGCUK with HeMI mutation. We

run both AGCUK and AGCUK with HeMI mutation for 50 iterations on 5 data sets.

Table 6.11 shows that AGCUK with HeMI mutation achieves better clustering result

compared to AGCUK in 5 out of 5 data sets according to both Silhouette Coefficient and DB

Index.

Table 6.11: Comparative results between AGCUK and AGCUK with HeMI Mutation


AGCUK AGCUK with HeMI Mutation


AGCUK AGCUK with HeMI Mutation

PID 1.40 0.84 0.27 0.53

BT 0.54 0.23 0.64 0.83

GI 0.62 0.32 0.65 0.79

LD 0.87 0.32 0.46 0.78

BN 0.79 0.60 0.47 0.52

Average 0.84 0.46 0.49 0.69

We also extend this analysis by introducing a version of the proposed HeMI where we

remove its mutation operation (let us call this version to be HeMI without mutation) and then

compare this with complete HeMI.

Table 6.12: Comparative results between HeMI and HeMI without Mutation


HeMI HeMI without Mutation


HeMI HeMI without Mutation

PID 0.40 0.46 0.73 0.69

BT 0.17 0.23 0.86 0.82

GI 0.26 0.28 0.82 0.80

LD 0.26 0.31 0.81 0.78

BN 0.39 0.47 0.70 0.66

Average 0.29 0.35 0.78 0.75

145

We run both HeMI and HeMI without mutation for 50 iterations. Table 6.12 shows that HeMI

achieves better clustering results than HeMI without mutation in 5 out of 5 data sets based on

Silhouette Coefficient and DB Index. This clearly indicates the effectiveness of the proposed

mutation operation used in HeMI.

6.3.7.4 An Analysis of the Impact of the Health Improvement

We also explore the effectiveness of the health improvement operation (see Component 8) of

HeMI. In Table 6.13 we present the experimental results of HeMI comparing with a different

version of HeMI called HeMI without health improvement operation that is exactly same as

HeMI except that it does not have Component 8 in it. We run both HeMI and HeMI without

health improvement operation for 50 iterations on 5 data sets.

Table 6.13: Comparative results between HeMI and HeMI without Health Improvement Operation


HeMI HeMI without Health

Improvement Operation


HeMI HeMI without Health

Improvement Operation

PID 0.40 0.80 0.73 0.50

BT 0.17 0.18 0.86 0.86

GI 0.26 0.28 0.82 0.81

LD 0.26 0.27 0.81 0.80

BN 0.39 0.44 0.70 0.68

Average 0.29 0.39 0.78 0.73

We can see in Table 10 that HeMI achieves better clustering results than HeMI without health

improvement operation based on Silhouette Coefficient and DB Index.

6.3.7.5 An Analysis of the Impact of the Interval

We also explore the impact of various intervals, while carrying out the Neighbor Information

Sharing component (see Component 10 in Section 6.2.2) of HeMI. In Fig. 6.4 we present the

experimental results of HeMI comparing with different versions of HeMI called HeMI with

Interval 5 and HeMI with Interval 15. In HeMI with Interval 5 and HeMI with Interval 15 the

146

neighboring streams share/exchange the best chromosome among them at every 5th generation

and 15th generation, respectively. Note that in HeMI the neighboring streams share/exchange

the best chromosome among them at every 10th generation. We run HeMI, HeMI with interval

5 and HeMI with Interval 15 for 50 iterations on 5 data sets.

Fig. 6.4: Comparative results between HeMI and HeMI with different Intervals

We can see in Fig. 6.4 that HeMI with Interval 5 achieves better clustering results than HeMI

and HeMI with Interval 15 based on Silhouette Coefficient. The results clearly indicate that the

interval with a lower number of iterations achieves better clustering results than the interval with

a higher number of iterations. However, although the interval with a lower (5) number of

iterations achieves better clustering results than the higher (15) number of iterations, it will also

increase the execution time. Therefore, in HeMI the value of the interval is set to 10, as a

heuristic.

6.3.7.6 An Analysis of the Impact of the number of Streams

In order to evaluate the impact of different numbers of multiple streams (see Component 2 in

Section 6.2.2), we compare HeMI (where we use 4 streams) with another version of HeMI with

8 streams. We run both HeMI and HeMI with 8 streams for 50 iterations on 5 data sets.

147

Fig. 6.5 shows that HeMI with 8 streams achieves better clustering results compared to HeMI

according to Silhouette Coefficient. The results indicate that HeMI with a higher number of

streams performs better than HeMI with a lower number of streams. A higher number of streams

will increase the execution time. Considering this we in this study have used 4 streams in HeMI.

However, a user may want to use more streams as necessary.

Fig. 6.5: Comparative results between HeMI and HeMI with 8 Streams

6.3.7.7 An Analysis of the Improvement in Chromosomes over the Iterations

In Fig. 6.6 we present the average fitness (in terms of DB Index, where Fitness = 1/DB Index)

values of the best chromosomes over 5 separate runs of HeMI. We run HeMI 5 times and then

present the average fitness of these 5 runs. Average fitness values are plotted against the

iterations, for all 20 data sets. Most of the data sets achieve a rapid improvement within first 5

to 10 iterations, and then continues to steadily increase over the iterations. This is also clear

from Fig. 6.7 that presents the grand average fitness of the best chromosomes over all 20 data

sets, instead of each data set separately.

148

Fig. 6.6: Average Fitness versus Iteration. Each line represents the average fitness of the best chromosome of 5 consecutive

runs of HeMI on a data set

Fig. 6.7: Average Fitness (best chromosome) versus Iteration over the 20 data sets

In Fig. 6.8 we present the average fitness of the best chromosome of HeMI, AGCUK and

HeMI with a single stream on the PID data set. Both HeMI and AGCUK use the same fitness

function (DB Index (D L Davies & Bouldin, 1979)) to calculate the fitness of a chromosome.

Average fitness values of the best chromosomes of HeMI are always higher than those of HeMI

with a single stream and AGCUK, clearly indicating the effectiveness of various components of

HeMI including its multiple streams.

149

Fig. 6.8: Average Fitness (best chromosome) versus Iteration. Each line represents the average fitness of 5 consecutive runs

on PID data set


We now analyze the results by using a statistical sign test (D.Mason, 1998; Triola, 2001) on all

20 data sets for all 20 runs to evaluate the superiority of the results (Silhouette Coefficient and

DB Index) obtained by HeMI over the results obtained by the existing techniques. We observe

that the results do not follow a normal distribution and thus do not satisfy the conditions for a

parametric test.

(a)

(b)

Fig. 6.9: (a) Sign test of HeMI based on Silhouette Coefficient on ten data sets. (b) Sign test of HeMI based on Silhouette

Coefficient on ten data set

150

Hence, we perform a non-parametric sign test on the Silhouette Coefficient and DB Index as

shown in Fig. 6.9 (a), Fig. 6.9 (b), and Fig. 6.10 (a), Fig. 6.10 (b). The first five bars for each

data set in Fig. 6.9 (a), Fig. 6.9 (b), and Fig. 6.10 (a), Fig. 6.10 (b) show the z-values (test

statistics) values for HeMI and the five existing techniques while the sixth bar shows the z(ref.)

value. If the z-value is greater than the z(ref.) value then the results obtained by HeMI are

significantly better than the results of existing techniques.

(a)

(b)

Fig. 6.10: (a) Sign test of HeMI based on DB Index on ten data sets. (b) Sign test of HeMI based on DB Index on ten data sets

In Fig. 6.9 (a) and 6.9 (b) we present the sign test of HeMI compared with the existing

techniques on 20 data sets in terms of Silhouette Coefficient, where HeMI significantly performs

better than other techniques on 19 out of 20 data sets. Fig. 6.10 (a) and Fig. 6.10 (b) show the

statistical significance of HeMI compared with other existing techniques based on DB Index,

where HeMI significantly performs better than other techniques on 19 out 20 data sets. We carry

out the statistical analysis at z > 1.96, p < 0.025 and right-tailed in terms of Silhouette Coefficient

151

and DB Index. Note that the cases where we have a lower z value than z(ref.) are indicated with

arrows in Fig. 6.9 (a) and Fig. 6.10 (b).

6.3.9 An Analysis on the use of K-means++ instead of K-means in HeMI

The HeMI algorithm allows us to use any lightweight clustering techniques for the initial

population including K-means and K-means++. In our experiments so far we used K-means for

the initial population and we see that HeMI clearly outperforms all other existing techniques

used in this study. Table 6.14 indicate that HeMI with K-means++ for the initial population

achieves better clustering results than HeMI with K-means. Hence, we are confident that HeMI

with K-means++ will win against other existing techniques even more strongly.

Table 6.14: Comparative results between HeMI and HeMI with K-means++


HeMI HeMI with K-means++


HeMI HeMI with K-means++

PID 0.40 0.17 0.73 0.86

BT 0.17 0.17 0.86 0.87

GI 0.26 0.24 0.82 0.83

LD 0.26 0.26 0.81 0.82

BN 0.39 0.38 0.70 0.71

Average 0.29 0.24 0.78 0.81


In this section, we estimate and present the complexity of HeMI and compare it with the

complexities of the existing techniques used in this study. The main factors involving the

complexity of HeMI are number of records 𝑛 in a data set 𝐷, number of attributes 𝑚 in 𝐷,

number of genes 𝑘 in a chromosome, number of chromosomes z in a population of a stream,

number of iterations 𝑁′ of k-means and number of iterations 𝑁 of HeMI. Out of these factors

we consider that 𝑛, 𝑚, 𝑘 and z can be much bigger than others and hence we compute the

complexity in terms of them.

152

For the initial population, HeMI uses K-means to get a number of deterministic

chromosomes, the complexity of which is 𝑂(𝑛𝑚𝑘𝑧). It also randomly selects some

chromosomes, for which the complexity is 𝑂(𝑘𝑧). The fitness function is DB index which has

a complexity of 𝑂(𝑛𝑚𝑘𝑧). Once fitness is computed the noising selection requires pairwise

comparison which can be done in 𝑂(𝑧) complexity. The crossover operation requires roulette

wheel for which we need 𝑂(𝑧2) complexity. For the twin removal, we need

𝑂(𝑚𝑘2𝑧) complexity. In the mutation operation, complexities for the division, absorption, and

random change are 𝑂(𝑛𝑚𝑘𝑧), 𝑂(𝑚𝑘𝑧) and 𝑂(𝑧), respectively. Complexities for Phase 1, Phase

2 and Phase 3 of the Health Improvement component are 𝑂(𝑛𝑚𝑘𝑧), 𝑂(𝑧) and

𝑂(𝑧), respectively.

The elitist operation has a complexity of 𝑂(𝑧) once the fitness is calculated with the cost of

𝑂(𝑛𝑚𝑘𝑧). Information exchange among neighboring streams requires 𝑂(𝑧) complexity.

Similarly, the global best selection also requires 𝑂(𝑛𝑚𝑘𝑧) + 𝑂(𝑧) complexity. Hence, the

overall complexity of HeMI is 𝑂(𝑛𝑚𝑘2𝑧2). With respect to 𝑛 and 𝑚 (the two most significant

factors), it has a linear complexity 𝑂(𝑛𝑚). The complexity of K-means, AGCUK, GAGR and

GenClust are 𝑂(𝑛𝑚) (Lloyd, 1982), 𝑂(𝑛𝑚), (Y. Liu et al., 2011), 𝑂(𝑛𝑚) (D.-X. Chang et al.,

2009) and 𝑂(𝑛𝑚2 + 𝑛2𝑚) (Rahman & Islam, 2014) respectively.

6.3.11 Comparison between HeMI and Multiple Runs of K-means

Although the complexity of K-means and HeMI in terms of 𝑛 and 𝑚 are the same, i.e. 𝑂(𝑛𝑚),

we realize that HeMI may require higher execution time than K-means. For example, it will

require executing the distance calculation more frequently than K-means. Therefore, by the time

we run HeMI once we can perhaps run K-means multiple times. We compute that K-means

requires executing the distance calculation approximately 𝑛𝑚𝑘𝑖 times, where 𝑖 is the number of

153

iterations and 𝑘 is the number of seeds. Considering 50 iterations (i.e. 𝑖 = 50) for K-means, it

requires the distance calculation 50 𝑛𝑚𝑘 times.

On the other hand, HeMI requires executing the distance calculation approximately 𝑛𝑚𝑘𝑧𝑖

(for initial population) + 8 𝑛𝑚𝑘𝑧𝐺 (for genetic operations) times, where 𝑧 is the number of

chromosomes, 𝑖 is the number of iteration in K-means and 𝐺 is the number of generations.

Considering, 𝑖 = 50, 𝑧 = 20 and 𝐺 = 50, HeMI requires the distance calculation 9000 𝑛𝑚𝑘

times. That is, HeMI requires the distance calculation (9000𝑛𝑚𝑘

50𝑛𝑚𝑘) = 180 times more than K-

means. As a result, by the time we can run HeMI once we can run K-means approximately 180

times.

Fig. 6.11: Comparative result between HeMI and K-means

Therefore, in this section we run K-means up to 500 times and pick the best result out of

these 500 runs. We also run HeMI 20 times and pick the worst result out of the 20 runs. Finally

in Fig. 6.11, we compare the best result of K-means with the worst result of HeMI on three

randomly chosen data sets. The top three lines in Fig. 6.11show the worst result of HeMI on

three data sets and the bottom three lines represent the best results of K-means at different runs

starting from 50 to 500 for the same three data sets. The results clearly indicate that K-means

cannot beat HeMI (in terms of the Silhouette Coefficient) even if K-means runs 500 times.

154

6.4 Summary

In this chapter, we propose a clustering technique that in addition to selecting an initial

population with a low complexity of 𝑂(𝑛), uses new components including multiple streams,

information exchange between neighboring streams, regular health improvement of the

chromosomes and mutation which also aim to improve chromosome health/quality.

We evaluate the proposed technique (HeMI) by comparing its clustering quality with five

existing techniques namely AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009),

K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007) and GenClust (Rahman &

Islam, 2014) on twenty (20) natural data sets that are publicly available from the UCI machine

learning repository (M. Lichman, 2013) in terms of two well-known evaluation criteria:

Silhouette Coefficient and DB Index.

We run each technique 20 times on each data set, and we present the average clustering

results of 20 runs and standard deviation of the average clustering result. The experimental

results show that HeMI achieves better results than GenClust in 20 out of 20 data sets and in 17

out of 20 data sets the standard deviations of HeMI do not overlap the standard deviations of

GenClust and the average Silhouette coefficient of HeMI is higher than GenClust.

HeMI achieves higher Silhouette Coefficient than AGCUK in 20 out of 20 data sets. The

standard deviations of HeMI do not overlap the standard deviations of AGCUK in 19 out of 20

data sets. HeMI also achieves higher Silhouette Coefficient than K-means, K-means++ and

GAGR in all 20 data sets. In 20 out of 20 data sets the standard deviations of HeMI do not


Moreover, HeMI achieves better clustering results (on an average) than GenClust in 19 out

of 20 data sets, based on DB Index for which a lower value indicates a better result. In 18 out to

20 data sets HeMI does not have any overlap of standard deviations with the standard deviations

155

of GenClust. Moreover, HeMI performs better than K-means, K-means++, GAGR and AGCUK

on all 20 data sets based on DB Index. In 20 out to 20 data sets HeMI does not have any overlap

of standard deviations with the standard deviations of these techniques.

We also carry out thorough investigation to evaluate major components of HeMI one by one.

It is evident that all of these components have positive impact on the final clustering quality.

We also present a complexity analysis which shows that HeMI has a complexity of 𝑂(𝑛), where

𝑛 is the number of records in a data set.

With this result of HeMI, we achieve our first goal of proposing a parameter less clustering

technique with a high-quality solution and low complexity. However, in order to achieve our

second goal of producing sensible clustering solutions we need to carefully analyse the results

obtained by HeMI and other existing techniques. We carry out this analysis in the next chapter.

156

Chapter 7

A Novel GA-based Clustering Technique and its

Suitability for Knowledge Discovery from a Brain

Data Set

7.1 Introduction

In this chapter, we present a GA-based Clustering technique called CSClust which aims to

produce sensible clusters. We realize that some recent clustering techniques do not produce

sensible clusters and fail to discover knowledge from underlying data sets. Sometimes, they

obtain a huge number of clusters and sometimes they obtain only two clusters, where one cluster

contains one record and the other cluster contains all remaining records. Interestingly, these

clustering solutions often achieve high fitness values based on existing evaluation criteria.

During the PhD, candidature we have published the following paper based on this chapter.

Beg, A. H. and Islam, M. Z. (2016): A Novel Genetic Algorithm-Based Clustering Technique and its

Suitability for Knowledge Discovery from a Brain Data set, In Proc. of the IEEE Congress on

Evolutionary Computation (IEEE CEC 2016), Vancouver, Canada, July 24-29, 2016, pp. 948-956. (ERA

2010 Rank A).

157

Therefore, in CSClust we propose a new cleansing and cloning operation that helps to

produce sensible clusters with high fitness values, which are also useful for knowledge

discovery. We now briefly introduce the novel components/properties of CSClust and their

logical justifications as follows.

The central component of CSClust is a cleansing operation in each generation in order to

ensure that all chromosomes in a population have a sensible solution. Through our empirical

analysis using GenClust and GAGR (see Section 7.2) we find that although they produce a

clustering solution with better fitness value, they often end up producing a non-sensible

clustering result. Therefore, we introduce a cleansing operation by applying two conditions: (i)

the number of clusters must be within the range of maximum number and a minimum number

of clusters which is learned by CSClust from some properties of a data set, and (ii) the minimum

number of records in a cluster must be greater than a threshold minimum number of records

which is again data-driven (i.e. not user defined). CSClust uses the initial population in order to

learn the range of maximum and a minimum number of clusters and the threshold minimum

number of records.

Another important component of CSClust is the selection of sensible properties that makes

better use of initial population. CSClust also produces high-quality initial population (see Step

2 in Section 7.3) through a deterministic phase and a random phase. It uses the same approach

of initial population that we use in Chapter 6 (see Section 6.2.2 of Chapter 6). It does not require

users to determine the cluster number and/or radii of clusters. CSClust keeps the complexity as

low as 𝑂(𝑛) in the initial population selection operation. CSClust selects top |P| chromosomes

(|P|= 20 in this study) from the two phase. It then finds the necessary properties of a sensible

clustering solution. The properties of the sensible clustering solution are then used in each

population in order to ensure that chromosomes in a population do not contradict the properties.

158

Another interesting idea associated with CSClust is the cloning operation used to replace sick

chromosomes in each population. In each population, the cleansing operation identifies the sick

chromosomes, which are then replaced by a pool of healthy chromosomes found in the initial

population through the cloning operation.

We evaluate our technique CSClust by comparing its performance with the performance of

five existing techniques, namely: AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al.,

2009), GenClust (Rahman & Islam, 2014), K-means (Lloyd, 1982) and K-means++ (Arthur &

Vassilvitskii, 2007). We conduct experiments with the techniques on 10 real-life data sets that

are publicly available in the UCI machine learning repository (M. Lichman, 2013). The

experimental results clearly indicate that CSClust performs better than the existing techniques

in terms of two evaluation criteria: Silhouette Coefficient (Agustín-Blas et al., 2012; Pang-Ning

Tan, Michael Steinbach, 2005) and DB Index(D L Davies & Bouldin, 1979).

Moreover, we apply all the techniques on a brain data set CHB-MIT Scalp data set (see

Section 7.4.4), in order to assess their quality in producing sensible clustering solutions. The

empirical analysis presented in Section 7.4.5 indicates that CSClust produces better clustering

solutions which are suitable for deriving knowledge from a data set whereas all other techniques

typically fail to generate sensible clustering solutions.


Proposing CSClust which aims to produce sensible clusters;

The evaluation of CSClust by comparing it with existing techniques;

The organization of the chapter is as follows: in Section 7.2 we discuss the motivation behind

the proposed technique; in Section 7.3 we present our proposed technique; the experimental


in Section 7.5.

159


In this section, we discuss the motivation behind our proposed technique. We first explore some

clustering solutions made by existing clustering techniques. We use a brain data set called CHB-

MIT Scalp EEG (Goldberger et al., 2000) as an example which is available from

https://physionet.org/cgi-bin/atm/ATM. We plot the data set so that we can graphically visualize

the clusters (see Fig. 7.1). Fig. 7.1 shows the structure of the data set where it has two clusters:

seizure and non-seizure.

Fig. 7.1: The three-dimensional CHB-MIT Scalp EEG (chb01-03) data set

The original data set (Goldberger et al., 2000) contains 9 non- class attributes and a class

attribute having two possible values: seizure and non-seizure. We pick three non-class attributes

(max, min and std) so that we can plot the records on a paper (see Fig. 7.1). In Fig. 7.1, dots

represent non-seizure and plus signs represent seizure records.

We apply GenClust (Rahman & Islam, 2014) on the CHB-MIT data set. The empirical

analysis shows that GenClust generates 477 clusters (see Fig. 7.2), which appears to be a non-

sensible clustering solution since the actual number of clusters of this data set is only two


160

(seizure and non-seizure) as shown in Fig. 7.1. Note that while clustering the records only the

non-class attributes are used and the class attribute is not used.

Fig. 7.2: Clustering result of GenClust on the CHB-MIT Scalp EEG (chb01-03) data set

GenClust uses the fitness function called COSEC (Rahman & Islam, 2014) to evaluate the

fitness of a chromosome. The COSEC value of a chromosome increases when the number of

clusters of a chromosome is increases. Therefore, due to use of CSOEC GenClust tends to obtain

a clustering solution with a large number of clusters.

Fig. 7.3: Clustering result of GAGR on the CHB-MIT Scalp EEG (chb01-03) data set

161

We also apply GAGR (D.-X. Chang et al., 2009) on the CHB-MIT Scalp data set. GAGR

generates 56 clusters (see Fig. 7.3) which is also not sensible as the original number of clusters

of this data set is only two. It uses SSE (Pang-Ning Tan, Michael Steinbach, 2005) as its fitness

function. In GAGR, like GenClust the fitness of a chromosome increases when the number of

clusters of the chromosome increases. Accordingly, GAGR tends to generate a clustering

solution with a high number of clusters.

We then explore DB Index (D L Davies & Bouldin, 1979) in GenClust as the fitness function

instead of COSEC. Typically, DB Index considers the compactness of a cluster and maximum

seed distance between clusters. Therefore, a clustering solution with a low number of clusters

and fewer numbers of records obtain higher DB Index value.

Fig. 7.4 shows that using DB Index as the fitness function GenClust generates two clusters

but the number of records in one cluster is one and all other records belong to the other cluster,

which is also not sensible. In order to handle such a situation CSClust introduces the cleansing

operation.

Fig. 7.4: Clustering result of GenClust using DB Index on CHB-MIT Scalp EEG (chb01-03) data set

One crucial component of CSClust is the cleansing operation aiming to ensure that all

chromosomes in a population do not contradict the properties of a sensible clustering solution.

162

The sensible chromosomes are selected by applying two conditions: (i) the number of clusters

must be within the range of maximum number and a minimum number of clusters and (ii) the

minimum number of records in a cluster must be greater than a threshold minimum number of

records. The threshold values are learned from the sensible chromosomes selected in the initial

population. The chromosomes which are selected in the initial population are supposed to be

sensible due to the selection of top chromosomes from a pool of high-quality chromosomes.

Another interesting idea associated with CSClust is the cloning operation used to replace sick

chromosomes found in the cleansing operation. CSClust replaced the sick chromosomes by the

pool of high-quality chromosomes found in the initial population. The pool of high-quality

chromosomes created for the initial population are supposed to be reasonably healthy due to the

use of K-means many times.

7.3 CSClust: High-quality Chromosome Selection and Cleansing Operation in a GA for

Clustering

We first mention the main steps of CSClust as follows and then explain each of them in detail.

Out of the following steps, Step 3, Step 7 and Step 8 are our novel contributions of this chapter.

BEGIN

Step 1 Normalization


Step 3: Sensible Properties Selection

DO: j=1to t /* t is the user defined number of iterations/generations */


Step 5 Mutation Operation

Step 6: Twin Removal Operation

Step 7: Cleansing Operation

Step 8: Cloning Operation


END

END

163


CSClust takes a data set 𝐷 as input. It first normalizes the data set 𝐷 in order to weigh each

attribute equally regardless of their domain sizes. For normalization, CSClust uses the same

approach of normalization that we used in DeRanClust (see Section 3.2 in Chapter 3).


For the population initialization, CSClust uses the same approach of population initialization

that we used in HeMI (see Section 6.2.2 of Chapter 6). CSClust selects |P| number of

chromosomes in the initial population, |P|/2 from the deterministic phase and |P|/2 from

random phase. In the experiments of this chapter, we use |P| to be 20. CSClust uses Davis

Bouldin (DB) index (D L Davies & Bouldin, 1979) to calculate the fitness of a chromosome.

Step 3: Sensible Properties Selection

This is an original contribution of CSClust that makes better use of initial population for finding

necessary properties of a sensible clustering solution. In the population initialization step, It

selects |𝑃| top chromosomes from the generated initial chromosomes based on their fitness.

CSClust learns the necessary properties (minimum (𝑀𝑛) and maximum (𝑀𝑥) number of clusters,

and minimum number of records (𝑀𝑟) of a cluster) of a sensible clusteirng solution from the |𝑃|

chromosomes (see Step 3 of Algorithm 7.1). The properties of the sensible clustering solution

are then used in the cleansing and cloning operation.


All chromosomes participate in the crossover pair by pair. The best chromosome (available in

the current population) is selected as the 1st chromosome of the pair and the 2nd chromosome of

the pair is selected probabilistically using the roulette wheel technique (D. Chang et al., 2012;

Maulik & Bandyopadhyay, 2000; Mukhopadhyay & Maulik, 2009). The probability of a

164

chromosome 𝑃𝑗 is computed as 𝑇𝑗 = (𝑓𝑗/ ∑ 𝑓𝑖|P|𝑖=1 ). Here, 𝑓𝑗 is the fitness of the chromosome 𝑃𝑗

and |𝑃| is the size of the current population.

Once the pair of chromosomes is selected for crossover, CSClust then applies the gene re-

arrangement operation (Rahman & Islam, 2014) in order to avoid the inappropriate arrangement

of genes. The pair of chromosomes are then participates in a conventional single point crossover

(Sanghamitra Bandyopadhyay & Maulik, 2002; D. Chang et al., 2012; Garai & Chaudhuri,


Islam, 2014; Song et al., 2009). CSClust applies crossover operation on each pair of

chromosomes of a population and all together it produces |𝑃| offspring chromosomes. It then

applies twin removal operation (Rahman & Islam, 2014) in order to delete/modify twin genes

(if any) of a chromosome.


The aim of the mutation operation is to randomly change some of the chromosomes in order to

explore different solution. CSClust uses the random change operation (D. Chang et al., 2012;

Rahman & Islam, 2014) probabilistically where the chromosome with low fitness has a high

chance to be selected for the random chance, and vice versa (D.-X. Chang et al., 2009; Rahman

& Islam, 2014). The mutation probability of the 𝑗𝑡ℎ chromosome is calculated as follows.

𝑀𝑗 = {𝑘1 ∗

𝑓𝑚𝑎𝑥 − 𝑓𝑗

𝑓𝑚𝑎𝑥 −𝑓𝑚𝑒𝑎𝑛 𝑖𝑓 𝑓𝑗 > 𝑓𝑚𝑒𝑎𝑛,

𝑘2 , 𝑖𝑓 𝑓𝑗 ≤ 𝑓𝑚𝑒𝑎𝑛

Eq. 7.1

where, 𝑘1 and 𝑘2 equal 0.5, 𝑓𝑚𝑎𝑥 is the maximum fitness value of a chromosome in the

population, 𝑓𝑚𝑒𝑎𝑛 is the average fitness value of the chromosome in the population and 𝑓𝑗 is the

165

𝑗𝑡ℎ chromosome fitness. Once a chromosome is selected for random change operation CSClust

then randomly select an attribute of each gene and modify the attribute value randomly.

Step 6: Twin Removal Operation

CSClust uses the twin removal operation in order to remove/modify twin genes (if any) from

each chromosome. For twin removal, CSClust uses the same approach of twin removal of

DeRanClust (see Section 3.2 of Chapter 3).

Algorithm 7.1: CSClust

Input: A data set 𝐷 having 𝑛 records and |𝐴| attributes, where 𝐴 is the set of attributes


Require:

Pd← ∅ /* 𝑃𝑑 is the set of deterministic chromosome (45 chromosomes), initially set 𝑃𝑑 to empty */

Pr← ∅ /* 𝑃𝑟 is the set of random chromosome (45 chromosomes), initially set 𝑃𝑟 to empty */

end

Step 1: /* Normalization */


end


Pd ← GenerateDeterministicChromosomes (D′) /* generate 𝑃𝑑 through K-means, the number of initial seeds for K-means are chosen deterministically *

Pr ← GenerateRandomChromosomes (D′) /* generate 𝑃𝑟 randomly, the number of initial seeds are chosen randomly and the seeds also chosen randomly */

Px ←SelecteDeterministicChromosomes (Pd ) /* select top 10 chromosomes (50% chromosomes of the initial population) based on fitness */

Py ← SelecteRandomChromosomes (Pr) /* select top 10 chromosomes (50% chromosomes of the initial population) based on fitness */

Ps ← Ps ∪ (Px ∪ Py) /* 𝑃s is the set of initial population (20 chromosomes) */


end

Step 3: /* Sensible Properties Selection*/

(Mn, Mx, Mr) ←FindSensibleProperties (Ps) /* Find minimum number of clusters (𝑀𝑛), maximum number of clusters (𝑀𝑥) and

end minimum number of records in a cluster ( 𝑀𝑟) */

for t=1 to I do /* default I = 50, I is the user defined number of iterations and t is the counter of I */

Step 4: /* Crossover Operation */


Po =Twin Removal (Po) /* perform twin removal on 𝑃𝑜 and get a set of chromosomes 𝑃𝑜 */

end


Pm ← PerformMutationOperation (Po) /* perform mutation operation on 𝑃𝑜 and get a set of mutated chromosomes 𝑃𝑚 */

end

Step 6: /* Twin Removal */

Pm =Twin Removal (Pm) /* perform twin removal on 𝑃𝑚 and get a set of chromosomes 𝑃𝑚 */

end

Step 7: /* Cleansing Operation */

Sc ← FindSickChromosomes (Pm, Mn, Mx, Mr ) /* Find a set of sick chromosomes 𝑆𝑐 from 𝑃𝑚 based on 𝑀𝑛 , 𝑀𝑥, 𝑀𝑟*/

Pm ← Pm-Sc /* Remove 𝑆𝑐 chromosomes from 𝑃𝑚 */

end

Step 8: /* Cloning Operation */

while |Sc| ≤ 0 do

Hc ← Cloning Operation (Pd) /* Replace the sick chromosomes from 𝑃𝑑 and get a set of healthy chromosomes 𝐻𝑐 */

end

Pm ← Pm ∪ Hc /* Insert 𝐻𝑐 into 𝑃𝑚 */

end



C ← C ∪ Pb /* insert 𝑃𝑏 into 𝐶 */

Return C

end

end

166


This is an original contribution of CSClust. The aim of this component is to identify the

chromosomes in a population with sensible and non-sensible solutions. CSClust first learns the

necessary properties [minimum (𝑀𝑛) and maximum (𝑀𝑥) number of clusters, and minimum

number of records (𝑀𝑟) in a cluster] of a sensible clustering solution through the Component 3.

In different runs CSClust may finds different value of 𝑀𝑛, 𝑀𝑥 and 𝑀𝑟. Therefore, it avoids using

a rigid set of values for 𝑀𝑛, 𝑀𝑥 and 𝑀𝑟 and thus relaxes the boundaries. It increases the value

of 𝑀𝑥 by 𝑡 % and decreases the value of 𝑀𝑛 and 𝑀𝑟 by 𝑡 %. The value of 𝑡 in this study is set

to 10.

CSClust applies the cleansing operation on each chromosome in a population based on the

properties of a sensible clustering solution. If the length of a chromosome (i.e. the number of

genes in the chromosome) is greater than or equal to 𝑀𝑛 and less than or equal to 𝑀𝑥, and the

number of records in each cluster is greater than 𝑀𝑟 then the chromosome is selected as a

sensible solution otherwise it is considered as a sick Chromosome (see Step 7 of Algorithm 7.1).

The sick chromosomes are then removed from the population.


This is another new contribution of CSClust. The cloning operation replaces the sick

chromosomes found in the cleansing operation. To replace a sick chromosome, CSClust

probabilistically selects a chromosome from the pool of chromosomes that was obtained through

the deterministic phase of the population initialization (see Component 2).The chromosomes in

the pool are expected to be generally of good health since they are obtained by multiple use of

K-means. CSClust then randomly changes an attribute value of a gene to another value within

the domain of the attribute (see Step 8 of Algorithm 7.1). Thus, the chosen chromosome is

slightly modified before being added in the population replacing a sick chromosome.

167




iterations. For finding the best chromosome, CSClust uses the same approach of elitist operation



7.4.1 The Data sets and the Cluster Evaluation Criteria

We empirically compare the performance of our proposed technique CSClust with five existing

techniques namely K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007),

AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009) and GenClust (Rahman &

Islam, 2014) on a brain data set (CHB-MIT data set)(Goldberger et al., 2000). We also compare

the performance of CSClust with the five existing techniques on 10 other real-life data sets that

are available in the UCI machine learning repository (M. Lichman, 2013). Detailed information

about the data sets is provided in Table 7.1. All the data sets used in this study for the

experimentation have only numerical attributes except the class attribute. All the data sets we

choose with numerical attributes because the techniques (except CSClust and GenClust) that

we use in this study can only handle numerical attributes.

Some of the data sets contain missing values in them. It means that some attribute values of

some records are missing. We delete all the records having any missing values. Two well-known

evaluation criteria namely Silhouette Coefficient (Pang-Ning Tan, Michael Steinbach, 2005)and

DB Index (D L Davies & Bouldin, 1979) are used to compare the performance of our technique.

Note that the higher value of Silhouette Coefficient indicates a good clustering result and the

lower value of DB Index represents a good clustering result.

168



with missing

No. of Records

without missing

No. of numerical

attributes

No. of categorical

attributes

Class size






Yeast (YT) 1484 1484 8 0 10


Wine Quality (WQ) 4898 4898 11 0 7



7.4.2 The Parameter used in the Experiments

In the experimentation on AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009),

GenClust (Rahman & Islam, 2014) and CSClust the population size is set to be 20 and the

number of iterations/generations is set to be 50. We maintain this consistency for all techniques

in order to ensure a fair comparison among them.

The number of iterations in K-means (Lloyd, 1982) and K-means++ (Arthur & Vassilvitskii,

2007) is set to 50 and the number of iterations of K-means in GenClust also set to 50. The cluster

numbers 𝑘 in GAGR, K-means and K-means++ is user defined. However, in order to simulate

a natural scenario the cluster number of GAGR, K-means and K-means++ are generated

randomly in the range between 2 to √𝑛 , where 𝑛 is the number of records in a data set. The

threshold value for K-means is set to 0.005. The value of 𝑟𝑚𝑎𝑥 and 𝑟𝑚𝑖𝑛 in AGCUK set to be 1

and 0 respectively.


On each data set, we run CSClust 10 times, since it can produce different clustering results in

different runs. We also run all other techniques GenClust, GAGR, AGCUK, K-means and K-

169

means++ for 10 times. We then present the average clustering results of CSClust and all other

techniques.

7.4.4 Brain Data set (CHB-MIT Scalp EEG) Pre-processing

Before experimentation, we first prepare the CHB-MIT Scalp EEG data set (Goldberger et al.,

2000). This data set consists of EEG recordings of 22 epileptic patients from different age

groups. The data was collected at the Children’s Hospital Boston consisting of EEG recordings

from pediatric subjects with intractable seizures. All the EEG signals were sampled at 256

samples per second with 16-bit resolution. The International 10-20 system of EEG (Jasper,

1958; Oostenveld & Praamstra, 2001) channel positions and nomenclature were used.

The International 10-20 system is usually employed to record spontaneous EEG. In this

system, 21 channels are located on the surface of the scalp, and three other channels are placed

on each side equidistant from the neighboring points (Jasper, 1958) (see Fig. 7.6). In the data

set most of the cases 23 channels were used, only in some cases 24 or 26 channels were used.

For each channel, we divide the data in epochs of 10 seconds. We then calculate the Maximum

(Max), Minimum (Min), Mean, Standard deviation (Std), Kurtosis, Skewness, Entropy, Line

length and Energy for each epoch. Hence, from each channel of one hour data we get 360 records

containing nine attributes: Max, Min, Mean, Std, Kurtosis, Skewness, Entropy, Line Length and

Energy.

For example, we prepare one-hour data of one patient (chb01_03) who is an 11 years old girl.

This data set has the recordings of 23 channels. Hence, from all 23 channels altogether, we get

360*23=8280 records. In this data set the patient experienced a seizure for around 40 seconds

(from the 2996th second to 3036th second). During this period we get 5 records. These records

are considered as seizure records and all other records are considered as non-seizure records.

170

Therefore, from the chb_01_03 data set altogether we get 23*5= 115 seizure records and 8165

non-seizure records.

7.4.5 Experimental Results on Brain Data Set

In this section, we experimentally evaluate the performance of CSClust by comparing it with K-

means, K-means++, AGCUK, GAGR and GenClust on a brain data set (Goldberger et al., 2000).

Columns 2 and 3 of Table 7.2 present the clustering results obtained by CSClust and all other

techniques based on Silhouette Coefficient and DB Index.

Columns 2 and 3 of Table 7.2 indicate that CSClust achieves better clustering results than

all other techniques based on Silhouette Coefficient and DB Index. Moreover, CSClust produces

the actual cluster number (see Column 4 in Table 7.2 and Fig. 7.5), where all other techniques

GenClust, AGCUK, GAGR, K-means and K-means++ fail to generate the actual number of

clusters on the data set. The original number of cluster of this data set is two (see Fig. 7.1 of

Section 7.2). This results indicates the usefulness of the proposed components including

cleansing and cloning operation in order to produce a sensible clustering solution.

Table 7.2: Clustering results of all techniques on the CHB-MIT Scalp EEG (chb01-03) data set

Clustering Techniques Silhouette Coefficient

(higher the better)

DB Index

(lower the better)

Number of Clusters

CSClust 0.77 0.29 2

GenClust 0.50 0.74 468.40

AGCUK 0.54 0.60 2.80

GAGR 0.17 1.06 45.5

K-means 0.23 0.96 39.3

K-means++ 0.12 1.07 35.7

7.4.6 Analysis of the Clustering Result obtained by CSClust on the Brain Data set

We now analyze the clustering results obtained by CSClust on CHB-MIT Scalp EEG (chb01-

03) data set in order to explore knowledge from this data set. We plot the clustering result of

171

CSClust as shown in Fig. 7.5. In Fig. 7.5, dots (black) and circles (red) represent cluster 1

obtained by CSClust. Dots (black) represent the records with the class value of non-seizure and

the circles (red) represent the records with the class value of seizure. Moreover, the circles

(green) represent Cluster 2 obtained by CSClust. Circles (green) represent the records with the

class value of seizure and dots (green) represent the records with class value of non-seizure.

Detailed information of the records in Cluster 2 is presented in Table 7.3. The first column

in Table 7.3 shows the channel number and position of the channel on the surface of the scalp

according to the International 10-20 system. Column 2 and 3 present the number of seizure and

non-seizure records in Cluster 2 for every channel.

The number of seizure records in this data set is 115 according to 23 channels. Through our

analysis of the original signals of 23 channels we find that around 11 out of the 23 channel show

seizure signal during the seizure time. Therefore, we consider 11*5= 55 records as seizure

records. However, only around 20-25 seizure records out of these 55 seizure records can be

clearly identified (see Fig. 7.1) where other records are overlapped with non-seizure records.

Each channel has 5 seizure records.

Fig. 7.5: Clustering result of CSClust on CHB-MIT Scalp EEG (chb01-03) data set

172

Table 7.3: Channel wise number of records in Cluster 2 of CSClust on CHB-MIT Scalp EEG (chb01-03) data set

Channel (Position) Number of

seizure records

Number of

non-seizure records

Channel 1 (FP1-F7) 2 0

Channel 2 (F7-T7) 1 0


Channel 6 (F3-C3) 1 0



Channel 11 (C4-P4) 1 0

Channel 12 (P4-O2) 1 0



Channel 15 (T8-P8) 3 0


Channel 17 (FZ-CZ) 2 2

Channel 18 (CZ-PZ) 0 2

Channel 21 (FT9-FT10) 3 2


In order to visualize the seizure records we plot the records of Channel-5, Channel-13,

Channel-17 and Channel-21 as shown in Fig. 7.7 (a), (b), (c) and (d), respectively. Fig. 7.7 (a)

to (d) show that in each channel, around 2/3 out of 5 seizure records are clearly visible.

Therefore, CSClust finds 40 records in cluster 2, where 33 of them are seizure records. It finds

some non-seizure records in cluster 2 because they are very similar to seizure records.

In addition, through the analysis of the original EEG signals of 23 channels during the seizure

time, we find that around 11 out of 23 channels show the seizure signal (see Fig. 7.9, Fig. 7.11,

Fig. 7.12) during the seizure time. Other channels show the non-seizure signal during the seizure

time (see Fig. 7.10). Fig. 7.10 shows the signal of Channel-7 during the seizure time but the

amplitude of this channel is low (varies between 200 uV and -200 uV). Fig. 7.8 shows the signals

of Channel-5 during the non-seizure time where the amplitude varies between 200 uV and -200

173

uV. Fig. 7.9, Fig. 7.11 and Fig. 7.12 show the signals of Channel-5, Channel-9 and Channel-13,

respectively during the seizure time, where the signals show high amplitude (between 400 uV

and - 400 uV).

We also find that all 11 channels showing the seizure signal during the seizure time are

located in the frontal lobe and temporal lobe of the scalp (see Fig. 7.6) indicating that the seizure

for this patient was a localized seizure originated in the frontal lobe and temporal lobe. From

these 11 channels 8 of them are located in the frontal lobe and 3 others (Chanel-15, Channel-16

and Channel-23) are located in the temporal-parietal lobe. Interestingly, CSClust also finds the

maximum records in cluster 2 from these 11 channels. This again re-confirms the quality of the

clustering results obtained by our proposed technique.

Fig. 7.6: Channel positions according to the International 10-20 system (Jasper, 1958; Sharbrough F et al., 1991)

A = Ear lobe C = Central

P = Parietal F = Frontal

Fp = Frontal polar O = Occipital

Frontal

lobe

Temporal-

Parietal lobe

174

(a) Channel-5

(b) Channel-9

(c) Channel-13

(d) Channel-21

Fig. 7.7: Seizure records on different channels

Fig. 7.8: EEG signals (10 seconds) of channel-5 during the non-seizure

time

Fig. 7.9: EEG signals (10 seconds) of channel-5 during the seizure time

175




7.4.7 Knowledge from Decision Tree on Brain Data set

In this section, we present a number of decision trees to discover logic rules for seizure and non-

seizure records from the CHB-MIT (chb01-03) data set. We apply an existing technique SysFor

(Islam & Giggins, 2011) on this data set. This technique requires the class values of a data set

in order to build a decision forest. We labeled the data set based on the clustering result of

CSClust. CSClust produces two clusters: Cluster 1 and Cluster 2. During the labeling, we

consider the records in Cluster 1 as non-seizure records and records in Cluster 2 as seizure

records. Thus, we get two-class values: seizure and non-seizure in the labeled data set.

SysFor generates four decision tree on the data set shown in Fig. 7.13 (a) - Fig 7.13 (d).

Typically, a decision tree (see Fig. 7.13 (a) - Fig 7.13 (d)) consists of nodes and leaves. In Fig.

7.13 (a) - Fig 7.13 (d) the nodes are denoted by rectangles and leaves are denoted by ovals. The

shortest path from the root node to the leaves makes a logic rule that represents the relationship

between the class attribute and non-class attribute (Adnan & Islam, 2014).

176

(a ) Decision tree 1

(b) Decision tree 2

(c ) Decision tree 3

(d) Decision tree 4

Fig. 7.13: Decision trees on CHB-MIT (chb01-03) data set

Fig. 7.13 (a) shows the decision tree 1 where the logic rule for leaf 1 and leaf 2 are “if Std

<= 102.44 → Records =Non-seizure” and “if Std >102.44 → Records = Seizure”, respectively.

In decision tree 2 as shown in Fig. 7.13 (b) the logic rule for leaf 1 is “if Max <= 239.51 →

Records =Non-seizure” and the logic rule for leaf 2 is “if Max > 239.51 AND Std <=102.44 →

Records =Non-seizure” and the logic rule for leaf 3 is “if Max > 239.51 AND Std >102.44 →

Records =Seizure”.

Moreover, from all the decision trees shown in Fig. 7.13, it can be identified that the attribute

standard deviation has more influence to categorize seizure and non-seizure records. In addition,

the standard deviation of seizure signals is higher than the standard deviation of non-seizure

signals. This is perfectly matching with the logic rule if Std > 102.44 Seizure (see Fig. 7.13).

Moreover, the data set is labeled based on the clustering results obtained by our technique. If

the clustering results were inaccurate the decision trees produced from the data set would not be

Leaf 1

Leaf 2

Leaf 1

Leaf 2

Leaf 3

Leaf 1

Leaf 2

Leaf 3

Leaf 4

Leaf 1

Leaf 2

Leaf 3

Leaf 4

177

so accurate, where 12 out of 13 leaves in the trees (see Fig. 7.13) have 100% accurate

classification. This re-confirms the sensible clustering solution obtained by our technique.

7.4.8 Experimental Results on 10 Real Life Data sets

In Section 7.4.5, we empirically compare our proposed technique CSClust with K-means

(Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007), AGCUK (Y. Liu et al., 2011), GAGR

(D.-X. Chang et al., 2009) and GenClust (Rahman & Islam, 2014) on a brain data set (CHB-

MIT Scalp EEG(Goldberger et al., 2000)). In this section, we experimentally evaluate the

performance of CSClust by comparing it with the five existing techniques on 10 real-life data

sets.

Fig. 7.14: Comparative results between CSClust and other techniques based on Silhouette Coefficient (higher the better)

Fig. 7.15: Comparative results between CSClust and other techniques based on DB Index (lower the better)

Fig. 7.14 shows the average Silhouette Coefficient of the clustering solutions, where CSClust

achieves better clustering results than AGCUK in 9 out of 10 data sets based on Silhouette

Coefficient. Moreover, in 10 out of 10 data sets the average DB Index of 10 runs of CSClust is

higher than K-means, K-means++, GAGR, and GenClust ( see Fig. 7.15). The right most

178

columns of Fig. 7.14 and Fig. 7.15 show the average Silhouette Coefficient and DB Index on all

techniques on all data sets. CSClust achieves a clearly better result on an average than all other

techniques.

7.4.9 An Analysis of the Improvement in Chromosomes over the Iterations

In Fig. 7.16, we present the grand average fitness (in terms of DB Index, where Fitness= 1/DB)

values of the best chromosomes of CSClust over the 10 data sets for 10 runs. The grand average

fitness is plotted against the iterations. As we can see in Fig. 7.16, the fitness of the best

chromosome increases steadily over the iterations.

Fig. 7.16: Grand Average fitness versus iteration over the 10 data sets


We now carry out statistical sign test (D.Mason, 1998; Triola, 2001) in order to evaluate the

superiority of the results (Silhouette Coefficient and DB Index) obtained by CSClust over the

results obtained by the existing techniques . We observe that the results do not follow a normal

distribution, and the conditions for a parametric test do not satisfy. Hence, we perform a non-

parametric sign test on the Silhouette Coefficient and DB Index as shown in Fig. 7.17. The first

five bars for Silhouette Coefficient and DB Index in Fig. 7.17 shows the z-values (test statistics

value) for CSClust and other techniques while the sixth bar shows the z-ref value. If the z-value

179

is greater than the z-ref value, then the result obtained by CSClust can be considered to be

significantly better than the results obtained by the existing techniques.

Fig. 7.17: Sign test of CSClust on 11 data sets

We carry out the right-tailed sign test at z > 1.96, p<0.025 in terms of Silhouette Coefficient

and DB Index. Fig. 7.17 shows that CSClust results are significantly better than the five existing

techniques based on Silhouette Coefficient and DB Index.

7.5 Summary

We realize that many existing clustering techniques do not produce sensible clustering solutions,

although their solutions achieve high fitness values based on existing evaluation criteria. These

solutions are typically not useful in knowledge discovery from underlying data sets. Therefore,

in this chapter, we propose a novel clustering technique called CSClust that first learns important

properties of sensible clustering solutions and then applies the information in producing its


We apply some existing techniques on a brain data set and realize that their clustering

solutions either have a way too many clusters or only two clusters where one cluster contains

only one record and all other records are stored in the other cluster. The proposed clustering

180

technique overcomes this problem and produces the right number of clusters with right records

in the clusters. From the brain data set, it captures 40 seizure records in one cluster and non-

seizure records in another cluster.

While preparing the data set we consider all records from all channels within the seizure time

period to be labelled as seizure records. Thus, we get 115 seizure records in the prepared brain

data set. However, since it was a localized seizure some channels do not capture the seizure

signals even during the seizure period as evident in Fig. 7.10. Hence, although these records are

labelled as seizure records many of them actually do not exhibit any seizure properties and hence

behave like non-seizure records. This is also evident in Fig. 7.7 where we can see many seizure

records are overlapped with non-seizure records.

Interestingly, CSClust captures only the real seizure records in a cluster and does not capture

the fake seizure records in the cluster. This demonstrates the suitability of CSClust in producing

sensible clustering solutions for knowledge discovery. We then re-label the records based on

our clustering solutions and produce a number of decision trees to discover logic rules for

seizure and non-seizure records. The logic rules (such as if Std > 102.44 Seizure) obtained

by the forest appear to be sensible further confirming the accuracy of clustering results obtained

by the proposed clustering technique.

The data set where the decision forest is built from is labelled based on the clustering results

obtained by our technique. If the clustering results were inaccurate the decision trees produced

from the data set would not be so accurate, where 12 out of 13 leaves in the trees (see Fig. 7.13)

have 100% accurate classification. Moreover, the logic rules obtained by the trees make perfect

sense. For example, it is clear from the signals from a non-seizure period (see Fig. 7.8) and

seizure period (see Fig. 7.9) that the standard deviation of seizure signals is higher than non-

seizure signals. This is perfectly matching with the logic rule if Std > 102.44 Seizure (see

Fig. 7.13). This re-confirms the high quality of the clustering results obtained by our technique.

181

We then compare our clustering technique with five existing clustering techniques on 10 data

sets and demonstrate the statistically significant superiority of the proposed technique over the

existing techniques in terms of two evaluation criteria namely Silhouette Coefficient and DB

Index.

However, CSClust also has a number of limitations. CSClust learns the properties of sensible

clustering solutions by applying DB-Index on the initial population (see Step 2 in Section 7.3).

This approach can be problematic since the selection can be biased by the limitation of DB-

Index. Therefore, In the next chapter we propose a clustering technique called HeMI++ as

further improvement of HeMI and CSClust. HeMI++ identifies the properties of a sensible

clustering solution without any influence of DB Index.

In addition, we also find that while the existing clustering techniques produce non-sensible

clustering, they achieve better evaluation values based on the existing evaluation criteria.

Therefore, a good evaluation technique is also highly required in order to evaluate sensible and

non-sensible clustering solutions. In the next chapter, we also propose a new cluster evaluation

technique which is suitable for evaluating sensible and non-sensible clustering solution.

182

Chapter 8

Application of a Novel GA-based Clustering and Tree

based Validation on a Brain Data Set for Knowledge

Discovery

8.1 Introduction

In this chapter we propose a new clustering technique and an evaluation technique. In the

proposed clustering technique, we combine our previous technique called CSClust (see Chapter

7) with HeMI (see Chapter 6) where we also significantly improve the components of CSClust

and HeMI. Therefore, we call the proposed technique HeMI++. In this chapter, we achieve our

second and third research goals (producing sensible clustering solutions, and a cluster

evaluation technique for better evaluation sensible and non-sensible clustering results).

During the PhD candidature, we have published the following paper based on this chapter with PhD

supervisors.

Beg, A. H., Islam, M. Z., and Estivill-Castro. V. (2016): Application of a Novel GA-based Clustering

and Tree based Validation on a Brain Data Set for Knowledge Discovery, Information Systems, (Status:

Under Review). (ERA 2010 Rank A*, SJR 2016 Rank Q1, H Index 64).

183

We first explore the quality of HeMI and some existing clustering techniques. We also

explore the quality of existing evaluation techniques. In Chapter 7, we find that some existing

techniques do not produce sensible clusters. However, in this chapter, we carefully assess the

clustering quality of the existing techniques and HeMI through cluster visualization.

In order to assess the quality of the existing clustering techniques and cluster evaluation

techniques, we use a brain data set (CHB-MIT Scalp) (Goldberger et al., 2000) as an example

which is available from https://physionet.org/cgi-bin/atm/ATM. We plot the data set so that we

can graphically visualize the clusters (see Fig. 8.1). We know that this data set has two types of

records: seizure and non-seizure. We can also see in the figure (Fig. 8.1) that there are clearly

two clusters of records. We then apply the existing clustering techniques on this data set and

plot their clustering results.

We find that some recent and state of the art clustering techniques such as GAGR (D.-X.

Chang et al., 2009), GenClust (Rahman & Islam, 2014) do not produce sensible clusters. We

also find that our technique HeMI (as presented in Chapter 6) does not produce sensible clusters.

GenClust produces 447 clusters (see Fig. 8.2) which is not sensible as the actual clusters of this

data set is supposed to be only two. GAGR produces 56 clusters as shown in Fig. 8.3. HeMI

produces two clusters (see Fig. 8.4) where one cluster contains one record, and the other cluster

contains all remaining records. Therefore, a clustering technique that can produce a sensible

clustering solution is highly desirable.

We also evaluate the clustering quality of the existing techniques based on the internal and

external evaluation criteria (see Table 8.1). While the existing clustering techniques produce

non-sensible clustering, they achieve better evaluation values (compared to a sensible clustering

solution) based on the existing evaluation criteria. Therefore, a good evaluation technique is

also highly required in order to evaluate sensible and non-sensible clustering solutions.


184

In the proposed clustering technique HeMI++, we introduce a new cleansing and cloning

operation that helps to produce sensible clustering solution. We now briefly introduce the novel

components/properties of HeMI++ and their logical justifications as follows.

The central component of HeMI++ is a cleansing operation in each generation in order to

ensure that all chromosomes in a population have a sensible solution. Through our empirical

analysis on the existing techniques (see Section 8.2.1) we find that although they produce a

clustering solution with better fitness value, they often end up producing a non-sensible

clustering result.

Therefore, we introduce a cleansing operation by applying two conditions: (i) the number of

clusters must be within the range of a maximum number and a minimum number of clusters

which are learned by HeMI++ from some properties of a data set, and (ii) the minimum number

of records in a cluster must be greater than a threshold minimum number of records which are

again data driven (i.e. not user defined). HeMI++ uses the initial population in order to learn the

range of a maximum and a minimum number of clusters and the threshold minimum number of

records.

Another important component of HeMI++ is the initial population selection targeting high-

quality chromosomes and better use of the initial population. It produces high-quality initial

population using the same approach of HeMI (see Section 6.2.2 in Chapter 6). HeMI++ stores

the information of all the chromosomes that it generates in the initial population. It then learns

necessary properties of a sensible clustering solution for a data set from these initial population,

without requiring any user input.

Another interesting idea associated with HeMI++ is the cloning operation that replaces sick

chromosomes in each generation/population. In each population, the cleansing operation

identifies the sick chromosomes, which are then replaced by a pool of healthy chromosomes

185

found in the initial population. The pool of high-quality chromosomes created for the initial

population are expected to be reasonably healthy chromosomes due to the use of K-means many

times.

Through our empirical analysis on the existing cluster evaluation techniques (see Section

8.2.1) we also observe that the existing cluster evaluation techniques produce inaccurate

evaluation values. Sometimes they produce higher evaluation values for non-sensible clustering

solutions and lower evaluation values for sensible clustering solutions. Sometimes they produce

higher evaluation values both for the sensible and non-sensible clustering solutions which is not

as useful for measuring clustering quality.

Therefore, we also propose a new evaluation technique called Tree Index where we first label

a data set based on the clustering solutions and produce a decision tree (Quinlan, 1993, 1996).

We then calculate the entropy (Pang-Ning Tan, Michael Steinbach, 2005) for each leaf (i.e. the

entropy of the distribution of class values within the leaf) and learn the depth of the leaf in the

tree.

Based on the entropy and depth of a leaf, for all leaves, we then compute an evaluation value

that represents the clustering quality. The basic idea here is the fact that if a clustering result is

good then the labels assigned to the records based on the clustering result should lead to a

decision tree having homogeneous leaves (i.e. low entropy) with small depth in general. On the

other hand if the clustering result is bad, then the resulting tree should struggle to find a pattern

which will be reflected by heterogeneous leaves (i.e. high entropy) with high depth overall.

Imagine an extreme example where the labels are assigned completely randomly (i.e. a very bad

quality clustering) then it will be almost impossible for a tree to find any suitable pattern and

the leaves are likely to be very heterogeneous and perhaps deep.

186

We evaluate our technique HeMI++ by comparing its performance with the performance of

five existing techniques, namely AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al.,

2009), GenClust (Rahman & Islam, 2014), K-means (Lloyd, 1982), K-means++ (Arthur &

Vassilvitskii, 2007). The existing techniques are recent, high-quality and better than many other

techniques as shown in the literature (Sanghamitra Bandyopadhyay & Maulik, 2001, 2002; Lai,

2005; Lin et al., 2005; Murthy & Chowdhury, 1996).

We also compare HeMI++ with HeMI (see Chapter 6). We conduct experiments with the

techniques on 21 real-life data sets that are publicly available in the UCI machine learning

repository (M. Lichman, 2013). The experimental results clearly indicate that HeMI++ performs

better than the existing techniques in terms of our new cluster evaluation technique. The validity

of our new cluster evaluation technique is assessed by applying it on the clustering results

(obtained by various techniques), some sensible and some non-sensible (see Section 8.2.2 and

Section 8.3.5).

We first apply all the techniques on a brain data set (CHB-MIT Scalp data set, see Section

8.3.4), in order to test their suitability in discovering knowledge from a data set. The empirical

analysis presented in Section 8.3.5 indicates that HeMI++ produces a sensible clustering

solution which is suitable for deriving knowledge from a data set whereas all other techniques

fail to generate sensible clustering solutions which are not useful for discovering knowledge

from a data set.

The main contributions of this chapter can be summarized as follows.

Selection of important properties of sensible clustering solutions (see Component 4

of Section 8.2.3);

Application of the sensible properties in producing clustering solutions (see

Components 10 and 11 of Section 8.2.3);

187

New cluster evaluation technique called Tree Index (see Section 8.2.5);

Validation of the effectiveness of Tree Index by analyzing it on some ground truth

clustering results, which are also graphically visualized (see Section 8.2.1, Section

8.2.2 and Section 8.3.5);

Demonstration of the effectiveness of the proposed technique through knowledge

discovery from a brain data set (see Section 8.3.6 and Section 8.3.8);

Extensive evaluation of 21 publicly available data sets (see Section 8.3.7).

The rest of the chapter is organized as follows: The proposed technique is described in

Section 8.2. In Section 8.3, we discuss experimental results and the summary of the chapter is

presentd in Section 8.4.

8.2 Our Technique

8.2.1 Basic Concepts of Our Clustering Technique HeMI++

In this section, we discuss the basic concepts behind our proposed clustering technique,

HeMI++. We first explore some sensible and non-sensible clustering solutions, and their

evaluations made by various existing evaluation techniques/metrics. We use a brain data set

called CHB-MIT Scalp EEG (Goldberger et al., 2000). Fig. 8.1 shows the structure of the data

set where it has two clusters: seizure and non-seizure.

The original data set (Goldberger et al., 2000) contains 9 non-class attributes and a class

attribute having two possible values: seizure and non-seizure. We pick three non-class attributes

(max, min and standard deviation) so that the records can be plotted on a paper (see Fig. 8.1).

Throughout the chapter we use these three attributes for this data set. The class values of the

records are presented through the shapes; dots representing non-seizure and the plus signs

188

representing seizure records. From Fig. 8.1, we can clearly see the existence of two clusters; one

for mainly the seizure records and the other one for mainly non-seizure records.

Fig. 8.1: The three dimensional CHB-MIT Scalp EEG (chb01-03) data set


Fig. 8.2 shows a non-sensible clustering result which is produced by GenClust on the brain

data set. It appears to be a non-sensible clustering solution since it produces 477 clusters where

189

the actual number of clusters of this data set is only two (seizure and non-seizure) as shown in

Fig. 8.1. The clusters are plotted with 477 shapes such as dots, plus sign, triangles (see Fig. 8.2).

Note that while clustering the records only the non-class attributes are used and the class

attribute is not used.

GenClust uses the fitness function called COSEC (see Eq. 8.3) to evaluate the fitness of a

chromosome. It calculates the compactness Compj (see Eq. 8.1) of a cluster 𝐶𝑗, the separation

Sepj (see Eq. 8.2) of the cluster 𝐶𝑗, where |𝐶𝑗| is the number of records belonging to the

cluster 𝐶𝑗, sj is the seed of 𝐶𝑗 and 𝑥𝑎 is a record of cluster 𝐶𝑗.

Compj =∑ distxaϵCj

(xa,sj)

|Cj|

Eq. 8.1

Sepj = min∀k≠j{d(mj,mk)} Eq. 8.2

Fitness = ∑(Sepj −

∀j

Compj) Eq. 8.3

The COSEC value of a chromosome increases when the number of clusters of a chromosome

increases. Therefore, due to use of CSOEC GenClust tends to obtain a clustering solution with

a large number of clusters.

Although GenClust produces a non-sensible clustering solution, it surprisingly achieves

higher evaluation values based on the existing cluster evaluation techniques/metrics namely F-

measure, Purity, Silhouette Coefficient, XB Index and DB Index as shown in Table 8.1. Shaded

cells represent the best evaluation values among the techniques.

Fig. 8.3 shows another non-sensible clustering result which is obtained by GAGR on the

brain data set. GAGR generates 56 clusters, which is also not sensible as the original number of

clusters of this data set is only two. GAGR uses SSE (see Eq. 8.4) as its fitness function, where

190

𝑘 stands for the number of clusters and 𝑑𝑖𝑠𝑡 (𝑠𝑗 , 𝑥) denotes the distance between a record 𝑥 and

seed (𝑠𝑗) of cluster 𝐶𝑗. In GAGR, the fitness of a chromosome is computed by 1/ SSE.


SSE = ∑ ∑ dist (sj, 𝑥)2

x∈cj

k

j=1

Eq. 8.4

In GAGR, like GenClust, the fitness of a chromosome increases when the number of clusters

of the chromosome increases. Accordingly, GAGR tends to generate a clustering solution with

a high number of clusters. Interestingly, GAGR also achieves good evaluation metrics based on

the existing cluster evaluation techniques as shown in Table 8.1.

Fig. 8.4 shows another non-sensible clustering result which is obtained by HeMI on the brain

data set. HeMI generates two clusters but the number of records in one cluster is one and all

other records belong to the other cluster which is also not sensible. However, it also achieves

good evaluation values based on F-measure, Purity, Silhouette Coefficient, XB Index and DB

Index as shown in Table 8.1.

191

Table 8.1: Some sensible and non-sensible clustering solutions and their evaluation values based on the existing cluster

evaluation metrics

Techniques F-measure

(higher the

better)

Purity

(higher the

better)

Silhouette

Coefficient

(higher the

better)

XB Index

(lower the

better )

SSE

(lower the

better)

DB Index

(lower the

better)

Non-

sensible

Clustering

GenClust 0.99 0.99 0.50 0.25 65.68 0.78

HeMI 0.83 0.71 0.89 0.27 2441.59 0.13

GAGR 0.99 0.98 0.13 1.03 345.66 1.55

Sensible

Clustering

0.99 0.98 0.74 0.26 1949.67 0.33

Fig. 8.4: Clustering result of HeMI on the CHB-MIT Scalp EEG (chb01-03) data set

In Fig. 8.4, dots and plus signs represent Cluster 1 obtained by HeMI. Dots represent the

records with the class value of non-seizure, and the plus signs represent the records with seizure.

Moreover, the circles represent Cluster 2 obtained by HeMI. Circles represent the records with

the class value of seizure and triangles represent the records with class value of non-seizure. In

this figure there are no triangles meaning that no records in Cluster 2 with non-seizure.

HeMI uses DB Index (see

192

Eq. 8.8) to calculate the fitness of a chromosome. The DB index is the function of the ratio

of the sum of within cluster scatter to between-cluster separation. If 𝑠𝑗 is the seed of the

𝑗𝑡ℎcluster (𝑐𝑗) then scatter (𝑛𝑗) is calculated as follows.

𝑛𝑗,𝑞 = ( 1

|𝑐𝑗|∑ ||𝑥 − 𝑠𝑗|| 𝑞

2

𝑥𝜖𝑐𝑗

)

1/𝑞

Eq. 8.5

If 𝑠𝑖 is the seed of the 𝑖𝑡ℎcluster (𝑐𝑖) and 𝑠𝑗 is the seed of the 𝑗𝑡ℎcluster (𝑐𝑗) then the distance

between them is

𝑑𝑖𝑗,𝑡 = || 𝑠𝑖 − 𝑠𝑗||𝑡 Eq. 8.6

The Davies-Bouldin (DB) index of 𝑘 clusters is computed as follows.

𝑅𝑖,𝑞𝑡 = {𝑛𝑖,𝑞+𝑛𝑗,𝑞

𝑑𝑖𝑗,𝑡}𝑗,𝑗≠𝑖

𝑀𝑎𝑥 Eq. 8.7

𝐷𝐵 =1

𝐾∑ 𝑅𝑖,𝑞𝑡

𝐾

𝑖=1

Eq. 8.8

Fig. 8.5 shows an example of a sensible clustering solution where it produces two clusters.

Cluster 1 contains mostly the non-seizure records and Cluster 2 contains mostly the seizure

records.

Therefore, our proposed clustering technique HeMI++ aims to produce a clustering solution

like the sensible clustering solution shown in Fig. 8.5. In doing so, we first realize that HeMI

produces actual number of clusters as shown in Fig. 8.4 but the number of records in one cluster

in only one and all other records belong to other cluster. If we can handle this problem then we

will be able to produce a sensible clustering result using HeMI. Therefore, we introduce a

193

cleansing and cloning operation aiming to ensure that all chromosomes in a population have a

sensible solution.

Fig. 8.5: A sensible clustering result on the CHB-MIT Scalp EEG (chb01-03) data set

We apply a cleansing operation at the end of each iteration and based on the threshold values

it identifies the sick chromosomes that produce bad clustering solution. We replace the sick

chromosomes by a pool of high-quality chromosomes found in the initial population through

our cloning operation.

8.2.2 Basic Concepts of Our Cluster Evaluation Technique Tree Index

From our empirical analysis in the previous section, we observe that the existing evaluation

techniques fail to produce accurate evaluation values to identify sensible and non-sensible

clustering solutions (see Table 8.1). Therefore, we realize that a good evaluation technique is

required.

Therefore, we propose a new cluster evaluation technique called Tree Index that can identify

sensible and non-sensible clustering solution. Table 8.2 shows the clustering results of sensible

and non-sensible clustering solutions based on Tree Index. From the evaluation results shown

194

in Table 8.2 it is clear that the proposed evaluation technique is able to identify sensible and

non-sensible clustering solutions. It produces a good evaluation value for the sensible clustering

solution (shown in Fig. 8.5) and a bad evaluation value for the non-sensible clustering solutions.

Table 8.2: Cluster results of some sensible and non-sensible clustering solutions based on Tree Index

Clustering Techniques Tree Index (lower the better)

GenClust 5.36

HeMI ∞

GAGR 15.36

Sensible Clustering 0.14

8.2.3 Main Components of HeMI++

We first mention the main steps of HeMI++ as follows and then explain each of them in detail.

Note that Components 4 and 10 are new contributions of HeMI++.

BEGIN




END

Step 3: Selection of Sensible Properties

DO: j=1to G /* G is the user defined number of intervals*/


DO: t=1to I /* I=10; I is the user defined number of iterations */

Step 4: Noise-Based Selection







Step 11: The Elitist Operation

END

END


END


END

195

Component 1: Normalization

HeMI++ takes a data set 𝐷 as input. It first normalizes the data set 𝐷 in order to weigh each

attribute equally regardless of their domain sizes. For normalization, HeMI++ uses the same

approach of normalization of HeMI (see Section 6.2.2 of Chapter 6).

Component 2: Multiple Stream

HeMI++ uses the multiple stream approach of HeMI (see Section 6.2.2 of Chapter 6) in order

to take the advantage of using a big population through multiple streams where each stream

contains relatively small number of chromosomes.

Component 3: Population Initialization

For the population initialization, HeMI++ uses the same approach of population initialization

that we used in HeMI (see Section 6.2.2 of Chapter 6). HeMI++ selects |P| number of

chromosomes in the initial population, |P|/2 from the deterministic phase and |P|/2 from

random phase. In the experiments of this chapter, we use |P| to be 20. HeMI++ uses Davis

Bouldin (DB) index (D L Davies & Bouldin, 1979) to calculate the fitness of a chromosome.

Component 4: Selection of Sensible Properties

This is an original contribution of HeMI++ that makes better use of initial population for finding

necessary properties of a sensible clustering solution. HeMI++ selects |𝑃| top chromosomes

from the generated initial chromosomes based on their fitness (DB value). DB Index is biased

towards low number of clusters and low number of records in a cluster (se Section 8.2.1).

Therefore, HeMI++ finds the necessary properties of a sensible clustering solution. The

properties of the sensible clustering solution are then used in each generation in order to ensure

that chromosomes in a population do not contradict the properties.

196

In the initial population, HeMI++ produces 9×4=360 chromosomes as it has 4 streams. Each

stream generates 90 chromosomes. HeMI++ finds the minimum number of records in a cluster

for each of the 360 chromosomes. It then sorts these numbers in descending order and calculates

the median of these numbers. The median value is then used as a property (of a sensible

clustering solution) relating to the minimum number of records in a cluster. HeMI++ similarly

finds the minimum and maximum number of clusters in a clustering solution based on the 360

chromosomes. These values are then used in the cleansing operation (see Component 10) in

order to identify a sensible clustering solution.

Note that CSClust (see Step 3 of Section 7.3 in Chapter 7) also uses a similar component.

However, there are some significant differences, First, CSClust does not use multiple streams

and hence it finds the properties based on 90 chromosomes of its single stream. Second, in order

to find the minimum number of records in a cluster CSClust identifies the best 20 chromosomes

according to their DB Index values and then picks the minimum number of records in a cluster

out of all clusters in these 20 chromosomes. Since the best 20 chromosomes are selected based

on their DB Index values, CSClust also suffers from the drawbacks of DB Index in identifying

the properties of a sensible clustering solution.

Component 5: Noise-based Selection

At the beginning of each generation starting from Generation 2, we carry out the Noise Based

Selection (Y. Liu et al., 2011) in order to get a new population for subsequent genetic operations

such as crossover and health improvement. HeMI++ uses the same approach of noise-based

selection that we used in DeRanClust (see Section 3.2 in Chapter 3).

197

Algorithm 8.1: HeMI++

Input: A data set 𝐷 having 𝑛 records and |𝐴| attributes, where 𝐴 is the set of attributes Output: A set of cluster C

Require:

Pd← ∅ /* 𝑃𝑑 is the set of deterministic chromosome (45 chromosomes), initially set 𝑃𝑑 to empty */

Pr← ∅ /* 𝑃𝑟 is the set of random chromosome (45 chromosomes), initially set 𝑃𝑟 to empty */

end

Step 1: /* Normalization */


end

for k= 1 to m do /* m=4, user defined number of streams, default value of m is set to 4 and k is the counter of m */


Pd ← GenerateDeterministicChromosomes (D′) /* generate 𝑃𝑑 through K-means, the number of initial seeds for K-means are chosen deterministically *

Pr ← GenerateRandomChromosomes (D′) /* generate 𝑃𝑟 randomly, the number of initial seeds are chosen randomly and the seeds also chosen randomly */

Px ←SelectDeterministicChromosomes (Pd ) /* select 10 chromosomes (50% chromosomes of the initial population) based on fitness */

Py ← SelectRandomChromosomes (Pr) /* select 10 chromosomes (50% chromosomes of the initial population) based on fitness */

Ps ← Ps ∪ (Px ∪ Py) /* 𝑃s is the set of initial population (20 chromosomes) */


end

end

Step 3: /* Selection of Sensible Properties */

(Mn, Mx, Mr) ←FindSensibleProperties (Pr, Pd) /* Find minimum number of clusters (𝑀𝑛), maximum number of clusters (𝑀𝑥) and

end minimum number of records in a cluster ( 𝑀𝑟) */

for j=1 to G do /* default G=5, G is the number of intervals of the total number of iterations and j is the counter of G */

for k= 1 to m do /* default m=4, m is the user defined number of streams, and k is the counter of m */

for t=1 to I do /* default I = 10, I is the user defined number of iterations for each interval and t is the counter of I */

Step 4: /* Noised-Based Selection */

if t >1 then

Ps = NoiseBasedSelection(Pst, Ps

t+1) /* perform noise based selection between current ( 𝑃𝑠𝑡+1 ) and previous (𝑃𝑠

𝑡) generation*/

end

end

Step 5: /* Crossover Operation */


end

Step 6: /* Twin Removal */

P0 =Twin Removal (P0) /* perform twin removal on 𝑃0 and get a set of chromosomes 𝑃0 */

end


Pm ← PerformMutationOperation (Po) /* perform three steps mutation operation on 𝑃𝑜 and get a set of mutated chromosomes 𝑃𝑚 */

end

Step 8: /* Health Improvement Operation */

Pc ← PerformHealthImprovementOperation (Pm) /* perform three steps health improvement operation on 𝑃𝑚 and get healthy chromosomes 𝑃𝑐 */

end

Step 9: /* Cleansing Operation */

Sc ← FindSickChromosomes (Pc, Mn, Mx, Mr ) /* Find sick chromosomes 𝑆𝑐 from 𝑃𝑐 based on 𝑀𝑛, 𝑀𝑥, 𝑀𝑟*/

Pc ← Pc-Sc /* Remove 𝑆𝑐 chromosomes from 𝑃𝑐 */

end

Step 10: /* Cloning Operation */

while |Sc| ≤ 0 do

Hc ← Cloning Operation (Pd) /* Replace the sick chromosomes from 𝑃𝑑 and get a set of healthy chromosomes 𝐻𝑐 */

end

Pc ← Pc ∪ Hc /* Insert 𝐻𝑐 into 𝑃𝑐 */

end


Pbk ←ElitistOperation (Pc & Pb ) /* apply elitist operation on 𝑃𝑐 & 𝑃𝑏 and find the best chromosome 𝑃𝑏

𝑘 */

Pg← Pbk /* insert 𝑃𝑏

𝑘 into 𝑃𝑔, 𝑃𝑔 is the set of chromosomes that contains the best chromosome of each stream */

end

end

end

Step 12: /* Neighbor Information Sharing */

for k= 1 to m do /* default m=4, m is the user defined number of streams and k is the counter of m */

Pk ← NeighborInformationSharing (Pc, Pg) /* apply neighbor information sharing on 𝑃𝑐 and get updated chromosomes 𝑃𝑘*/

Pbk ←FindStreamBestChromosome (Pk) /* find the best chromosome of the stream */

Sb← Pb k /* insert 𝑃𝑏

𝑘 into 𝑆𝑏, 𝑆𝑏 is the set of chromosomes that contains the best chromosome of each stream */

end

end

end

Step 13: /* Global Best Selection*/

C ←FindGlobalBestChromosome (Sb ) /* find the global best chromosome 𝐶 from 𝑆𝑏*/ Return C

end

198

Component 6: Crossover Operation

All chromosomes in a population participate in crossover pair by pair. To perform crossover,

HeMI++ uses the same approach of crossover that we used in HeMI (see Section 6.2.2 of

Chapter 6).

Component 7: Twin Removal

HeMI++ uses the twin removal operation in order to remove/modify twin genes (if any) from

each chromosome. For twin removal, HeMI++ uses the same approach of twin removal of

DeRanClust (see Section 3.2 of Chapter 3).

Component 8: Three Steps Mutation Operation

HeMI++ uses three steps of mutation operations: division, absorption and a random change.

The mutation operation of HeMI++ is the same as the mutation operation of HeMI (see Section

6.2.2 of Chapter 6).

Component 9: Health Improvement Operation

This component aims to continuously improve the health of chromosomes within a population

in order to ensure the presence of healthy (high-quality) chromosomes in each generation.

HeMI++ uses the same approach of health improvement operation of HeMI (see Section 6.2.2

of Chapter 6).

Component 10: Cleansing Operation

This is an original contribution of HeMI++. The aim of this component is to identify the

chromosomes in a population with sensible and non-sensible solutions. HeMI++ first learns the

necessary properties [minimum (𝑀𝑛) and maximum (𝑀𝑥) number of clusters, and minimum

number of records (𝑀𝑟) in a cluster] of a sensible clustering solution through Component 4.

199

HeMI++ then applies the cleansing operation on each chromosome using the same approach of

cleansing operation that we used in CSClust (see Section 7.3 of Chapter 7).

Component 11: Cloning Operation

The cloning operation replaces the sick chromosomes found in the cleansing operation. To

replace a sick chromosome, HeMI++ uses the same approach of cloning operation that we used

in CSClust (see Section 7.3 in Chapter 7).

Component 12: The Elitist Operation



iterations. For finding the best chromosome, HeMI++ uses the same approach of elitist operation


Component 13: Neighbor Information Sharing

HeMI++ uses the Neighbor Information Sharing component of HeMI (see Section 6.2.2 of

Chapter 6) in order to share/exchange the best chromosome among neighboring streams at a

regular interval such as at every 10th iteration. The main idea of this operation is to give a stream

a chance to borrow a good chromosome form its neighboring streams after every 10 iterations.

Thus, steams can share their good chromosomes and help each other.

Component 14: Global Best Selection

HeMI++ uses this component in order to find the global best chromosome among multiple

streams. At the end of all iterations, each stream obtains its best chromosome. HeMI++

compares the best chromosome of all streams and selects the best of the best chromosomes as

the final clustering solution. The genes of the best chromosome represent the seed/centroid of a

cluster and records are allocated to their closest seeds to form the final clusters.

200

8.2.4 The HeMI++ Algorithm

We now present the HeMI++ algorithm, which is shown in Algorithm 8.1. HeMI++ first takes

a data set 𝐷 as an input and normalizes all attributes separately as explained in Component 1. It

then takes the user defined number of multiple streams as explained in Component 2. The default

number of multiple streams in HeMI++ is set to 4.

HeMI++ then produces initial chromosomes for each stream separately through the

Population Initialization as explained in Component 3 (see Step 2 of Algorithm 8.1). It then

applies its proposed component the Selection of Sensible Properties (see Component 4 and Step

3 in Algorithm 8.1) in order to find the necessary properties of a sensible clustering solution.

HeMI++ applies the noise-based selection operation from the 2nd iteration as explained in

Component 5 (see Step 4 of Algorithm 8.1).

HeMI++ then sequentially applies the Crossover, Twin Removal, Mutation and Health

Improvement operation. All these operations are explained before (see from Component 6 to

Component 9 and Step 5 to Step 8 in Algorithm 8.1). HeMI++ applies the cleansing and cloning

operation (Component 10 and Component 11, and Step 9 and 10 of Algorithm 8.1) in order to

increase the chance that all chromosomes in a population do not contradict with the properties

of a sensible solution.

It then performs the Elitist operation (see Component 12 and Step 11 of Algorithm 8.1) to

find the best chromosome. In order to take the advantage of multiple streams HeMI++ then

applies the neighbor information sharing component (see Component 13 and Step 12 of

Algorithm 8.1) at a regular interval. In this study, the default value of the interval is 10 iterations.

At the end of all iterations, HeMI++ applies the Global Best Selection (see Component 14 and

see Step 13 of Algorithm 8.1) operation in order to find the final clustering solution.

201

8.2.5 Our Cluster Evaluation Technique (Tree Index)

We now propose a new cluster evaluation technique called Tree Index which is able to better

evaluate clustering solutions than conventional cluster evaluation metrics. The steps of the

proposed cluster evaluation technique are as follows.

Step 1. The proposed cluster evaluation technique first labels a data set based on the

clustering result that it wants to evaluate. For example: if a clustering technique generates a

clustering result with three clusters then Tree Index labels the data set considering the three

clusters as three different class values.

Step 2. It then builds a decision tree on the labelled data set to classify the records based on

their labels. It can use any existing decision tree algorithm. In this study we have used C4.5

(Quinlan, 1993, 1996).

Step 3. Tree Index then finds the entropy (Pang-Ning Tan, Michael Steinbach, 2005) of each

leaf of the tree. The entropy is a well-known evaluation technique that measures the level of

uncertainty in a distribution.

Step 4. It then finds the depth of each leaf of the tree. Typically, a tree having a lower depth

represents a higher agreement between the class labels and corresponding records of a data set

than a tree with a higher depth.

Step 5. It then computes the evaluation value (𝐸) as follows.

𝐸 =∑ 𝐸𝑖×𝑘𝑖

𝑙

𝑖=1

|c| {

𝑘𝑖 = 𝑑𝑖 (𝑑𝑒𝑝𝑡ℎ), 𝑖𝑓 𝑑𝑖 > 0

𝑘𝑖 = ∞, 𝑖𝑓 𝑑𝑖 = 0

Eq. 8.9

where, 𝐸𝑖 is the entropy of the 𝑖𝑡ℎ leaf, |c| is the number of possible class values, which is

the same as the number of clusters, 𝑑𝑖 is the depth of the 𝑖𝑡ℎleaf. The value of 𝑘𝑖 is 𝑑𝑖 when the

value of 𝑑𝑖 is greater than 0. The value of 𝑘𝑖 is ∞ when the value of 𝑑𝑖 is 0. The depth 𝑑𝑖 = 0

202

means that the tree has a single leaf with depth zero; that is the root node itself is the only leaf.

It means that a tree has not been built indicating that there is no strong pattern in the data set.

This can happen when the records are labelled incorrectly meaning that the clustering results

are of poor quality. On the other hand a good clustering will result in a good labeling of records

which will then build a shallow tree with homogeneous leaves (zero entropy). This will obtain

a very low 𝐸 value in Eq. 8.9.


8.3.1 The Data Sets and the Evaluation Criteria

We empirically compare the performance of our proposed technique HeMI++ with five existing

techniques namely AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), GenClust

(Rahman & Islam, 2014), K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007),

and our technique HeMI (see Chapter 6) on a brain data set (CHB-MIT data set) (Goldberger et

al., 2000). We also compare the performance of HeMI++ with the existing techniques on 21

natural data sets that are available in the UCI machine learning repository (M. Lichman, 2013).

Detailed information about the data sets is presented in Table 8.3. We choose a wide variety

of data sets. For example, some data sets (such as Glass Identification) have only numerical

attributes and some data sets (such as Credit Approval) have both the numerical and categorical

attributes. Most of the data sets we choose with numerical attributes because the techniques

(except HeMI, HeMI++ and GenClust) that we use in this study can only handle numerical

attributes.

Some data sets have a low number of attributes and some data sets have a high number of

attributes. For example, the Blood Transfusion (BT) data set has only 4 attributes and the

Hepatitis (HT) data set has 19 attributes. Similarly some data sets have a low number of records

such as the Zoo data set that has only 101 records. Some data sets have a relatively high number

203

of records such as the Chess (King-Rook vs. King) (CKRK) data set that has 28,056 records. In

addition, some data sets have a low number of class values such as Vertebral Column (VC) that

has only two class values and some data sets have a high number of class value such as Leaf

(LF) that has 36 class values.



with missing

No. of Records

without missing

No. of numerical

attributes

No. of categorical

attributes

Class size

Zoo 101 101 1 18 7

Hepatitis (HT) 155 80 6 13 2


Statlog Heart (STH) 270 270 6 7 2


Ecoli (EC) 336 336 8 0 8

Leaf (LF) 340 340 16 0 36


Credit Approval (CA) 690 653 6 9 2

Breast Cancer Wisconsin Original

(WBC)

699 683 10 0 2




Bank Note Authentication (BN) 1,372 1,372 4 0 2

Contraceptive Method Choice (CMC) 1,473 1,473 2 7 3

Yeast (YT) 1,484 1,484 8 0 10

Image Segmentation (IS) 2,310 2,310 18 0 7

Wine Quality (WQ) 4,898 4,898 11 0 7

Page Blocks Classification (PBC) 5,473 5,473 10 0 5

MAGIC Gamma Telescope (MGT) 19,020 19,020 11 0 2

Chess (King-Rook vs. King) (CKRK) 28,056 28,056 3 3 18

Class values are the labels of records that represent an important property of a data set.

Typically the data set, for which a clustering technique is used, does not have class values for

its records. Therefore, before applying a clustering technique on a data set we remove the class

attribute.

204

Some of the data sets contain missing values in them. It means that some attribute values of

some records are missing. We delete all the records having any missing values. For example,

the Credit Approval (CA) data set has altogether 690 records, but 37 of them have one or more

missing values. Hence, after deleted these 37 records the data set has 653 records that do not

have any missing values. We evaluate and compare the techniques based on the proposed

evaluation technique Tree Index, where the lower evaluation value represents a good clustering

result.

8.3.2 The Parameters used in the Experiments

In the experiments on AGCUK, GAGR, GenClust, HeMI and HeMI++ the population size is

set to 20 and the number of iterations/generations is set to 50. In order to ensure a fair

comparison among the techniques we maintain this consistency. The number of iterations in K-

means and K-means++ is set to 50 and the number of iterations of K-means in GenClust also

set to 50. The cluster number 𝑘 in GAGR, K-means and K-means++ is user defined. Hence, to

simulate a user defined 𝑘 we in this study generate 𝑘 randomly (for GAGR, K-means and K-

means++) in the range between 2 to √𝑛 , where 𝑛 is the number of records in a data set. The

threshold value for K- means is set to 0.005.

The value of 𝑟𝑚𝑎𝑥 and 𝑟𝑚𝑖𝑛 in AGCUK and HeMI are set to 1 and 0 respectively. For our

proposed cluster evaluation technique we need to build a decision tree from a data set where

records are labelled on the clustering result that is being evaluated. While building the decision

tree we need to assign a minimum number of records for each leaf. In this study we assign 1%

of records of a data set, as long as it stays within the range between 2 to 15. If 1% of records is

less than 2 then we assign 2, and if 1% of records is more than 15 then we assign 15.

205


On each data set, we run HeMI++ 20 times, since it can give different clustering results in

different runs. We also run all other techniques GenClust, GAGR, AGCUK, K-means, K-

means++ and HeMI for 20 times. We then present the average clustering results. We understand

that clustering techniques are generally applied on unlabelled data sets that do not have any class

attribute. Hence, the class attributes are removed before clustering them. They are again used

for an external evaluation of the clustering results.

8.3.4 Brain Data set Pre-processing

Before experimentation, we first prepare the CHB-MIT Scalp EEG data set (Goldberger et al.,

2000). For the data pre-processing HeMI++ uses the same approach of data pre-processing that

we used in CSCClust (see Section 7.4.4 of Chapter 7). In HeMI++, we prepare one-hour data of

one patient (chb01_03) who is an 11 years old girl. This data set has the recordings of 23

channels. Hence, from all 23 channels altogether, we get 360*23=8280 records. In this data set

the patient experienced a seizure for around 40 seconds (from the 2996th second to 3036th

second). During this period we get 5 records. These records are considered as seizure records

and all other records are considered as non-seizure records. Therefore, from the chb_01_03 data

set altogether we get 23*5= 115 seizure records and 8165 non-seizure records.

8.3.5 Clustering Quality Comparison between HeMI++ and Other Techniques on the

MIT-Chb01_03 Data Set

In this section, we empirically compare HeMI++ with AGCUK (Y. Liu et al., 2011), GAGR

(D.-X. Chang et al., 2009), GenClust (Rahman & Islam, 2014), K-means (Lloyd, 1982), K-

means++ (Arthur & Vassilvitskii, 2007) and our technique HeMI (see Chapter 6) on a brain data

set (MIT-chb01_03) through visual analysis of clustering results. We also compare all the

techniques based on our proposed cluster evaluation technique. In this section, we use three

206

attributes (max, min and std) of the data set in order to plot the records so that we can see the

records and their orientations. Such plots also help us to see clustering results and their

appropriateness.

Fig. 8.6: Clustering result of HeMI on the CHB-MIT Scalp EEG (chb01-03) data set

Fig. 8.7 : Clustering result of AGCUK on the CHB-MIT Scalp EEG (chb01-03) data set

207

Fig. 8.6 shows the clustering result of HeMI on the CHB-MIT Scalp EEG (chb01-03) data

set. HeMI generates two clusters but one cluster contains only one record and all other records

belong to the other cluster. Clearly, this does not appear to be a sensible clustering. From Table

8.4 we can see that according to our proposed cluster evaluation technique HeMI receives a poor

evaluation result which is ∞. Therefore, the evaluation made by our proposed evaluation

technique matches with the manual evaluation through the visual analysis of the plotted records.


Fig. 8.7 shows the clustering result of AGCUK where it generates two clusters: seizure and

non-seizure. Mainly, the non-seizure records found in Cluster 1 and mixed of seizure and non-

seizure records are found in Cluster 2. Cluster 1 has 2836 non-seizure records (dots in Fig. 8.7)

and 0 seizure records (plus signs in Fig. 8.7), while Cluster 2 has 5389 non-seizure records

(triangles in Fig. 8.7) and 55 seizure records (circles in Fig. 8.7). We can clearly see that while

the clustering result is more sensible than the clustering result of HeMI (see Fig. 8.6), it is still

not a good clustering result. Our proposed cluster evaluation technique also identifies that as we

208

can see in Table 8.4 that AGCUK is better than HeMI. This again re-confirms the effectiveness

of our proposed evaluation technique.

Fig. 8.8, Fig. 8.9, Fig. 8.10 and Fig. 8.11 show the clustering results of GAGR, GenClust, K-

means and K-means++ where GAGR, GenClust, K-means and K-means++ produce 56, 477, 28

and 13 clusters, respectively. Considering that the data set has only two types of records: Seizure

and Non-seizure these clustering results with so many clusters also do not make a perfect scene.

This is also identified by our proposed evaluation technique as shown in Table 8.4.


As we can see in Fig. 8.12, HeMI++ produces a sensible clustering solution as it matches

with the original orientation of records in the data set. It produces two clusters: Cluster 1 and

Cluster 2. Cluster 1 contains 8219 non-seizure records and 38 seizure records, while Cluster 2

contains 6 non-seizure records and 17 seizure records. As a result, HeMI++ also achieves a good

evaluation value based on our proposed evaluation technique as shown in Table 8.4. This re-

confirms that the proposed evaluation technique produces better evaluation value for better


209

Fig. 8.10: Clustering result of K-means on the CHB-MIT Scalp EEG (chb01-03) data set

Fig. 8.11: Clustering result of K-means++ on the CHB-MIT Scalp EEG (chb01-03) data set

210

Fig. 8.12: Clustering result of our proposed technique, HeMI++ on the CHB-MIT Scalp EEG (chb01-03) data set

Table 8.4: Clustering results of HeMI++ and other techniques based on Tree Index

Clustering Techniques Tree Index (lower the better)

HeMI++ 0.55

HeMI ∞

GenClust 5.27

GAGR 19.89

AGCUK 18.19

K-means 27.41

K-means++ 31.01

8.3.6 Analysis of the Clustering Result Obtained by HeMI++ from the CHB-MIT Scalp

EEG (chb01-03) Data Set

We now analyze the clustering results obtained by HeMI++ on the CHB-MIT Scalp EEG

(chb01-03) data set in order to explore knowledge from this data set. Detailed information of

the records in Cluster 2 is presented in Table 8.5. The first column in Table 8.5 shows the

channel number and position of the channel on the surface of the scalp according to the

211

International 10-20 system. Column 2 and 3 present the number of seizure and non-seizure

records in Cluster 2 for every channel.

Table 8.5: Channel wise number of records in Cluster 2 of HeMI++ on the CHB-MIT Scalp EEG (chb01-03) data set

Channel (Position) Number of

seizure records

Number of

non-seizure records









Channel 17 (FZ-CZ) 0 1

Channel 21 (FT9-FT10) 2 0


The number of seizure records in this data set is 115, considering that each of the 23 channels

has 5 seizure records. However, we have access to the original brain signal for all 23 channels.

10-second epochs of the signal for some of the channels are shown in Fig. 8.14 - Fig. 8.18.

Through our analysis of the signal for all 23 channels we realize that only 11 out of 23 channels

actually show seizure pattern during the seizure time. Therefore, there are actually only 11×5 =

55 seizure records. When we plot these records in Fig. 8.1 around 25 of them are clearly visible

and others are overlapped with other records.

In order to visualize the seizure records we plot the records of Channel-5, Channel-13,

Channel-17 and Channel-21 as shown in Fig. 8.13 (a), (b), (c) and (d), respectively. Fig. 8.13

(a) to Fig. 8.13 (d) show that in each channel, around 2/3 out of 5 seizure records are clearly

visible. Therefore, HeMI++ finds 23 records in Cluster 2, where 17 of them are seizure records.

It finds some non-seizure records in Cluster 2 because they are very similar to seizure records.

212

(a) Channel-5

(b) Channel-13

(c) Channel-17

(d) Channel-21

Fig. 8.13: Seizure records of different channels

Fig. 8.14 shows the signals of Channel-5 during the non-seizure time where the amplitude

varies between 200 uV and -200 uV. Fig. 8.15, Fig. 8.17 and Fig. 8.18 show the signals of

Channel-5, Channel-9 and Channel-13, respectively during the seizure time, where the signals

show high amplitude (between 400 uV and - 400 uV). Hence, we realize that the seizure time

signal generally displays high amplitude whereas the non-seizure time signal displays low

amplitude. However, when we look at Fig. 8.16 that displays overall low amplitude (much like

213

any non-seizure time signal) although it is the signal for the channel during the seizure time.

Thus, we realize that Channel-7 does not experience a seizure signal even during the seizure

time.

Fig. 8.14: EEG signals (10 seconds) of Channel-5 during the non-seizure time

Fig. 8.15: EEG signals (10 seconds) of Channel-5 during the seizure time




We also find that all 11 channels showing the seizure signal during the seizure time are

located in the frontal lobe and temporal-parietal lobe of the scalp (see Fig. 8.19) indicating that

the seizure for this patient was a localized seizure originated in the frontal lobe and temporal-

parietal lobe. Out of these 11 channels 8 of them are located in the frontal lobe and 3 others

(Chanel-15, Channel-16 and Channel-23) are located in the temporal–parietal lobe.

214

Interestingly, HeMI++ also finds the maximum records in Cluster 2 from these 11 channels.

This again re-confirms the quality of the clustering results obtained by our proposed technique.

Fig. 8.19: Channel positions according to the International 10-20 system (Jasper, 1958; Sharbrough F et al., 1991)

8.3.7 Evaluation of HeMI++ and Tree Index on the LD data set

In order to further evaluation of Tree Index and HeMI++, in this section, we empirically

compere the clustering results of all the techniques on the LD data set based on Tree Index. We

also graphically visualize the clustering results in order to validate the correctness of Tree Index

evaluation. In this section, we use three attributes (mcv mean corpuscular volume, alkphos

alkaline phosphatase and sgpt alamine aminotransferase) of the data set in order to plot the

records so that we can see the records and their orientations. Such plots of clustering results help

us to evaluate the correctness of Tree Index evaluation. Fig. 8.20, 8.21, 8.22, 8.23, 8.24, 8.25

A = Ear lobe C = Central

P = Parietal F = Frontal

Fp = Frontal polar O = Occipital

Frontal

lobe

Temporal-

Parietal lobe

215

Fig. 8.20: Clustering result of HeMI on the LD data set

Fig. 8.21: Clustering result of AGCUK on the LD data set

Fig. 8.22: Clustering result of GAGR on the LD data set

Fig. 8.23: Clustering result of GenClust on the LD data set

Fig. 8.24: Clustering result of K-means on the LD data set

Fig. 8.25: Clustering result of K-means++ on the LD data set

216

Fig. 8.26: Clustering result of HeMI++ on the LD data set

Fig. 8.27: The three dimensional LD data set

Table 8.6: Comparative results of all the techniques on the LD data set based on Tree Index and other evaluation techniques

Techniques F-measure

(higher the

better)

Entropy

(lower the

better)

Purity

(higher the

better)

Silhouette

Coefficient

(higher the

better)

XB Index

(lower the

better )

SSE

(lower the

better)

DB Index

(lower the

better)

Tree Index

(lower the

better)

GenClust 0.82 0.43 0.82 0.67 0.19 2.64 0.50 3.53

HeMI 0.73 0.97 0.57 0.73 0.05 100.96 0.35 ∞

GAGR 0.63 0.92 0.60 0.09 1.04 59.56 1.98 5.62

HeMI++ 0.73 0.98 0.57 0.43 0.63 95.85 0.96 0.39

K-Means 0.60 0.94 0.60 0.21 0.34 16.60 1.10 6.21

K-Means++ 0.65 0.96 0.59 0.12 0.23 57.27 1.13 9.62

AGCUK 0.73 0.97 0.57 0.75 0.20 33.73 0.36 ∞

8.26 and 8.27 show the clustering results of HeMI, AGUCK, GAGR, GenClust, K-means, K-

means++ and HeMI++, respectively. Fig. 8.27 shows the original structure of the LD data set.

As we can see in Fig. 8.27, HeMI++ produces a sensible clustering solution as it quite

matches with the original orientation of records in the data set. Therefore, HeMI++ achieves a

good evaluation value based on Tree Index evaluation as shown in Table 8.6. However, it

achieves bad evaluation values based on Entropy, Purity, Silhouette Coefficient, XB Index and

DB Index as shown in Table 8.6. Fig. 8.20 shows that HeMI produces non sensible clustering

results. It produces two clusters where one cluster contains one record and other clusters

217

contains all the remaining records. It also achieves good evaluation values based on F-measure,

Silhouette Coefficient, XB Index and DB Index as shown in Table 8.6. Similar to HeMI,

AGCUK also produces non-sensible clustering solution (see Fig. 8.21) and achieves good

evaluation values based on F-measure, Silhouette Coefficient, XB Index and DB Index as shown

in Table 8.6. However, Tree Index produces bad evaluation values both for HeMI and AGCUK

as shown in Table 8.6. Similarly, Tree Index produces bad evaluation values for other non-

sensible clustering results produced by GAGAR, GenClust, K-means and K-means ++ (see

column 8 of Table 8.6). This again re-confirms that Tree Index produces good evaluation value

for good clustering solutions and bad evaluation value for bad clustering results.

8.3.8 Experimental Results on All Techniques on 21 Real Life Data Sets

In Section 8.3.5 we empirically compare our proposed technique HeMI++ with AGCUK (Y.

Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), GenClust (Rahman & Islam, 2014), K-

means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007) and our technique HeMI (see

Chapter 6) on a brain data set (CHB-MIT Scalp EEG). We now experimentally evaluate the

performance of HeMI++ by comparing it with K-means, K-means++, GAGR, AGCUK,

GenClust and HeMI on 21 other real-life data sets. For each data set, we run each technique 20

times. Since there are 6 data sets with categorical attributes K-means, K-means++, AGCUK and

GAGR cannot handle these data sets and therefore, these techniques are tested on 15 (instead of

21) data sets. However, HeMI++, HeMI and GenClust are tested on all 21 data sets.

In Section 8.3.5 we demonstrate the effectiveness of our proposed cluster evaluation

technique. We find that our cluster evaluation technique matches the expectation when we

investigate the clustering results manually through visual analysis. Besides, in Table 8.1 we have

presented the limitations of the existing cluster evaluation metrics. Therefore, in Table 8.7 we

present the average evaluation values based on our proposed evaluation technique. HeMI++

achieves better results than GenClust in 13 out of 15 numerical data sets. HeMI++ achieves

218

better evaluation values than K-means, K-means++, AGCUK, GenClust and HeMI in all 15

data sets.

Table 8.7: Comparative results between HeMI++ and other techniques on 15 numerical data sets based on Tree Index

Data set Tree Index (lower the better)

K-means K-means++ GAGR AGCUK GenClust HeMI HeMI++

GI 2.11 1.87 3.47 0.92 2.47 ∞ 0.31

VC 4.94 4.11 5.37 ∞ 3.55 ∞ 1.53

EC ∞ ∞ 6.12 ∞ ∞ ∞ 2.94

LF 2.64 3.05 3.53 1.32 3.71 ∞ 0.95

LD 6.31 4.85 7.52 1.21 4.28 0.46 0.24

WBC 5.92 7.07 8.24 3.21 5.58 2.38 1.28

BT 5.81 5.73 0.47 ∞ ∞ ∞ 0.27

PID 13.40 14.19 6.20 ∞ 3.72 ∞ 6.19

SV 5.18 3.25 4.46 1.2 3.05 ∞ 0.00

BN 4.25 5.59 4.64 1.89 2.51 ∞ 0.77

YT 13.78 12.98 ∞ ∞ 4.87 ∞ 2.44

IS 3.12 2.48 5.19 2.11 ∞ ∞ 1.53

WQ 32.64 47.09 15.37 ∞ 7.98 ∞ 13.26

PBC 13.15 14.18 10.21 ∞ 4.77 ∞ 0.44

MGT 62.06 128.61 100.09 72.92 30.67 ∞ 18.89

Fig. 8.28 shows the total score of all techniques on 15 numerical data sets based on Tree

Index. In this scoring system, the technique with the best clustering result gets 7 points and the

technique with the worst result gets 1 point - for each data set. Fig. 8.28 shows the total scores

of a technique over all data sets. The bar graph shows that HeMI++ achieves a higher score than

all other techniques.

We compare HeMI++ with GenClust and HeMI on 6 categorical data sets (data sets that have

only categorical attributes or both the categorical and numerical attributes). For each data set we

run each technique 20 times (except CKRK data set) and present the average clustering result.

We run each technique on CKRK data set 5 times and present the average result. Table 8.8

219

shows that HeMI++ achieves better results in 5 out of 6 data sets than GenClust. HeMI++

performs better than HeMI in 6 out of 6 categorical data sets.

Table 8.8: Clustering results of HeMI ++ and other techniques on 6 categorical data sets based on Tree Index

Tree Index (lower the better)

Data set GenCslust HeMI HeMI++

Zoo 0.87 0.86 0.08

HT 1.94 0.59 0.46

STH 1.75 ∞ 1.38

CA ∞ ∞ 1.57

CMC 0.78 ∞ 2.51

CKRK 55.79 1.76 0.64

Fig. 8.28: Scores of the techniques on 15 numerical data sets based on Tree Index

8.3.9 An Analysis of the Clustering Quality of HeMI++ on Different Data Sets

In Chapter 8, we present the detailed experimental results of HeMI++ on 21 real life data sets

and compare with five other techniques. We now reanalyze the findings in order to investigate

some of the factors that may have influence the clustering quality of HeMI++. The principal

elements are as follows:

Number of records (𝑛) in a data set;

220

Number of attributes (𝑚) in a data set; and

Types of the majority of the attributes in a data set.

We divide the data sets into six groups as follows:

Group A: This group contains the data sets having a number of records fewer than 5000.

From the 21 data sets (see Table 8.3), 18 data sets fall within this group.

Group B: We consider the data sets having a number of records 5000 or more in this

group. Therefore, from our analysis the PBC, MGT and CKRK data sets of Table 8.3

are part of this group.

Group C: This group contains the data set having a number of attributes fewer than ten.

Therefore, in our analysis nine data sets of Table 8.3 belong to this group.

Group D: We consider the data sets having a number of attributes 10 or more in this

group. From the 21 data sets (see Table 8.3), 12 data sets are in this group.

Group E: This group contains the data sets having a higher number of numerical

attributes than the categorical attributes. Therefore, in our analysis 15 data sets of Table

8.3 are in this group.

Group F: In this group, we consider the data sets having a higher number of categorical

attributes than numerical attributes. Five data sets (namely Zoo, Hepatitis, Statlog

Heart, Credit Approval and CMC) of Table 8.3 fall within this group.

We analyze the performance of HeMI++ by comparing its performance with that of five

existing techniques, namely K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii,

2007), AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), and GenClust (Rahman

& Islam, 2014) in terms of the number of wins, losses, and draws. We define these as follows.

Win: This means that HeMI++ performs better than an existing technique.

221

Loss: The term, “Loss” means that HeMI++ does not perfom better than an existing

technique.

Draw: This indicates that the performance of HeMI++ is the same as the performance

of an existing technique.

We now analyze the performance of HeMI++ based on the above factors and groups in the

following subsections.

8.3.9.1 Performance of HeMI++ compared to Existing Techniques, based on Number of

Records

In this subsection we analyze the results of HeMI++ in order to explore the influence of the

number of records on the performance of HeMI++. In Table 8.9 we present the number of wins,

losses, and draws for HeMI++ compare to AGCUK, GAGR, GenClust, K-means, and K-

means++ based on the cluster evaluation technique Tree Index (see Section 8.2.5 of Chapter 8)

for Group A and Group B. In the table, the percentage of wins, losses, and draws for HeMI++

compared to the existing techniques are also presented. As there are five data sets with

categorical attributes in Group A (see Table 8.3), AGCUK, GAGR, K-means and K-means++

cannot contend with these data sets, and therefore, for Group A, HeMI++ is tested against these

techniques on 13 (instead of 18) data sets. However, HeMI++ and GenClust are tested on all 18

data sets.

From Table 8.9, it is evident that for Group A data sets, HeMI++ wins more than 80% of

cases (see Column 8 of the table) against GenClust. HeMI++ wins 100% of cases compared to

AGCUK, GAGR, K-means, and K-means++ for Group A data sets. HeMI++ suffers a loss of

18.75% of cases against GenClust for the data sets of Group A. However, HeMI++ achieves a

100% win record (see Column 11 of Table 8.9) alongside the existing techniques, including

GenClust, for large data sets with 5000 or more records (i.e. Group B).

222

Table 8.9: Performance of HeMI++ compared to existing techniques, based on number of records

Against

techniques

Scores in numbers Scores in percentage (%)

Group A Group B Group A Group B

No. of data sets = 18 No. of data sets = 3 No. of data sets = 18 No. of data sets = 3

Win Loss Draw Win Loss Draw Win Loss Draw Win Loss Draw

AGCUK 13 0 0 3 0 0 100.00 0.00 0.00 100.00 0.00 0.00

GAGR 13 0 0 3 0 0 100.00 0.00 0.00 100.00 0.00 0.00

GenClust 16 2 0 3 0 0 81.25 18.75 0.00 100.00 0.00 0.00

K-means 13 0 0 3 0 0 100.00 0.00 0.00 100.00 0.00 0.00

K-Means++ 13 0 0 3 0 0 100.00 0.00 0.00 100.00 0.00 0.00

8.3.9.2 Performance of HeMI++ compared to Existing Techniques, based on Number of

Attributes

The results comparing HeMI++ to the existing techniques based on the number of attributes will

now be analyzed. Table 8.10 shows the number of wins, losses, and draws for HeMI++ against

AGCUK, GAGR, GenClust, K-means, and K-means++ based on Tree Index (see Section 8.2.5

in Chapter 8) for Group C and Group D. As there are two data set with categorical attributes in

Group C (see Table 8.3), AGCUK, GAGR, K-means, and K-means++ cannot process these data

sets, and therefore, for Group C, HeMI++ is tested against AGCUK, GAGR, K-means, and K-

means++ on 8 (instead of 9) data sets. However, HeMI++ and GenClust are tested on all nine

data sets. Similarly, in Group D there are four data sets with categorical attributes. Therefore,

for Group D, HeMI++ is tested against AGCUK, GAGR, K-means, and K-means++ on eight

(instead of twelve) data sets. HeMI++ and GenClust are tested on all twelve data sets.

Table 8.10 shows the results of HeMI++ alongside the existing techniques for Group C and

Group D data sets. HeMI++ achieves a 100% win rate compared to AGCUK, GAGR, K-means,

and K-means++ for the data sets of Group C and Group D. For Group C, HeMI++ suffers a loss

of 28.57% of cases against GenClust. However, HeMI++ performs better (i.e. 9.09 % loss)

against GenClust for the data sets having a number of attributes of ten or more (i.e. Group D),

223

compared to the data sets having a number of attributes of fewer than 10 (i.e. Group C) based

on Tree Index.

Table 8.10: Performance of HeMI++ compared to existing techniques, based on number of attributes

Against

techniques


Group C Group D Group C Group D



AGCUK 7 0 0 8 0 0 100.00 0.00 0.00 100.00 0.00 0.00

GAGR 7 0 0 8 0 0 100.00 0.00 0.00 100.00 0.00 0.00

GenClust 7 2 0 11 1 0 71.43 28.57 0.00 90.91 9.09 0.00

K-means 7 0 0 8 0 0 100.00 0.00 0.00 100.00 0.00 0.00

K-Means++ 7 0 0 8 0 0 100.00 0.00 0.00 100.00 0.00 0.00

8.3.9.3 Performance of HeMI++ compared to Existing Techniques, based on Type of the

Majority of Attributes

We now analyze the results of HeMI++ over AGCUK, GAGR, GenClust, K-means and K-

means++ based on the majority attribute types. Table 8.11 present the performance of HeMI++

over the existing techniques based on Tree Index. Note that in Gorup F there are 5 data sets, and

all the data sets are with catagorical attributes. Since AGCUK, GAGR, K-means and K-

means++ cannot handle the data sets with categorical attributes and therefore, for Group F

HeMI++ is not tested against these techniques. However, HeMI++ is tested against all the

techniques on all the 15 data sets in Group E.

From Table 8.11 it appears that HeMI++ achieves 100% win against AGCUK, GAGR, K-

means, and K-means++ for the data sets that have a higher number of numerical attributes than

categorical attributes (i.e. Group E). HeMI++ performs better (i.e. 15.38% loss) against

GenClust for the data sets with a higher number of numerical attributes (i.e. Group E), compared

to the data sets with a higher number of categorical attributes (i.e. Group F) (i.e. 25.00% loss)

based on Tree Index.

224

Table 8.11: Performance of HeMI++ compared to existing techniques, based on type of the majority of attributes

Against

techniques


Group E Group F Group E Group F



AGCUK 15 0 0 0 0 0 100.00 0.00 0.00 0.00 0.00 0.00

GAGR 15 0 0 0 0 0 100.00 0.00 0.00 0.00 0.00 0.00

GenClust 13 2 0 4 1 0 84.62 15.38 0.00 75.00 25.00 0.00

K-means 15 0 0 0 0 0 100.00 0.00 0.00 0.00 0.00 0.00

K-Means++ 15 0 0 0 0 0 100.00 0.00 0.00 0.00 0.00 0.00

8.3.10 Knowledge from the Brain Data

The CHB-MIT (chb01-03) data set has altogether 8280 records. According to the data provider

the numbers of seizure and non-seizure records are as follows.

Seizure Records Non-seizure Records

23x5 = 115 8165

However, from our previous discussion we know that actually there are 11 × 5 = 55 seizure

records as this was a localized seizure and all channels do not experience a seizure signal even

during the seizure time. HeMI++ groups the records in two clusters. Cluster 1 has mostly seizure

records and Cluster 2 has mostly non-seizure records. Hence, according to HeMI++ the numbers

of seizure and non-seizure records are as follows.

Seizure Records Non-seizure Records

Cluster 1: 23 Cluster 2: 8257

We now make two variants of the Brain data set: 𝐷1 and 𝐷2. In 𝐷1 all 8280 records are

labelled as either seizure or non-seizure according to the HeMI++ clustering result. In 𝐷2 all

225

records are labelled as either seizure or non-seizure according to the information supplied by

the data provider.

We build a number of decision trees from 𝐷1 using two existing decision forest algorithms,

SysFor (Islam & Giggins, 2011) and Forest CERN (Adnan & Islam, 2016) as shown in Fig. 8.29

(see Fig. 8.29 (a) to Fig. 8.29 (e)). We also build a number of decision trees from 𝐷2 by using

Forest CERN. In Fig. 8.29 (f) we show one representative tree from the trees built from 𝐷2.

We can see from these figures that the trees built from 𝐷1 are clearly more accurate than the

tree built from 𝐷2. For example, the tree in Fig. 8.29 (a) has only 7 inaccurate records compared

to 83 records in the tree in Fig. 8.29 (f). Moreover, the trees built from 𝐷1 are also shallower

than the tree from 𝐷2. This clearly indicates that the labeling in 𝐷1 is better than the labeling in

𝐷2, which in turn means the clustering results obtained by HeMI++ are more sensible than the

grouping of the records based on the knowledge/observation of the seizure time without

considering the localized seizure impact. We thus realize that HeMI++ correctly identifies the

seizure records and non-seizure records.

We now study the trees in Fig. 8.29 (a) to Fig. 8.29 (e) closely to discover knowledge. We

find that if the Standard deviation of the signal is high (higher than 125 or so) then it generally

represents seizure signal and otherwise non-seizure signal. This matches our current

understanding suggesting that erratic or abrupt signal (see Fig. 8.15 and Fig. 8.17) represents

seizure.

We also find that if Max amplitude of signal is high (say higher than 400 uV or so) then it

represents seizure signal which again matches our existing knowledge. Although the decision

trees discover knowledge that are already known or understood they play a useful role in

verifying them. Moreover, they also give a figure (such as Std > 125) for seizure

detection/prediction as this figure may vary from patient to patient. Thus, we can clearly see the

226

value of knowledge discovery from our clustering results. We aim to carry out a thorough

knowledge discovery from the records in the Brain data set.

(a) Decision Tree on the CHB-MIT Scalp EEG (chb01-03) data

set (labelled by clustering result of HeMI++)

(b) Decision Tree on the CHB-MIT Scalp EEG (chb01-03) data set

(labelled by clustering result of HeMI++)

(c) Decision Tree on the CHB-MIT Scalp EEG (chb01-03) data


(d) Decision Tree on the CHB-MIT Scalp EEG (chb01-03) data set (labelled by clustering result of HeMI++)

(e) Decision Tree on the CHB-MIT Scalp EEG (chb01-03) data


(f) Decision Tree on the original Chb_MIT_01_03 data set

Fig. 8.29: Decision trees on the CHB-MIT Scalp EEG (chb01-03) data set

227


In this section, we present the complexity of HeMI++ and compare it with the complexity of

AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et al., 2009), GenClust (Rahman & Islam,

2014), K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007) and HeMI (see

Chapter 6). The main factors related to the complexity of HeMI++ are as follows: in a data set

𝐷 number of records is 𝑛, number of attributes is 𝑚, number of genes in a chromosome is 𝑘,

number of chromosomes in a population is z, number of iterations in K-means is 𝑁′ and number

of iterations in HeMI++ is 𝑁. We realize that out of these factors 𝑛, 𝑚, 𝑘 and 𝑧 can be much

bigger than others. Hence, we consider 𝑛, 𝑚, 𝑘 and 𝑧 to compute the complexity.

In the initial population, HeMI++ uses a deterministic phase and a random phase to generate

a number of chromosomes. In the deterministic phase, it uses K-means to generate a number of

chromosomes, the complexity of which is 𝑂(𝑛𝑚𝑘𝑧). In the random phase, HeMI++ generates

initial chromosomes randomly. The complexity for this phase is 𝑂(𝑘𝑧). HeMI++ uses DB Index

as its fitness function to compute the fitness of a chromosome. The complexity of DB Index

is 𝑂(𝑛𝑚𝑘𝑧). HeMI++ selects the necessary properties for a sensible solution which can be done

in 𝑂(𝑛𝑚𝑘𝑧) complexity. The noising selection requires pair wise comparison for which we

need 𝑂(𝑧) complexity.

The crossover operation uses roulette wheel for which the complexity is 𝑂(𝑧2). For twin

removal, it requires 𝑂(𝑚𝑘2𝑧) complexity. In the mutation operation, HeMI++ uses a division,

absorption and random change. Complexities for the division, absorption and random change

are 𝑂(𝑛𝑚𝑘𝑧), 𝑂(𝑚𝑘𝑧) and 𝑂(𝑧), respectively. In the health improvement are 𝑂(𝑛𝑚𝑘𝑧), 𝑂(𝑧)

and 𝑂(𝑧) respectively.

HeMI++ performs a cleansing operation with the complexity of 𝑂(𝑛𝑚𝑘𝑧). Similarly, the

cloning operation requires 𝑂(𝑧) complexity. The complexity of elitist operation is 𝑂(𝑧) once

228

the fitness is computed with the cost of 𝑂(𝑛𝑚𝑘𝑧). The neighbor information sharing requires

𝑂(𝑧) complexity. Similarly, the complexity of global best selection is 𝑂(𝑛𝑚𝑘𝑧). Hence, the

overall complexity of HeMI++ is 𝑂(𝑛𝑚𝑘2𝑧2). With respect to the two most significant factors

(𝑛 and 𝑚), it has a linear complexity 𝑂(𝑛𝑚). The complexity of AGCUK, K-means, GAGR,

GenClust and HeMI are 𝑂(𝑛𝑚)(Y. Liu et al., 2011), 𝑂(𝑛𝑚) (Lloyd, 1982), 𝑂(𝑛𝑚) (D.-X.

Chang et al., 2009), 𝑂(𝑛𝑚2 + 𝑛2𝑚) (Rahman & Islam, 2014) and 𝑂(𝑛𝑚) (see Chapter 6),

respectively.

8.3.12 Statistical Friedman Test

We now carry out statistical Friedman test (Demšar, 2006; Friedman, 1940) in order to evaluate

the superiority of the results (Silhouette Coefficient) obtained by HeMI++ over the results

obtained by the existing techniques and our proposed technique HeMI. We compute the

Silhouette Coefficient for each algorithm according to the rank-ordering used in the Friedman

Test (Demšar, 2006; Friedman, 1940). Among the 7 competing algorithms, the one providing

the best Silhouette Coefficient is assigned the Rank of 1, the second best to the Silhouette

Coefficient Rank of 2 and so on (hence, the lower the average rank the better result). For the

result of ties are resolved by assigning the average of the sequential Silhouette Coefficient ranks

they would have received. The Silhouette Coefficient Rank of each competing algorithm for

each data set is presented within parentheses. The bottom row of Table 8.12 presents (within

parentheses) the average of Silhouette Coefficient Rank (in short, Rank) of each competing

algorithm from all data sets considered.

From Table 8.12, we can see that K-means provides the best Silhouette Coefficient for 0 data

set (Rank: 4.26), K-means++ for 0 data sets (Rank: 4.40), GAGR for 0 data sets (Rank: 4.60),

AGCUK for 0 data sets (Rank: 4.20), GenClust for 2 data sets (Rank: 3.50), HeMI for 0 data

sets (Rank: 5.90) whereas HeMI++ achieves the best Silhouette Coefficient in 13 out of 15 data

229

sets (Rank 1.13). We now conduct statistical significance test (Demšar, 2006) in order to assess

the superiority of HeMI++ over the existing techniques.

Table 8.12: Silhouette Coefficient rank of the techniques based on Friedman Test (Demšar, 2006; Friedman, 1940)

Data set Tree Index (lower the better)

K-means K-means++ GAGR AGCUK GenClust HeMI HeMI++

GI 2.11 (4) 1.87 (3) 3.47 (6) 0.92 (2) 2.47 (5) ∞ (7) 0.31 (1)

VC 4.94 (4) 4.11 (3) 5.37 (5) ∞ (6.5) 3.55 (2) ∞ (6.5) 1.53 (1)

EC ∞ (5) ∞ (5) 6.12 (2) ∞ (5) ∞ (5) ∞ (5) 2.94 (1)

LF 2.64 (3) 3.05 (4) 3.53 (5) 1.32 (2) 3.71 (6) ∞ (7) 0.95 (1)

LD 6.31 (6) 4.85 (5) 7.52 (7) 1.21 (3) 4.28 (4) 0.46 (2) 0.24 (1)

WBC 5.92 (5) 7.07 (6) 8.24 (7) 3.21 (3) 5.58 (4) 2.38 (2) 1.28 (1)

BT 5.81 (4) 5.73 (3) 0.47 (2) ∞ (6) ∞ (6) ∞ (6) 0.27 (1)

PID 13.40 (4) 14.19 (5) 6.20 (3) ∞ (6.5) 3.72 (1) ∞ (6.5) 6.19 (2)

SV 5.18 (6) 3.25 (4) 4.46 (5) 1.2 (2) 3.05 (3) ∞ (7) 0.00 (1)

BN 4.25 (4) 5.59 (6) 4.64 (5) 1.89 (2) 2.51 (3) ∞ (7) 0.77 (1)

YT 13.78 (4) 12.98 (3) ∞ (6) ∞ (6) 4.87 (2) ∞ (6) 2.44 (1)

IS 3.12 (4) 2.48 (3) 5.19 (5) 2.11 (2) ∞ (6.5) ∞ (6.5) 1.53 (1)

WQ 32.64 (4) 47.09 (5) 15.37 (3) ∞ (6.5) 7.98 (1) ∞ (6.5) 13.26 (2)

PBC 13.15 (4) 14.18 (5) 10.21 (3) ∞ (6.5) 4.77 (2) ∞ (6.5) 0.44 (1)

MGT 62.06 (3) 128.61 (6) 100.09 (5) 72.92 (4) 30.67 (2) ∞ (7) 18.89 (1)

Average rank (4.26) (4.40) (4.60) (4.20) (3.50) (5.90) (1.13)

The Friedman (Friedman, 1940) test is a non-parametric test used to compare multiple

algorithms on multiple data sets. Using the Friedman test the null hypothesis is that all

algorithms are equivalent. If the null hypothesis is rejected, we can proceed with a post-hoc test

such as Bonferroni-Dunn test (Demšar, 2006; O. J. Dunn, 1961). The Friedman statistics is

distributed according to 𝑋𝐹2 with (𝑘 − 1) degrees of freedom when 𝑘 (the number of competing

algorithms) and 𝑁 (the number of data sets) are big enough (as a rule of a thumb, 𝑘 > 5 and

𝑁 > 10) (Demšar, 2006). Iman and Davenport (Iman & Davenport, 1980) demonstrated that

Friedman’s 𝑋𝐹2 is undesirably conservative and derived a better statistic 𝐹𝐹 =

(𝑁−1)𝑋𝐹2

(𝑘−1)−𝑋𝐹2. With 7

230

algorithms and 15 data sets, the value of 𝐹𝐹 is calculated to be11.64. With =0.05 the critical

value of 𝐹𝐹 is calculated to be 2.13.

We can see that the critical value remains lower than the pair wise differences of ranks

between the control clustering algorithm (HeMI++) and all other contending algorithms (K-

means vs HeMI++: 3.13, K-means++ vs HeMI++: 3.26, GAGR vs HeMI++: 3.46, AGCUK vs

HeMI++: 3.06, GenClust vs HeMI++: 2.36 and HeMI vs HeMI++: 4.76) indicating that

HeMI++ performs better (in therms of Silhouette Coefficient) than all other algorithms on 15

real-life data sets.

8.4 Summary

We realize that some recent clustering techniques do not produce sensible clustering solutions

although their solutions achieve high fitness values based on existing evaluation criteria. These

solutions are therefore unlikely to be useful in knowledge discovery from underlying data sets.

We apply some existing techniques on a brain data set and realize that their clustering solutions

either have a way too many clusters or only two clusters where one cluster contains only one

record and all other records are stored in the other cluster.

Hence, in this chapter we propose a new clustering technique HeMI++ that first learns

important properties of sensible clustering solutions and then applies the information in

producing its clustering solutions. When we apply HeMI++ on a brain data set we find that the

proposed clustering technique overcomes the existing problem and produces the right number

of clusters with right records in the clusters.

During the development of the proposed clustering technique we realize that the existing

cluster evaluation techniques are biased towards either high number of clusters or very low

number of clusters. Hence, in this chapter we also propose a novel cluster evaluation technique

called Tree Index.

231

Tree Index first labels the records based on the clustering results that it wants to evaluate. It

then builds a decision tree from the data set with the labels. The basic idea here is that if the

labeling is good (i.e. sensible) then the produced tree is likely to classify the training records

more accurately and be shallow. Based on this basic concept Tree Index computes an evaluation

value of a clustering solution. Different clustering solutions can be compared based on their

Tree Index values.

In this chapter we graphically visualize two types of clustering solutions: sensible solutions

and non-sensible solutions (either having too many clusters or having one record in one cluster

and all other records in the other cluster) as shown in Fig. 8.2 to Fig. 8.5. While existing

evaluation techniques fail to correctly evaluate the cluster quality, Tree Index scores the sensible

solutions higher than those non-sensible solutions.

We then empirically compare our proposed clustering technique (HeMI++) with five existing

techniques on 21 publicly available data sets in terms of our Tree Index evaluation technique.

We find that HeMI++ achieves the best clustering solutions in 18 out of 21 data sets. Moreover,

we graphically visualize the clustering results of HeMI++ on a brain data set and find the results

to be more sensible than others. Additionally, we discover some useful knowledge from the

clustering results produced by HeMI++ indicating its usefulness in knowledge discovery.

232

Chapter 9

Discussion

9.1 Introduction

The main goals of this study are to propose a novel clustering technique with the ability to

produce sensible clusters, and to put forward a cluster evaluation technique suitable for

evaluating sensible and non-sensible clusters. A number of clustering techniques have been

outlined in the literature (Arthur & Vassilvitskii, 2007; D.-X. Chang et al., 2009; D. Chang et

al., 2012; He & Tan, 2012; Y. Liu et al., 2011; Lloyd, 1982; Rahman & Islam, 2014). However,

there are limitations to the existing clustering techniques, indicating there is potential for

improvement. Hence, in this study we propose a number of clustering techniques with sequential

quality improvement, developed by addressing the limitations of existing techniques.

Moreover, during the development of the proposed clustering techniques we observe that the

existing cluster evaluation techniques (Agustín-Blas et al., 2012; D L Davies & Bouldin, 1979;

Pang-Ning Tan, Michael Steinbach, 2005; Rahman & Islam, 2014) are biased towards either

high numbers of clusters or very low numbers of clusters. Consequently, in this study we also

propose a novel cluster evaluation technique which is demonstrated to be effective in evaluating

sensible and non-sensible clustering solutions.

233

In this chapter, we will present an overall discussion and comparison of the proposed

techniques and investigate their performances. The main contributions of the thesis will then be

discussed, followed by complexity analyses of the proposed techniques along with comparison

with existing techniques. Finally, comment is made on future research directions.

The structure of Chapter 9 is as follows: Section 9.2 features a comparison and discussion of

the proposed techniques; Section 9.3 assesses the main contributions of the thesis. Complexity

analyses of the proposed techniques are presented in Section 9.4, and comparison of

complexities is described in Section 9.5; while Section 9.6 summarises the proposed techniques;

and finally, future research directions are proposed in Section 9.7.

9.2 Comparison and Discusion of the Proposed Techniques

9.2.1 DeRanClust

In Chapter 3, we present a GA-based clustering technique called DeRanClust that produces

high-quality chromosomes in the intial population. The use of GA in clustering techniques can

help to avoid the local optima issue of K-means (Agustín-Blas et al., 2012; D.-X. Chang et al.,

2009; D. Chang et al., 2012; He & Tan, 2012; Y. Liu et al., 2011; Peng et al., 2014; Rahman &

Islam, 2014). Typically, a genetic algorithm-based technique does not require any user input in

clusters 𝑘.

However, many existing techniques (Y. Liu et al., 2011; Maio et al., 1995; Maulik &

Bandyopadhyay, 2000; Xiao et al., 2010) generate the number of genes of a chromosome

randomly, in population initialization. These techniques may also randomly choose records as

genes, instead of carefully choosing genes of a chromosome. Careful selection of genes can

create an initial population containing high-quality chromosomes; and a high-quality initial

population typically increase the likelihood of obtaining a good clustering solution at the

234

completion of genetic processing (Diaz-Gomez & Hougen, 2007; Goldberg et al., 1991;

Rahman & Islam, 2014).

An existing technique known as GenClust (Rahman & Islam, 2014) finds a high-quality

initial population and thereby obtains good clustering solutions. However, its initial population

selection process is very complex; with a complexity of 𝑂(𝑛2), where 𝑛 is the number of records

in a data set. Moreover, GenClust requires user input in regard to the number of radius values

for the clusters in the initial population selection. It can be very difficult for a user to estimate

the set of radius values (i.e. radii).

Therefore, we propose DeRanClust to enable a high-quality initial population with a low

complexity of 𝑂(𝑛) to be produced. This technique automatically chooses the number of

clusters for the chromosomes in the initial population. Therefore, no user input is require for the

number of clusters 𝑘. The proposed population initialization approach uses chromosomes

obtained both deterministically and randomly. The effectiveness of this method of population

initialization is illustrated in Table 3.2 using five data sets. The table indicates that an existing

technique called AGCUK performs better when it uses our proposed population initialization

approach rather than using traditional population initialization techniques.

Lastly, we compare DeRanClust with AGCUK (Y. Liu et al., 2011), GAGR (D. Chang et al.,

2012), K-Means (Lloyd, 1982), and GenClust (Rahman & Islam, 2014) on five data sets in terms

of two well-known evaluation criteria: Silhouette Coefficient (Agustín-Blas et al., 2012; Pang-

Ning Tan, Michael Steinbach, 2005) and DB Index (D. L. Davies & Bouldin, 1979).

DeRanClust performs significantly better than the other techniques for all data sets and for two

evaluation criteria. Using the proposed DeRanClust technique, we make progress towards

achieving our first research goal.

235

9.2.2 GMC

Building on this, we contend that cluster quality of DeRanClust (see Chapter 3) can be improved

by enhancing other genetic operations such as crossover and mutation operations. Consequently,

in Chapter 4, we propose a novel clustering technique titled GMC that uses new selection,

crossover and mutation operations in order to improve cluster quality. Chapter 4 involved

further progress towards achieving our first research goal.

In the proposed crossover operation, it first classifies the chromosomes in a population in

one of two groups: Good group and Non-good group. It then performs different types of

crossover on the two different groups. Intuitively, this to increase the possibility of obtaining

good-quality offspring chromosomes from a pair of good-quality parent chromosomes. Fig. 4.3

and Fig. 4.4 show the impact of the proposed crossover operation in producing better clustering

results.

Mutation is another important component in GA, in the quest to improve chromosome

quality. Therefore, in a similar method to crossover we also introduce a new mutation operation

in GMC with the aim of improving chromosome quality. The proposed mutation operation

reduces the number of changes on the good-quality chromosomes, and increases the number of

changes on the bad-quality chromosomes, in order to improve their overall quality. Fig. 4.5 and

Fig. 4.6 show the effectiveness of the proposed mutation operation. GMC also uses a new

selection operation comparing chromosomes with two generations, whereby a chromosome

with higher fitness value has a greater likelihood of being selected for other genetic operations,

such as crossover and mutation. Fig. 4.7 and Fig. 4.8 show the effectiveness of the proposed

selection operation.

We evaluate GMC by comparing it with AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang

et al., 2009), K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007), and GenClust

236

(Rahman & Islam, 2014) on ten natural data sets available from the UCI Machine Learning

Repository (M. Lichman, 2013). GMC achieves significantly better clustering results (according

to the sign test analysis) than all the existing techniques on all ten data sets in terms of two

evaluation criteria (see Fig. 4.11). Fig. 4.1 and Fig. 4.2 demonstrate that GMC clearly achieves

better average results than all other techniques.

9.2.3 GCS

Although genetic operations such as crossover and mutation operation tend to improve the

health/fitness of a chromosome, they can also deteriorate the health of some chromosomes. GCS

(see Chapter 5), therefore introduces a new genetic operation known as the health check in order

to ensure the health of chromosomes in a population. The usefulness of the health check

operation is presented in Table 5.2. The work conducted in Chapter 5 allow us to move closer

to achieving research goal 1.

GCS also modifies the process by which selects a pair of chromosomes in a crossover

operation through two phases in order to increase the potential for obtaining better quality

offspring chromosomes. GMC (as presented in Chapter 4) also uses a new crossover operation

where a chromosome with low fitness value always makes a pair with another low-quality

chromosome. Therefore, GCS introduces a new crossover operation where each chromosome

gets an opportunity to make a pair with the best chromosome. Table 5.3 shows that GCS

achieves better clustering results when it uses the proposed crossover operation rather than the

conventional crossover operation.

The proposed technique also uses a new selection operation in order to increase the quality

of chromosomes in a population. GCS uses the elitist operation after each genetic operation

within a generation, in order to keep track of the best solution obtained thus far. Fig. 5.7 shows

the gradual improvement of the best chromosome over the iterations. In Fig. 5.8, we can see that

237

all chromosomes in a population improve over the iterations, indicating the usefulness of

components including the health check and selection operation.

We empirically compare GCS with AGCUK (Y. Liu et al., 2011), GAGR (D.-X. Chang et

al., 2009), K-means (Lloyd, 1982), K-means++ (Arthur & Vassilvitskii, 2007), and GenClust

(Rahman & Islam, 2014) on 15 natural data sets available from the UCI Machine Learning

Repository (M. Lichman, 2013). GCS achieves significantly better clustering results than all the

existing techniques on all 15 data sets in terms of two evaluation criteria (see Fig. 5.1 to Fig.

5.4). Fig. 5.2 and Fig. 5.4 indicate that GCS clearly achieves better average results than all other

techniques, without any overlapping of the standard deviation.

9.2.4 HeMI

It is evident from the literature (Pourvaziri & Naderi, 2014; Straßburg et al., 2012) that the

population size has a positive impact on clustering quality. That is, a big population size is likely

to contribute towards a good clustering solution. However, a big population size requires high

time complexity.

Therefore, in Chapter 6, we propose a novel clustering method titled HeMI that uses a big

population through multiple streams where each stream contains a relatively small number of

chromosomes, and thus can facilitate managing a low execution time as they are suitable for

parallel processing when necessary. The effectiveness of the use of multiple streams is presented

in Table 6.6. Various genetic operations (such as crossover and mutation) are applied to each

stream in parallel. As a result, HeMI is likely to produce better quality clustering solutions.

Moreover, due to splitting chromosomes into a number of streams and processing the splits

separately, HeMI exhibits ability to explore the solution space, compared to the traditional

approach of processing all chromosomes in a single stream.

238

Note that some existing techniques use parallel genetic algorithms (Kumar et al., 2011; Y.

Y. Liu & Wang, 2015; Moore, 2004; Straßburg et al., 2012) where the total number of

chromosomes is divided into a number of parallel runs. In our technique, however, the total

number of chromosomes is increased. The main goal of the existing techniques is to reduce time

complexity through the parallelization of the genetic algorithms, whereas the main goal of HeMI

is to improve clustering results. Employing parallelization via these existing techniques does

not share information between the parallel streams, whereas HeMI introduces information

sharing across the streams at regular intervals in order to take advantage of the multiple streams

(see Table 6.6, Table 6.7, and Table 6.8). The impact of the number of streams is presented in

Fig. 6.5. HeMI also demonstrates the significance of the use of intervals on information sharing

in Fig. 6.4.

In a similar way to DeRanClust (Chapter 3), GMC (Chapter 4), and GCS (Chapter 5), HeMI

also uses a high-quality initial population in two phases; a deterministic phase and a random

phase. However, in the random phase excluding HeMI all these techniques generate |𝑃|

2

chromosomes, where |𝑃| (|𝑃| is set to be 20 in our experiments) is the number of chromosomes

in a population. We realize that in a comparable way to the deterministic phase, good-quality

chromosomes can also be produced through the random phase. Therefore, in HeMI we generate

the same number of chromosomes (45 chromosomes) in the random phase and then select top

|𝑃|

2 (i.e. 10) chromosomes from the random phase. The effectiveness of the high-quality initial

population selection is presented in Table 6.10.

The presence of healthy chromosomes (i.e. chromosomes with high fitness values) in a

population can increase the possibility of good clustering results. Hence, HeMI replaces the sick

chromosomes (i.e. chromosomes with low fitness) with healthy chromosomes. GCS (as

presented in Chapter 5) also uses a health check operation to select sick chromosomes in a

239

population, and probabilistically replaces them with healthy chromosomes found in the previous

20 generations. GCS applies the health check operation after 20 generations. However, we

empirically find that the chromosomes and best chromosome in a population improve their

quality over the iterations (see Fig. 6.6, Fig. 6.7 and Fig. 6.8). Hence, GCS’s approach of using

the pool of best chromosomes obtained from the first 20 iterations may not be effective in the

health improvement of later iterations, such as the 40th iteration.

Consequently, HeMI uses a new health check operation whereby some of the healthy

chromosomes are chosen from a pool of healthy chromosomes obtained by the initial

population, while other healthy chromosomes are generated through the crossover operation of

the existing healthy chromosomes of a generation, with the hope that the crossover of two

healthy chromosomes may generate new healthy chromosomes. Table 6.13 demonstrates the

impact of the health improvement operation.

HeMI uses the three steps of mutation operation, which employs a division and absorption

operation in sequence, if they improve the quality of clustering solutions. Additionally, at the

end of the division and absorption operation, it also applies a random change in chromosomes.

The usefulness of the proposed mutation operation is presented in Table 6.11 and Table 6.12.

HeMI also maintained randomness through the noising selection and crossover operation, in

order to explore the solution space through its randomness.

Finally, we compare HeMI with AGCUK, GAGR, K-means, K-means++, and GenClust on

20 data sets in relation to two well-known evaluation criteria; Silhouette Coefficient and DB

Index. HeMI achieves significantly better clustering results (according to the sign test analysis)

than all existing techniques on all 20 data sets in terms of the two evaluation criteria (see Fig.

6.9 and Fig. 6.10). Fig. 6.2 and Fig. 6.3 demonstrate that HeMI clearly achieves better results

on average than all other techniques based on the two evaluation criteria, without any

240

overlapping of the standard deviation. In Chapter 6 we achieve research goal 1 (producing

parameter less clustering techniques with high-quality solutions and low complexity).

9.2.5 CSClust

With the result of HeMI (as presented in Chapter 6), we achieve our first goal of proposing a

parameter less clustering technique with a high-quality solution and low complexity. However,

in order to achieve our second goal of producing sensible clustering solutions we need to

carefully analyse the results obtained by HeMI and other existing techniques. This analysis is

undertaken in Chapter 7. From the empirical analysis (see Section 7.2 in Chapter 7) we realise

that many existing clustering techniques (AGCUK, GAGR, GenClust, K-means and K-

means++) do not produce sensible clustering solutions, although their solutions achieve high

fitness values based on existing evaluation criteria. These solutions are typically not useful in

knowledge discovery from underlying data sets (see Fig. 7.2, Fig. 7.3, and Fig. 7.4). Therefore,

in Chapter 7, we propose a novel clustering technique known as CSClust that first learns

important properties of sensible clustering solutions and then applies this information in

producing its clustering solutions.

We apply the existing techniques on a brain data set and realize that their clustering solutions

either have far too many clusters or only two clusters; where one cluster contains only one record

and all other records are stored in the other cluster (see Fig. 7.2, Fig. 7.3, and Fig. 7.4). CSClust

overcomes this problem and produces the right number of clusters with right records in the

clusters (see Table 7.3). From the brain data set, it captures 40 seizure records in one cluster and

non-seizure records in another cluster (see Fig. 7.5). To evaluate the clustering quality of

CSClust, we relabel the records based on the clustering solutions of CSClust and produce a

number of decision trees to discover logic rules for seizure and non-seizure records. The logic

rules (such as if Std > 102.44 Seizure) obtained by the forest (see Fig. 7.13 ) appear to be

241

sensible; further confirming the accuracy of clustering results obtained by the proposed

clustering technique.

We compare CSClust with AGCUK, GAGR, GenClust, K-means, and K-means++ using the

brain data set. Table 7.2 demonstrates that CSClust achieves better clustering results than the

existing techniques, based on the two evaluation criteria. The empirical results on the brain data

set also indicate that CSClust produces an appropriate number of clusters (see Column 4 of

Table 7.2). We also evaluate CSClust against existing clustering techniques using ten natural

data sets obtained from the UCI machine learning repository. Fig. 7.14 and Fig. 7.15 show that

CSClust clearly achieves better results than all other techniques, based on the two evaluation

criteria.

9.2.6 HeMI++

In Chapter 8, we propose a new clustering technique and an evaluation technique. In the

proposed clustering technique, we combine our previous technique – known as CSClust – with

HeMI, and also significantly improve the components of CSClust (see Chapter 7) and HeMI

(see Chapter 6). Therefore, we call the proposed technique HeMI++. In Chapter 8, we achieve

our second and third research goals (producing high-quality and sensible clustering solutions,

and a cluster evaluation technique for better evaluating sensible and non-sensible clustering

solutions).

First, we explore the quality of HeMI and several existing clustering techniques. We also

assess the quality of existing evaluation techniques. In Chapter 7, we find that some existing

techniques do not produce sensible clusters. However, in Chapter 8, we carefully assess the

clustering quality of the existing techniques and HeMI through cluster visualization. In order to

assess the quality of the existing clustering techniques and cluster evaluation techniques. We

plot the data set in order to graphically visualize the clusters (see Fig. 8.1). We know that this

242

data set has two types of records; seizure and non-seizure. Fig. 8.1 also clearly demonstrate that

there are two clusters of records. We then apply the existing clustering techniques on this data

set and plot their clustering results so that we can graphically visualize the clusters.

Through this, we find that some existing clustering techniques – such as GAGR and GenClust

– do not produce sensible clusters. We also find that our technique HeMI (as presented in

Chapter 6) does not produce sensible clusters. GenClust produces 447 clusters (see Fig. 8.2)

which is not sensible as the actual number of clusters in this data set is supposed to be only two.

As shown in Fig. 8.3, GAGR produces 56 clusters, while HeMI produces two clusters (see Fig.

8.4) where one cluster contains one record, and the other cluster contains all remaining records.

In order to handle such a situation, in HeMI++ we propose a new component named Selection

of Sensible Properties (see Section 8.2.3 of Chapter 8). Through this component, HeMI++ first

learns important properties of sensible clustering solutions and then applies the information in

producing its clustering solutions.

Note that CSClust also learns the important properties of sensible clustering solutions.

However, CSClust does this by using the DB Index on the initial population (see Step 2 in

Section 7.3 in Chapter 7). This approach can be problematic, as the selection can be biased by

the limitations of the DB Index. HeMI++ therefore learns the properties of a sensible clustering

solution through a new approach, not via the DB Index (see Section 8.2.3 of Chapter 8). The

necessary properties of a sensible clustering solution for a data set is learned by HeMI++ from

the initial population, which is generated in the initial population through multiple streams.

The central component of HeMI++ is a cleansing operation (see Section 8.2.3 in Chapter 8)

in each generation in order to ensure that all chromosomes in a population have a sensible

solution. It applies the cleansing operation to each chromosome of a population by applying two

conditions: (i) the number of clusters must be within the range of a maximum and minimum

number of clusters, which is learned by HeMI++ from some of the properties of a data set, and

243

(ii) the minimum number of records in a cluster must be greater than a threshold minimum

number of records, which again is data driven (i.e. not user defined). HeMI++ use the initial

population in order to learn the range of a maximum and minimum number of clusters and the

threshold minimum number of records.

Another interesting idea associated with HeMI++ is the cloning operation (see Section 8.2.3

in Chapter 8) that replaces sick chromosomes in each generation/population. In each population,

the cleansing operation identifies the sick chromosomes, which are then replaced by a pool of

healthy chromosomes found in the initial population. The pool of high-quality chromosomes

created for the initial population is expected to be reasonably healthy, due to the repeated use of

K-means.

During the development of the proposed clustering technique, we realize that the existing

cluster evaluation techniques (see Section 8.2.1) produce inaccurate evaluation results.

Sometimes they produce higher evaluation values for non-sensible clustering solutions and

lower evaluation values for sensible clustering solutions. Sometimes they produce higher

evaluation values both for the sensible and non-sensible clustering solutions, which is not as

useful for measuring clustering quality. Therefore, we propose a new evaluation technique titled

Tree Index (see Section 8.2.5).

Tree Index first labels the records based on the clustering results to be evaluated. It then

builds a decision tree from the data set with the labels. The premise of this being, that if the

labeling is good (i.e. sensible) then the produced tree is likely to classify the training records

more accurately and be shallow. Using this basic concept Tree Index computes an evaluation

value of a clustering solution. Different clustering solutions can be compared based on their

Tree Index values.

244

We graphically visualize two types of clustering solutions; sensible solutions and non-

sensible solutions (either having too many clusters or having one record in one cluster and all

other records in the other cluster) as shown in Fig. 8.2, Fig. 8.3, Fig. 8.4 and Fig. 8.5. While

existing evaluation techniques fail to correctly evaluate the cluster quality, Tree Index evaluates

the sensible solution more accurately than those non-sensible solutions.

Our proposed clustering technique (HeMI++) is then empirically compared with five existing

techniques using 21 publicly available data sets, in terms of our Tree Index evaluation technique.

We find that HeMI++ achieves the best clustering solutions in 18 out of 21 data sets. Moreover,

we graphically visualize the clustering results of HeMI++ on a brain data set and find the results

to be more sensible than others. Additionally, we discover some useful knowledge from the

clustering results produced by HeMI++, indicating its practicality in knowledge discovery.

Therefore, we empirically demonstrate achievement of our second and third research goals –

through the proposed HeMI++ and Tree Index techniques – of producing high quality and

sensible clusters (superior to other techniques used in this study) require no user input, and a

cluster evaluation technique for better evaluation of sensible and non-sensible clustering results

(more accurate than the existing evaluation techniques used in this study).

9.3 Key Contributions of the Thesis

In this study, we present a number of clustering techniques for producing high-quality and

sensible clustering solutions, and a cluster evaluation technique for better evaluating sensible

and non-sensible clustering results. The main contributions of the thesis are listed as follows.

All proposed clustering techniques produce high-quality clusters with a low complexity

of 𝑂(𝑛);

All proposed clustering techniques do not require any user input;

245

All proposed clustering techniques avoid local optima while clustering the records;

We propose clustering techniques with the ability to process data sets with categorical

and/numerical attributes;

We propose clustering techniques that generate appropriate cluster numbers through a

data-driven approach;

We propose clustering techniques that produce sensible clustering solution, appropriate

for knowledge discovery;

We propose a cluster evaluation technique suitable for evaluating sensible and non-

sensible clustering solutions.

9.4 Complexity Analysis of the Techniques

In this section, we present a detailed complexity analysis of our proposed techniques and some

existing techniques.

9.4.1 Notations for Complexity Analysis

We use the following notations in order to analyze the complexity of techniques. We consider

a data set 𝐷 with 𝑛 records; 𝑚 attributes; maximum domain size of a categorical attribute is 𝑑;

minimum number of records in a cluster is 𝑅; number of genes in a chromosome is 𝑘; number

of chromosomes in a population is 𝑃; number of iterations in K-means is 𝑁ˊ, and number of

generations is 𝑁.

9.4.2 Complexity of DeRanClust

In Chapter 3, we present a novel clustering techniques known as DeRanClust. The step-by-step

detailed complexity analysis of DeRanClust is as follows:

246


We apply DeRanClust on the data sets with numerical attributes only. To normalize the values

of a numerical attribute of a data set, we find the minimum and maximum domain values of the

attribute. The complexity of find the minimum domain value of a numerical attribute is

𝑂(𝑛). Similarly, the complexity of find the maximum domain value of a numerical attribute is

𝑂(𝑛). Therefore, the overall complexity for normalizing 𝑚 attributes is 𝑂(𝑛𝑚).


DeRanClust produces its initial chromosomes through a deterministic phase and a random

phase. In the deterministic phase, it applies K-means to produce each chromosome. If there are

𝑃 number of chromosomes, then the complexity for deterministic chromosomes is 𝑂(𝑛𝑚𝑘𝑃𝑁´),

where the number of attributes is 𝑚, total number of records is 𝑛, maximum number of genes

of a chromosome is 𝑘 and the total number of iterations in K-means is 𝑁´. The complexity of

𝑃 number of random chromosomes is 𝑂(𝑘𝑃), where 𝑘 is the maximum number of genes in

each chromosome.

The fitness of each chromosome is calculated using the DB Index which estimates the

distance between all pairs of seeds. If there are 𝑧 number of chromosomes, then the complexity

of fitness calculation is 𝑂(𝑛𝑚𝑘𝑃). Once the fitness values of 𝑃 chromosomes are computed

then we need to sort them in descending order in order to select 𝑃 best chromosomes. The

complexity of sorting is 𝑂(𝑃2). Therefore, the total complexity of population initialization

is 𝑂(𝑛𝑚𝑘𝑃𝑁´ + 𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃2) = 𝑂(𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2).

Step 3: Noise-based Selection

In noise-based selection, DeRanClust pairwise compares the fitness of the chromosomes of the

current 𝑖𝑡ℎgeneration with the fitness of chromosomes of the previous (𝑖 − 1)𝑡ℎ generation.

Therefore, for 𝑃 chromosomes the complexity is 𝑂(𝑃).

247


For the crossover operation, pairs of chromosomes are selected from 𝑧 chromosomes. The best

chromosome (currently available in the population) is chosen as the first chromosome of the

pair while the second chromosome of the pair is chosen using the roulette approach. In the

roulette wheel technique, we need to calculate the probability of each chromosome in order to

select the second chromosome of the pair. There are 𝑃

2 crossover operations altogether.

Therefore, the overall complexity of the crossover operation is 𝑂(𝑃2).


The twin removal approach removes/change the identical genes of a chromosome. If a

chromosome has 𝑘 genes then the complexity of finding the identical genes is 𝑂(𝑚𝑘2).

Therefore, the complexity of all 𝑃 chromosomes is 𝑂(𝑚𝑘2𝑃).


Division

DeRanClust first calculates the fitness of 𝑧 chromosomes with a complexity of 𝑂(𝑛𝑚𝑘𝑃). It

then split the sparse cluster using K-means, where the value of 𝑘 is 2. If there are 𝑃 number of

chromosomes, then the complexity of splitting all 𝑃 chromosomes is 𝑂(𝑘𝑃). The overall

complexity of division operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑘𝑃)= 𝑂(𝑛𝑚𝑘𝑃).

Absorption

After the division operation, DeRanClust calculates the fitness of 𝑃 chromosomes with the

complexity of 𝑂(𝑛𝑚𝑘𝑃). It then compares the fitness of all 𝑃 chromosomes (that it obtains after

division operation) with all 𝑃 chromosomes (that it obtains after crossover operation) in order

to select the chromosomes for the absorption operation. The complexity of this is 𝑂(𝑃). In the

absorption operation, it identifies two closest clusters in a chromosome and then merges the two

248

closest clusters with a complexity of 𝑂(𝑘). The complexity for 𝑧 chromosomes is 𝑂(𝑘𝑃). The

overall complexity of the absorption operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑃 + 𝑘𝑃) = 𝑂(𝑛𝑚𝑘𝑃).


After the mutation operation, DeRanClust calculates the fitness of 𝑃 chromosomes again with a

complexity of 𝑂(𝑛𝑚𝑘𝑃). In the elitist operation, it identifies the best and worst chromosome

of a generation. The complexity of this is 𝑂(𝑃) for all 𝑃 chromosomes. The overall complexity

of the elitist operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑃) = 𝑂(𝑛𝑚𝑘𝑃).

If there are 𝑁 iterations, then Step 3 to Step 7 will be repeated 𝑁 times while the

normalization and population initialization will occur only once. Therefore, the total complexity

of the steps is 𝑂(𝑛𝑚 + 𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑃 + 𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃 +

𝑛𝑚𝑘𝑃))= 𝑂(𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃)). If 𝑛 ≫ 𝑃, 𝑚 ≫ 𝑃, 𝑛 ≫ 𝑘,

𝑚 ≫ 𝑘, 𝑛 ≫ 𝑁′, 𝑚 ≫ 𝑁′, 𝑛 ≫ 𝑁 and 𝑚 ≫ 𝑁 then the complexity of DeRanClust is 𝑂(𝑛𝑚).

9.4.3 Complexity of GMC

In Chapter 4, we present a novel clustering technique known as GMC. The step-by-step detailed

complexity analysis of GMC is as follows:


In GMC, the complexity of normalizing the values of numerical attributes of a data set is the

same as the complexity of normalizing the attributes values of DeRanClust. Therefore, the

complexity for normalizing 𝑚 attributes is 𝑂(𝑛𝑚).


The complexity of population initialization of GMC is the same as the complexity of population

initialization of DeRanClust. Therefore, the complexity of population initialization is

𝑂(𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2).

249


In the probabilistic selection, GMC first merged the chromosomes of the current 𝑖𝑡ℎ and the

previous (𝑖 − 1)𝑡ℎgeneration. The complexity of this is 𝑂(𝑃). It then calculates the fitness of

each chromosome. If there are 𝑃 number of chromosomes, then the complexity of the fitness

calculation is 𝑂(𝑛𝑚𝑘𝑃). It then probabilistically selects a set of chromosomes from the merged

chromosomes based on their fitness value. The complexity of this is 𝑂(𝑃).

Therefore, the overall complexity of the probabilistic selection is 𝑂(𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃) =

𝑂(𝑛𝑚𝑘𝑃).


Before applying crossover, GMC first identifies the chromosomes in a population with two

groups: Good group and Non-good group. The complexity for this is 𝑂(𝑃).

Good group

In the good group, each chromosome makes a pair with each other. The complexity for this

is 𝑂(𝑃2).

Single-point crossover

For each pair of chromosomes in the good group, GMC applies single-point crossover with a

complexity of 𝑂(𝑃2).

Random crossover

In the random crossover phase, GMC combines a pair of chromosomes (𝑃𝑥 and 𝑃𝑦), and

generates a random number (𝑅𝑚) between 0 and the length of the combined chromosomes (𝑃𝑥 +

𝑃𝑦). If there are 𝑃 number of chromosomes, then the complexity of this is 𝑂(𝑘𝑃). For offspring

one, it then randomly selects 𝑅𝑚 genes from the combined chromosomes and deletes 𝑅𝑚 genes

from (𝑃𝑥 + 𝑃𝑦). The remaining genes ((𝑃𝑥 + 𝑃𝑦) -𝑅𝑚)) in the combined chromosomes are then

250

selected for offspring two. The complexity of this is 𝑂(𝑘). If there are 𝑃 number of

chromosomes, then complexity is 𝑂(𝑘𝑃).

Once the crossover operation is complete in this group, GMC then calculates fitness of the

offspring chromosomes with a complexity of 𝑂(𝑛𝑚𝑘𝑃). It then sorts the offspring

chromosomes in descending order based on their fitnesss values. The complexity of this is

𝑂(𝑃2). It then selects 𝑃

2 offspring chromosomes with a complexity of 𝑂(𝑃). The overall

complexity for random crossover is 𝑂(𝑘𝑃 + 𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑃) = 𝑂(𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃2).

Non-good group

In the non-good group, a pair of chromosomes are selected using the roulette wheel. The best

chromosome of this group is selected as the first chromosome of the second pair while the

second chromosome of the pair is selected using roulette wheel. The complexity of this is 𝑂(𝑃2).

Single-point crossover

The complexity of the single point crossover for the non-good group is the same as the

complexity of the single-point crossover for the good group. Therefore, the complexity of the

single-point crossover of the non-good group is 𝑂(𝑃2).

Random crossover

The complexity of the random crossover for the non-good group is the same as the complexity

of the random crossover of the good group. Therefore, the complexity of the random crossover

of the non-good group is 𝑂(𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃2).

The overall complexity of the crossover operation is 𝑂(𝑃 + 𝑃2 + 𝑃2 + 𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃2 +

𝑃2 + 𝑃2 + 𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃2) = (𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃2).

251


In GMC, the complexity of twin removal is the same as the complexity of twin removal of

DeRanClust. Therefore, the complexity of twin removal is 𝑂(𝑚𝑘2𝑃).


Before applying mutation, GMC first identifies the chromosomes in a population with two

groups: Good group and Non-good group. The complexity of this is 𝑂(𝑃).

Good group

For the good group, GMC applies division and absorption operation.

Division

The complexity of division operation is the same as the complexity of division operation of

DeRanClust. Therefore, the complexity of division operation is 𝑂(𝑛𝑚𝑘𝑃).

Absorption

The complexity of the absorption operation is the same as the complexity of the absorption

operation of DeRanClust. Therefore, the complexity of absorption operation is 𝑂(𝑛𝑚𝑘𝑃).

Non-good group

For the non-good group, division, absorption, and a random change operation are applied.

Division

The complexity of the division operation is the same as the complexity of the division operation

of DeRanClust. Therefore, the complexity of division operation is 𝑂(𝑛𝑚𝑘𝑃).

Absorption

The complexity of the absorption operation is the same as the complexity of the absorption

operation of DeRanClust. Therefore, the complexity of the absorption operation is 𝑂(𝑛𝑚𝑘𝑃).

252

Random Change

GMC changes one attribute value (randomly chosen) of a gene of the chromosome. The

complexity of this is 𝑂(𝑃), if there are 𝑃 number of chromosomes.

The overall complexity of mutation operation is 𝑂(𝑃 + 𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃 +

𝑛𝑚𝑘𝑃 + 𝑃) = 𝑂(𝑛𝑚𝑘𝑃).


In GMC, the complexity of the elitist operation is the same as the complexity of the elitist

operation of DeRanClust. Therefore, the complexity of elitist operation is 𝑂(𝑛𝑚𝑘𝑃).

If there are 𝑁 iterations then the Step 3 to Step 7 will be repeated 𝑁 times while the


of the steps will be 𝑂(𝑛𝑚 + 𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑛𝑚𝑘𝑃 + 𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃2 +

𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃)) = 𝑂(𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑛𝑚𝑘𝑃 + 𝑚𝑘2𝑃 + 𝑃2)). If 𝑛 ≫

𝑃, 𝑚 ≫ 𝑃, 𝑛 ≫ 𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑁′, 𝑚 ≫ 𝑁′, 𝑛 ≫ 𝑁 and 𝑚 ≫ 𝑁 then the complexity of GMC

is 𝑂(𝑛𝑚).

9.4.4 Complexity of GCS

In Chapter 5, we present a novel clustering techniques called GCS. The step-by-step detailed

complexity analysis of GCS is as follows:


In GCS, the complexity of normalizing the values of numerical attributes of a data set is the


complexity of normalizing 𝑚 attributes is 𝑂(𝑛𝑚).

253


The complexity of population initializaton of GCS is the same as the complexity of population

initializaton of DeRanClust. Therefore, the complexity of population initialization is


Step 3: Two Phases of Selection

Phase 1

In Phase1, GCS selects the top |𝑃| chromosomes (according to the fitness values) from 2 × |𝑃|

chromosomes of the current population. The complexity of this is 𝑂(𝑃).

Phase 2

In Phase 2, GCS selects |𝑃| chromosomes probabilistically from a set of 3 × |𝑃| chromosomes,

which is made of the remaining bottom |𝑃| chromosomes of the current population and 2 × |𝑃|

chromosomes from the last population of the immediate previous generation. The complexity

of this is 𝑂(𝑃).

The overall complexity of Step 3 is 𝑂(𝑃 + 𝑃) = 𝑂(𝑃).


Phase 1

In Phase 1, GCS selects 2 × |𝑃| − 1 pair of chromosomes, where in each pair the first

chromosome is always the best chromosome of the population. All other chromosomes are

chosen one by one as the second chromosome of a pair. All pairs have different second

chromosome. Therefore, the total complexity of Phase 1 is 𝑂(𝑃2).

Phase 2

254

In Phase 2, pairs of chromosomes are selected from 𝑃 chromosomes. The best chromosome

(currently available in the population) is chosen as the first chromosome of the pair. The second

chromosome of the pair is chosen using the roulette approach. In the roulette wheel technique,

we need to calculate the probability of each chromosome in order to select the second

chromosome of the pair. Therefore, the overall complexity of Phase 2 is 𝑂(𝑃2).

The overall complexity of the crossover operation is 𝑂(𝑃2 + 𝑃2) = 𝑂(𝑃2).


In GCS, the complexity of twin removal is the same as the complexity of twin removal of



GCS applies division and absorption mutation operation.

Division

The complexity of division operation is the same as the complexity of division operation of

DeRanClust. Therefore, the complexity of division operation is 𝑂(𝑛𝑚𝑘𝑃).

Absorption

The complexity of absorption operation is the same as the complexity of absorption operation

of DeRanClust. Therefore, the complexity of absorption operation is 𝑂(𝑛𝑚𝑘𝑃).

The overall complexity of the mutation operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃) = 𝑂(𝑛𝑚𝑘𝑃).


Phase 1

GCS prepares a set of chromosomes 𝑃, where it stores the best chromosomes of each generation

for the first 𝐼 generations. It first calculates the fitness of 𝑃 chromosomes with a complexity

255

of 𝑂(𝑛𝑚𝑘𝑃). It then calculates the average fitness 𝐹𝑑 of the 𝑃 chromossomes with a complexity

of 𝑂(𝑃).

Phase 2

GCS calculates the fitness of the chromosomes of the current population. If there are 𝑧 number

of chromosomes in the population then complexity of this is 𝑂(𝑛𝑚𝑘𝑃). It then compare the

fitness of the chromosomes with 𝐹𝑑 in order to find sick chromosomes. The complexity of this

is 𝑂(𝑃). GCS then probabilistically selects a chromosome from 𝑃 chromosomes (i.e. prepares

in Phase 1) to replace the sick chromosome with a complexity of 𝑂(𝑃).

The overall complexity of the health check operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃 +

𝑃)= 𝑂(𝑛𝑚𝑘𝑃).


The complexity of the elitist operation of GCS is the same as the complexity of the elitist

operation of DeRanClust. Therefore, the complexity of the elitist operation is 𝑂(𝑛𝑚𝑘𝑃).



of the steps will be 𝑂(𝑛𝑚 + 𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑃 + 𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃 +

𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃)) = 𝑂(𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃)). If 𝑛 ≫ 𝑃, 𝑚 ≫

𝑃, 𝑛 ≫ 𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑁′, 𝑚 ≫ 𝑁′, 𝑛 ≫ 𝑁 and 𝑚 ≫ 𝑁 then the complexity of GCS is 𝑂(𝑛𝑚).

9.4.5 Complexity of HeMI

In Chapter 6, we present a novel clustering techniques known as HeMI. The step-by-step

detailed complexity analysis of HeMI is as follows:


Numerical Attributes

256

In HeMI, the complexity of normalizing the values of numerical attributes of a data set is the

same as the complexity of normalizing the attributes values of DeRanClsut. Therefore, the


Categorical Attributes

To calculate the distance between two categorical attributes, we find the similarity following

the approach of VICUS (H. Giggins & Brankovic, 2012). The similarity between two

categorical attribute values (𝑎 and 𝑏) is calculated as follows (H. Giggins & Brankovic, 2012;

Md Anisur Rahman, 2014):

𝑆𝑎,𝑏′ =

∑ √𝑒𝑎𝑐 × 𝑒𝑏𝑐𝑙𝑐=1

√𝑑(𝑎) × 𝑑(𝑏)

Eq. 9.1

where 𝑑(𝑎) is the degree of the attribute value 𝑎 (i.e. the number of other attribute values co-

appearing with the attribute value 𝑎 in the whole data set), 𝑒𝑎𝑐 is the number of edges between

two attribute values 𝑎 and 𝑐 (i.e. number of times the two categorical values 𝑎 and 𝑐 co-appear

in the whole data set) and 𝑙 is the total number of domain values for all attributes (except 𝑎

and 𝑏) values.

Let us consider that the domain size of the largest categorical attribute (i.e. the attribute that

has the largest number of domain values) is 𝑑. Then the complexity of calculating the degrees

of the values of a categorical attribute is 𝑂((𝑚 − 1)𝑛). The complexity of calculating the

degrees of the values of all categorical attribute is 𝑂((𝑚 − 1)𝑛) + 𝑂((𝑚 − 2)𝑛) + … +

𝑂((𝑚 − (𝑚 − 1))𝑛) which is 𝑂(𝑛𝑚2). The complexity of calculating the edges between two

values 𝑎 and 𝑏; ∀ 𝑎, 𝑏 is 𝑂(𝑛𝑚2). If the domain size of an attribute is 𝑑, then the complexity

of calculating the similarity for all value pairs of the attribute is 𝑂(𝑑2). Therefore, the

257

complexity of calculating the similarity of all value pairs of 𝑚 attributes is 𝑂(𝑚𝑑2). The overall

complexity of normalizing the categorical attribute is 𝑂(𝑛𝑚2 + 𝑚𝑑2).


In HeMI, the complexity of population initialization is the same as the complexity of population

initialization of DeRanClust, GCS and GMC. Therefore, the complexity of population

initialization is 𝑂(𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2).


The complexity of noise-based selection of HeMI is the same as the complexity of noise-based

selection of DeRanClust. Therefore, the complexity of noise-based selection is 𝑂(𝑃).


In HeMI, the complexity of the crossover operation is the same as the complexity of the

crossover operation of DeRanClust. Therefore, the complexity of the crossover operation is

𝑂(𝑃2).


The complexity of twin removal of HeMI is the same as the complexity of twin removal of



Division

In HeMI, the complexity of division operation is the same as the complexity of division

operation of DeRanClust. Therefore, the complexity of division operation is 𝑂(𝑛𝑚𝑘𝑃).

Absorption

The complexity of the absorption operation of HeMI is the same as the complexity of absorption

operation of DeRanClust. Therefore, the complexity of the absorption operation is 𝑂(𝑛𝑚𝑘𝑃).

258

Random Change

After the absorption operation, HeMI calculates the fitness of 𝑃 chromosomes with a

complexity of 𝑂(𝑛𝑚𝑘𝑃). It then compares the fitness of all 𝑃 chromosomes (that it obtains after

the absorption operation) with all 𝑃 chromosomes (that it obtains for the division operation) in

order to select the chromosomes for the random change operation. The complexity of this

is 𝑂(𝑃). HeMI also calculates the mutation probability for all 𝑃 chromosomes with a complexity

of 𝑂(𝑃). If a chromosome is chosen for a random change operation, it then changes one attribute

value (randomly chosen) of a gene of the chromosome. The complexity of this is 𝑂(𝑃) if there

are 𝑃 number of chromosomes. The overall complexity of the random change operation is

𝑂(𝑛𝑚𝑘𝑃 + 𝑃 + 𝑃 + 𝑃) = 𝑂(𝑛𝑚𝑘𝑃).

The overall complexity of the mutation operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃) = 𝑂(𝑛𝑚𝑘𝑃).


Phase 1

HeMI first calculates the fitness of 𝑃 chromosomes with a complexity of 𝑂(𝑛𝑚𝑘𝑃). It then

identifies 𝑃 (50% of the population size) healthy chromosomes with a complexity of 𝑂(𝑃). The

overall complexity of Phase 1 is 𝑂(𝑛𝑚𝑘𝑃 + 𝑃) = 𝑂(𝑛𝑚𝑘𝑃).

Phase 2

HeMI identifies 𝑃 (20% of the population size) healthy chromosomes with a complexity of

𝑂(𝑃). Applying the same approach to that of Component 5 (see Section 6.2.2 of Chapter 6) it

then chooses pairs of chromosomes from these 20% of healthy chromosomes. The complexity

of this is 𝑂(𝑃). The overall complexity of Phase 2 is 𝑂(𝑃 + 𝑃)= 𝑂(𝑃).

259

Phase 3

HeMI identifies 𝑃 (30% of the population size) chromosomes from the pool of chromosomes

obtained through the deterministic phase of Component 3 (see Section 6.2.2 of Chapter 6) with

a complexity of 𝑂(𝑃). For each of these chromosomes HeMI then randomly changes an attribute

value of a gene within its original domain. The complexity of this is 𝑂(𝑃). The overall

complexity of Phase 3 is 𝑂(𝑃 + 𝑃)= 𝑂(𝑃).

The overall complexity of the health improvement operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑃 +

𝑃)= 𝑂(𝑛𝑚𝑘𝑃).


The complexity of the elistist operation of HeMI is the same as the complexity of the elitist

operation of DeRanClust. Theerefore, the complexity of the elitist operation is 𝑂(𝑛𝑚𝑘𝑃).


For the information exchange among neighboring streams, HeMI finds the best and worst

chromosome of each stream. The complexity of this is 𝑂(𝑃).


HeMI compares all the best chromosomes of all streams and then selects the best of the best

chromosomes as the final clustering solution. The complexity of this is 𝑂(𝑃).


normalization, population initialization and global best selection will occur only once.

Therefore, the total complexity of the steps will be 𝑂(𝑛𝑚2 + 𝑚𝑑2 + 𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 +

𝑃2 + 𝑃 + 𝑁(𝑃 + 𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃))= 𝑂(𝑛𝑚2 + 𝑚𝑑2 +

260

𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃)). If 𝑛 ≫ 𝑑, 𝑚 ≫ 𝑑, 𝑛 ≫ 𝑃, 𝑚 ≫ 𝑃, 𝑛 ≫

𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑁′, 𝑚 ≫ 𝑁′, 𝑛 ≫ 𝑁 and 𝑚 ≫ 𝑁 then the complexity of HeMI is 𝑂(𝑛𝑚).

9.4.6 Complexity of CSClust

In Chapter 7, we present a novel clustering techniques titled CSClust. The step-by-step detailed

complexity analysis of CSClust is as follows:



In CSClust, the complexity of normalizing the values of numerical attributes of a data set is the

same as the complexity of normalizing the attributes values of DeRanClust, GCS, GMC, and

HeMI. Therefore, the complexity of normalizing 𝑚 attributes is 𝑂(𝑛𝑚).


In CSClust, the complexity of normalizing the values of categorical attributes of a data set is

the same as the complexity of normalizing the categorical attributes values of HeMI. Therefore,

the complexity of normalizing 𝑚 attributes is 𝑂(𝑛𝑚2 + 𝑚𝑑2).

The overall complexity of normalization is 𝑂(𝑛𝑚 + 𝑛𝑚2 + 𝑚𝑑2) = (𝑛𝑚2 + 𝑚𝑑2).


In CSClust, the complexity of population initialization is the same as the complexity of

population initialization of HeMI. Therefore, the complexity of population initialization is



CSClust selects the necessary properties [minimum (𝑀𝑛) and maximum (𝑀𝑥)], number of

clusters, and minimum number of records (𝑀𝑟) in a cluster of a sensible clustering solution. To

261

find the minimum number of records in a cluster, CSClust calculate the distance between genes

and records to form clusters. The complexity of this is 𝑂(𝑛𝑚𝑘), where 𝑘 is the maximum

number of genes in each chromosome. If there are 𝑃 number of chromosomes, then the

complexity of selecting a minimum number of records in a cluster is 𝑂(𝑛𝑚𝑘𝑃).

Once the clusters are formed for each chromosome, it then finds the minimum and maximum

number of clusters from 𝑃 chromosomes. The complexity of this is 𝑂(𝑃). Therefore, the total

complexity of selecting sensible properties is 𝑂(𝑛𝑚𝑘𝑃 + 𝑃) = 𝑂(𝑛𝑚𝑘𝑃).


The complexity of the crossover operation of CSClust is the same as the complexity of the

crossover operation of DeRanClust and HeMI. Therefore, the complexity of Step 4 is 𝑂(𝑃2).


CSClust first calculates the fitness of 𝑧 chromosomes with the complexity of 𝑂(𝑛𝑚𝑘𝑃). It then

finds the maximum and minimum fitness of the chromosomes for calculating the mutation

probability. The complexity of this is 𝑂(𝑃). If a chromosome is chosen for mutation then

CSClust changes an attribute value (randomly chosen) of each and every gene of the

chromosome. The complexity of this is 𝑂(𝑘), if the number of genes in a chromosome is 𝑘.

Therefore, the overall complexity of the mutation operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑃 +

𝑘𝑃)= 𝑂(𝑛𝑚𝑘𝑃).


In CSClust, the complexity of twin removal is the same as the complexity of twin removal of


262


CSClust applies the cleansing operation to each chromosome in a population based on the

properties of a sensible clustering solution. In this operation, it finds the (minimum (𝑀𝑛) and

maximum (𝑀𝑥) number of clusters, and the minimum number of records (𝑀𝑟) in a cluster of

each chromosome. In order to find the minimum number of records in a cluster, CSClust

calculates the distance between genes and records to form clusters. The complexity of this is

𝑂(𝑛𝑚𝑘), where 𝑘 is the maximum number of genes in each chromosome. If there are 𝑃 number

of chromosomes, then the complexity of selecting a minimum number of records in a cluster

is 𝑂(𝑛𝑚𝑘𝑃).

Once the clusters are formed for each chromosome, it then finds the minimum and maximum

numbers of clusters from 𝑃 chromosomes. The complexity of this is 𝑂(𝑃). Therefore, the total

complexity of the cleansing operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑃) = 𝑂(𝑛𝑚𝑘𝑃).


In the cloning operation, CSClust replaces the sick chromosomes. To replace a sick

chromosome, another chromosome is probabilistically selected from the pool of chromosomes.

If there re 𝑃 number of chromosomes in the pool then the complexity of the probabilistic

selection is 𝑂(𝑃). HeMI++ then randomly changes an attribute value of a gene to another value

within the domain of the attribute. The complexity of this is 𝑂(𝑃). Therefore, the overall

complexity of Step-8 is 𝑂(𝑃 + 𝑃) = 𝑂(𝑃).


In CSClust, the complexity of the elitist operation is the same as the complexity of the elitist

operation of DeRanClust. Therefore, the complexity of the elitist operation is 𝑂(𝑛𝑚𝑘𝑃).

If there are 𝑁 iterations, then the Step 3 to Step 9 will be repeated 𝑁 times while the

normalization, population initialization and selection of sensible properties will occur only once.

263

Therefore, the total complexity of the steps will be 𝑂(𝑛𝑚2 + 𝑚𝑑2 + 𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 +

𝑃2 + 𝑛𝑚𝑘𝑃 + 𝑁( 𝑃2 + 𝑛𝑚𝑘𝑃 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃 + 𝑛𝑚𝑘𝑃)= 𝑂(𝑛𝑚2 + 𝑚𝑑2 +

𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃)). If 𝑛 ≫ 𝑑, 𝑚 ≫ 𝑑, 𝑛 ≫ 𝑃, 𝑚 ≫ 𝑃, 𝑛 ≫

𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑁′, 𝑚 ≫ 𝑁′, 𝑛 ≫ 𝑁 and 𝑚 ≫ 𝑁 then the complexity of CSClust is 𝑂(𝑛𝑚).

9.4.7 Complexity of HeMI++

In Chapter 8, we present a novel clustering techniques titled HeMI++. The step-by-step detailed

complexity analysis of HeMI++ is as follows:



In HeMI++, the complexity of normalizing the values of numerical attributes of a data set is the




In HeMI++, the complexity of normalizing the values of categorical attributes of a data set is

the same as the complexity of normalizing the attributes values of HeMI. Therefore, the

complexity of normalizing 𝑚 attributes is 𝑂(𝑛𝑚2 + 𝑚𝑑2).

The overall complexity of normalization is 𝑂(𝑛𝑚 + 𝑛𝑚2 + 𝑚𝑑2) = (𝑛𝑚2 + 𝑚𝑑2).


The complexity of population initialization of HeMI++ is the same as the complexity of

population initialization of HeMI. Therefore, the complexity of population initialization is


264


In HeMI++, the complexity of the selection of sensible properties is the same as the complexity

of selection of sensible properties of CSClust. Therefore, the complexity of the selection of

sensible properties is 𝑂(𝑛𝑚𝑘𝑃).


The complexity of noise-based selection of HeMI++ is the same as the complexity of noise-

based selection of DeRanClust and HeMI. Therefore, the complexity of noise-based selection is

𝑂(𝑃).


In HeMI, the complexity of the crossover operation is the same as the complexity of the

crossover operation of DeRanClust and HeMI. Therefore, the complexity of the crossover

operation is 𝑂(𝑃2).


The complexity of twin removal is the same as the complexity of twin removal of DeRanClust.

Therefore, the complexity of twin removal is 𝑂(𝑚𝑘2𝑃).


In HeMI++, the complexity of the three step mutation is the same as the complexity of the three

step mutation of HeMI. Therefore, the complexity of the three step mutation is 𝑂(𝑛𝑚𝑘𝑃).


The complexity of the health improvement operation of HeMI++ is the same as the complexity

of the health improvement operation of HeMI. Therefore, the complexity of the health

improvement operation is 𝑂(𝑛𝑚𝑘𝑃).

265


In HeMI++, the complexity of the cleansing operation is the same as the complexity of the

cleansing operation of CSClust. Therefore, the complexity of the cleansing operation is

𝑂(𝑛𝑚𝑘𝑃).


In HeMI++, the complexity of the cloning operation is the same as the complexity of the cloning

operation of CSClust. Therefore, the complexity of the cloning operation is 𝑂(𝑃).


In HeMI++, the complexity of the elitist operation is the same as the complexity of the elitist

operation of HeMI. Therefore, the complexity of the elitist operation is 𝑂(𝑛𝑚𝑘𝑃).


In HeMI++, the complexity of neighbor information sharing is the same as the complexity of

neighbor information sharing of HeMI. Therefore, the complexity for neighbor information

sharing is 𝑂(𝑃).

Step 13 Global Best Selection

The complexity of global best selection of HeMI++ is the same as the complexity of global best

selection of HeMI. Therefore, the complexity of global best selection is 𝑂(𝑃).

If there are 𝑁 iterations, then the Step 4 to Step 12 will be repeated 𝑁 times while the

normalization, population initialization, selection of sensible properties and global best selection

will occur only once. Therefore, the total complexity of the steps will be 𝑂(𝑛𝑚2 + 𝑚𝑑2 +

𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑛𝑚𝑘𝑃 + 𝑃 + 𝑁( 𝑃 + 𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃 + 𝑛𝑚𝑘𝑃 +

𝑃 + 𝑛𝑚𝑘𝑃 + 𝑃)= 𝑂(𝑛𝑚2 + 𝑚𝑑2 + 𝑛𝑚𝑘𝑃𝑁´ + 𝑛𝑚𝑘𝑃 + 𝑃2 + 𝑁(𝑃2 + 𝑚𝑘2𝑃 + 𝑛𝑚𝑘𝑃)). If

266

𝑛 ≫ 𝑑, 𝑚 ≫ 𝑑, 𝑛 ≫ 𝑃, 𝑚 ≫ 𝑃, 𝑛 ≫ 𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑁′, 𝑚 ≫ 𝑁′, 𝑛 ≫ 𝑁 and 𝑚 ≫ 𝑁 then the

complexity of HeMI++ is 𝑂(𝑛𝑚).

9.4.8 Complexity of AGCUK

The complexity of AGCUK is taken from the AGCUK paper (Y. Liu et al., 2011) and the thesis

(Md Anisur Rahman, 2014). The complexity of each generation/iteration is 𝑂(𝑛𝑚𝑘𝑃). If there

are 𝑁 number of iterations, then the total complexity of AGCUK for 𝑁 iterations is 𝑂(𝑛𝑚𝑘𝑧𝑁).

If 𝑛 ≫ 𝑃, 𝑚 ≫ 𝑃, 𝑛 ≫ 𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑁 and 𝑚 ≫ 𝑁 then the complexity of AGCUK

is 𝑂(𝑛𝑚).

9.4.9 Complexity of GAGR

The complexity of GAGR (D.-X. Chang et al., 2009) is derive from the thesis (Md Anisur

Rahman, 2014). The detailed complexity of GAGR is as follows:

Step 1:

In Step 1, GAGR generates 𝑃 number of random chromosomes with a complexity of 𝑂(𝑚𝑘𝑃).

Step 2:

In Step 2, the fitness of 𝑧 chromosomes is calculated using the sum of square error (SSE). The

complexity of the SSE calculation is 𝑂(𝑚𝑘𝑅𝑃), if there are 𝑃 number of chromosomes and the

minimum number of records in a cluster is 𝑅. The complexity of fitness calculation of 𝑃 number

of chromosomes is 𝑂(𝑃). GAGR then finds the best chromosome with a complexity of 𝑂(𝑃).

The best chromosome is then stored in a separate location. The complexity of this is 𝑂(𝑚𝑘).

The total complexity of Step 2 is 𝑂(𝑚𝑘𝑅𝑃 + 𝑃 + 𝑃 + 𝑚𝑘) = 𝑂(𝑚𝑘𝑅𝑃).

Step 3:

In Step 3, GAGR selects the best chromosome as the best cluster result with a complexity

of 𝑂(𝑚𝑘).

267

Step 4:

In Step 4, it selects the chromosomes for crossover and mutation operation with a complexity

of 𝑂(𝑚𝑘𝑃).

Step 5:

In Step 5, GAGR applies crossover operation. Before the crossover, it applies gene re-

arrangement on each chromosome (except the best chromosome) with a complexity of 𝑂(𝑚𝑘2).

The complexity of the gene rearrangement of (𝑃 − 1) chromosomes is 𝑂((𝑃 − 1)𝑘2𝑚)

= 𝑂(𝑚𝑘2𝑃). The complexity of the crossover of a pair of chromosomes is 𝑂(𝑘𝑚). The

complexity of the crossover of 𝑃

2 pair of chromosomes is 𝑂((

𝑃

2)𝑘𝑚). The overall complexity

of Step 5 is 𝑂(𝑚𝑘2𝑃) + 𝑂((𝑃

2)𝑘𝑚) = 𝑂(𝑚𝑘2𝑃).

Step 6:

In Step 6, GAGR applies mutation operation. It calculates the mutation probability of 𝑃

chromosomes with a complexity of 𝑂(𝑃). The complexity of performing a mutation operation

on a chromosome is 𝑂(𝑚𝑘).The complexity of inserting a mutated/un-mutated chromosome

is 𝑂(𝑚𝑘). The overall complexity of Step 6 is 𝑂(𝑃 + 𝑚𝑘 + 𝑚𝑘) = 𝑂(𝑃 + 𝑚𝑘).

Step 7:

In Step 7, GAGR calculates the newly generated chromosomes with a complexity of 𝑂(𝑚𝑘𝑅𝑃).

Step 8:

In Step 8, it compares the worst chromosomes in the new population with the best chromosome

(i.e. the best chromosome from all previous generations) in terms of their fitness value. The

complexity of this is 𝑂(𝑚𝑘).

268

Step 9:

In Step 9, GAGR find the best chromosome in the new population and replace the best

chromosome (i.e. the best chromosome from all previous generation). The complexity of this

is 𝑂(𝑚𝑘).

Step 10:

In Step 10, the best chromosome is selected as a reference for the gene re-arrangement. The

complexity of this is 𝑂(𝑚𝑘).

If there are 𝑁 iterations, then the total complexity of GAGR is 𝑂(𝑚𝑘𝑅𝑃𝑁 + 𝑚𝑘2𝑃𝑁).

Moreover, GAGR applies K-means on the best clustering solution. If there are 𝑁′ number of

iterations in K-means then the complexity of K-Means is 𝑂(𝑛𝑚𝑘𝑁′) (Md Anisur Rahman,

2014). Therefore, the total complexity of GAGR is 𝑂(𝑚𝑘𝑅𝑃𝑁 + 𝑚𝑘2𝑃𝑁 + 𝑛𝑚𝑘𝑁′). If 𝑛 ≫

𝑃, 𝑚 ≫ 𝑃, 𝑛 ≫ 𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑅, 𝑚 ≫ 𝑅, 𝑛 ≫ 𝑁′, 𝑚 ≫ 𝑁′, 𝑛 ≫ 𝑁 and 𝑚 ≫ 𝑁 then the

complexity of GAGR is 𝑂(𝑛𝑚).

9.4.10 Complexity of GenClust

The complexity of GenClust is taken from the GenClust paper (Rahman & Islam, 2014). The

detailed complexity of GenClust is as follows:


GenClust produces initial population through two phases: deterministic and random. The

complexity of the deterministic phase is 𝑂(𝑃(𝑛𝑚2 + 𝑚𝑑2 + 𝑛2𝑚)), where 𝑃 is the number of

chromosomes. The complexity of the random phase is (𝑘𝑃). The total complexity of the

population initialization is 𝑂(𝑃(𝑛𝑚2 + 𝑚𝑑2 + 𝑛2𝑚) + 𝑘𝑃).

269

Step 2: Selection Operation

GenClust uses COSEC to calculate the fitness of each chromosome. The complexity of

calculating the fitness of each chromosomes is 𝑂(𝑛𝑚𝐾 + 𝑚𝑘2). If there are 𝑃 number of

chromosomes then the complexity of the fitness calculation for 𝑃 chromosomes is 𝑂(𝑛𝑚𝐾𝑃 +

𝑚𝑘2𝑃). Chromosomes are then sorted in descending order with a complexity of 𝑂(𝑃2) for

finding the 𝑃

2 chromosomes. The total complexity of the selection operation is 𝑂(𝑛𝑚𝑘𝑃 +

𝑚𝑘2𝑃 + 𝑃2).


Before the crossover, GenClust applies gene re-arrangement on each chromosome (except the

best chromosome) with the complexity of 𝑂(𝑚𝑘2). The complexity of gene rearrangement of

(𝑃 − 1) chromosomes is 𝑂((𝑃 − 1)𝑘2𝑚) = 𝑂(𝑚𝑘2𝑃). GenClust uses roulette wheel approach

to select a pair of chromosomes. For 𝑃 number of chromosomes there are 𝑃

2 crossover

altogether. Therefore, the complexity of the crossover operation is 𝑂(𝑃2). After the crossover

operation, GenClust applies the twin removal operation with a complexity of 𝑂(𝑘2𝑚).

Therefore, the total complexity of this step is 𝑂(𝑃2 + 𝑚𝑘2𝑃 + 𝑚𝑘2𝑃) = 𝑂(𝑃2 + 𝑚𝑘2𝑃).

Step 4: Elitism Operation

GenClust calculates the fitness of 𝑧 chromosomes with a complexity of 𝑂(𝑛𝑘𝑚𝑃 + 𝑚𝑘2𝑃). It

then find the best and worst chromosome with a complexity of 𝑂(𝑃). The total complexity of

the elitist operation is 𝑂(𝑛𝑚𝑘𝑃 + 𝑚𝑘2𝑃 + 𝑃) = 𝑂(𝑛𝑚𝑘𝑃 + 𝑚𝑘2𝑃).


GenClust probabilistically selects a chromosome for the mutation operation. The complexity of

the probabilistic selection is 𝑂(𝑃). It then randomly changes an attribute value of each and every

gene of the selected chromosomes. The complexity of this is 𝑂(𝑘𝑃). GenClust again applies the

270

twin removal operation with a complexity of 𝑂(𝑚𝑘2𝑃). Therefore, the total complexity of this

step is 𝑂(𝑃 + 𝑘𝑃 + 𝑚𝑘2𝑃) = 𝑂(𝑘𝑃 + 𝑚𝑘2𝑃).

Step 6: K-Means

GenClust applies K-means to the best clustering result with a complexity of 𝑂(𝑛𝑚𝑘𝑁′).

The overall complexity of GenClust is (𝑛𝑚2𝑃 + 𝑚𝑑2𝑃 + 𝑛2𝑚𝑃 + 𝑘𝑃 + 𝑁(𝑛𝑚𝑘𝑃 +

𝑚𝑘2𝑃 + 𝑃2) + 𝑛𝑚𝑘𝑁′). If 𝑛 ≫ 𝑑, 𝑚 ≫ 𝑑, 𝑛 ≫ 𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑃, 𝑚 ≫ 𝑃, 𝑛 ≫ 𝑁, 𝑚 ≫ 𝑁,

𝑛 ≫ 𝑁′and 𝑚 ≫ 𝑁′ then the complexity of GenClust is 𝑂(𝑛𝑚2 + 𝑛2𝑚).

9.4.11 Complexity of K-means

The complexity of K-means is 𝑂(𝑛𝑚𝑘𝑁′) (Kolen & Hutcheson, 2002; Md Anisur Rahman,

2014; Xu & Wunsch, 2005). If 𝑛 ≫ 𝑘, 𝑚 ≫ 𝑘, 𝑛 ≫ 𝑁′ and 𝑚 ≫ 𝑁′ then the complexity of K-

Means is 𝑂(𝑛𝑚).

The complexities of some other existing GA-based clustering techniques such as

CLUSTERING (Tseng & Yang, 2001), GCA (Garai & Chaudhuri, 2004) and TGCA (He &

Tan, 2012) are 𝑂(𝑛2 + 𝑚2) (Tseng & Yang, 2001), 𝑂(𝑛2) (Garai & Chaudhuri, 2004) and

𝑂(𝑛𝑚2) (He & Tan, 2012), respectively.

9.5 Comparison of the Complexities of the Techniques

In Table 9.1, we present the complexities of our proposed technique. We also present the

complexities of some existing techniques in Table 9.1.

From Table 9.1, it is evident that the complexity of each of our proposed technique is 𝑂(𝑛𝑚)

which is lower than the complexity of GenClust. The complexity of GenClust is 𝑂(𝑛𝑚2 +

𝑛2𝑚). Besides, the complexity of HeMI++ is 𝑂(𝑛𝑚) which is the same as the complexity of

AGCUK, GAGR and K-means. Although the complexity of AGCUK, GAGR and K-means is

271

equal to the complexity of HeMI++, the clustering quality of HeMI++ is better than the

clustering quality of AGCUK, GAGR, and K-means.

Table 9.1: The complexities of the techniques

Techniques Complexity

DeRanClust 𝑂(𝑛𝑚)

GCS 𝑂(𝑛𝑚)

GMC 𝑂(𝑛𝑚)

HeMI 𝑂(𝑛𝑚)

CSClust 𝑂(𝑛𝑚)

HeMI++ 𝑂(𝑛𝑚)

AGCUK 𝑂(𝑛𝑚)

GAGR 𝑂(𝑛𝑚)

GenClust 𝑂(𝑛𝑚2 + 𝑛2𝑚)

K-means 𝑂(𝑛𝑚)

CLUSTERING 𝑂(𝑛2 + 𝑚2)

GCA 𝑂(𝑛2)

TGCA 𝑂(𝑛𝑚2)

9.6 Summary of the Proposed Techniques

In this section, a summary (see Table 9.2) is provided in regard to the strengths and weaknesses

of the proposed clustering techniques. The reader will be assisted to select a clustering

technique.

However, HeMI++ (presented in Chapter 8) is the most appropriate technique among our

proposed methods because it is suitable for knowledge discovery. CSClust (presented in Chapter

7) is another suitable technique for knowledge discovery. However, CSClust is affected by the

limitations of DB Index whereas HeMI++ is the advanced version and does not suffer from the

limitations of DB Index. HeMI++ is applicable to data sets with both numerical and/or

categorical attributes. It generates the number of clusters automatically through the clustering

272

process. It learns the sensible clustering property from a data set and applies that to produce a

sensible clustering solution.

Table 9.2: Strengths and weaknesses of the proposed techniques

Proposed

Technique

Strengths Weaknesses

DeRanClust

(Presented in

Chapter 3)

Does not require any user input.

Selects high-quality initial seeds.

Solves the local minima issue of partition based


Number of clusters is generated automatically through

the clustering process.

Inappropriate for knowledge

discovery.

Cannot handle data sets with

categorical attributes.

Affected by the limitations

of the DB Index.

GMC

(Presented in

Chapter 4)





Produces good-quality offspring chromosomes

through two phases of crossover operation.

Improves chromosome quality through three steps of

mutation operation.




discovery.

Cannot handle the data sets

with categorical attributes.


of the DB Index.

GCS

(Presented in

Chapter 5)





Ensures the presence of good-quality chromosomes in

a population at the beginning of each generation

through a selection operation.

Ensures the presence of healthy chromosomes in each

population through a health check operation.




discovery.

Cannot handle data sets with



of the DB Index.

HeMI

(Presented in

Chapter 6)





Uses a big population through multiple streams.


discovery.


of the DB Index.

273


mutation operation.


each population through a health check operation.



Applicable to data sets with both numerical and/or


CSClust

(Presented in

Chapter 7)





Learns the sensible clustering property from a data set

and applies that to produce a sensible clustering

solution.



Suitable for knowledge discovery.

Applicable to data sets with numerical and/or



of the DB Index.

HeMI++

(Presented in

Chapter 8)





Learns the sensible clustering property from a data set

and applies that to produce a sensible clustering

solution.

Uses a big population through multiple streams.


mutation operation.


each population through a health check operation.



Suitable for knowledge discovery.

Applicable to data sets with both numerical and/or


NA

Tree Index

(Presented in

Chapter 8)

Suitable for evaluating sensible and non-sensible


NA

274

9.7 Future Research Directions

A future direction for research could involve making further essential modifications to the

proposed technique HeMI++. HeMI++ learns the necessary properties (minimum and maximum

number of clusters, and the minimum number of records in a cluster ) of sensible clustering

solutions based on the chromosomes that it generates in the initial population through multiple

streams. For the properties minimum and maximum number of clusters, HeMI++ finds the

minimum and maximum number of clusters of the chromosomes that it generates through

multiple streams. We can explore the appropriate number of clusters without using any range

(minimum and maximum number of clusters) although the number of clusters are varied from

data set to data set.

Similarly, we can also explore the minimum number of records in a cluster. HeMI++ finds

the minimum number of records in a cluster for each of the chromosomes generated through

multiple streams. It then sorts these numbers in descending order and calculates the median of

these numbers. The median value is then used as the property; the minimum number of records

in a cluster. We plan to furhter explore the correctness of using the median value as the minimum

number of records in a cluster. Another future research direction could entail making further

modification of our cluster evaluation technique Tree Index.

275

Chapter 10

Conclusion

Clustering is an important and well-known technique in the area of data mining which has a

wide range of applications such as machine learning (Gan, 2013; Mukhopadhyay & Maulik,

2009), image segmentation (Cai et al., 2007; B. N. Li et al., 2011; F. Zhao et al., 2014), medical

imaging (Bai et al., 2013; Kannan et al., 2010; Kaya et al., 2017; Saha et al., 2016) and social

network analysis (Girvan & Newman, 2002).

Therefore, it is crucial to improve clustering techniques in order to obtain better quality

clusters from data sets. There are many approaches for clustering as presented in the literature.

However, the existing clustering techniques have some limitations and therefore, there is room

for further improvement. In this study we present a number of clustering techniques that produce

better quality clustering solutions than a number of recent existing techniques. In Chapter 9 we

present a detailed analysis and discussions based on the basic concepts and advantages of our

proposed techniques. We also present the main contributions of this study in Chapter 9 (see

Section 9.3).

Many existing techniques have some limitations such as the requirement for user input on

the number of clusters, the tendency of getting stuck at local optima while clustering the records,

and selection of high quality initial seeds with a high complexity of 𝑂(𝑛)2. Therefore, in this

study we present a number of clustering techniques that sequentially improve the cluster quality

and produce high-quality clustering solutions with low complexity and require no user input.

276

We propose DeRanClust (see Chapter 3 ) that does not require any user input, is less likely

to get stuck at local optima and explores high-quality initial seeds with a low complexity

of 𝑂(𝑛). In Chapter 3, we progress towards achieving our first research goal. We realize that

there is room for further improvement of cluster quality of DeRanClust by improving other

genetic operations such as crossover and mutation operations. Therefore, we propose GMC (see

Chapter 4) that uses a new selection, crossover and mutation operation in order to improve the

chromosomes quality. Chapter 4, involved further progress towards achieving our research goal

1.

Typically, the genetic operations such as the crossover and mutation operation tend to improve

the health of a chromosome, but they can also cause the health of some chromosomes to

deteriorate. Therefore, we propose GCS (see Chapter 5) that uses a health check operation in

order to ensure the presence of healthy chromosomes in a population. GCS also uses a new

crossover and selection operation. Chapter 5 further refines the techniques proposed in the

previous two chapters, and allow us to move closer to achieving research goal 1.

In addition, from the literature (Pourvaziri & Naderi, 2014; Straßburg et al., 2012), and

through our empirical analysis (carried out in Chapter 6) we find that the population size has a

positive impact on the clustering quality. That is, a big population size is likely to contribute

towards a good clustering solution. However, big population size requires high execution time.

Therefore, we propose HeMI (see Chapter 6) that uses a big population in multiple streams,

where each stream contains a relatively small number of chromosomes, and thus can facilitate

managing a low execution time since they are suitable for parallel processing when necessary.

HeMI also introduces information sharing among the streams at a regular interval in order to

take advantage of the multiple streams. HeMI also uses a new health improvement operation in

277

order to ensure the healthy chromosomes in a population. We compare HeMI with five (5)

existing techniques on 20 publicly available data sets in terms of two well-known evaluation

criteria (see Section 6.3.4 of Chapter 6). We also carry out a thorough experimentation to

investigate the usefulness of the new components of HeMI (see Section 6.3.7 of Chapter 6). In

Chapter 6 we achieve our first goal of proposing a parameter less clustering technique with high-

quality solutions and low complexity.

However, in order to achieve our second goal of producing sensible clustering solutions we

carefully assess the results obtained by HeMI and other existing techniques. We find that some

recent clustering techniques do not produce sensible clusters and fail to discover knowledge

from underlying data sets. Therefore, we propose CSClust (see Chapter 7) that uses a new

cleansing and cloning operation which helps to produce sensible clusters with high fitness

values.

Finally we propose HeMI++ where we combine our previous technique called CSClust with

HeMI where we also significantly improve the components of CSClust and HeMI. In HeMI++,

we first explore the quality of HeMI and some existing clustering techniques. We observe that

some existing clustering techniques do not produce sensible clusters (see see Fig. 8.2 and Fig.

8.3). We also find that our technique HeMI also does not produce sensible clusters (see Fig.

8.4). Sometimes, they obtain huge number of clusters and sometimes they obtain only two

clusters, where one cluster contains one record and the other cluster contains all remaining

records. In order to handle such a situation, HeMI++ uses a new component called Selection of

Sensible Properties. Through this component HeMI++ first learns important properties of

sensible clustering solutions and then applies the information in producing its clustering

solutions.

HeMI++ also proposes a cleansing and cloning operation that helps to produce sensible

clusters. HeMI++ learns necessary properties of a sensible clustering solution for a data set from

278

a high-quality initial population without requiring any user input. It then disqualifies the

chromosomes that do not satisfy the properties through its cleansing operation. In the cloning

operation, the chromosomes are then replaced by high-quality chromosomes found in the initial

population.

During the development of the proposed clustering techniques we realize that the existing

cluster evaluation techniques are biased towards either high number of clusters or very low

number of clusters. Therefore, we also evaluate the existing cluster evaluation techniques by

analyzing them on some ground truth results,which also graphically visualized (see Section

8.2.1 of Chapter 8). We find that the existing evaluation techniques produce better evaluation

values for non-sensible clustering solutions (compared to a sensible clustering solutions).

Hence, in this study we propose Tree Index which scores the sensible solutions higher than those

non-sensible solutions (see Section 8.2.1, 8.2.2 and 8.3.5 of Chapter 8).

We then empirically compare our proposed clustering technique (HeMI++) with five existing

techniques on 21 publicly available data sets in terms of our Tree Index. We find that HeMI++

achieves the best clustering solutions in 18 out of 21 data sets (see Section 8.3.8 of Chapter 8).

Moreover, we graphically visualize the clustering results of HeMI++ on a brain data set and find

the results to be more sensible than others. Additionally, we discover some useful knowledge

from the clustering results produced by HeMI++ indicating its usefulness in knowledge

discovery. In Chapter 8 we achieve our second and third research goals of producing high

quality and sensible clusters with no user input, and a cluster evaluation technique for better

evaluating sensible and non-sensible clustering results. A future research direction of the

proposed techniques is presented in Chapter 9 (see Section 9.7).

279

References

Aalaei, A., Fazlollahtabar, H., Mahdavi, I., Mahdavi-Amiri, N., & Yahyanejad, M. H. (2013).

A genetic algorithm for a creativity matrix cubic space clustering: A case study in

Mazandaran Gas Company. Applied Soft Computing, 13(4), 1661–1673.

http://doi.org/10.1016/j.asoc.2012.12.011

Abolhassani, B., Salt, J. E., & Dodds, D. E. (2004). A Two-Phase Genetic K-Means Algorithm

for Placement of Radioports in Cellular Networks. IEEE Transactions on Systems, Man

and Cybernetics, Part B (Cybernetics), 34(1), 533–538.

http://doi.org/10.1109/TSMCB.2003.817073

Abonyi János, & Feil Balázs. (2007). Cluster Analysis for Data Mining and System

Identification. Basel: Birkhäuser Basel. http://doi.org/10.1007/978-3-7643-7988-9

Abshouri, A. A., & Bakhtiary, A. (2012). A new clustering method based on Firefly and KHM.

Journal of Communication and Computer, 9, 387–391.

Adnan, M. N., & Islam, M. Z. (2014). ComboSplit: Combining Various Splitting Criteria for

Building a Single Decision Tree. In International Conference on Artificial Intelligence

and Pattern Recognition (pp. 1–8).

Adnan, M. N., & Islam, M. Z. (2016). Forest CERN: A New Decision Forest Building

Technique (pp. 304–315). Springer, Cham. http://doi.org/10.1007/978-3-319-31753-3_25

Agustín-Blas, L. E., Salcedo-Sanz, S., Jiménez-Fernández, S., Carro-Calvo, L., Del Ser, J., &

Portilla-Figueras, J. A. (2012). A new grouping genetic algorithm for clustering problems.

Expert Systems with Applications, 39(10), 9695–9703.

http://doi.org/10.1016/j.eswa.2012.02.149

280

Ahmad, A., & Dey, L. (2007a). A k-mean clustering algorithm for mixed numeric and

categorical data. Data & Knowledge Engineering, 63(2), 503–527.

http://doi.org/10.1016/j.datak.2007.03.016

Ahmad, A., & Dey, L. (2007b). A method to compute distance between two categorical values

of same attribute in unsupervised learning for categorical data set. Pattern Recognition

Letters (Vol. 28). http://doi.org/10.1016/j.patrec.2006.06.006

Alexander, G. J., & Peterson, M. A. (2007). An analysis of trade-size clustering and its relation

to stealth trading. Journal of Financial Economics, 84(2), 435–471.

http://doi.org/10.1016/j.jfineco.2006.02.005

Andreopoulos, B., An, A., & Wang, X. (2007). Hierarchical Density-Based Clustering of

Categorical Data and a Simplification. In Advances in Knowledge Discovery and Data

Mining (pp. 11–22). Berlin, Heidelberg: Springer Berlin Heidelberg.

http://doi.org/10.1007/978-3-540-71701-0_5

Andreopoulos, W. (2006). Clustering Algorithms for Categorical Data. York University,

Toronto, Ontario.

ap Gwilym, O., & Verousis, T. (2010). Price clustering and underpricing in the IPO

aftermarket. International Review of Financial Analysis, 19(2), 89–97.

http://doi.org/10.1016/j.irfa.2010.01.007

Arthur, D., & Vassilvitskii, S. (2007). 2006-13. In Proceedings of the Eighteenth Annual ACM-

SIAM Symposium on Discrete Algorithms, SODA 2007 (pp. 1027–1035). New Orleans,

Louisiana,USA.

Ashton, J. K., & Hudson, R. S. (2008). Interest rate clustering in UK financial services markets.

Journal of Banking & Finance, 32(7), 1393–1403.

http://doi.org/10.1016/j.jbankfin.2007.11.002

Bador, M., Gilleland, E., Castellà, M., & Arivelo, T. (2015). Spatial clustering of summer

281

temperature maxima from the CNRM-CM5 climate model ensembles & E-OBS over

Europe. Weather and Climate Extremes, 9, 17–24.

http://doi.org/10.1016/j.wace.2015.05.003

Bai, P. R., Liu, Q. Y., Li, L., Teng, S. H., Li, J., & Cao, M. Y. (2013). A novel region-based

level set method initialized with mean shift clustering for automated medical image

segmentation. Computers in Biology and Medicine, 43(11), 1827–1832.

http://doi.org/10.1016/j.compbiomed.2013.08.024

Bandyopadhyay, S., & Maulik, U. (2001). Nonparametric genetic clustering: comparison of

validity indices. IEEE Transactions on Systems, Man, and Cybernetics, Part-C, 31(2001),

120–125.

Bandyopadhyay, S., & Maulik, U. (2002). An evolutionary technique based on K-Means

algorithm for optimal clustering in RN. Information Sciences, 146(1), 221–237.

http://doi.org/10.1016/S0020-0255(02)00208-6

Bandyopadhyay, S., Maulik, U., & Mukhopadhyay, A. (2007). Multiobjective Genetic

Clustering for Pixel Classification in Remote Sensing Imagery. IEEE Transactions on

Geoscience and Remote Sensing, 45(5), 1506–1511.

http://doi.org/10.1109/TGRS.2007.892604

Banharnsakun, A., Sirinaovakul, B., & Achalakul, T. (2013). The best-so-far ABC with

multiple patrilines for clustering problems. Neurocomputing, 116, 355–366.

http://doi.org/10.1016/j.neucom.2012.02.047

Beauchemin, M. (2015). On affinity matrix normalization for graph cuts and spectral

clustering. Pattern Recognition Letters (Vol. 68).

http://doi.org/10.1016/j.patrec.2015.08.020

Bello-Orgaz, G., Jung, J. J., & Camacho, D. (2016). Social big data: Recent achievements and

new challenges. Information Fusion, 28, 45–59.

282

http://doi.org/10.1016/j.inffus.2015.08.005

Bezdek, J. C., & C., J. (1981). Pattern recognition with fuzzy objective function algorithms.

Plenum Press.

Brameier, M., & Wiuf, C. (2007). Co-clustering and visualization of gene expression data and

gene ontology terms for Saccharomyces cerevisiae using self-organizing maps. Journal

of Biomedical Informatics, 40(2), 160–173. http://doi.org/10.1016/j.jbi.2006.05.001

Brown, P., Chua, A., & Mitchell, J. (2002). The influence of cultural factors on price clustering:

Evidence from Asia–Pacific stock markets. Pacific-Basin Finance Journal, 10(3), 307–

332. http://doi.org/10.1016/S0927-538X(02)00049-5

Cagnina, L., Errecalde, M., Ingaramo, D., & Rosso, P. (2014). An efficient Particle Swarm

Optimization approach to cluster short texts. Information Sciences, 265, 36–49.

http://doi.org/10.1016/j.ins.2013.12.010

Cai, W., Chen, S., & Zhang, D. (2007). Fast and robust fuzzy c-means clustering algorithms

incorporating local information for image segmentation. Pattern Recognition, 40(3), 825–

838. http://doi.org/10.1016/j.patcog.2006.07.011

Chan, K. Y., Kwong, C. K., & Hu, B. Q. (2012). Market segmentation and ideal point

identification for new product design using fuzzy data compression and fuzzy clustering

methods. Applied Soft Computing, 12(4), 1371–1378.


Chang, D.-X., Zhang, X.-D., & Zheng, C.-W. (2009). A genetic algorithm with gene

rearrangement for K-means clustering. Pattern Recognition, 42(7), 1210–1222.

http://doi.org/10.1016/j.patcog.2008.11.006

Chang, D., Zhao, Y., Zheng, C., & Zhang, X. (2012). A genetic clustering algorithm using a

message-based similarity measure. Expert Systems with Applications, 39(2), 2194–2202.


283

Chapelle, O., Scholkopf, B., & Zien, A. (Eds.). (2006). Semi-Supervised Learning. The MIT

Press. http://doi.org/10.7551/mitpress/9780262033589.001.0001

Chen, M.-Y. (2013). A hybrid ANFIS model for business failure prediction utilizing particle

swarm optimization and subtractive clustering. Information Sciences, 220, 180–195.

http://doi.org/10.1016/j.ins.2011.09.013

Chen, Y., Wang, L., Li, F., Du, B., Choo, K.-K. R., Hassan, H., & Qin, W. (2017). Air quality

data clustering using EPLS method. Information Fusion, 36, 225–232.

http://doi.org/10.1016/j.inffus.2016.11.015

Chen, Z., & Ji, H. (2010). Graph-based Clustering for Computational Linguistics: A Survey.

In Proceedings of the 2010 Workshop on Graph-based Methods for Natural Language

Processing, ACL 2010 (pp. 1–9). Uppsala, Sweden.

Cheng, C. H., Lee, W. K., & Wong, K. F. (2002). A genetic algorithm-based clustering

approach for database partitioning. IEEE Transactions on Systems, Man and Cybernetics

Part C: Applications and Reviews, 32(3), 215–230.

http://doi.org/10.1109/TSMCC.2002.804444

Chiou, Y. C., & Lan, L. W. (2001). Genetic clustering algorithms. European Journal of

Operational Research, 135(2), 413–427. http://doi.org/10.1016/S0377-2217(00)00320-9

Chuang, K.-T., & Chen, M.-S. (2004). Clustering Categorical Data by Utilizing the Correlated-

Force Ensemble. In M. W. Berry, U. Dayal, C. Kamath, & D. Skillicorn (Eds.),

Proceedings of the 2004 SIAM International Conference on Data Mining. Philadelphia,

PA: Society for Industrial and Applied Mathematics.

http://doi.org/10.1137/1.9781611972740

Chuang, L.-Y., Hsiao, C.-J., & Yang, C.-H. (2011). Chaotic particle swarm optimization for

data clustering. Expert Systems with Applications, 38(12), 14555–14563.


284

Cost, S., & Salzberg, S. (1993). A Weighted Nearest Neighbor Algorithm for Learning with

Symbolic Features. Machine Learning, 10(1), 57–78.

http://doi.org/10.1023/A:1022664626993

Cowgill, M. C., Harvey, R. J., & Watson, L. T. (1999). A genetic algorithm approach to cluster

analysis. Computers & Mathematics with Applications, 37(7), 99–108.

http://doi.org/10.1016/S0898-1221(99)00090-5

Cucchiara, R. (1998). Genetic algorithms for clustering in machine vision. Machine Vision and

Applications, 11(1), 1–6. http://doi.org/10.1007/s001380050084

Cura, T. (2012). A particle swarm optimization approach to clustering. Expert Systems with

Applications, 39(1), 1582–1588. http://doi.org/10.1016/j.eswa.2011.07.123

D.Mason, R. (1998). Statistics: An Introduction (5th ed.). Brooks/Cole Publishing Company.

Daraganova, G., Pattison, P., Koskinen, J., Mitchell, B., Bill, A., Watts, M., & Baum, S. (2012).

Networks and geography: Modelling community network structures as the outcome of

both spatial and network processes. Social Networks, 34(1), 6–17.

http://doi.org/10.1016/j.socnet.2010.12.001

Das, S., Abraham, A., & Konar, A. (2008). Automatic kernel clustering with a Multi-Elitist

Particle Swarm Optimization Algorithm. Pattern Recognition Letters (Vol. 29).


Davies, D. L., & Bouldin, D. W. (1979). A Cluster Separation Measure. IEEE Transactions on

Pattern Analysis and Machine Intelligence, PAMI-1(2), 224–227.

http://doi.org/10.1109/TPAMI.1979.4766909

Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on

Pattern Analysis and Machine Intelligence, PAMI, 1(2), 224–227.

http://doi.org/10.1109/TPAMI.1979.4766909

de Arruda, G. F., Costa, L. da F., & Rodrigues, F. A. (2012). A complex networks approach

285

for data clustering. Physica A: Statistical Mechanics and Its Applications, 391(23), 6174–

6183. http://doi.org/10.1016/j.physa.2012.07.007

Deckersbach, T., Peters, A. T., Sylvia, L. G., Gold, A. K., da Silva Magalhaes, P. V., Henry,

D. B., … Miklowitz, D. J. (2016). A cluster analytic approach to identifying predictors

and moderators of psychosocial treatment for bipolar depression: Results from STEP-BD.

Journal of Affective Disorders, 203, 152–157. http://doi.org/10.1016/j.jad.2016.03.064

Demiriz, A., Demiriz, A., Bennett, K., & Embrechts, M. J. (1999). Semi-Supervised Clustering

Using Genetic Algorithms. In Artificial Neural Network in Engineering (ANNE-99), 809-

-814. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.3696

Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of

Machine Learning Research, 7, 1–30. Retrieved from

http://delivery.acm.org/10.1145/1250000/1248548/7-1-

demsar.pdf?ip=137.166.21.89&id=1248548&acc=PUBLIC&key=65D80644F295BC0D

.C3714298A2589389.4D4702B0C3E38B35.4D4702B0C3E38B35&CFID=966800663

&CFTOKEN=74417605&__acm__=1501499243_57e3d3eb1dc2a758ce96c929

Deng, S., He, Z., & Xu, X. (2010). G-ANMI: A mutual information based genetic clustering

algorithm for categorical data. Knowledge-Based Systems, 23(2), 144–149.

http://doi.org/10.1016/j.knosys.2009.11.001

Diaz-Gomez, P., & Hougen, D. (2007). Initial Population for Genetic Algorithms: A Metric

Approach. In Proceedings of the 2007 International Conference on Genetic and

Evolutionary Methods (pp. 43–49). Las Vegas, Nevada, USA.

Dimopoulos, C., & Mort, N. (2001). A hierarchical clustering methodology based on genetic

programming for the solution of simple cell-formation problems. International Journal of

Production Research, 39(1), 1–19. http://doi.org/10.1080/00207540150208835

Dipnall, J. F., Pasco, J. A., Berk, M., Williams, L. J., Dodd, S., Jacka, F. N., & Meyer, D.

286

(2017). Why so GLUMM? Detecting depression clusters through graphing lifestyle-

environs using machine-learning methods (GLUMM). European Psychiatry, 39, 40–50.

http://doi.org/10.1016/j.eurpsy.2016.06.003

Dunn, J. C. (1974). Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of

Cybernetics, 4(1), 95–104. http://doi.org/10.1080/01969727408546059

Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical

Association, 56, 52–64.

Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density Based Notion of Clusters in

Large Spatial Databases with Noise. In 2nd International Conference on Knowledge

Discovery and Data Mining (KDD-96).

Fathian, M., Amiri, B., & Maroosi, A. (2007). Application of honey-bee mating optimization

algorithm on clustering. Applied Mathematics and Computation, 190(2), 1502–1513.

http://doi.org/10.1016/j.amc.2007.02.029

Festa, P. (2013). A biased random-key genetic algorithm for data clustering. Mathematical

Biosciences, 245(1), 76–85. http://doi.org/10.1016/j.mbs.2013.07.011

Firat, A., Chatterjee, S., & Yilmaz, M. (2007). Genetic clustering of social networks using

random walks, 51, 6285–6294. http://doi.org/10.1016/j.csda.2007.01.010

Firestone, S. M., Ward, M. P., Christley, R. M., & Dhand, N. K. (2011). The importance of

location in contact networks: Describing early epidemic spread using spatial social

network analysis. Preventive Veterinary Medicine, 102(3), 185–195.

http://doi.org/10.1016/j.prevetmed.2011.07.006

Forsati, R., Keikha, A., & Shamsfard, M. (2015). An improved bee colony optimization

algorithm with an application to document clustering. Neurocomputing, 159, 9–26.


Friedman, M. (1940). A Comparison of Alternative Tests of Significance for the Problem of m

287

Rankings. Source: The Annals of Mathematical Statistics, 11(1), 86–92. Retrieved from

http://www.jstor.org/stable/2235971

Fuzzy Clustering. (2017). Retrieved February 27, 2017, from

http://reference.wolfram.com/legacy/applications/fuzzylogic/Manual/12.html

Galluccio, L., Michel, O., Comon, P., & Hero, A. O. (2012). Graph based k-means clustering.

Signal Processing, 92(9), 1970–1984. http://doi.org/10.1016/j.sigpro.2011.12.009

Gan, G. (2013). Application of data clustering and machine learning in variable annuity

valuation. Insurance: Mathematics and Economics, 53(3), 795–801.

http://doi.org/10.1016/j.insmatheco.2013.09.021

Ganti, V., Gehrket, J., & Ramakrishnant, R. (1999). CACTUS-Clustering Categorical Data

Using Summaries. In KDD-99 (pp. 73–83). San Diego CA, USA.

Garai, G., & Chaudhuri, B. . (2004). A novel genetic algorithm for automatic clustering.

Pattern Recognition Letters (Vol. 25). http://doi.org/10.1016/j.patrec.2003.09.012

Ghahramani, Z. (2004). Unsupervised Learning (pp. 72–112). Springer Berlin Heidelberg.

http://doi.org/10.1007/978-3-540-28650-9_5

Giebultowicz, S., Ali, M., Yunus, M., & Emch, M. (2011). A comparison of spatial and social

clustering of cholera in Matlab, Bangladesh. Health & Place, 17(2), 490–497.

http://doi.org/10.1016/j.healthplace.2010.12.004

Giggins, H., & Brankovic, L. (2012). VICUS: a noise addition technique for categorical data.

Proceedings of the Tenth Australasian Data Mining Conference - Volume 134, 139–148.

Giggins, H. P. (2009). Security of genetic databases. University of Newcastle,

Newcastle,NSW, Australia . Retrieved from

http://trove.nla.gov.au/work/31926869?selectedversion=NBD44558520

Girvan, M., & Newman, M. E. J. (2002). Community structure in social and biological

networks. Proceedings of the National Academy of Sciences, 99(12), 7821–7826.

288

http://doi.org/10.1073/pnas.122653799

Goldberg, D. E., Deb, K., & Clark, J. H. (1991). Genetic Algorithms, Noise, and the Sizing of

Populations. Complex Systems, 6, 333–362.

Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff, J. M., Ivanov, P. C., Mark, R. G.,

… Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet. Circulation,

101(23).

Gu, X., Zhang, Q., Singh, V. P., Chen, Y. D., & Shi, P. (2016). Temporal clustering of floods

and impacts of climate indices in the Tarim River basin, China. Global and Planetary

Change, 147, 12–24. http://doi.org/10.1016/j.gloplacha.2016.10.011

Gunaratne, G. H., Nicol, M., Seemann, L., & Török, A. (2009). Clustering of volatility in

variable diffusion processes. Physica A: Statistical Mechanics and Its Applications,

388(20), 4424–4430. http://doi.org/10.1016/j.physa.2009.06.050

Han, J., & Kamber, M. (2006). Data Mining Concepts and Techniques. San Francisco: Morgan

Kaufmann.

Hanagandi, V., & Nikolaou, M. (1998). A hybrid approach to global optimization using a

clustering algorithm in a genetic search framework. Computers & Chemical Engineering,

22(12), 1913–1925. http://doi.org/10.1016/S0098-1354(98)00251-8

Hassanzadeh, T., & Meybodi, M. R. (2012). A new hybrid approach for data clustering using

Firefly algorithm and k-means. The 16th CSI International Symposium on Artificial

Intelligence and Signal Processing (AISP 2012),IEEE, (Aisp), 7–11.

http://doi.org/10.1109/AISP.2012.6313708

Hatamlou, A. (2013). Black hole: A new heuristic optimization approach for data clustering.

Information Sciences, 222, 175–184. http://doi.org/10.1016/j.ins.2012.08.023

He, H., & Tan, Y. (2012). A two-stage genetic algorithm for automatic clustering.

Neurocomputing, 81, 49–59. http://doi.org/10.1016/j.neucom.2011.11.001

289

Holland, J. H. (John H. (1975). Adaptation in natural and artificial systems : an introductory

analysis with applications to biology, control, and artificial intelligence. University of

Michigan Press.

Hong, T.-P., Chen, C.-H., & Lin, F.-S. (2015). Using group genetic algorithm to improve

performance of attribute clustering. Applied Soft Computing, 29, 371–378.


Hong, X., Wang, J., & Qi, G. (2014). Comparison of spectral clustering, K-clustering and

hierarchical clustering on e-nose datasets: Application to the recognition of material

freshness, adulteration levels and pretreatment approaches for tomato juices.

Chemometrics and Intelligent Laboratory Systems, 133, 17–24.

http://doi.org/10.1016/j.chemolab.2014.01.017

Hong, Y., & Kwong, S. (2008). To combine steady-state genetic algorithm and ensemble

learning for data clustering. Pattern Recognition Letters, 29(9), 1416–1423.


Hruschka, H., Fettes, W., & Probst, M. (2004). Market segmentation by maximum likelihood

clustering using choice elasticities. European Journal of Operational Research, 154(3),

779–786. http://doi.org/10.1016/S0377-2217(02)00807-X

Hsieh, M.-H., & Magee, C. L. (2008). An algorithm and metric for network decomposition

from similarity matrices: Application to positional analysis. Social Networks, 30(2), 146–

158. http://doi.org/10.1016/j.socnet.2007.11.002

Huang, A. (2008). Similarity measures for text document clustering. In Sixth New Zealand

Computer Science Research Student Conference.

Huang, C.-L., Huang, W.-C., Chang, H.-Y., Yeh, Y.-C., & Tsai, C.-Y. (2013). Hybridization

strategies for continuous ant colony optimization and particle swarm optimization applied

to data clustering. Applied Soft Computing, 13(9), 3864–3872.

290


Huang, Z. (1997). Clustering large data sets with mixed numeric and categorical values. In In

The First Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 21--

34).

Hulse, J. D. Van, Khoshgoftaar, T. M., & Huang, H. (2007). The pairwise attribute noise

detection algorithm. Knowl Inf Syst, 11(2), 171–190. http://doi.org/10.1007/s10115-006-

0022-x

Iman, R. L., & Davenport, J. M. (1980). Approximations of the critical region of the fbietkan

statistic. Communications in Statistics - Theory and Methods, 9(6), 571–595.

http://doi.org/10.1080/03610928008827904

İnkaya, T., Kayalıgil, S., & Özdemirel, N. E. (2015). Ant Colony Optimization based clustering

methodology. Applied Soft Computing, 28, 301–311.


Islam, M. Z., & Brankovic, L. (2011). Privacy preserving data mining: A noise addition

framework using a novel clustering technique. Knowledge-Based Systems, 24(8), 1214–

1223. http://doi.org/10.1016/j.knosys.2011.05.011

Islam, M. Z., & Giggins, H. (2011). Knowledge Discovery through SysFor -a Systematically

Developed Forest of Multiple Decision Trees. 9th Australian Data Mining Conference,

195–204.

Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters,

31(8), 651–666. http://doi.org/10.1016/j.patrec.2009.09.011

Jarboui, B., Cheikh, M., Siarry, P., & Rebai, A. (2007). Combinatorial particle swarm

optimization (CPSO) for partitional clustering problem. Applied Mathematics and

Computation, 192(2), 337–345. http://doi.org/10.1016/j.amc.2007.03.010

Jasper, H. H. (1958). Report of the committee on methods of clinical examination in

291

electroencephalography: 1957. Electroencephalography and Clinical Neurophysiology.

http://doi.org/10.1016/0013-4694(58)90053-1

Ji, J., Pang, W., Zhou, C., Han, X., & Wang, Z. (2012). A fuzzy k-prototype clustering

algorithm for mixed numeric and categorical data. Knowledge-Based Systems, 30, 129–

135. http://doi.org/10.1016/j.knosys.2012.01.006

Jiang, B., Wang, N., & Wang, L. (2013). Particle swarm optimization with age-group topology

for multimodal functions and data clustering. Communications in Nonlinear Science and

Numerical Simulation, 18(11), 3134–3145. http://doi.org/10.1016/j.cnsns.2013.03.011

Kalyani, S., & Swarup, K. S. (2011). Particle swarm optimization based K-means clustering

approach for security assessment in power systems. Expert Systems with Applications,

38(9), 10839–10846. http://doi.org/10.1016/j.eswa.2011.02.086

Kannan, S. R., Ramathilagam, S., Sathya, A., & Pandiyarajan, R. (2010). Effective fuzzy c-

means based kernel function in segmenting medical images. Computers in Biology and

Medicine, 40(6), 572–579. http://doi.org/10.1016/j.compbiomed.2010.04.001

Karaboga, D., & Ozturk, C. (2011). A novel clustering approach: Artificial Bee Colony (ABC)

algorithm. Applied Soft Computing, 11(1), 652–657.


Kashef, R., & Kamel, M. S. (2009). Enhanced bisecting k-means clustering using intermediate

cooperation. Pattern Recognition, 42(11), 2557–2569.


Kaya, I. E., Pehlivanlı, A. Ç., Sekizkardeş, E. G., & Ibrikci, T. (2017). PCA based clustering

for brain tumor segmentation of T1w MRI images. Computer Methods and Programs in

Biomedicine, 140, 19–28. http://doi.org/10.1016/j.cmpb.2016.11.011

Kerr, G., Ruskin, H. J., Crane, M., & Doolan, P. (2008). Techniques for clustering gene

expression data. Computers in Biology and Medicine.

292

http://doi.org/10.1016/j.compbiomed.2007.11.001

Kolen, J. F., & Hutcheson, T. (2002). Reducing the Time Complexity of the Fuzzy C-Means

Algorithm. IEEE Transactions on Fuzzy Systems, 10(2), 263–267.

Korürek, M., & Nizam, A. (2008). A new arrhythmia clustering technique based on Ant Colony

Optimization. Journal of Biomedical Informatics, 41(6), 874–881.

http://doi.org/10.1016/j.jbi.2008.01.014

Kumar, J., Mills, R. T., Hoffman, F. M., & Hargrove, W. W. (2011). Parallel k-Means

Clustering for Quantitative Ecoregion Delineation Using Large Data Sets. Procedia

Computer Science, 4, 1602–1611. http://doi.org/10.1016/j.procs.2011.04.173

Kuo, R. J., Huang, Y. D., Lin, C.-C., Wu, Y.-H., & Zulvia, F. E. (2014). Automatic kernel

clustering with bee colony optimization algorithm. Information Sciences, 283, 107–122.

http://doi.org/10.1016/j.ins.2014.06.019

Kuo, R. J., Syu, Y. J., Chen, Z.-Y., & Tien, F. C. (2012). Integration of particle swarm

optimization and genetic algorithm for dynamic clustering. Information Sciences, 195,

124–140. http://doi.org/10.1016/j.ins.2012.01.021

Lai, C.-C. (2005). A novel clustering approach using hierarchical genetic algorithms.

Intelligent Automation and Soft Computing, 11(3), 143–153.

Laszlo, M., & Mukherjee, S. (2006). A genetic algorithm using hyper-quadtrees for low-

dimensional k-means clustering. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 28(4), 533–543. http://doi.org/10.1109/TPAMI.2006.66

Laszlo, M., & Mukherjee, S. (2007). A genetic algorithm that exchanges neighboring centers

for k-means clustering. Pattern Recognition Letters (Vol. 28).


Lee, M., & Pedrycz, W. (2009). The fuzzy C-means algorithm with fuzzy P-mode prototypes

for clustering objects having mixed features. Fuzzy Sets and Systems, 160(24), 3590–

293

3600. http://doi.org/10.1016/j.fss.2009.06.015

Lei, X., Tian, J., Ge, L., & Zhang, A. (2013). The clustering model and algorithm of PPI

network based on propagating mechanism of artificial bee colony. Information Sciences,

247, 21–39. http://doi.org/10.1016/j.ins.2013.05.027

Lei, X., Wang, F., Wu, F. X., Zhang, A., & Pedrycz, W. (2016). Protein complex identification

through Markov clustering with firefly algorithm on dynamic protein-protein interaction

networks. Information Sciences, 329, 303–316. http://doi.org/10.1016/j.ins.2015.09.028

Levine, S. S., & Kurzban, R. (2006). Explaining clustering in social networks: towards an

evolutionary theory of cascading benefits. Managerial and Decision Economics, 27(2–3),

173–187. http://doi.org/10.1002/mde.1291

Li, B. N., Chui, C. K., Chang, S., & Ong, S. H. (2011). Integrating spatial fuzzy clustering with

level set methods for automated medical image segmentation. Computers in Biology and

Medicine, 41(1), 1–10. http://doi.org/10.1016/j.compbiomed.2010.10.007

Li, C.-T., & Chiao, R. (2003). Multiresolution genetic clustering algorithm for texture

segmentation. Image and Vision Computing, 21(11), 955–966.

http://doi.org/10.1016/S0262-8856(03)00120-3

Liao, L., Lin, T., & Li, B. (2008). MRI brain image segmentation and bias field correction

based on fast spatially constrained kernel clustering approach. Pattern Recognition

Letters (Vol. 29). http://doi.org/10.1016/j.patrec.2008.03.012

Lin, H.-J., Yang, F.-W., & Kao, Y.-T. (2005). An Efficient GA-based Clustering Technique.

Amkang Journal of Science and Engineering, 8(2), 113–122.

Lin Yu Tseng, & Shiueng Bien Yang. (1997). Genetic algorithms for clustering, feature

selection and classification. In Proceedings of International Conference on Neural

Networks (ICNN’97) (Vol. 3, pp. 1612–1616). IEEE.

http://doi.org/10.1109/ICNN.1997.614135

294

Liu, B. (2011). Supervised Learning Web Data Mining Exploring Hyperlinks, Contents, and

Usage Data (2nd ed.). Springer-Verlag Berlin Heidelberg.

Liu, Y., Wu, X., & Shen, Y. (2011). Automatic clustering using genetic algorithms. Applied

Mathematics and Computation, 218(4), 1267–1279.

http://doi.org/10.1016/j.amc.2011.06.007

Liu, Y. Y., & Wang, S. (2015). A scalable parallel genetic algorithm for the Generalized

Assignment Problem. Parallel Computing, 46, 98–119.

http://doi.org/10.1016/j.parco.2014.04.008

Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information

Theory, 28(2), 129–137.

Lozano, J. A., & Larrañaga, P. (1999). Applying genetic algorithms to search for the best

hierarchical clustering of a dataset. Pattern Recognition Letters, 20(9), 911–918.

http://doi.org/10.1016/S0167-8655(99)00057-4

M. Lichman. (2013). UCI Machine Learning Repository. Retrieved June 22, 2013, from

http://archive.ics.uci.edu/ml/

Ma, Y., Cheng, G., Liu, Z., & Xie, F. (2017). Fuzzy nodes recognition based on spectral

clustering in complex networks. Physica A: Statistical Mechanics and Its Applications,

465, 792–797. http://doi.org/10.1016/j.physa.2016.08.022

Maimon, O., & Rokach, L. (2010). Data mining and knowledge discovery handbook. New

York: Springer.

Maio, D., Maltoni, D., & Rizzi, S. (1995). Topological clustering of maps using a genetic

algorithm. Pattern Recognition Letters, 16(1), 89–96. http://doi.org/10.1016/0167-

8655(94)00069-F

Mann, C. F., Matula, D. W., & Olinick, E. V. (2008). The use of sparsest cuts to reveal the

hierarchical community structure of social networks. Social Networks, 30(3), 223–234.

295

http://doi.org/10.1016/j.socnet.2008.03.004

Maraziotis, I. A. (2012). A semi-supervised fuzzy clustering algorithm applied to gene

expression data. Pattern Recognition, 45(1), 637–648.


Masulli, F., & Schenone, A. (1999). A fuzzy clustering based segmentation system as support

to diagnosis in medical imaging. Artificial Intelligence in Medicine, 16(2), 129–147.

http://doi.org/10.1016/S0933-3657(98)00069-4

Mathew, J., & Vijayakumar, R. (2014). Scalable parallel clustering approach for large data

using parallel K means and firefly algorithms. In 2014 International Conference on High

Performance Computing and Applications (ICHPCA) (pp. 1–8). IEEE.

http://doi.org/10.1109/ICHPCA.2014.7045322

Matthias, B., & Juri, S. (2009). Spectral Clustering.

Maulik, U., & Bandyopadhyay, S. (2000). Genetic algorithm-based clustering technique.

Pattern Recognition, 33(9), 1455–1465. http://doi.org/10.1016/S0031-3203(99)00137-5

Maulik, U., & Mukhopadhyay, A. (2010). Simulated annealing based automatic fuzzy

clustering combined with ANN classification for analyzing microarray data. Computers

& Operations Research, 37(8), 1369–1380. http://doi.org/10.1016/j.cor.2009.02.025

Md Anisur Rahman. (2014). Automatic Selection of High Quality Initial Seeds for Generating

High Quality Clusters without requiring any User Inputs. Charles Sturt University.

Menon, N., & Ramakrishnan, R. (2015). Brain Tumor Segmentation in MRI Images Using

Unsupervised Artificial Bee Colony Algorithm and FCM Clustering. In International

Conference on Communications and Signal Preocessing, ICCSP 2015 (pp. 0006–0009).

Merz, B., Nguyen, V. D., & Vorogushyn, S. (2016). Temporal clustering of floods in Germany:

Do flood-rich and flood-poor periods exist? Journal of Hydrology, 541, 824–838.

http://doi.org/10.1016/j.jhydrol.2016.07.041

296

Miller, G. E., & Cole, S. W. (2012). Clustering of Depression and Inflammation in Adolescents

Previously Exposed to Childhood Adversity. Biological Psychiatry, 72(1), 34–40.

http://doi.org/10.1016/j.biopsych.2012.02.034

Mo, J., Kiang, M. Y., Zou, P., & Li, Y. (2010). A two-stage clustering approach for multi-

region segmentation. Expert Systems with Applications, 37(10), 7120–7131.


Mohd, W. M. B. W., Beg, a. H., Herawan, T., & Rabbi, K. F. (2012). An Improved Parameter

less Data Clustering Technique based on Maximum Distance of Data and Lioyd k-means

Algorithm. Procedia Technology, 1, 367–371.

http://doi.org/10.1016/j.protcy.2012.02.076

Montani, S., & Leonardi, G. (2014). Retrieval and clustering for supporting business process

adjustment and analysis. Information Systems, 40, 128–141.

http://doi.org/10.1016/j.is.2012.11.006

Moore, M. (2004). An accurate parallel genetic algorithm to schedule tasks on a cluster.

Parallel Computing, 30(5), 567–583. http://doi.org/10.1016/j.parco.2003.12.005

Mukhopadhyay, A., & Maulik, U. (2009). Towards improving fuzzy clustering using support

vector machine: Application to gene expression data. Pattern Recognition, 42(11), 2744–

2763. http://doi.org/10.1016/j.patcog.2009.04.018

Mungle, S., Benyoucef, L., Son, Y. J., & Tiwari, M. K. (2013). A fuzzy clustering-based

genetic algorithm approach for time-cost-quality trade-off problems: A case study of

highway construction project. Engineering Applications of Artificial Intelligence, 26(8),

1953–1966. http://doi.org/10.1016/j.engappai.2013.05.006

Mur, A., Dormido, R., Duro, N., Dormido-Canto, S., & Vega, J. (2016). Determination of the

optimal number of clusters using a spectral clustering optimization. Expert Systems with

Applications, 65, 304–314. http://doi.org/10.1016/j.eswa.2016.08.059

297

Murthy, C. A., & Chowdhury, N. (1996). In search of optimal clusters using genetic algorithms.

Pattern Recognition Letters, 17(8), 825–832. http://doi.org/10.1016/0167-

8655(96)00043-8

Nanda, S. R., Mahanty, B., & Tiwari, M. K. (2010). Clustering Indian stock market data for

portfolio management. Expert Systems with Applications, 37(12), 8793–8798.


Narayan, P. K., Narayan, S., & Popp, S. (2011). Investigating price clustering in the oil futures

market. Applied Energy (Vol. 88). http://doi.org/10.1016/j.apenergy.2010.07.034

Narayan, P. K., Narayan, S., Popp, S., & D’Rosario, M. (2011). Share price clustering in

Mexico. International Review of Financial Analysis (Vol. 20).

http://doi.org/10.1016/j.irfa.2011.02.003

Nascimento, M. C. V., & de Carvalho, A. C. P. L. F. (2011). Spectral methods for graph

clustering – A survey. European Journal of Operational Research, 211(2), 221–231.

http://doi.org/10.1016/j.ejor.2010.08.012

Neto, J. C., Meyer, G. E., & Jones, D. D. (2006). Individual leaf extractions from young canopy

images using Gustafson-Kessel clustering and a genetic algorithm. Computers and

Electronics in Agriculture, 51(1–2), 66–85. http://doi.org/10.1016/j.compag.2005.11.002

Omran, M. G. H., Engelbrecht, A. P., & Salman, A. (2007). An overview of clustering methods.

Intelligent Data Analysis, 11(6), 583–605.

Oostenveld, R., & Praamstra, P. (2001). The five percent electrode system for high-resolution

EEG and ERP measurements. Clinical Neurophysiology, 112(4), 713–719.

http://doi.org/10.1016/S1388-2457(00)00527-7

Opsahl, T., & Panzarasa, P. (2009). Clustering in weighted networks. Social Networks, 31(2),

155–163. http://doi.org/10.1016/j.socnet.2009.02.002

Ozturk, C., Hancer, E., & Karaboga, D. (2015). Dynamic clustering with improved binary

298

artificial bee colony algorithm. Applied Soft Computing Journal, 28, 69–80.


Pakhira, M. K., Bandyopadhyay, S., & Maulik, U. (2005). A study of some fuzzy cluster

validity indices, genetic clustering and application to pixel classification. Fuzzy Sets and

Systems, 155(2), 191–214. http://doi.org/10.1016/j.fss.2005.04.009

Pang-Ning Tan, Michael Steinbach, V. K. (2005). Introduction to Data Mining (1st ed.).

Pearson Addison Wessley.

Parente, J., Pereira, M. G., & Tonini, M. (2016). Space-time clustering analysis of wildfires:

The influence of dataset characteristics, fire prevention policy decisions, weather and

climate. Science of The Total Environment, 559, 151–165.

http://doi.org/10.1016/j.scitotenv.2016.03.129

Paterlini, S., & Krink, T. (2006). Differential evolution and particle swarm optimisation in

partitional clustering. Computational Statistics & Data Analysis, 50(5), 1220–1247.

http://doi.org/10.1016/j.csda.2004.12.004

Peng, P., Addam, O., Elzohbi, M., Özyer, S. T., Elhajj, A., Gao, S., … Alhajj, R. (2014).

Reporting and analyzing alternative clustering solutions by employing multi-objective

genetic algorithm and conducting experiments on cancer data. Knowledge-Based Systems,

56, 108–122. http://doi.org/10.1016/j.knosys.2013.11.003

Pirim, H., Ekşioğlu, B., Perkins, A. D., & Yüceer, Ç. (2012). Clustering of high throughput

gene expression data. Computers & Operations Research, 39(12), 3046–3061.

http://doi.org/10.1016/j.cor.2012.03.008

Pourvaziri, H., & Naderi, B. (2014). A hybrid multi-population genetic algorithm for the

dynamic facility layout problem. Applied Soft Computing, 24, 457–469.


Pyle, D. (1999). Data preparation for data mining. San Francisco: Morgan Kaufmann

299

Publishers.

Qiao, S., Li, T., Li, H., Peng, J., & Chen, H. (2012). A new blockmodeling based hierarchical

clustering algorithm for web social networks. Engineering Applications of Artificial

Intelligence, 25(3), 640–647. http://doi.org/10.1016/j.engappai.2012.01.003

Qing, L., Gang, W., Zaiyue, Y., & Qiuping, W. (2008). Crowding clustering genetic algorithm

for multimodal function optimization. Applied Soft Computing, 8(1), 88–95.


Quinlan, J. R. (1993). C4.5 : programs for machine learning. San Mateo, U.S.A: Morgan

Kaufmann Publishers.

Quinlan, J. R. (1996). Improved Use of Continuous Attributes in C4.5. Journal of Artiicial

Intelligence Research Submitted, 4, 77–90.

Rafailidis, D., Constantinou, E., & Manolopoulos, Y. (2017). Landmark selection for spectral

clustering based on Weighted PageRank. Future Generation Computer Systems, 68, 465–

472. http://doi.org/10.1016/j.future.2016.03.006

Rahman, M. A., & Islam, M. Z. (2014). A hybrid clustering technique combining a novel

genetic algorithm with K-Means. Knowledge-Based Systems, 71, 345–365.

http://doi.org/10.1016/j.knosys.2014.08.011

Ramos, G. N., Hatakeyama, Y., Dong, F., & Hirota, K. (2009). Hyperbox clustering with Ant

Colony Optimization (HACO) method and its application to medical risk profile

recognition. Applied Soft Computing, 9(2), 632–640.


Rivera-Baltanas, T., Olivares, J. M., Martinez-Villamarin, J. R., Y. Fenton, E., E. Kalynchuk,

L., & J. Caruncho, H. (2014). Serotonin 2A receptor clustering in peripheral lymphocytes

is altered in major depression and may be a biomarker of therapeutic efficacy. Journal of

Affective Disorders, 163, 47–55. http://doi.org/10.1016/j.jad.2014.03.011

300

Roiger, R. J., & Geatz, M. (2003). Data mining : A tutorial-based primer. Addison Wesley.

Roy, A., & Parui, S. K. (2014). Pair-copula based mixture models and their application in

clustering. Pattern Recognition, 47(4), 1689–1697.


Saha, S., Alok, A. K., & Ekbal, A. (2016). Brain image segmentation using semi-supervised

clustering. Expert Systems with Applications, 52, 50–63.


Schaeffer, S. E. (2007). Graph clustering. Computer Science Review, 1(1), 27–64.

http://doi.org/10.1016/j.cosrev.2007.05.001

Scheunders, P. (1997). A genetic c-Means clustering algorithm applied to color image

quantization. Pattern Recognition, 30(6), 859–866. http://doi.org/10.1016/S0031-

3203(96)00131-8

Schulz, J. (2008). Minkowski distance. Retrieved May 15, 2013, from

http://www.code10.info/index.php?option=com_content&view=article&id=61:articlemi

nkowski-distance&catid=38:cat_coding_algorithms_data-similarity&Itemid=57

Senthilnath, J., Omkar, S. N., & Mani, V. (2011). Clustering using firefly algorithm:

Performance study. Swarm and Evolutionary Computation, 1(3), 164–171.

http://doi.org/10.1016/j.swevo.2011.06.003

Shang, R., Zhang, Z., Jiao, L., Wang, W., & Yang, S. (2016). Global discriminative-based

nonnegative spectral clustering. Pattern Recognition, 55, 172–182.


Sharbrough F, Chatrian G-E, Lesser RP, Lüders H, Nuwer M, & Picton TW. (1991). American

Electroencephalographic Society guidelines for standard electrode position nomenclature.

Journal of Clinical Neurophysiology : Official Publication of the American

Electroencephalographic Society, 8(2), 200–2. Retrieved from

301

http://www.ncbi.nlm.nih.gov/pubmed/2050819

Sheikh, R. H., Raghuwanshi, M. M., & Jaiswal, A. N. (2008). Genetic Algorithm Based

Clustering: A Survey. In 2008 First International Conference on Emerging Trends in

Engineering and Technology (pp. 314–319). IEEE.

http://doi.org/10.1109/ICETET.2008.48

Shelokar, P. ., Jayaraman, V. ., & Kulkarni, B. . (2004). An ant colony approach for clustering.

Analytica Chimica Acta, 509(2), 187–195. http://doi.org/10.1016/j.aca.2003.12.032

Sheng, W., Howells, G., Fairhurst, M., & Deravi, F. (2008). Template-Free Biometric-Key

Generation by Means of Fuzzy Genetic Clustering. IEEE Transactions on Information

Forensics and Security, 3(2), 183–191.

Sisodia, D., Singh, L., Sisodia, S., & Saxena, K. (2012). Clustering Techniques: A Brief Survey

of Different Clustering Algorithms. International Journal of Latest Trends in Engineering

and Technology (IJLTET), 1(3), 82–87.

Son, L. H., & Tuan, T. M. (2017). Dental segmentation from X-ray images using semi-

supervised fuzzy clustering with spatial constraints. Engineering Applications of Artificial

Intelligence, 59, 186–195. http://doi.org/10.1016/j.engappai.2017.01.003

Song, W., Li, C. H., & Park, S. C. (2009). Genetic algorithm for text clustering using ontology

and evaluating the validity of various semantic similarity measures. Expert Systems with

Applications, 36(5), 9095–9104. http://doi.org/10.1016/j.eswa.2008.12.046

Sonğur, C., & Top, M. (2016). Regional clustering of medical imaging technologies.

Computers in Human Behavior, 61, 333–343. http://doi.org/10.1016/j.chb.2016.03.056

Srikanth, R., George, R., Warsi, N., Prabhu, D., Petry, F. E., & Buckles, B. P. (1995). A

variable-length genetic algorithm for clustering and classification. Pattern Recognition

Letters, 16(8), 789–800. http://doi.org/10.1016/0167-8655(95)00043-G

Srinivas, M., & M.Patnaik, L. (1994). Adaptive probabilities of crossover and mutation in

302

genetic algorithms. IEEE Transactions on Systems, Man and Cybernetics, 24(1994), 656–

667.

Stockman, G., & Shapiro, L. G. (2001). Computer Vision. New Jersey: Prentice-Hall.

Straßburg, J., Gonzàlez-Martel, C., & Alexandrov, V. (2012). Parallel genetic algorithms for

stock market trading rules. Procedia Computer Science, 9, 1306–1313.

http://doi.org/10.1016/j.procs.2012.04.143

Sumathi, S., & Sivanandam, S. N. (2006). Introduction to data mining and its applications.

Springer.

Sun, J., Chen, W., Fang, W., Wun, X., & Xu, W. (2012). Gene expression data analysis with

the clustering method based on an improved quantum-behaved Particle Swarm

Optimization. Engineering Applications of Artificial Intelligence, 25(2), 376–391.

http://doi.org/10.1016/j.engappai.2011.09.017

Suzuki, T., Shiga, T., Kuwahara, K., Kobayashi, S., Suzuki, S., Nishimura, K., … Hagiwara,

N. (2014). Impact of clustered depression and anxiety on mortality and rehospitalization

in patients with heart failure. Journal of Cardiology, 64(6), 456–462.

http://doi.org/10.1016/j.jjcc.2014.02.031

Szeto, L. K., Liew, A. W.-C., Yan, H., & Tang, S. (2003). Gene expression data clustering and

visualization based on a binary hierarchical clustering framework. Journal of Visual

Languages & Computing, 14(4), 341–362. http://doi.org/10.1016/S1045-

926X(03)00033-8

Teknomo, K. (2015a). Jaccard’s Coefficient. Retrieved May 15, 2013, from

http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html

Teknomo, K. (2015b). Minkowski Distance. Retrieved January 2, 2017, from

http://people.revoledu.com/kardi/tutorial/Similarity/MinkowskiDistance.html

Traud, A. L., Mucha, P. J., & Porter, M. A. (2012). Social structure of Facebook networks.

303

Physica A: Statistical Mechanics and Its Applications, 391(16), 4165–4180.

http://doi.org/10.1016/j.physa.2011.12.021

Triola, M. F. (2001). Elementary Statistics (8th ed.). Boston San Francisco New York: Addison

Wesley Longman, Inc.

Tsai, C.-Y., & Kao, I.-W. (2011). Particle swarm optimization with selective particle

regeneration for data clustering. Expert Systems with Applications, 38(6), 6565–6576.


Tseng, L. Y., & Yang, S. B. (2001). A genetic approach to the automatic clustering

problem.pdf. Pattern Recognition Society, 34(October 1999), 415–424.

Turgut, D., Das, S. K., Elmasri, R., & Turgut, B. (2002). Optimizing clustering algorithm in

mobile ad hoc networks using genetic algorithmic approach. In Global

Telecommunications Conference, 2002. GLOBECOM ’02. IEEE (Vol. 1, pp. 62–66).

IEEE. http://doi.org/10.1109/GLOCOM.2002.1188042

Tzes, A., Pei-Yuan Peng, & Guthy, J. (1998). Genetic-based fuzzy clustering for DC-motor

friction identification and compensation. IEEE Transactions on Control Systems

Technology, 6(4), 462–472. http://doi.org/10.1109/87.701338

Van Lancker, A., Beeckman, D., Verhaeghe, S., Van Den Noortgate, N., & Van Hecke, A.

(2016). Symptom clustering in hospitalised older palliative cancer patients: A cross-

sectional study. International Journal of Nursing Studies, 61, 72–81.

http://doi.org/10.1016/j.ijnurstu.2016.05.010

von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4),

395–416. http://doi.org/10.1007/s11222-007-9033-z

W.M. Ma, E., & Chow, T. W. S. (2004). A new shifting grid clustering algorithm. Pattern

Recognition, 37(3), 503–514. http://doi.org/10.1016/j.patcog.2003.08.014

Wan, M., Wang, C., Li, L., & Yang, Y. (2012). Chaotic ant swarm approach for data clustering.

304

Applied Soft Computing, 12(8), 2387–2393. http://doi.org/10.1016/j.asoc.2012.03.037

Wang, C.-H. (2009). Outlier identification and market segmentation using kernel-based

clustering techniques. Expert Systems with Applications, 36(2), 3744–3750.


Wang, C., Cao, L., Li, J., Wei, W., Ou, Y., & Wang, M. (2011). Coupled Nominal Similarity

in Unsupervised Learning.

Wang, W., Yang, J., & Muntz, R. (1997). STING : A Statistical Information Grid Approach to

Spatial Data Mining. In The 23rd VLDB Conference Athens (pp. 186–195). Greece.

Wikaisuksakul, S. (2014). A multi-objective genetic algorithm with fuzzy c-means for

automatic data clustering. Applied Soft Computing, 24, 679–691.


Xiao, J., Yan, Y., Zhang, J., & Tang, Y. (2010). A quantum-inspired genetic algorithm for k-

means clustering. Expert Systems with Applications, 37(7), 4966–4973.


Xie, X. L., & Beni, G. (1991). A validity measure for fuzzy clustering. IEEE Transactions on

Pattern Analysis and Machine Intelligence, PAMI, 13(8), 841–847.

Xu, R., Damelin, S., Nadler, B., & Wunsch, D. C. (2010). Clustering of high-dimensional gene

expression data with feature filtering methods and diffusion maps. Artificial Intelligence

in Medicine, 48(2), 91–98. http://doi.org/10.1016/j.artmed.2009.06.001

Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural

Networks, 16(3), 645–678.

Yan, X., Zhu, Y., Zou, W., & Wang, L. (2012). A new approach for data clustering using hybrid

artificial bee colony algorithm. Neurocomputing, 97, 241–250.


Yang, F., Sun, T., & Zhang, C. (2009). An efficient hybrid data clustering method based on K-

305

harmonic means and Particle Swarm Optimization. Expert Systems with Applications,


Yang, J. Y., & Ersoy, O. K. (2003). Combined Supervised and Unsupervised Learning in

Genomic Data Mining. Electrical and Computer Engineering, Purdue University.

Yang, Y., Wang, Y., & Xue, X. (2016). A novel spectral clustering method with superpixels

for image segmentation. Optik - International Journal for Light and Electron Optics,

127(1), 161–167. http://doi.org/10.1016/j.ijleo.2015.10.053

Yücenur, G. N., & Demirel, N. Ç. (2011). A new geometric shape-based genetic clustering

algorithm for the multi-depot vehicle routing problem. Expert Systems with Applications,


Zeng, Y., & Garcia-Frias, J. (2006). A novel HMM-based clustering algorithm for the analysis

of gene expression time-course data. Computational Statistics & Data Analysis, 50(9),

2472–2494. http://doi.org/10.1016/j.csda.2005.07.007

Zhang, C., Ouyang, D., & Ning, J. (2010). An artificial bee colony approach for clustering.

Expert Systems with Applications, 37(7), 4761–4767.


Zhang, L., & Cao, Q. (2011). A novel ant-based clustering algorithm using the kernel method.

Information Sciences, 181(20), 4658–4672. http://doi.org/10.1016/j.ins.2010.11.005

Zhang, L., Cao, Q., & Lee, J. (2013). A novel ant-based clustering algorithm using Renyi

entropy. Applied Soft Computing, 13(5), 2643–2657.


Zhao, F., Fan, J., & Liu, H. (2014). Optimal-selection-based suppressed fuzzy c-means

clustering algorithm with self-tuning non local spatial information for image

segmentation. Expert Systems with Applications, 41(9), 4083–4093.


306

Zhao, L., Yang, Y., & Zeng, Y. (2009). Eliciting compact T-S fuzzy models using subtractive

clustering and coevolutionary particle swarm optimization. Neurocomputing, 72(10–12),

2569–2575. http://doi.org/10.1016/j.neucom.2008.11.001

Zhao, P., & Zhang, C.-Q. (2011). A new clustering method and its application in social

networks. Pattern Recognition Letters, 32(15), 2109–2118.


Zhao, Z., Feng, S., Wang, Q., Huang, J. Z., Williams, G. J., & Fan, J. (2012). Topic oriented

community detection through social objects and link analysis in social networks.

Knowledge-Based Systems, 26, 164–173. http://doi.org/10.1016/j.knosys.2011.07.017

Zhong, C., Miao, D., & Wang, R. (2010). A graph-theoretical clustering method based on two

rounds of minimum spanning trees. Pattern Recognition, 43(3), 752–766.


genetic algorithm based clustering techniques and tree based … · i declaration of authorship i...

Documents